Blog : Extract or Find email addresses from a text File or HTML file using Java

May28


Problem

 

I want to find out all the valid email addresses from a text file or HTML file or web page. Here I  am trying to extract (find/filter ) valid email ids from a given input..

This is just a draft version of the code, You might have better ideas, Please share


 

package email;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
/*
* Uppercase and lowercase letters (case insensitive)
* The digits 0 through 9
* The characters, ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~
* The character “.” provided that it is not the first or last character in the local-part.
*
* This program is just for reference , need lot of modification to capture all kind of
* email ids
* Place input file anywhere and provide the path to this program..It will extract all email ids from
* source file and create a comma seperated email id list
*
*/
public class ExtractEmailids {
/**
* @param args
*/
public static void main(String[] args) {
try{
String fileName = “c:\\emailInput.txt”; //this can be replaced by a URL char data stream
String outputFile = “c:\\emailOutput.txt”;
if(fileName == null)
return;
else{
new ExtractEmailids().parseFile(fileName,outputFile);
}
}catch (Exception e) {
e.printStackTrace();
}
}
private void parseFile(String fileName, String outputFile)throws Exception{
BufferedReader br = new BufferedReader(new FileReader(fileName));
FileWriter writer = new FileWriter(outputFile);
String thisLine;
while((thisLine = br.readLine())!=null){
System.out.println(thisLine);
//Parsing Line and finding email ids
if(thisLine.indexOf(“@”) < 0){
continue;
}
int startIndex = 0;
int endIndex = -1;
while(thisLine.substring(startIndex,thisLine.length()).indexOf(“@”) > -1){
String emailId =null;
int indexOfAt = thisLine.substring(startIndex,thisLine.length()).indexOf(“@”) + startIndex;
//Find all letters beforfe @
for(int i=indexOfAt-1;i>=0;i–){
boolean isCharAllowed = isCharAllowed(thisLine.charAt(i));
if(i==0 || isCharAllowed ==false){
startIndex = i;
break;
}
}
//Find .com, .net,.
for(int i=indexOfAt+1;i<thisLine.length();i++){
boolean isCharAllowed = isCharAllowed(thisLine.charAt(i));
if(i== thisLine.length()-1 || isCharAllowed ==false ){
endIndex = i;
break;
}
}
emailId = thisLine.substring(startIndex,endIndex);
startIndex = endIndex +1;
writer.write(emailId.trim()+”,”);
System.out.println(emailId);
}
}
writer.flush();
writer.close();
}
private boolean isCharAllowed(char ch){
int chInt  = (int)ch;
// * Uppercase and lowercase letters (case insensitive)
// * The digits 0 through 9
// * The characters, ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~
// * The character “.” provided that it is not the first or last character in the local-part.
if( (chInt >= (int)’A’ &&  chInt <= (int)’Z’ ) ||
(chInt >= (int)’a’ &&  chInt <= (int)’z’ ) ||
(chInt >= (int)’0′ &&  chInt <= (int)’9′ )){
return true;
}
if( ch ==’.’ ||  ch ==’_’ || ch ==’-’ || ch ==’!’ ||ch ==’#’ ||ch ==’||ch ==’%’ ||

 

ch ==’&’ ||ch ==’\” ||ch ==’*’ ||ch ==’+’ ||ch ==’/’ ||ch == ‘=’ ||ch ==’^’ ||
ch ==’{‘ ||ch ==’|’ || ch ==’}’ ||ch ==’~'){
return true;
}
if(ch == ‘@’){
return true;
}
return false;
}
}


Join Indian Community is USA
Posted in Software / Software category on May 28 2010, 01:46 PM
638 Views, 0 Comments, 1 Appreciations, Overall rating:
Tags: Java, Email, J2ee, Programmer, Software
Post a comment | Appreciate this post | Report abuse |

Comments


 
X