Introduction
There are many companies/universities web sites contain a list of every employer working in this company, this list contains the name, email, telephone and even address for every employee .As an agent looking for cooperation with this company/university you need to add these people in your yahoo list to contact them or to ask about any information could help you, if they are interest in your product or if they are not?.
But how can you add them in your Email list?, this would take a lot of time and effort you may have not, Because you have to enter this person Name (First, Middle and last), Emails, Telephone and even there genders, Yes that’s right what if your product is only for women or only for men?, this mean that you will need more time to figure out if this person is man or women!!!.
Email Extractor is a simple program I have started when I faced this problem. by taking an advantage of what already available of Regex expressions on internet this program can scan a field of text looking for people and extract their names, telephone, emails, even their gender and a lot of other things you may need to know about this person to start a contact him.
This program could not just extract this person data but also order his data in a DataGridView control which contains rows for every detail available about this person , and over this it can save this data as a “.CSV” file which yahoo mail accept as importing file type to extract contacts and inserting these contacts in your mail. It also enable you specify what kind of data you really need by defining your own Regex expressions, you can save this expressions in a database under a specific Group name so you can use them after.
Handling data by executing Regex Expressions is also available to maintain and prepare mixed data to extract the specific data belongs to one person programmatically.
Extracting data belongs to one person from a pool of text is almost impossible, so every block of data belongs to one person must be separated from other data by more than two spaces “\n”, This will enable the code to split data into blocks every block belongs to a specific person, and also you can specify a Regex expression for text-block-splitting.
Background
Yahoo mail is one of the most mail provider popular in the internet it provide many tools to handle emails, yahoo also provide an important tool ,this tool is importing contacts ,which provide the ability to import contacts from other mail contact (yahoo, Gmail,…) or from files like CSV files.
How to add A Comma-separated values file (.CSV) to your Yahoo mail contact?
1. Open your yahoo mail.
2. Press “contacts”.
3. Add a new contact list.
4. Press import contacts.
5. Choose “others” to specify that you will import contacts from another method.
6. For step 1 unselect another yahoo account.
7. Select “a desktop email program (Outlook, Apple Mail, etc...)”.
8. Browse to the specific .CSV file you need to add.
9. Press “continue”.
What Is Regular Expressions?
A regular expression is a set of characters that specify a pattern. The term "regular" has nothing to do with a high-fiber diet. It comes from a term used to describe grammars and formal languages.
Regular expressions are used when you want to search for specify lines of text containing a particular pattern. Most of the UNIX utilities operate on ASCII files a line at a time. Regular expressions search for patterns on a single line, and not for patterns that start on one line and end on another.
It is simple to search for a specific word or string of characters. Almost every editor on every computer system can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You can search for a word with four or more vowels that end with an "s". Numbers, punctuation characters, you name it, a regular expression can find it. What happens once the program you are using find it is another matter. Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string with a new pattern. It all depends on the utility.
Regular expressions confuse people because they look a lot like the file matching patterns the shell uses. They even act the same way--almost. The square brackers are similar, and the asterisk acts similar to, but not identical to the asterisk in a regular expression. In particular, the Bourne shell, C shell, find, and cpio use file name matching patterns and not regular expressions.
Remember that shell meta-characters are expanded before the shell passes the arguments to the program. To prevent this expansion, the special characters in a regular expression must be quoted when passed as an option from the shell. You already know how to do this because I covered this topic in last month's tutorial.
You can know More about Regex Expressions here.
Search for People:
This is the main method for this program, it first splits text data derived from main RichtextBox(MixedEmailSRichtextBox) , by using the containing 2 spaces rule ,as different data related to different people must be preceded with spaces ([\n]{2} ,and then it import them to GetPeopleFromBlock Method which perform different Regex expressions to extract data related to specific person like Email, Name, Telephone number,….
private void SearchForPeople()
{
WrongRegex.Clear();
string Block = MixedEmailSRichtextBox.Text; string Regex = @"(\n){2,}(\w)*?";
if (BlockSplitingText.Enabled == true && BlockSplitingText.Text.Length > 0)
Regex = BlockSplitingText.Text;
if (BlockSplitingText.Enabled == true && BlockSplitingText.Text.Length > 0)
Regex = BlockSplitingText.Text;
List<Person> People = GetPeopleFromBlock(Block, Regex);
this.EmailsdataGridView.DataSource = PeopleTable(People, GetAdditionalRegexfrom(RegexDataGridView));
}
static List<Person> GetPeopleFromBlock(string Data,string RegexSplitting)
{
List<Person> MyPData = new List<Person>();
try
{
string[] M = Regex.Split(Data, RegexSplitting); ;
foreach (var Pdata in M)
{
if (Pdata.Length > 20)
{
Person newPerson = new Person(Pdata);
MyPData.Add(newPerson);
}
}
}
catch (Exception e)
{
MessageBox.Show(e.Message);
}
return MyPData;
}
People Table function:
This Method import the list of persons loaded with their block of data and then perform GetPersonFromBlock method which execute the already existed Regex expressions existed in the class and the user defined Regex expressions “AdditionalRegex”.
private DataTable PeopleTable(List<Person> People, List<AdditionalRegexExpression> AdditionalRegex)
{
DataTable PeopleTable = new DataTable();
AddCoulmnsForPeopleTable(PeopleTable);
for (int i = 0; i < People.Count; i++)
{
DataRow NewPerson = PeopleTable.NewRow();
List<PersonData> newPErsonData =People[i].GetPersonFromBlock(AdditionalRegex,NoDataWithoutEmail);
for (int P = 0; P < newPErsonData.Count; P++)
{
if (newPErsonData[P].dataType == null)
continue;
string ColumnName = newPErsonData[P].dataType.ToString().Replace('_', ' ');
if(ColumnName=="Email Address")
ColumnName="E-mail Address";
if (ColumnName == "Email Display Name")
ColumnName = "E-mail Display Name";
string AlreadyExistedData = NewPerson[ColumnName].ToString(); NewPerson[ColumnName] = AlreadyExistedData + CleanString(newPErsonData[P].Data.ToString()); }
if(newPErsonData.Count>0)
PeopleTable.Rows.Add(NewPerson);
}
return PeopleTable;
}
static string CleanString(string X)
{
return Regex.Replace(X,RegexExpressionCollection.CleanStringForExcel," ");
}
The .CSV files exporting problem’s solution:
As you have noticed in the previous code there are method called Cleanstring ,this method play an important rule ,as it remove any sign could confuse the .csv reader program (excel,..) while he is ordering cell values in the rows,it perform a specific regex expression to remove coma –which .csv depending on as cell separating sign- and other signs.
public static string CleanStringForExcel { get { return @"\n|\s|\t|\f|\v|\e|,"; } }
Class Person:
This class gets the block data in the constructor and performs the Regex expressions according to the methods which are calling.it also contain an enumaeration for some of the columns as a data type enum ,which can specify
class Person
{
public enum DataType
{
First_Name, Middle_Name, Last_Name, Department, Job_Title, Business_State, Home_Postal_Code, Business_Phone, Home_Phone, Birthday, Email_Address, Email_Display_Name, Gender,Language, Notes, Web_Page, Business_Fax
}
string DataBlock { get; set; }
public Person(string DataBlock) {
this.DataBlock = DataBlock;
}
public PersonData getEmailPersonFromBlock()
{
List<string> Emails = new List<string>();
Regex emailRegex = new Regex(RegexExpressionCollection.Email,
RegexOptions.IgnoreCase);
MatchCollection emailMatches = emailRegex.Matches(this.DataBlock);
foreach (Match emailMatch in emailMatches)
{
Emails.Add(emailMatch.Value);
}
if(emailMatches.Count==0)
return new PersonData(DataType.Email_Address, "");
return new PersonData(DataType.Email_Address, Emails[0]);
}
public PersonData Gender()
{
string Name = GetPersonNameFromBlock()[0].Data; return Gender(Name);
}
Person class contain GetPersonFromBlock which perform all the Regex Expressions available at once and extract the needed data as Person DataType.
public List<PersonData> GetPersonFromBlock(List<AdditionalRegexExpression> AddationalRegex, bool EmailIsMainData)
{
PersonData PEmailData = getEmailPersonFromBlock();
if (!Regex.IsMatch( PEmailData.Data.ToString(),RegexExpressionCollection.Email)&& EmailIsMainData) return new List<PersonData>();
List<PersonData> PNameData = new List<PersonData>();
if (PEmailData.Data != "")
PNameData = GetPersonNameFromBlock(PEmailData.Data.ToString()); PersonData PPhone = GetPhonePersonFormBlock();
PersonData PbirthDaY = GetPhonePersonBirthDate();
PersonData PEMialDisplayName = new PersonData();
if(PNameData.Count>0)
PEMialDisplayName = new PersonData(DataType.Email_Display_Name, PNameData[0].Data);
PersonData PPostalCode = GetPersonPostalCodeFromBlocK();
PersonData Gender = new PersonData();
if(PNameData.Count>0)
Gender= this.Gender(PNameData[0].Data);
PersonData Notes = new PersonData(DataType.Notes,this.DataBlock); PersonData WebPage = GetPersonWebPageFromBlock();
PersonData Business_Phone = GetPersonBussinessPhone();
PersonData Lang = GetPersonLanguage();
PersonData Business_Fax = getPersonBusinessFax();
PersonData Business_State = GetPersonBusiness_State();
PersonData Jop_Tile = new PersonData(DataType.Job_Title, Business_State.Data);
PersonData Department = GetPersonDepartment();
List<PersonData> AddaationalData = GetAddationalData(AddationalRegex);
List<PersonData> TotalData = new List<PersonData>();
TotalData.AddRange(PNameData);
TotalData.AddRange(new List<PersonData> { Business_Fax, PEmailData, PPhone, PbirthDaY, PEMialDisplayName, PPostalCode, Gender, Notes, WebPage, Business_Phone, Lang, Business_State, Jop_Tile, Department });
TotalData.AddRange(AddaationalData); return TotalData;
}
private string[] StringSplittededBySpace(string Data)
{
return Regex.Split(Data, @"[\s|\.|\,|\-]+?");
}
static List<int> WrongRegex = new List<int>();
This is very important as it perform the Regex expressions added the Regex Expression table which has been defined by user.
public List<PersonData> GetAddationalData(List<AdditionalRegexExpression> AdditionalRegex)
{
List<PersonData> AdditionalData = new List<PersonData>();
for (int i = 0; i < AdditionalRegex.Count; i++)
{
if (WrongRegex.Contains(i))
continue;
try
{
MatchCollection RegexMatches = Regex.Matches(this.DataBlock, AdditionalRegex[i].RgExp.ToString());
int count = 0;
if (AdditionalRegex[i].NumOfResults == 0)
{
count = RegexMatches.Count;
}
else
count = AdditionalRegex[i].NumOfResults;
if (RegexMatches.Count < count)
count = RegexMatches.Count;
string Data = "";
for (int c = 0; c < count; c++)
{
Data += RegexMatches[c].Value.ToString();
}
string targetColumn = AdditionalRegex[i].RgNam;
AdditionalData.Add(new PersonData(targetColumn, Data));
}
catch (Exception e)
{
WrongRegex.Add(i);
throw e;
}
}
return AdditionalData;
}
}
Class Regex Expression Type:
When user choose to add new Regex Expression in the Regex Expression Table ,the target column which the extracted data will be stored must be defined and also the number of results.
class AdditionalRegexExpression
{
public AdditionalRegexExpression(string Name, string Regex,int NumberOfResults)
{
this.RgNam = Name;
this.RgExp = Regex;
this.NumOfResults = NumberOfResults;
}
public string RgExp { get; set; }
public string RgColumnTarget { get; set; }
public int NumOfResults { get; set; } }
Opening text file in Rich textbox:
This method is used to load mixed data from a text file in Richtextbox control, this mixed data will be used to extract data related to different people. the ability to differentiate data related to specific person from another person controlling the accuracy of extracting results is depending on :
1) There must be more than 2 “\n” (spaces) between different related to different people; this will help program to split this data accurately and pushing them in data extracting process.
2) If there are another Regex entered by user it must be accurate and exist between person’s blocks.
3) There are data homogeneity –there must be like a general rule for different kind of data –like “Tel[:]+(\d+[-])+” for telephone number ,This will avoid you from writing different Regex to get only one kind of data like Emails.
using (OpenFileDialog dlgOpen = new OpenFileDialog())
{
try
{
dlgOpen.Filter = "All files(*.*)|*.*";
dlgOpen.InitialDirectory = "D:";
dlgOpen.Title = "Open";
if (dlgOpen.ShowDialog() == DialogResult.OK)
{
StreamReader sr = new StreamReader(dlgOpen.FileName, Encoding.Default);
string str = sr.ReadToEnd();
sr.Close();
MixedEmailSRichtextBox.Text = str;
}
}
catch (Exception errorMsg)
{
MessageBox.Show(errorMsg.Message);
}
}
Different Regex for Data:
Class RegexExpressionCollection contains some of the available Regex Expressions which is important and will help you to identify important details about this person. Let’s agree that all Regex expressions here can be overwritten by just adding new Regex expression and adjust the column target to be as you wish.
1. Fax number Regex:
As the only difference between Fax and telephone number is that fax number is preceded with word fax ,so the Regex search for word fax as a general key to differentiate between fax and telephone number.
public static string Fax{ get { return @"Fax:\s[+]*([\d+[-]*)+"; }}
2. 2.Phone number Regex:
public static string Phone { get { return @"(Tel|phone|telephone|mobile):\s[+]*([\d+[-]*)+"; } }
3. Web page:
There are different Regex expressions for web page this is the most simple and accurate I have found.
public static string WebPage{get { return @"(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z 0-9\-\._\?\,\'/\\\+&%\$#\=~])*"; } }
4. Birth Date:
public static string BirthDate{get { return @"Fax:\s[+]*([\d+[-]*)+"; } }
5. Email:
As Email is the most important thing you have here , if this method didn’t reveal any value the program will erase it from results automatically, unless you check the check box titled” don’t erase the No email results”.
public static string Email{get { return @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"; } }
6. Gender:
Gender Guessing for every client is very important ,this would determine if this person is really interest in your product or not. Suppose you were given a long mailing list and had to write a program to determine whether each name on the list represented a male or female, perhaps so you could preface each name with "Mr." or "Ms." in the salutation of a form letter. How would you do it? The first solution that comes to mind might be to create a table of every personal name and the gender normally associated with it. Then, for each name in the mailing list, look up the corresponding gender in the table, most likely using a hash function for speed.
Such an approach has two key problems. First, variations in spelling, no matter how slight, will defeat the algorithm. A human can tell that Caren, Caryn, Karyn, and Karin are all variations of the same name, but a table-lookup algorithm can't. If one of those four spellings aren't in the table, the program will be unable to match that name with a gender. Second, typing thousands of name-gender pairs into a table is time-consuming and error-prone. Moreover, you might never have a complete list because of variant spellings. What if I told you a gender detection program can be written in about 50 lines of code and data.
instead of matching straight text, the program matches based on patterns within the text. The strategy says that if a name matches a certain pattern of letters, it must be a certain gender.For example, if a name contains the letter sequence ann (upper or lowercase), it must be female (not including foreign names, diminutives of English names, and am-biguous names). This rule catches a lot of names, including Ann, Anna, Annabelle,Annacarol, Annalisa, Annaliz, Annamarie, Anne, Anne- marie, Annette, Annie…
A number of advantages are associated with using a pattern-matching approach, even beyond the amount of time saved by not having to enter thousands of names and their associated genders into a table. First, the program will often work with a name or variation of a name not yet encountered. Second, and in a similar vein, type and misspelled names still tend to yield the same gender as the correct transcription. And third, the program is guaranteed to return a gender for any name given it. You can't stump the code.
-You can read more about these codes here
These collection of Regex were revealed Page "This tour-de-force yiWK program reveals the strengths and weaknesses of the rule-based paradigm and will prove useful for mailing-list programmers” PREDICT GENDER GIVEN A FIRST NAME by Scott Pakin, August 1991 lawker.googlecode.com/svn/fridge/share/pdf/pakin1991.pdf”
·Male Name Regex:
public static bool IsMale(string Name)
{
string[] ArrayOfRegex = {
"^[^S].*r[rv]e?y?$" ,"^[^G].*v[ei]$" ,"^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$" ,"^[CGJWZ][^o][^dnt]*y$" ,"^.*[Rlr][abo]y$" ,"^.*[GRguw][ae]y?ne$" ,"^[CLMQTV].*[^dl][in]c.*[ey]$" ,"^.*[ay][dl]e$" ,"^[^o]*ke$" ,"^[^EL].*o(rg?|sh?)?(e|ua)$" ,"^[^JPSWZ].*[denor]n.*y$" ,"^Br[aou][cd].*[ey]$" ,"^[ILW][aeg][^ir]*e$" ,"^[ABEIUY][euz]?[blr][aeiy]$" ,"^[ART][^r]*[dhn]e?y$" ,"^.*oi?[mn]e$" ,"^D.*[mnw].*[iy]$" ,"^[^BG](e[rst]|ha)[^il]*e$"
};
for (int i = 0; i < ArrayOfRegex.Length; i++)
if(Regex.IsMatch(Name,ArrayOfRegex[i]))
return true;
return false;
}
· Female Name Regex
public static string[] Male{get{
return new string[]{
"^[^S].*r[rv]e?y?$" ,"^[^G].*v[ei]$" ,"^[^AJKLMNP][^o][^eit]*([glrsw]ey|lie)$" ,"^[CGJWZ][^o][^dnt]*y$" ,"^.*[Rlr][abo]y$" ,"^.*[GRguw][ae]y?ne$" ,"^[CLMQTV].*[^dl][in]c.*[ey]$" ,"^.*[ay][dl]e$" ,"^[^o]*ke$" ,"^[^EL].*o(rg?|sh?)?(e|ua)$" ,"^[^JPSWZ].*[denor]n.*y$" ,"^Br[aou][cd].*[ey]$" ,"^[ILW][aeg][^ir]*e$" ,"^[ABEIUY][euz]?[blr][aeiy]$" ,"^[ART][^r]*[dhn]e?y$" ,"^.*oi?[mn]e$" ,"^D.*[mnw].*[iy]$" ,"^[^BG](e[rst]|ha)[^il]*e$" };
7. And others Regex Expressions
Saving DataGridView as Comma-separated values (CSV) file:
This method is used to save DataGridView as comma-Separated values file, it first use StringBuilder to extract column name then adding “,” which define this string as column value -.csv take first rows as columns names and next strings as rows values – this method is very simple and the accuracy is very low as if the data contains “,” the excel program will be confused, the result will be in inaccurate values orders.
private void ExtractDataToCSV(DataGridView dgv)
{
if (dgv.Rows.Count == 0)
{
return;
}
StringBuilder sb = new StringBuilder();
string columnsHeader = "";
for (int i = 0; i < dgv.Columns.Count; i++)
{
columnsHeader += dgv.Columns[i].Name + ",";
}
sb.Append(columnsHeader + Environment.NewLine);
foreach (DataGridViewRow dgvRow in dgv.Rows)
{
if (!dgvRow.IsNewRow)
{
for (int c = 0; c < dgvRow.Cells.Count; c++)
{
sb.Append(dgvRow.Cells[c].Value + ",");
}
sb.Append(Environment.NewLine);
}
}
SaveFileDialog sfd = new SaveFileDialog();
sfd.Filter = "CSV files (*.csv)|*.csv";
if (sfd.ShowDialog() == System.Windows.Forms.DialogResult.OK)
{
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(sfd.FileName, false))
{
sw.WriteLine(sb.ToString());
}
}
MessageBox.Show("CSV file saved.");
}
Saving Regex Expressions In Database:
There are an integrated database you can use to save your code so you can it in another time ,you have first to click “new group “ button ,entering the group name in the text box and then pressing “save to database” the it will be added to drop down combo box .
History
Emails Extractor (version 1.0) -------- 9/2012.