Introduction
In almost all cases, it is a good practice to validate the user input because of many reasons (i.e., security, reliability, as proof of the given template, etc.). One of these cases is quite a common case of an email validation. This article discusses this topic a little bit and presents a very simple C# static
class that could be applied as a "ready to use" solution for the given email validation task.
Background
Recently I was toying with regular expressions, in particular with the expressions for e-mail validation. The expressions I found on the Internet didn't seem to match properly all the e-mail addresses I'd like to match. For example, using one of the common email validation regular expressions, I had successfully matched the following e-mails: "..@test.com",".a@test.com", ".@s.dd", "ab@988.120.150.10", "ab@120.256.256.120", "2@bde.cc", "-@bde.cc", "..@bde.cc", _@bde.cc, which obviously are not to be considered as valid email addresses.
Test Class and Sample Code
The validation class (see code below) and the sample project have been written in C#. But this stuff is really simple and can be converted to any programming language with the support of regular expressions. I've tried to make the sample class as simple as possible, and I feel it is the simplest one indeed. The validation expression though is a bit more complex and needs further clarification.
public static class TestEmail
{
public const string MatchEmailPattern =
@"^(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
+ @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?
[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
+ @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?
[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
+ @"([a-zA-Z0-9]+[\w-]+\.)+[a-zA-Z]{1}[a-zA-Z0-9-]{1,23})$";
public static bool IsEmail(string email)
{
if (email != null) return Regex.IsMatch(email, MatchEmailPattern);
else return false;
}
}
Validation Expression
The email matching expression given above seems complex just because it definitely is. Therefore some clarification at this point. In the code example, the matching expression which is assigned to the MatchEmailPattern
constant, consists of four parts. The whole expression begins with the ^
-sign and ends with the $
-sign. This ensures that matching occurs for all characters of the given email string and not for some part of it. The expression parts are matched as a whole expression as follows:
Expression = ExpressionPart1 AND (ExpressionPart2 OR
(ExpressionPart3 AND ExpressionPart4))
Now let's describe each part of the matching regular expression.
Expression Part 1
(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@
Matches the user name part of an email address. This expression part is matched as follows:
([\w-]+\.)+[\w-]+)
- matches the dot separated user name groups (i.e. John.Connor of John.Connor@test.maildomain.com).
Rules: There should be at least two dot separated groups, each group should contain at least one alphanumeric character or '-
', '_
' characters and the '.
' character at the end except the most right to the @-sign group. - or
([a-zA-Z]{1}|[\w-]{2,})
- matches the single (without '.
' characters) user name (i.e. Max-Brown of Max-Brown@test.maildomain.com).
Rules: If the user name part consists of only one character, it should be a word character (a to z or A to Z) or there should be at least two alphanumerical, underscore or minus characters. @
- matches the @-sign of an email address.
Expression Part 2
(([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.
([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}
Matches the host part of an email address, if the host in the email address is represented as an IP-Address. As you can see, this expression part has four identical repeating groups each of them is used to match first up to the fourth octet of an IP-Address respectively. The whole expression part is matched as follows:
([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.
- matches one of the first three octets of an IP-Address in the email string (i.e. 195.250.101. of John.Connor@195.250.101.219).
Rules: Optional first character '0
' to '1
' and optional second character '0
' to '9
' mandatory third character '0
' to '9
' OR mandatory first characters '25
' and mandatory third character '0
' to '5
' OR mandatory first character '2
' and mandatory second character '0
' to '4
' and mandatory third character '0
' to '9
' AND mandatory '.
' character at the end of the octet group. ([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])
- matches the last fourth octet of an IP-Address in the Email string (i.e. 219 of John.Connor@195.250.101.219).
Rules: The same as for the first three octets with the exception of '.
' character at the end. {1}
- match a string, which contains only one "four-octet" group right after the @-sign.
Expression Part 3
([a-zA-Z0-9]+[\w-]+\.)+
Matches the sub domains of the host name part of an email (i.e. test.maildomain of John.Connor@test.maildomain.com or 123reg.co of max@123reg.co.in).
Rules: There should be at least one subdomain. The first character is mandatory and should be a word character (a to z or A to Z). The second character is mandatory and should be an alphanumerical or underscore or minus character. The third and all following characters are optional and should be alphanumerical or underscore or minus characters.
Expression Part 4
[a-zA-Z]{1}[a-zA-Z0-9-]{1,23}
Matches the top level domain of the host name part of an email (i.e. com of John.Connor@test.maildomain.com or PHOTOGRAPHY of INFO@MY.PHOTOGRAPHY).
Rules: Have to begin with one word character (a to z, A to Z) with following at least one and up to twenty three alphanumeric or "minus" characters (a to z, A to Z, 0 to 9 or -). The current TLD list (top-level domains) could be found here. If you'd like to match a top level domain with more than twenty four characters (actual max. TLD chars number), just change the '23'
-character in the given expression to the desired length minus one character (i.e. to match a top level domain with up to 100 characters, you could use the following expression:
[a-zA-Z]{1}[a-zA-Z0-9-]{1,99}
Using the Project Sample
To use the project sample, you have to download the file EmailValidator.zip from the link above and unzip it then in a directory on your computer. The sample project has been created and tested in Visual Studio 2005 and has been cleaned before packaging into a zip-file. Therefore if you work with VS 2005, you just have to open the project and build it. If you have to deal with VS 2003, you can simply create a new console application project and remove the automatically created Program.cs file and add EmailValidator.cs and TestEmail.cs files, then build the project. In case of VS 2008, you can convert the sample EmailValidator
-Project into the new format and finally build it. As soon as you'll have the sample project built, you can start the EmailValidator.exe program from the IDE, directly or by using test.bat file. The last case is even more interesting, because there a lot of emails that are matched (given as program start parameter).
How Perfect You'd Like To Be ?
As well as you could express a predicate using different sentences, you can have many different regular expressions which define the same text matching rule. The email validation expression given here though does not match all the possible characters. For example, according to RFC 2822 (see Chapters 3.2.1, 3.4.1) the local part of an email address may use the following ASCII characters ! # $ % * / ? | ^ { } ` ~ & ' + - = _
. Some mail systems on the other hand are more restrictive. Hotmail, for example, only allows using email addresses with alphanumeric and . _ -
characters, and will refuse creation of a mail-account as well as mail sending to an email address containing ! # $ % * / ? | ^ { } ` ~ & ' + =
characters in the local part. The presented validation expression has been composed with compatibility concerns in mind and should not be seen as something general. It can be freely modified to satisfy your particular needs.
Conclusion
Even though the regular expressions are often cryptic and tricky, they are very powerful for the text processing tasks like validation/constraint of user input, search and replace text and so on. Just imagine if you have to code all the email matching rules in a "normal" way with "if
"-statements for example. Instead of a few rows of code, you will get a lot of "if
-else
" stuff that is neither more readable nor less buggy. That's all I'd like to say by now. So, if you've found this article useful, you might want to vote for it. :)
Revision History
- 23-May-2015 Long TLDs (top-level domains) and domains with first number-char are now supported
- 18-Aug-2008 Correction of the matching expression to match IP-addresses with octets like 2[0-4][6-9]
- 27-Jul -2008 Clarification of the expression part 1 corrected
- 18-Jan-2008 Web Link to RFC 2822 document added
- 14-Jan-2008 Revision history added
- 12-Jan-2008 Considerations to compatibility of the given validation expression to the actual standards added (see "How perfect you'd like to be ?")
- 08-Jan-2008 Original article