Introduction
There may come a time when you have a document full of various pieces of data, possibly one that contains information on a long list of people, but you only need the addresses. This is where RegEx
comes in handy. You can use RegEx
to iterate through the characters of the document and locate a string
that matches the pattern you specify. RegEx
can also be used to check a short string
to see if its format and contents match a specified pattern. For a detailed reference on RegEx
, check out this article.
Using the Code
To start, the easiest piece of the address to match is the zip code although it's the least exact.
A simple pattern to match a zip code would look like the following:
ZIP Code
\b\d{5}(?:-\d{4})?\b
That pattern matches five primary digits and allows the option of having a hyphen and four extended digits. This matches all zip codes, however it is possible for there to be a match of five digits that is not a zip code. Adding to our pattern will fix that.
Next, we need to match a city and state. This pattern will match most cities:
City
(?:[A-Z][a-z.-]+[ ]?)+
There is room for false matches with this pattern too but when we add the state pattern, it will be much more accurate.
The only sure way to test for a state name is to create a pattern that contains the name of each state. It's long, but you can always know it's 100% accurate.
State
Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|
Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|
Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New[ ]Hampshire|New[ ]Jersey|New[ ]Mexico
|New[ ]York|North[ ]Carolina|North[ ]Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode[ ]Island
|South[ ]Carolina|South[ ]Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West[ ]Virginia
|Wisconsin|Wyoming
I've added line breaks for readability but make sure to remove those when using the pattern. This is a sure fire way to test for states that are spelled out but in some addresses the states are abbreviated.
State Abbreviations
AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT
|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY
When you are searching for a city, state, zip combination, combine the patterns like this:
City, State, Zip
{city pattern},[ ](?:{state pattern}|{abbrev. state pattern})[ ]{zip pattern}
To finish off our address RegEx
pattern, we need to test for a street address.
Street
\d+[ ](?:[A-Za-z0-9.-]+[ ]?)+(?:Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\.?
For the full pattern, combine the city, state, zip pattern with this street pattern separated by \s to test for either a space or a line break and you're done!