This week, I have a lot of good challenges in SQL as we are building an SSIS package to unify data from different systems like Solomon, Raisers Edge, Goldmine, etc. A part of that SSIS Package is to cleanse the data and the toughest to cleanse is the addresses (which means putting the correct data on the correct fields like Post Code and Suburbs). Now I was quite challenged in one scenario where I need to extract Suburb
field - I need to get that data form a consolidated address field (everything in there) and I want to do it in one Select
statement!
So what are my components? I have a table of correct suburb
s and a table of unclean addresses.
i.e. Suburb
Table looks like this:
and Address
table looks like this:
So how would I extract the correct suburb
in one Select
? Initially, what I was doing was:
SELECT a.CompleteAddress, s.Suburb FROM tblCrap_Addresses a
INNER JOIN tblLookup_Suburb s ON a.CompleteAddress LIKE
(SELECT TOP 1 '%' + s.Suburb + '%' )
But there’s a problem with this one as if I have an address like 999 X Street, Mount Albert and in my suburbs I have Mount, Mount Al, and Mount Albert I will definitely get a wrong join on the suburb and will get the Mount as the suburb instead of Mt Albert. I tried a lot of stuff like indexing the Suburb
table to the length of Mt Albert so it gets the longest length DESC
so it joins on the longest string
first but still does not work. It's also impossible to put a condition on the LIKE
while doing a JOIN
. Then I suddenly remembered the APPLY
operator.
So what is an APPLY
operator? According to Technet:
“The APPLY
operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query. The table-valued function acts as the right input and the outer table expression acts as the left input. The right input is evaluated for each row from the left input and the rows produced are combined for the final output. The list of columns produced by the APPLY
operator is the set of columns in the left input followed by the list of columns returned by the right input.”
So my new command now is:
SELECT
a.CompleteAddress
s1.Suburb
FROM
tblCrap_Address a
OUTER APPLY
(SELECT TOP 1 s2.Suburb FROM tblLookup_Suburb s2 where
a.CompleteAddress
LIKE '%' + s2.Suburb + '%' ORDER BY LEN(s2.Suburb) DESC) s1