Duplicates - the bane of any good database. Some duplicates are technical - for whatever reason, the primary key wasn't on the table and an extra record was somehow inserted. Good table architecture should minimize these duplicates.
But duplicates from the business domain can be just as bad, and harder to avoid and pinpoint - unless business rules and validation checks are put in place. Is that record a duplicate or just 2 people with the same names and same dates of birth? An order for the same book was put through twice within seconds by the same customer - is it really a duplicate [clicked submit twice?] or did they legitimately order a 2nd copy? Assuming the people and book order records were inserted into tables with identity fields, there's no technical duplicate. But a business domain duplicate? That's harder to judge.
The latter scenario happened to me recently on Amazon.com. When I within seconds decided to order a 2nd copy of a book, Amazon asked me up-front whether I actually intended to make the extra purchase. This helps minimize Amazon's business domain duplicates [at least for orders].
At Boston Public Schools, the most significant business domain duplicate we deal with is with students
. When a new student
is enrolled, a bunch of new records are inserted into various tables with information about the student
's demographic information, enrollment record, language information, etc. We want each student
to be represented once in the various tables. Fine - this shouldn't be a problem.
But let's say a family just returned back to Boston after being away for a few years. Their kids previously attended school in Boston, and now they're back to register for school again. If the family explains they were previously in Boston Public Schools, the kids will get the same student
numbers they had before. However, this doesn't always happen, and it is up to the MyBPS user and/or the system to determine this. Users of MyBPS registering students for BPS generally do a name lookup, but sometimes this doesn't work [misspellings, changed information, etc]. The system needs to be smart enough to sniff out a potential duplicate before it gets inserted.
And that's exactly the logic I put in place a couple years ago. I wrote a stored procedure that takes in student
information inputs [name, date of birth, city of birth, current address, race, gender, parent name, etc.] and outputs a list of students already on the system that might, possibly, represent the student being registered for school. A warning message displays to the MyBPS user, and they can either associate the existing number with the student
or click a checkbox
acknowledging that there's no duplicate.
The procedure itself first grabs students
who might be duplicates simply based on name, date of birth [2 of the 3 parts [of mm, dd, yyyy] must match], and gender [must match]. For the name match, I required the last names start with the same letter and I required the difference SQL function to return a 3 or 4 [see my previous post for information about difference and soundex]. Such potential duplicates are put into a table variable which is then used for another check.
For the students identified as possible duplicates, I then look at 7 different fields and require at least 3 to match. Here's some of the SQL from the stored procedure that is doing the check [this is part of a larger query and will not run stand-alone, but it shows the idea]:
(CASE WHEN @parentlastname1 = 'NONE' THEN 0
WHEN difference(parent1.lastname, @parentlastname1) >= @MinimumSimilarity THEN 1
WHEN difference(parent2.lastname, @parentlastname1) >= @MinimumSimilarity THEN 1
WHEN @parentlastname2 = 'NONE' THEN 0
WHEN difference(parent1.lastname, @parentlastname2) >= @MinimumSimilarity THEN 1
WHEN difference(parent2.lastname, @parentlastname2) >= @MinimumSimilarity THEN 1
ELSE 0 END) AS IsParentSimilarscoreInfo,
(CASE WHEN geocode = @geo THEN 1 ELSE 0 END) AS IsGeoSame,
(CASE WHEN @streetname = 'NONE' THEN 0
WHEN difference(StudentData.street, @streetname) >= @MinimumSimilarity THEN 1
ELSE 0 END) AS IsStreetNameSimilar,
(CASE WHEN @phonenumber = 'NONE' THEN 0
WHEN StudentData.hphone = @phonenumber THEN 1
ELSE 0 END) AS IsPhoneNumberSame,
(CASE WHEN @cityofbirth = 'NONE' THEN 0
WHEN difference(StudentData.cityofbirth, @cityofbirth) >= @MinimumSimilarity THEN 1
ELSE 0 END) AS IsCityOfBirthSimilar,
(CASE WHEN Datediff(DAY, @dob, StudentData.dob) = 0 THEN 1
ELSE 0 END) AS IsDOBSame,
(CASE WHEN difference(StudentData.lastname, @lastname) >= (@MinimumSimilarity+1) THEN 1
ELSE 0 END) AS IsLastNameSimilar,
(CASE WHEN StudentData.lastname = @lastname AND StudentData.firstname = @firstname THEN 1
ELSE 0 END) AS IsNameSame
The query needed a bit of tweaking to get it right. For example, at one point @MinimumSimilarity
was set to 4 [so a perfect soundex score was needed for the text matches], but clear duplicates weren't being returned as such with this setting. Also, I previously required 4 of 7 fields to match, but this was a tad too restrictive. So I kept experimenting until there was a good balance between disregarding real duplicates and having too many false positives.