Preventing Duplicates in the Data

Andrew Zwicker

0.00/5 (No votes)

16 May 2011CPOL3 min read

7.8K

How to prevent duplicates in the data

Duplicates - the bane of any good database. Some duplicates are technical - for whatever reason, the primary key wasn't on the table and an extra record was somehow inserted. Good table architecture should minimize these duplicates.

But duplicates from the business domain can be just as bad, and harder to avoid and pinpoint - unless business rules and validation checks are put in place. Is that record a duplicate or just 2 people with the same names and same dates of birth? An order for the same book was put through twice within seconds by the same customer - is it really a duplicate [clicked submit twice?] or did they legitimately order a 2^nd copy? Assuming the people and book order records were inserted into tables with identity fields, there's no technical duplicate. But a business domain duplicate? That's harder to judge.

The latter scenario happened to me recently on Amazon.com. When I within seconds decided to order a 2^nd copy of a book, Amazon asked me up-front whether I actually intended to make the extra purchase. This helps minimize Amazon's business domain duplicates [at least for orders].

At Boston Public Schools, the most significant business domain duplicate we deal with is with students. When a new student is enrolled, a bunch of new records are inserted into various tables with information about the student's demographic information, enrollment record, language information, etc. We want each student to be represented once in the various tables. Fine - this shouldn't be a problem.

But let's say a family just returned back to Boston after being away for a few years. Their kids previously attended school in Boston, and now they're back to register for school again. If the family explains they were previously in Boston Public Schools, the kids will get the same student numbers they had before. However, this doesn't always happen, and it is up to the MyBPS user and/or the system to determine this. Users of MyBPS registering students for BPS generally do a name lookup, but sometimes this doesn't work [misspellings, changed information, etc]. The system needs to be smart enough to sniff out a potential duplicate before it gets inserted.

And that's exactly the logic I put in place a couple years ago. I wrote a stored procedure that takes in student information inputs [name, date of birth, city of birth, current address, race, gender, parent name, etc.] and outputs a list of students already on the system that might, possibly, represent the student being registered for school. A warning message displays to the MyBPS user, and they can either associate the existing number with the student or click a checkbox acknowledging that there's no duplicate.

The procedure itself first grabs students who might be duplicates simply based on name, date of birth [2 of the 3 parts [of mm, dd, yyyy] must match], and gender [must match]. For the name match, I required the last names start with the same letter and I required the difference SQL function to return a 3 or 4 [see my previous post for information about difference and soundex]. Such potential duplicates are put into a table variable which is then used for another check.

For the students identified as possible duplicates, I then look at 7 different fields and require at least 3 to match. Here's some of the SQL from the stored procedure that is doing the check [this is part of a larger query and will not run stand-alone, but it shows the idea]:

SQL

-- Note: A field with 'NONE' indicates the user hasn't entered this data field yet.
-- Parent similarity [check both parent names].
(CASE WHEN @parentlastname1 = 'NONE' THEN 0
 WHEN difference(parent1.lastname, @parentlastname1) >= @MinimumSimilarity THEN 1
 WHEN difference(parent2.lastname, @parentlastname1) >= @MinimumSimilarity THEN 1
 WHEN @parentlastname2 = 'NONE' THEN 0
 WHEN difference(parent1.lastname, @parentlastname2) >= @MinimumSimilarity THEN 1
 WHEN difference(parent2.lastname, @parentlastname2) >= @MinimumSimilarity THEN 1
 ELSE 0 END) AS IsParentSimilarscoreInfo,
-- Geo Code [Boston is broken into 800+ geo codes each of which is a few city blocks].
  (CASE WHEN geocode = @geo THEN 1 ELSE 0 END) AS IsGeoSame,
-- Street name
  (CASE WHEN @streetname = 'NONE' THEN 0
 WHEN difference(StudentData.street, @streetname) >= @MinimumSimilarity THEN 1
 ELSE 0 END) AS IsStreetNameSimilar,
-- Phone number. 
  (CASE WHEN @phonenumber = 'NONE' THEN 0
 WHEN StudentData.hphone = @phonenumber THEN 1
 ELSE 0 END) AS IsPhoneNumberSame,
-- City of birth
  (CASE WHEN @cityofbirth = 'NONE' THEN 0
 WHEN difference(StudentData.cityofbirth, @cityofbirth) >= @MinimumSimilarity THEN 1
 ELSE 0 END) AS IsCityOfBirthSimilar,
-- Is the DOB the same?  
  (CASE WHEN Datediff(DAY, @dob, StudentData.dob) = 0 THEN 1
 ELSE 0 END) AS IsDOBSame,
-- Is the student's last name very similar?
  (CASE WHEN difference(StudentData.lastname, @lastname) >= (@MinimumSimilarity+1) THEN 1
 ELSE 0 END) AS IsLastNameSimilar, 
-- Is the student's name the same?
  (CASE WHEN StudentData.lastname = @lastname AND StudentData.firstname = @firstname THEN 1
 ELSE 0 END) AS IsNameSame

The query needed a bit of tweaking to get it right. For example, at one point @MinimumSimilarity was set to 4 [so a perfect soundex score was needed for the text matches], but clear duplicates weren't being returned as such with this setting. Also, I previously required 4 of 7 fields to match, but this was a tad too restrictive. So I kept experimenting until there was a good balance between disregarding real duplicates and having too many false positives.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)