How Can I Find Duplicate Values in SQL Server?

essentialSQL

5.00/5 (2 votes)

4 Jan 2016MIT6 min read

17.7K

How to find duplicate values in SQL Server

In this article, find out how to find duplicate values in a table or view using SQL. We’ll go step by step through the process. We’ll start with a simple problem, slowly build up the SQL, until we achieve the end result.

By the end, you’ll understand the pattern used to identify duplicate values and be able to use it in your database.

All the examples for this lesson are based on Microsoft SQL Server Management Studio and the AdventureWorks2012 database. You can get started using these free tools using my Guide Getting Started Using SQL Server.

Find Duplicate Values in SQL Server

Let’s get started. We’ll base this article on a real-world request; the human resource manager would like you to find all employees sharing the same birthday. She would like the list sorted by BirthDate and EmployeeName.

After looking at the database, it becomes apparent the HumanResources.Employee table is the one to use as it contains employee birthdates.

At first glance, it seem like it would be pretty easy to find duplicate values in SQL server. After all, we can easily sort the data.

Duplicate Birth Dates

But once the data is sorted, it gets harder! Since SQL is a set based language, there is no easy way, except for using cursors, to know the previous record’s values.

If we knew these, we could just compare values, and when they were the same, flag the records as duplicates.

Luckily, there is another way for us to do this. We’ll use an INNER JOIN to match employee birthdays. By doing so, we’ll get a list of employees sharing the same birthdate.

This is going to be a build-as you go article. I’ll start out with a simple query, show results, and point out what needs refinement and move on. We’ll start out with getting a list of employees and their birthdates.

Step 1 – Get a List of Employees Sorted By Birthdate

When working with SQL, especially in uncharted territory, I feel it is better to build a statement in small steps, verifying results as you go, rather than writing the “final” SQL in one step, to only find I need to troubleshoot it.

Hint: If you’re working with a very large database, then it may make sense to make a smaller copy as your dev or test version and use that to write your queries. That way, you don’t kill the production database’s performance and get everyone down on you.

So for our first step, we are going to list all employees. To do so, we’ll join the Employee table to the Person table to so we can get the employee’s name.

Here is the query so far

SQL

SELECT   E1.BusinessEntityID,
         P.FirstName + ' ' + p.LastName AS FullName,
         E1.BirthDate
FROM     HumanResources.Employee AS E1
         INNER JOIN
         Person.Person AS P
         ON P.BusinessEntityID = E1.BusinessEntityID
ORDER BY E1.BirthDate, FullName

If you look at the result, you see we have all the elements of the HR manager’s request, except that we’re displaying every employee.

Employee Names and Birth Dates

In the next step, we’ll set up the results so we can start to compare birth dates to find duplicate values.

STEP 2 – Compare Birthdates to Identify Duplicates

Now that we have a list of employees, we now need a means to compare birthdates so we can identify employees with the same birthdates. In general, these are duplicate values.

To do the comparison, we’ll do a self-join on the employee table. A self-join is just a simplified version of an INNER JOIN. We start out using BirthDate as our join condition. This ensures we’re only retrieving employees with the same birth date.

SQL

SELECT   E1.BusinessEntityID,
         E2.BusinessEntityID,
         P.FirstName + ' ' + p.LastName AS FullName,
         E1.BirthDate
FROM     HumanResources.Employee AS E1
         INNER JOIN
         Person.Person AS P
         ON P.BusinessEntityID = E1.BusinessEntityID
         INNER JOIN
         HumanResources.Employee AS E2
         ON E2.BirthDate = E1.BirthDate
ORDER BY E1.BirthDate, FullName

I added E2.BusinessEntityID to the query so you can compare the primary key from both E1 and E2. You see in many cases that they are the same.

The reason we’re focusing on BusinessEntityID is that it is the primary key and the unique identifier for the table. It becomes a highly concise and convenient means to identify a row’s results and to understand its source.

We’re getting closer to obtaining our final result, but once you check out the results, you’ll see we’re picking up the same record in both the E1 and E2 match.

Check out the items circled in red. Those are the false positives we need to eliminate from our results. Those are the same rows matching to themselves.

The good news is we are really close to just identifying the duplicates.

I circled a 100% guaranteed duplicate in blue. Notice that the BusinessEntityIDs are different. This indicates the self-join is matching BirthDate on different rows – true duplicates to be sure.

Duplicates Identified, but with False Positives

In the next step, we’ll take those false positives head on and remove them from our results.

Step 3 – Eliminate Matches to Same Row – Remove False Positives

In the prior step, you may have noticed all the false positive matches have the same BusinessEntityID; whereas, the true duplicates were not equal.

This is our big hint.

If we want to only see duplicates, then we need to only bring back matches from the join where the BusinessEntityID values are not equal.

To do this, we can add:

SQL

E2.BusinessEntityID <> E1.BusinessEntityID

As a join condition to our self-join. I’ve highlighted the added condition in bold.

SQL

SELECT   E1.BusinessEntityID,
         E2.BusinessEntityID,
         P.FirstName + ' ' + p.LastName AS FullName,
         E1.BirthDate
FROM     HumanResources.Employee AS E1
         INNER JOIN
         Person.Person AS P
         ON P.BusinessEntityID = E1.BusinessEntityID
         INNER JOIN
         HumanResources.Employee AS E2
         ON E2.BirthDate = E1.BirthDate
            AND E2.BusinessEntityID <> E1.BusinessEntityID
ORDER BY E1.BirthDate, FullName

Once this query is run, you’ll see there are fewer rows in the results, and those which remain are truly duplicates.

Duplicates with False Positives Removed

Since this was a business request, let’s clean up the query so we are only showing the information requested.

Step 4 – Final Touches

Let’s get rid of the BusinessEntityID values from the query. They were there only to help us troubleshoot.

The final query is listed here:

SQL

SELECT   P.FirstName + ' ' + p.LastName AS FullName,
         E1.BirthDate
FROM     HumanResources.Employee AS E1
         INNER JOIN
         Person.Person AS P
         ON P.BusinessEntityID = E1.BusinessEntityID
         INNER JOIN
         HumanResources.Employee AS E2
         ON E2.BirthDate = E1.BirthDate
            AND E2.BusinessEntityID <> E1.BusinessEntityID
ORDER BY E1.BirthDate, FullName

And here are the results you can present to the HR Manager!

Presentable Results

Mark, one of my readers, pointed out to me that if there are three employees that have the same birth dates, then you would have duplicates in the final results. I verified this and that is true. To return a list show each duplicate only once, you can use the DISTINCT clause. This query works in all cases:

SQL

SELECT   DISTINCT P.FirstName + ' ' + p.LastName AS FullName,
         E1.BirthDate
FROM     HumanResources.Employee AS E1
         INNER JOIN
         Person.Person AS P
         ON P.BusinessEntityID = E1.BusinessEntityID
         INNER JOIN
         HumanResources.Employee AS E2
         ON E2.BirthDate = E1.BirthDate
            AND E2.BusinessEntityID <> E1.BusinessEntityID
ORDER BY E1.BirthDate, FullName

Final Comments

To summarize, here are the steps we took to identify duplicate data in our table.

We first created a query of the data we want to view. In our example, this was the employee and their birth date.
We performed a self-join, INNER JOIN on the same table in geek speak, and using the field we deemed a duplicated. In our case, we wanted to find duplicate birthdays.
Finally, we eliminated matches to the same row by excluding rows where the primary keys were the same.

By taking a step by step approach, you can see we took a lot of the guess work out of creating the query.

If you’re looking to improve how you write your queries or are just confounded by it all and looking for a way to clear the fog, then may I suggest my guide Three Steps to Better SQL.

The post How can I find Duplicate Values in SQL Server? appeared first on Essential SQL.

License

This article, along with any associated source code and files, is licensed under The MIT License