Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / SQL

Remove Duplicate Rows with CTE

5.00/5 (9 votes)
14 Aug 2017CPOL2 min read 17.6K   69  
This article gives you quick tip about how you can remove duplicate rows easily using common table expression in SQL

Introduction

Many times, our database contains duplicate rows which can be filtered out using certain columns. This can happen for many reasons like bad data import, wrong SQL execution, bug in the user interface, etc. A developer had to remove these duplicate rows with some heroic query. Here, I am showing a simple way to remove duplicate rows.

Background

Suppose we have a table called EmployeeDesignation which holds employee ID and designation ID. The table may contain records like this:

EmployeeIDDesignationID
111
212
313
111
111
313

Notice this table has duplicate records like (1,11) 3 times and (3,13) 2 times. Now we want to keep only one distinct pair and remove the rest. Let's see how we can do that in the next section.

Using the Code

Let's quickly learn about Common Table Expression (CTE). CTE can be thought of a temporary result set. CTE gives you more readability and simplification of complex queries. A CTE looks like below:

SQL
WITH 
expression_name_1 (columns) AS
(CTE query definition 1)

[, expression_name_X AS
   (CTE query definition X)
 , etc ]

SELECT expression_A, expression_B, ... --User defined statement using CTE expressions
FROM expression_name_1 

CTE starts with the word "With" followed by an expression name and column names. Then the selection query that populates the expression. We can have multiple set of expression separated by comma followed by the final statement. We are not going into details of CTE as this is out of scope of this article. Let's look at how we can use CTE to remove duplicate rows. Take a look at the code below:

SQL
WITH EmpDesg (EMPID, DESGID, OCCURRENCE ) AS
(select EmployeeId,DesignationID, ROW_NUMBER() _
OVER(PARTITION BY EmployeeId, DesignationID ORDER BY EmployeeId) AS rowNumber from EmployeeDesignation)
select * from EmpDesg

In the above SQL, "EmpDesg" is the CTE expression which acts as a temporary view. It has three columns (EMPID, DESGID and OCCURRENCE). In the query definition, we are using Row_Number function and partitioning the table rows with EmployeeID and DesignationID to assign an occurrence number to each pair. The query will return a result like below:

EMPIDDESGIDOCCURRENCE
1111
1112
1113
2121
3131
3132

Now instead of the final select query, we can delete the rows from our temporary result set which has OCCURRENCE > 1.

SQL
WITH EmpDesg (EMPID, DESGID, OCCURRENCE ) AS
(select EmployeeId,DesignationID, ROW_NUMBER() 
 OVER(PARTITION BY EmployeeId, DesignationID ORDER BY EmployeeId) AS rowNumber from EmployeeDesignation)
delete from EmpDesg where OCCURRENCE >1

This will just keep only one occurrence of each Employee ID and Designation ID pair and remove the duplicates from EmployeeDesignation table.

We can think of CTE as a subquery. If we run delete statement against CTE, it will delete rows from the table until SQL can infer which table to update/delete based on CTE. Otherwise, it will give a error.

That's all!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)