Introduction
Many times, our database contains duplicate rows which can be filtered out using certain columns. This can happen for many reasons like bad data import, wrong SQL execution, bug in the user interface, etc. A developer had to remove these duplicate rows with some heroic query. Here, I am showing a simple way to remove duplicate rows.
Background
Suppose we have a table called EmployeeDesignation
which holds employee ID and designation ID. The table may contain records like this:
EmployeeID | DesignationID |
1 | 11 |
2 | 12 |
3 | 13 |
1 | 11 |
1 | 11 |
3 | 13 |
Notice this table has duplicate records like (1,11) 3 times and (3,13) 2 times. Now we want to keep only one distinct pair and remove the rest. Let's see how we can do that in the next section.
Using the Code
Let's quickly learn about Common Table Expression (CTE). CTE can be thought of a temporary result set. CTE gives you more readability and simplification of complex queries. A CTE looks like below:
WITH
expression_name_1 (columns) AS
(CTE query definition 1)
[, expression_name_X AS
(CTE query definition X)
, etc ]
SELECT expression_A, expression_B, ...
FROM expression_name_1
CTE starts with the word "With
" followed by an expression name and column names. Then the selection query that populates the expression. We can have multiple set of expression separated by comma followed by the final statement. We are not going into details of CTE as this is out of scope of this article. Let's look at how we can use CTE to remove duplicate rows. Take a look at the code below:
WITH EmpDesg (EMPID, DESGID, OCCURRENCE ) AS
(select EmployeeId,DesignationID, ROW_NUMBER() _
OVER(PARTITION BY EmployeeId, DesignationID ORDER BY EmployeeId) AS rowNumber from EmployeeDesignation)
select * from EmpDesg
In the above SQL, "EmpDesg
" is the CTE expression which acts as a temporary view. It has three columns (EMPID
, DESGID
and OCCURRENCE
). In the query definition, we are using Row_Number
function and partitioning the table rows with EmployeeID
and DesignationID
to assign an occurrence number to each pair. The query will return a result like below:
EMPID | DESGID | OCCURRENCE |
1 | 11 | 1 |
1 | 11 | 2 |
1 | 11 | 3 |
2 | 12 | 1 |
3 | 13 | 1 |
3 | 13 | 2 |
Now instead of the final select
query, we can delete the rows from our temporary result set which has OCCURRENCE > 1
.
WITH EmpDesg (EMPID, DESGID, OCCURRENCE ) AS
(select EmployeeId,DesignationID, ROW_NUMBER()
OVER(PARTITION BY EmployeeId, DesignationID ORDER BY EmployeeId) AS rowNumber from EmployeeDesignation)
delete from EmpDesg where OCCURRENCE >1
This will just keep only one occurrence of each Employee ID and Designation ID pair and remove the duplicates from EmployeeDesignation
table.
We can think of CTE as a subquery. If we run delete
statement against CTE, it will delete rows from the table until SQL can infer which table to update/delete based on CTE. Otherwise, it will give a error.
That's all!