Introduction
Everyone knows duplicate rows create so many problems:
- Incorrect output
- clue less exceptions
- working/tested code failure
- database size increase
- application crash
- many more.
In this article I explain how can we identify duplicate rows from the database table and clean-up them. I also cover duplicate XML nodes inside XML type fields with
the same problem. In real life, there are various unique scenarios that raise and handle all those with various techniques. It is difficult to cover all the scenarios in a single article and provide solutions. Here my intention is
try to understand the basic part of duplicate rows/XML-node issues and solutions so that in different scenarios
we can handle those appropriately.
Background
Nobody creates duplicate rows intentionally in a database. Duplicate rows
are created easily when tables have no primary/unique key, no proper data validation,
and when business logic changes but data correction of existing data is not properly completed. Besides that there are
a few scenarios that cause creating duplicate rows:
- Importing data from other databases, the source database contains duplicate rows.
- Re-structure/reorganize database, and time duplicate rows may created.
- Existing application bugs.
- Data-entry operation enters duplicate rows.
Identify Unique and Duplicate Rows
Identifying unique and duplicate rows from
the database table I create a sample table and rows, insert those rows to that table, and execute
SQL queries. My table structure consists of three fields:
- Id
- Name
- DateOfBirth
create table MyTestTable(Id int, Name varchar(50), DateOfBirth datetime);
insert into MyTestTable(Id, Name, DateOfBirth)
values(1, 'A', '01-Jan-2010'), (2, 'B', '01-Jan-2011'), (2, 'B', '01-Jan-2011'),
(3, 'C', '01-Jan-2012'),(4, 'D', '01-Jan-2013'),(4, 'D', '01-Jan-2013');
After analyzing data, it will be clear that ids 1
and 3 are unique rows, the remaining are duplicate. Identifying unique rows:
select max(Id) as Id, Name, DateOfBirth FROM MyTestTable
group by Name, DateOfBirth
having count(*) = 1;
In query, group by the fields (Name
and DateOfBirth) and use the Max
aggregate function (ID field if more columns exist in the table, then we need to group by all fields there). If we use
Min
instead of Max
, the result would be same. We filter rows with
the count(*) = 1 clause. Actually we request
the database for giving groups where each group contains a single record.
Now for duplicate rows from the table, just need to change
the filter clause count(*) > 1
select max(Id) as Id, Name, DateOfBirth FROM MyTestTable
group by Name, DateOfBirth
having count(*) > 1;
To know the frequency of duplicate rows:
select id,name,dateofbirth, count(*) as Total
from MyTestTable
group by id,name,dateofbirth
having count(*) > 1
Now I will explain a real life scenario. In
the sample rows I did not insert time in the dateofbirth field. But often datetime
field contains time with date. To better understand this, I clear the table first and insert sample rows again with time.
truncate table MyTestTable;
insert into MyTestTable (Id, Name, DateOfBirth)
values (1, 'A', '01-Jan-2010 12:00:01'), (2, 'B', '01-Jan-2011 12:00:05'), (2, 'B', '01-Jan-2011 12:00:06'),
(3, 'C', '01-Jan-2012 12:00:01'),(4, 'D', '01-Jan-2013 12:00:07'),(4, 'D', '01-Jan-2013 12:00:08');
After analysis of the new row-set, you will see there
are no duplicate rows because of time. Adding time with date makes each row unique.
Run duplicate rows identification
query again:
select max(Id) as Id, Name, DateOfBirth FROM MyTestTable
group by Name, DateOfBirth
having count(*) > 1
It will return empty result. That proves no duplicate rows exist in this table. But when you consider duplicate rows, often time
they should be ignored for business policy. So I re-write the query considering this:
select max(Id) as Id, Name, max(DateOfBirth) as DateOfBirth FROM MyTestTable
group by Name, convert(varchar(20),DateOfBirth, 104)
having count(*) > 1
Ignoring time from datetime, I use
T-SQL Convert scalar and Max
aggregate function.
Delete Duplicate Rows
If
the database contains duplicate rows, we must clean up those. Previously we
identified the duplicate rows. Now we will start the clean up operation. We may think deleting all duplicate rows
is very easy and straightforward. See the SQL below:
delete from MyTestTable
where id in(select max(Id) FROM MyTestTable
group by Name, convert(varchar(20),DateOfBirth, 104)
having count(*) > 1)
Now there is no duplicate rows. But it deletes real data too. But
how? After analysis of the data you will find that ids 2 and 4 each have double rows, and
the first row was valid. The next one is actually duplicate.
So we should keep the first of each and delete
the next of that group. The delete query will be tricky. We need to take help from CTE (Common Table Expression) and
the row_number
function. CTE will help to point to a particular row and
the
row_number
function will create a unique row-id for each row so that we can track each row uniquely. The delete query
is:
WITH cte AS
(
select row_number() over(order by id asc) as rowid, Id, Name, DateOfBirth
from MyTestTable
)
DELETE FROM cte WHERE rowid in (select max(rowid) from cte
group by id, Name, DateOfBirth having count(*) > 1);
After execution of the above query, the output is:
I want to say something regarding
the delete query. The table has no unique field like a primary key by which we can identify each row. So we create a dynamic unique field using
the
row_number
function. The rowid field is actually virtual and no physical existence
is there. So we will not execute the delete statement directly from the table based on that field.
We use CTE for executing the delete statement based on the virtual field and clean up duplicate rows from
the table.
Please wait! The story is not finished yet. In our sample, rows 2
and 4 each have two rows. So the delete query simply deletes the next row. If more than
two duplicate rows are found then, what will happen? The query will delete
the last duplicate row only, though
I use the Max
aggregate function.
select max(rowid) from cte
We understand that
the problem will not be solved yet but we are very close to a solution.
We need a complete
T-SQL query for deleting all duplicate rows except the first one.
Solution-1
while (1=1)
begin
WITH cte AS
(
select row_number() over(order by id asc) as rowid, Id, Name, DateOfBirth from MyTestTable
)
delete from cte where rowid in (select max(rowid) from cte group by id, Name, DateOfBirth having count(*) > 1);
if @@ROWCOUNT = 0 break;
end
Start an infinite loop for deleting duplicate rows until a single duplicate row
is found. The infinite loop will finish when the SQL Server global variable
@@RowCount
returns 0. It will return zero when no duplicate rows
are found in the source table.
Solution-1 works fine but
the code is not very elegant and is a little complex because an infinite loop is used here. Many people dislike it because
of the lack of simplicity.
Solution-2
WITH cte AS
(
select row_number() over (partition by id,Name,
DateOfBirth order by id asc) as rowid, Id, Name, DateOfBirth
from MyTestTable
)
DELETE FROM cte WHERE rowid > 1;
Solution-2 is simpler than Solution-1.
The virtual field Rowid is generated group wise using the row_number
function with id, name, DateOfBith fields treated as a group. The generated rowid starts from 1 for each group. The
new filter condition is rowid > 1. That means each group only keeps
a row whose id is 1, the remaining ones will be deleted.
Duplicate XML node Identification and Delete
XML data type in SQL Server comes with
the 2005 version. There are many reasons we should use the XML type field
- Store multiple fields value to a single field.
- Store schema and data both together.
- Store hierarchical data.
- Dynamic data with schema is defined at runtime.
- Store schema which can be
configurable at runtime.
- Store various data templates.
- Data need to validate against any fixed schema.
- And many more
Inside
the XML field duplicate nodes might be created for various reasons. It is equally important
as identifying and cleaning-up duplicate rows from the table, duplicate nodes from an XML field should be identified and cleaned-up for data
accuracy and to have methods to handle various exceptions.
Identify Duplicate XML Nodes
To accurately visualize the problem and solution we create a sample table named MyTestTable2 with
two columns:
- Id (datatype-int)
- Data (datatype-xml)
create table MyTestTable2(Id int, Data xml);
declare @data1 xml=
'<employee>
<code>1</code>
<name>A</name>
<code>1</code>
</employee>',
@data2 xml='<employee>
<code>2</code>
<code>2</code>
<name>B</name>
</employee>',
@data3 xml='<employee>
<code>3</code>
<name>C</name>
</employee>';
insert into MyTestTable2(Id, data) values(1, @data1), (2, @data2), (3, @data3);
Now
the table with data looks like:
select all id, data from MyTestTable2;
After data analysis we see that only
the row with ID 1 contains a duplicate XML node named code that exists twice in the employee element (repeat after name node). The following
SQL helps to identify rows where duplicate XML nodes exist.
select Id, Data from MyTestTable2 where data.value('count(/employee/code)', 'int') > 1;
It will return a single row whose id is 1. So we understand that the query is correct for identifying duplicate
XML nodes containing rows.
Clean-up Duplicate XML Node
The following query will delete duplicate nodes from the source
table.
update MyTestTable2 set
data.modify('delete /employee/code[position()>1]')
where data.value('count(/employee/code)', 'int') > 1;
After executing the above query, the output will be as follows:
select id, data from MyTestTable2;
Now we see there there is no duplicate
XML node in the data field (XML type) in the table.
When to Delete?
Duplicate rows try to identify and delete as early as possible. If it is late then that duplicate row might be used as a reference row (foreign key). If it is used as a reference key then
a referential integrity error will be raised and delete operation becomes complex. If you face that situation then first you need to update
the reference table/reference field with a valid key and then go for delete.
Any way for continuous search and delete
of duplicate rows?
We can write a stored procedure for identifying duplicate rows/XML nodes and delete them and execute that stored procedure
in a regular interval. We can use the following tools for executing stored procedure periodically:
- SQL Server Agent
- Windows Service
- Windows Task Scheduler
If you choose Windows Service or Windows Task Scheduler then you need to create a console/Windows service application and inside that application use ADO.NET or
Enterprise Library for accessing your database and execute the stored procedure. If you need to use Windows Service then you can also use
the Windows Timer object for periodic execution. My personal choice is Windows Task Scheduler. It is simpler and easily configurable.
Any Precautions?
Duplicate delete stored procedure (SP) must be well tested. You
should take enough time to test it properly. It is better to take help from testing team/third person if you have
the option.
Delete operation is very risky especially when you do it in
the production database. The client never accepts data loss. It is not acceptable
for anyone that you go for deleting duplicate rows but simultaneously delete real data too. So before delete operation you must make sure source data is properly backed-up so that if any mistakes happen
you can recover data.
Points of Interest
I try to explain and show the various ways to create duplicate
records in a database as well as a node in an XML type field. Duplicate data creates maintenance overheads and produces various types of bugs. So we need to clean up
the production database on a regular basis specially in projects/products which have
a long maintenance life.