Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database / SQL-Server

Remove duplicate records from SQL tables

3.12/5 (11 votes)
13 May 2007CPOL3 min read 1   503  
A generic procedure to retrieve and delete duplicate records from MS SQL Server tables.

Introduction

A generic stored procedure to retrieve and delete duplicate records from a table, specifying a custom list of fields identifying each record.

Background

When I did optimizations on a customer's production database, I found problems creating proper primary keys because of duplicate records found in tables. Doing a little web search, I found plenty of scripts to remove duplicate records. All existing scripts had two major problems:

  1. They are not generic and must be written for every table
  2. They are all assuming that duplicate records are identical in all fields and not only primary keys

In my situation, I had to process 150 tables, and I didn't want to write a script for every table. Also, in some tables, duplicate records had the same fields designated for the Primary Key, but other fields were different! So Select Distinct, used in all scripts, was still returning these records twice.

Therefore, I had to write my own script to solve the problem.

Using the code

There are two Stored Procedures:

SQL
GetDuplicates @TableName, @PK
DeleteDuplicates @TableName, @PK

Both procedures accept two string parameters: table name and a comma delimited list of primary key fields. Primary key fields are fields which uniquely identify records, they are not necessarily an actual table primary key.

Example to call:

SQL
--Get all duplicate recordes from table 
--        OrderDetails having unique CustomerID, OrderID and ItemID
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

Both Stored Procedures are using dynamic SQL to be generic. They are also calling the function GetTableFieldsList to get a string with all the table fields delimited by comma. The function allows to set a padding function to all the fields by adding prefix and suffix parameters. Every field in the returned list will be prefixed by the prefix and suffixed by the suffix parameter. The use of @ within the parameters allows getting the current field name.

Example to call:

SQL
SELECT dbo.GetTableFieldsList('Customers', 'MAX(', ') AS @')

will return MAX(f1) AS f1, MAX(f2) AS f2, ... where f1, f2, ... are fields of the Customers table.

SQL
CREATE     Function GetTableFieldsList(
 @TableName AS VARCHAR(255),
 @FieldPrefix AS VARCHAR(255) = '',
 @FieldSuffix AS VARCHAR(255) = '')
RETURNS VARCHAR(5000)
AS

BEGIN
declare @fields_list as varchar(5000)
set @fields_list = ''
declare names_curr cursor read_only forward_only
for 
select syscolumns.name from syscolumns inner join 
       sysobjects on syscolumns.id = sysobjects.id 
where sysobjects.name=@TableName
order by syscolumns.colid
declare @tmp_field as varchar(255)
declare @tmp_prefix as varchar(255)
declare @tmp_suffix as varchar(255)
open names_curr
fetch next from names_curr into @tmp_field
while (@@FETCH_STATUS = 0)
begin
 set @tmp_prefix = Replace(@FieldPrefix, '@', @tmp_field)
 set @tmp_suffix = Replace(@FieldSuffix, '@', @tmp_field)
 set @tmp_field = @tmp_prefix + @tmp_field + @tmp_suffix
 set @fields_list = CASE WHEN @fields_list='' THEN 
     @tmp_field ELSE @fields_list + ', ' + @tmp_field END
 fetch next from names_curr into @tmp_field
end
close names_curr
deallocate names_curr
RETURN @fields_list
END

The first Stored Procedure, GetDuplicates, will return all duplicate records of the table, ordered by the Primary Key fields list + the DuplicateCount field to indicate the number of duplicate records. This procedure has two string parameters: the table name and the Primary Key fields list. Note that Primary Key fields list must contain the field names only, without the table name.

Example to call:

SQL
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

will return all duplicate records of the OrderDetails table, having the same CustomerID, OrderID, and ItemID fields.

SQL
CREATE    PROCEDURE GetDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @table_pk as varchar(5000)
declare @tmp_pk as varchar(5000)
set @table_pk = replace(@pk, ' ', '') --Remove blanks
set @tmp_pk = @table_pk
--Add table name prefix
set @table_pk = @TableName+'.'+replace(@table_pk, ',', ', '+@TableName+'.')
--Add table name prefix
set @tmp_pk = '#dup_table_tmp.'+replace(@tmp_pk, ',', ', #dup_table_tmp.')
declare @sql as varchar(5000)
set @sql = 'select ' + @PK
set @sql = @sql + ', count(*) AS DuplicateCount '
set @sql = @sql + 'INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1; '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, @TableName+'.', ' as @')
set @sql = @sql + ', #dup_table_tmp.DuplicateCount'
set @sql = @sql + ' from ' + @TableName + ' inner join #dup_table_tmp'
set @sql = @sql + ' on Checksum(' + @table_pk + ') = Checksum(' + @tmp_pk + ')'
set @sql = @sql + ' order by ' + @table_pk
--print @sql
exec (@sql)

The next Stored Procedure, DeleteDuplicates, will delete all duplicate records of the table, leaving only one record for each Primary Key fields combination. Assuming that duplicate records might have different fields other than the Primary Key, we will use the MAX aggregate function to select records with maximal values in all fields.

Example to call:

SQL
EXEC DeleteDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

will delete all duplicate records of the OrderDetails table, having the same CustomerID, OrderID, and ItemID fields, leaving only one record with the same CustomerID, OrderID, and ItemID.

SQL
CREATE PROCEDURE DeleteDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @sql as varchar(5000)
set @sql = 'begin transaction; '
set @sql = @sql + 'if exists (select ' + @PK + ' from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1)'
set @sql = @sql + ' begin '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, 'max(', ') as @')
set @sql = @sql + ' INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK + '; '
set @sql = @sql + 'delete from ' + @TableName + '; '
set @sql = @sql + 'insert into ' + @TableName
set @sql = @sql + ' select * from #dup_table_tmp;'
set @sql = @sql + ' end '
set @sql = @sql + 'commit;'
--print @sql
exec (@sql)

During the work, this procedure copies the filtered contents of the table to a temporary table and then selects all the contents back into the original table.

The attached zip file includes both the Stored Procedures and the function creation script. There is also a sample script to create a table, fill in some test data, and execute the procedures.

History

  • May 13, 2007 - The first version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)