Introduction
A generic stored procedure to retrieve and delete duplicate records from a table, specifying a custom list of fields identifying each record.
Background
When I did optimizations on a customer's production database, I found problems creating proper primary keys because of duplicate records found in tables. Doing a little web search, I found plenty of scripts to remove duplicate records. All existing scripts had two major problems:
- They are not generic and must be written for every table
- They are all assuming that duplicate records are identical in all fields and not only primary keys
In my situation, I had to process 150 tables, and I didn't want to write a script for every table. Also, in some tables, duplicate records had the same fields designated for the Primary Key, but other fields were different! So Select Distinct
, used in all scripts, was still returning these records twice.
Therefore, I had to write my own script to solve the problem.
Using the code
There are two Stored Procedures:
GetDuplicates @TableName, @PK
DeleteDuplicates @TableName, @PK
Both procedures accept two string parameters: table name and a comma delimited list of primary key fields. Primary key fields are fields which uniquely identify records, they are not necessarily an actual table primary key.
Example to call:
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'
Both Stored Procedures are using dynamic SQL to be generic. They are also calling the function GetTableFieldsList
to get a string with all the table fields delimited by comma. The function allows to set a padding function to all the fields by adding prefix and suffix parameters. Every field in the returned list will be prefixed by the prefix and suffixed by the suffix parameter. The use of @
within the parameters allows getting the current field name.
Example to call:
SELECT dbo.GetTableFieldsList('Customers', 'MAX(', ') AS @')
will return MAX(f1) AS f1, MAX(f2) AS f2
, ... where f1, f2, ... are fields of the Customers table.
CREATE Function GetTableFieldsList(
@TableName AS VARCHAR(255),
@FieldPrefix AS VARCHAR(255) = '',
@FieldSuffix AS VARCHAR(255) = '')
RETURNS VARCHAR(5000)
AS
BEGIN
declare @fields_list as varchar(5000)
set @fields_list = ''
declare names_curr cursor read_only forward_only
for
select syscolumns.name from syscolumns inner join
sysobjects on syscolumns.id = sysobjects.id
where sysobjects.name=@TableName
order by syscolumns.colid
declare @tmp_field as varchar(255)
declare @tmp_prefix as varchar(255)
declare @tmp_suffix as varchar(255)
open names_curr
fetch next from names_curr into @tmp_field
while (@@FETCH_STATUS = 0)
begin
set @tmp_prefix = Replace(@FieldPrefix, '@', @tmp_field)
set @tmp_suffix = Replace(@FieldSuffix, '@', @tmp_field)
set @tmp_field = @tmp_prefix + @tmp_field + @tmp_suffix
set @fields_list = CASE WHEN @fields_list='' THEN
@tmp_field ELSE @fields_list + ', ' + @tmp_field END
fetch next from names_curr into @tmp_field
end
close names_curr
deallocate names_curr
RETURN @fields_list
END
The first Stored Procedure, GetDuplicates
, will return all duplicate records of the table, ordered by the Primary Key fields list + the DuplicateCount
field to indicate the number of duplicate records. This procedure has two string parameters: the table name and the Primary Key fields list. Note that Primary Key fields list must contain the field names only, without the table name.
Example to call:
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'
will return all duplicate records of the OrderDetails table, having the same CustomerID
, OrderID
, and ItemID
fields.
CREATE PROCEDURE GetDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @table_pk as varchar(5000)
declare @tmp_pk as varchar(5000)
set @table_pk = replace(@pk, ' ', '')
set @tmp_pk = @table_pk
set @table_pk = @TableName+'.'+replace(@table_pk, ',', ', '+@TableName+'.')
set @tmp_pk = '#dup_table_tmp.'+replace(@tmp_pk, ',', ', #dup_table_tmp.')
declare @sql as varchar(5000)
set @sql = 'select ' + @PK
set @sql = @sql + ', count(*) AS DuplicateCount '
set @sql = @sql + 'INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1; '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, @TableName+'.', ' as @')
set @sql = @sql + ', #dup_table_tmp.DuplicateCount'
set @sql = @sql + ' from ' + @TableName + ' inner join #dup_table_tmp'
set @sql = @sql + ' on Checksum(' + @table_pk + ') = Checksum(' + @tmp_pk + ')'
set @sql = @sql + ' order by ' + @table_pk
exec (@sql)
The next Stored Procedure, DeleteDuplicates
, will delete all duplicate records of the table, leaving only one record for each Primary Key fields combination. Assuming that duplicate records might have different fields other than the Primary Key, we will use the MAX
aggregate function to select records with maximal values in all fields.
Example to call:
EXEC DeleteDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'
will delete all duplicate records of the OrderDetails table, having the same CustomerID
, OrderID
, and ItemID
fields, leaving only one record with the same CustomerID
, OrderID
, and ItemID
.
CREATE PROCEDURE DeleteDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @sql as varchar(5000)
set @sql = 'begin transaction; '
set @sql = @sql + 'if exists (select ' + @PK + ' from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1)'
set @sql = @sql + ' begin '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, 'max(', ') as @')
set @sql = @sql + ' INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK + '; '
set @sql = @sql + 'delete from ' + @TableName + '; '
set @sql = @sql + 'insert into ' + @TableName
set @sql = @sql + ' select * from #dup_table_tmp;'
set @sql = @sql + ' end '
set @sql = @sql + 'commit;'
exec (@sql)
During the work, this procedure copies the filtered contents of the table to a temporary table and then selects all the contents back into the original table.
The attached zip file includes both the Stored Procedures and the function creation script. There is also a sample script to create a table, fill in some test data, and execute the procedures.
History
- May 13, 2007 - The first version.