Data Version Control on MS SQL Server (using a generic trigger)

Martin Lachman

5.00/5 (4 votes)

4 Oct 2016CPOL9 min read

15.8K

341

How to create a system on MS SQL Server for detection of actual changed records in a database table.

Introduction

This article describes a simple and compact system of detecting changed records in a database table. The main focus of this solution is the detection of actually changed records. The article can serve as an inspiration for others who want to make more complex data version control systems such as auditing, history records accounting, etc.

Background

Problem with actual changed records

Thinking about a general system of data version control (DVC) in a database system (in this case the MS SQL Server), several crucial questions must be asked:

Do we want to monitor all records that are considered as changed, or only those, that have actually changed?
How can we recognize an actual changed record?
What is responsible for the recognition of an actual changed record (some table trigger, a special stored procedure, etc.)?
How should we mark that a particular record has actually changed?

The MS SQL Server doesn't provide own instrument which helps to determine whether an affected record has actually changed or not. If an affected record gets included in the result set of an update statement, it is considered to be changed - despite the fact that all the data in the columns of the record are the same as before the update. Especially, if the ROWVERSION data type column is used, which might be useful in certain situations, a new value is stored there whenever an update statement gets executed, and there is no possibility for us to influence it.

There are several ways to tell whether a record has been changed or not. For example:

By comparing the old and the new value of all changing columns in each SQL update statement. This is very uncomfortable and very difficult for code maintenance. But, if only one update statement per table is used, it could be considered an acceptable solution.
By performing a similar comparison as in point 1, using the Table-Update Trigger.
Computing a check sum value (using the CHECKSUM or BINARY_CHECKSUM built-in functions which return a 32bit integer) for a row and comparing with it against the previous value. We can use an asterisk instead of a list of columns. But the check sum function is designed for the ' ... use in building hash indexes' (as you can see at msdn). A 32bit integer value is too small to avoid the collision. And there are some specifics of its implementation - e.g. null value usage, etc.
A similar solution as in point nr. 3, this time using a built-in hash function: HASHBYTES. The function returns VARBINARY value and its length is determined by the selected hash algorithm. For SHA or SHA1 algorithm this is a 160 bit value. However, while not being entirely impossible, the collision is almost out of the question, though not impossible. The second disadvantage is that a list of columns (or an asterisk) cannot be used at the place of an input parameter. Instead a concatenation function, as e.g. SELECT FOR XML statement, must be used.

A possible solution using a generic trigger

This solution of the issue is based on a common SQL CLR Trigger. Since there is only one code implementation for it (for all the tables), it can be called a generic trigger. The trigger identifies an actual changed row by comparing the old and the new values in each column.

Next, each DVC table has a special attribute that contains a row version number. The number is sequentially assigned an integer value (by an inbuilt database sequencer - on MS SQL Server 2012 or later), which is unique for each row in a given database, and the new value is generated when the row is actually changed (or inserted as new). The actually changed record is marked this way.

Our Data-Version Controlled Tables have these system attributes:

System Column	Data Type	Description
`SysInsertRowVersion`	`BIGINT`	Contains a unique value of the row version valid at the moment of row insertion. The column has a default value constraint - next value from the sequencer.
`SysUpdateRowVersion`	`BIGINT`	Contains the current value of row version - when the row is actually changed, a new unique value is stored here.
`SysRowState`	`TINYINT`	Indicates whether a row is valid or not (i.e. deleted). For example, the value of 0 means the row is valid whereas the value of 1 means the row has been deleted.
`SysRowId`	`BIGINT`	This is a primary key alias implemented as a computed column. It allows us to keep the original primary key column name in legacy tables and to use a common name in the generic trigger at the same time. If we don't have a one-column PK (or the PK is based on non integer values), we can create a new column as a candidate key (identity column) and declare it as a unique key.
`SysTableName`	`VARCHAR(128)`	Contains the table name as a computed column with a constant value for each row. The value is used in the generic trigger. Without it the trigger would not recognize which table is currently being handled and there is would be no other reliable way of finding out. A computed-column constant value is not stored in a table data page and shouldn't be too resource demanding.

When a record gets inserted into the DVC table, the SysInsertRowVersion column contains a new unique value provided by the sequencer. The value is enforced by the default constraint declared on the table. For example, if you store a current value from a previous processing task, we can find out whether the newly inserted record comes earlier than the time at which the task was finished.

If the record has been updated and some of its column values have actually changed, the SysUpdateRowVersion column will show a new unique value given by the sequencer, which is different from the value in the SysInsertRowVersion column. This value is assigned by the generic update trigger.

In this case an update statement is used instead of the delete statement for deleting a row. The update statement assigns the SysRowState column a value which means deleted record. To make DVC tables query easier, you can create a simple view of each DVC table excluding the deleted rows can be created.

Using the code

The following part describes the solution at the example of simple case of a single table. The problem can be divided into a SQL Server part and a SQL CLR part (the generic trigger implementation).

SQL Server part

First, a sequencer providing a unique long-integer (BIGINT) incremented values is needed. In this example, it is marked seq_SysDVC and its DDL code is as follows:

SQL

-- DVC System Sequencer:
CREATE SEQUENCE dbo.seq_SysDVC AS BIGINT
  START WITH 1
  INCREMENT BY 1
GO

Second, the DVC table in this example is marked TbCustomer with the following DDL code:

SQL

-- sample customer table:
CREATE TABLE dbo.TbCustomer(
  -- custom PK column name
  CustomerId INTEGER NOT NULL IDENTITY(1,1),
  -- other business entity attributes:
  FirstName VARCHAR(100) NOT NULL,
  LastName VARCHAR(100) NOT NULL,
  Birthday DATETIME NOT NULL,
  Rating DECIMAL NOT NULL,
  -- DVC System Columns:
  SysInsertRowVersion BIGINT NOT NULL CONSTRAINT dfc_dbo_TbCustomer_SysInsertRowVersion DEFAULT NEXT VALUE FOR dbo.seq_SysDVC,
  SysUpdateRowVersion BIGINT NOT NULL CONSTRAINT dfc_dbo_TbCustomer_SysUpdateRowVersion DEFAULT -1,
  SysRowState TINYINT NOT NULL CONSTRAINT dfc_dbo_TbCustomer_SysRowState DEFAULT 0,
  SysRowId AS (CONVERT(BIGINT, CustomerId)),
  SysTableName AS ('dbo.TbCustomer'),
  -- native PK of the table
  CONSTRAINT pkc_TbCustomer PRIMARY KEY CLUSTERED (CustomerId ASC)
)
GO

On the SQL Server side a common table data type consisting of one attribute: SysRowId is necessary to provide a batch updating statement of the generic trigger. Its DDL code is here:

SQL

-- common table data type for a batch update statement of the generic trigger:
CREATE TYPE dbo.udt_DVCChangeResultTable AS TABLE(
  -- primary key
  SysRowId BIGINT NOT NULL PRIMARY KEY CLUSTERED
)
GO

SQL CLR part

On the SQL CLR side, the generic trigger has a simple assembly project with one public class (CDVCTrigger) containing a single public static method (OnUpdateBaseTrig) that implements its logic. The project contains some further helper classes and structures with evident purposes.

Should an old value be compared with a new one for a given column in a particular record, two trigger pseudo tables have to be used (Inserted and Deleted). However, two simultaneously opened SqlDataReader objects cause problems with the Multiple Active Result Sets related to the context SQL Server connection used in the trigger, we cannot use them. To avoid the possible growth in resource allocation DataSet/DataTable objects should not be used. Finally the use of a separate command per each changed row slows down the trigger process. For this reason, I use my own common structures for in-memory data storage.

For further information, see the Visual Studio project, which provides detailed explanation in the source code.

A compiled binary file of the SQL CLR project: DVC.SQLCLR.dll is created. It can be deployed on the SQL Server by means of:

SQL

-- Allow CLR integration
EXEC sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

-- Drop the assembly if it already exists
IF EXISTS (SELECT * FROM sys.assemblies asms WHERE asms.name = N'DVC.SQLCLR')
  DROP ASSEMBLY [DVC.SQLCLR]
GO

-- Import the assembly - with set safe mode:
CREATE ASSEMBLY [DVC.SQLCLR] FROM 'C:\Temp\DVC.SQLCLR.dll' WITH PERMISSION_SET = SAFE
GO

Copy the binary file of the assembly into a server side file system, for example into its local path 'c:\Temp\'.
Open the SQL Server Management Studio, switch the context database to the target one and open a new query window.
In the query window run the 'installation' code:

The following code shows how to simply 'declare' a new update trigger (based on the generic trigger) on the table from the example above:

SQL

-- Drop the trigger when it already exists
IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = object_id(N'dbo.tri_TbCustomer_Upd') AND OBJECTPROPERTY(id, N'IsTrigger') = 1)
  DROP TRIGGER dbo.tri_TbCustomer_Upd
GO

-- Create the trigger - in better words: declare the trigger
CREATE TRIGGER dbo.tri_TbCustomer_Upd ON dbo.TbCustomer FOR UPDATE AS EXTERNAL NAME [DVC.SQLCLR].[DVC.SQLCLR.CDVCTrigger].OnUpdateBaseTrig
GO

Now that the task is finished, it is time to test its results.

Testing of the DVC system

The example table TbCustomer is currently empty and new data must be inserted. And the current value of the DVC sequencer can be obtained in the following way:

SQL

-- This select returns current value of the DVC sequencer (for example, we can encapsulate it into a new function for 3rd party clients):
SELECT CONVERT(BIGINT, (SELECT TOP 1 T.Current_Value FROM SYS.Sequences AS T WHERE T.Name = 'seq_SysDVC'))

-- Insert sample records
INSERT INTO
  dbo.TbCustomer(FirstName, LastName, Birthday, Rating)
VALUES
  ('Graham', 'Greene', '1904-10-02', 50),
  ('Karel', 'Capek', '1890-01-09', 150),
  ('Josef', 'Skvorecky', '1924-09-27', 250)

-- Control select over the table
SELECT * FROM dbo.TbCustomer

As shown below, the new records have valid values in SysInsertRowVersion column. The new value has been assigned here by the column default constraint:

The following update statement activates the generic trigger. The trigger recognizes an actually changed record and inserts new value in the SysUpdateRowVersion column:

SQL

-- The update statement conditionally changes the rating of one record
UPDATE
  T
SET
  T.Rating = CASE WHEN T.CustomerId = 2 THEN (T.Rating * 2) ELSE T.Rating END
FROM
  dbo.TbCustomer AS T

-- Control select over the table
SELECT * FROM dbo.TbCustomer

Although the SQL Server returns a '(3 row(s) affected)' message, there is only one row, which has actually changed:

As seen in the SysUpdateRowVersion column, the actually changed record is marked by a change in the number, while the unchanged records remain unmarked, thus confirming the ability of this solution to differentiate between updated and actually changed data.

Points of Interest

Using a computed column alias instead of the original primary key column

In case of doubt whether a query using the SysRowId column primary-key alias in its WHERE clause is effective enough, the following two execution plans produced by two sample queries can be compared: The first query uses the native primary key column while the second query uses the alias column. The execution plans are not identical:

Detecting a parent table object of SQL CLR trigger current context

There are several 'obscure' ways how to obtain the name of a parent table for which the trigger has been fired. However, in fact, there is no attribute of the SqlTriggerContext where this could be found. For more details, see the following discussion threads:

History

2016-10-04: First version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)