Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database / SQL-Server / SQL-Server-2008

Microsoft SQL Server 2008 - Change Data Capture (CDC)

4.80/5 (14 votes)
8 Mar 2011CPOL7 min read 91.6K  
CDC is a new feature for logging DML changes. In this article, we will see how CDC could be put to use and how efficiently it could serve our purpose.

Introduction

Microsoft SQL Server 2008 has introduced a very exciting feature for logging DML changes. Previously, whenever there was a need for tracking all the modifications on critical tables done by ETL jobs or manual DMLs, developers needed to create triggers and log changes in explicit tables. With the introduction of 2008, all change tracking has been automated at the product level. Change Data Capture reports work for all such requirements. In this article, we will see how CDC could be put to use and how efficiently it could serve our purpose.

Note: Change Data Capture is available only in SQL Server 2008 Enterprise, Developer, and Evaluation editions.

This article will walk through the CDC configuration setup and use the setup CDC for monitoring the changes between modification intervals. The latter part of this article will give readers some insights on CDC’s utility, i.e., how CDC can help figure out the changes on the record set of objects modified by ETL jobs. For example, say an SSIS package runs daily 4 am EST and transforms data on table ABC. The records modified (inserts/updates/deletes) by this job which ran on 03/04/2011 can be effectively recorded for analysis purposes with the help of CDC. The utility section will give insights on this in detail.

Configuring Setup

CDC is by default disabled and must be enabled at the database level followed by enabling on the table. To enable, CDC EXECUTE sys.sp_cdc_enable_db; it returns 0 once CDC is successfully enabled, or 1 in case of any failure. CDC cannot be enabled on any of the system databases. To validate if CDC has been enabled on the database, query on sys.databases. Only user databases can participate in CDC. To be able to enable CDC, the user’s server roles must be that of a SYSADMIN.

We will follow this up closely with an example.

SQL
CREATE DATABASE TestDB;
Use TestDB
GO
CREATE TABLE Employee(
EID INT IDENTITY(1,1)PRIMARY KEY,
ENAME VARCHAR(50),
DEPT VARCHAR(20)
);

INSERT INTO Employee 
Values ('Rambo','IT'),
('Jason','Finance'),
('Brad','HR');
/*This is a new feature in SQL Server 2008 for constructing your 
Insert query to be able to handle multiple inserts in one go.  */
SELECT * FROM Employee;
/*The output of our select statement shows three records inserted.
EID         ENAME                                           DEPT
----------- ----------------------------------------------- -----------
1           Rambo                                           IT
2           Jason                                           Finance
3           Brad                                            HR
(3 row(s) affected)*/
/*Enabling CDC on the Database TestDB*/
Use TestDB
GO
EXEC sys.sp_cdc_enable_db
--This would create a User 'CDC' and Schema 'CDC'
SELECT name,is_cdc_enabled FROM sys.databases WHERE name='TestDB'
/*
name     is_cdc_enabled
-------- --------------
TestDB         1

(1 row(s) affected)
CDC enabled successfully on the DB TestDB 
*/

CDC has been enabled on the TestDB database; at this point, a user ‘CDC’ and schema ‘CDC’ will be created. Next, CDC needs to be enabled on the specific tables for which changes are required to be captured. In this example, the Employee table is the candidate for CDC, hence enable CDC on Employee.

SQL
EXEC sys.sp_cdc_enable_table
    @source_schema ='dbo',
    @source_name ='Employee',
    @role_name ='EmpCDCRole',
    @supports_net_changes = 1
/*Messages Returned
Job 'cdc.TestDB_capture' started successfully.
Job 'cdc.TestDB_cleanup' started successfully.*/

SELECT name,is_tracked_by_cdc FROM sys.tables WHERE name='Employee'
/*
name        is_tracked_by_cdc
--------    -----------------
Employee     1

(1 row(s) affected)*/

The sys.sp_cdc_enable_table command enables CDC on the tables. An important point to note is, a new role for CDC needs to be created 'EmpCDCRole'. Users who are not associated with the db_owner role and need access to CDC details should be associated with this role. Secondly, the @supports_net_changes parameter needs to be set to 1 if all changes happening on a record needs to be summarized in the form of net change. It is good if that’s set to 1.

Once execution of this command is successful, two new jobs will be created in the SQL Agent to capture and clean up the change. To make sure that CDC has been enabled on the table, validate if the is_tracked_by_cdc column has been set to 1, FROM sys.tables WHERE name='Employee'.

For disabling CDC on the table, sys.sp_cdc_disable_table needs to be executed.

SQL
EXEC sys.sp_cdc_disable_table
  @source_schema ='dbo',
  @source_name ='Employee',
  @capture_instance ='dbo_Employee'

Similarly, for disabling CDC at the database level, we need EXEC sys.sp_cdc_disable_db.

Remember to check sys.databases and sys.tables and validate if CDC has rightly been set to 0 (disabled) once the disable query has completed its execution.

Demonstration

In this section, a demonstration on changed data capturing is given. CDC on the database TestDB and table Employee has been configured. Now fire a few DML for the CDC to be able to capture the changes.

SQL
DELETE FROM Employee WHERE EID=3
INSERT INTO Employee VALUES('Mary','HR')
UPDATE Employee SET ENAME='Nichole' WHERE EID=2
UPDATE Employee SET ENAME='EMMA' WHERE EID=2

SELECT * FROM Employee
/*EID         ENAME                                   DEPT
----------- ----------------------------------- --------------
1           Rambo                                     IT
2           EMMA                                      Finance
4           Mary                                      HR

(3 row(s) affected)*/

A delete, an insert, and two updates have been fired, and the final table looks as above. To see the captured changes, SQL provides two functions:

  1. cdc.fn_cdc_get_net_changes_dbo_Employee and
  2. cdc.fn_cdc_get_all_changes_dbo_Employee

As the name suggests, the 'net changes' function gives the net changes on the records and the 'all changes' function provides all the changes before and after the DML is executed.

SQL
DECLARE @Begin_LSN binary(10), @End_LSN binary(10)
-- get the first LSN 
SELECT @Begin_LSN =sys.fn_cdc_get_min_lsn('dbo_Employee')
-- get the last LSN 
SELECT @End_LSN =sys.fn_cdc_get_max_lsn()
-- returns net changes
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_Employee(
 @Begin_LSN, @End_LSN,'ALL');
/*__$start_lsn           __$operation __$update_mask EID         ENAME DEPT
---------------------- ------------ ---------------- ----------- ----- -----
0x0000001C000000610004 1            NULL             3           Brad  HR
0x0000001C000000620004 2            NULL             4           Mary  HR
0x0000001C0000006E0004 4            NULL             2           EMMA  Finance

(3 row(s) affected)*/

LSN

All the changes are logged in terms of LSN or Log Sequence Number. SQL distinctly identifies each operation of DML via a Log Sequence Number. Any committed modifications on any tables are recorded in the transaction log of the database with a specific LSN provided by SQL Server. The __$operationcolumn values are: 1 = delete, 2 = insert, 3 = update (values before update), 4 = update (values after update).

cdc.fn_cdc_get_net_changes_dbo_Employee gives us all the records net changed falling between the LSN we provide in the function. We have three records returned by the net_change function; there was a delete, an insert, and two updates, but on the same record. In case of the updated record, it simply shows the net changed value after both the updates are complete.

For getting all the changes, execute cdc.fn_cdc_get_all_changes_dbo_Employee; there are options either to pass 'ALL' or 'ALL UPDATE OLD'. The 'ALL' option provides all the changes, but for updates, it provides the after updated values. Hence we find two records for updates. We have one record showing the first update when Jason was updated to Nichole, and one record when Nichole was updated to EMMA.

SQL
DECLARE @Begin_LSN binary(10), @End_LSN binary(10)
-- get the first LSN 
SELECT @Begin_LSN =sys.fn_cdc_get_min_lsn('dbo_Employee')
-- get the last LSN 
SELECT @End_LSN =sys.fn_cdc_get_max_lsn()
-- returns all changes only after update values for ‘UPDATE’
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
 @Begin_LSN, @End_LSN,'ALL');

/*
 __$start_lsn           __$seqval              __$opera  __$update EID  ENAME   DEPT
                                                  tion   _mask
---------------------- ---------------------- ---------  --------- ---  ------- -----
0x0000001C000000610004 0x0000001C000000610002 1          0x07      3    Brad    HR
0x0000001C000000620004 0x0000001C000000620003 2          0x07      4    Mary    HR
0x0000001C0000006D0004 0x0000001C0000006D0002 4          0x02      2    Nichole Finance
0x0000001C0000006E0004 0x0000001C0000006E0002 4          0x02      2    EMMA    Finance

(4 row(s) affected)
*/

The 'ALL UPDATE OLD' option takes it one step ahead and provides the before and after image of the records in case of updates. The records before (__$operation code=3) and after (__$operation code=4) the update have been captured by CDC here. Initially, Jason was updated to Nichole and again Nichole updated to EMMA.

SQL
DECLARE @Begin_LSN binary(10), @End_LSN binary(10)
-- get the first LSN 
SELECT @Begin_LSN =sys.fn_cdc_get_min_lsn('dbo_Employee')
-- get the last LSN 
SELECT @End_LSN =sys.fn_cdc_get_max_lsn()
-- returns all changes both before & after update values for ‘UPDATE’
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
 @Begin_LSN, @End_LSN,'ALL UPDATE OLD');
/*
__$start_lsn           __$seqval              __$oper   __$update EID ENAME   DEPT
                                                 ation   _mask
---------------------- ---------------------- --------- --------- --- ------- --------
0x0000001C000000610004 0x0000001C000000610002  1         0x07     3   Brad    HR
0x0000001C000000620004 0x0000001C000000620003  2         0x07     4   Mary    HR
0x0000001C0000006D0004 0x0000001C0000006D0002  3         0x02     2   Jason   Finance
0x0000001C0000006D0004 0x0000001C0000006D0002  4         0x02     2   Nichole Finance
0x0000001C0000006E0004 0x0000001C0000006E0002  3         0x02     2   Nichole Finance
0x0000001C0000006E0004 0x0000001C0000006E0002  4         0x02     2   EMMA    Finance

(6 row(s) affected)
*/

To get the details on the columns being captured, the sys.sp_cdc_get_captured_columns function can be put to use. This returns change data capture metadata information for the captured source columns tracked by the specified capture instance.

SQL
EXEC sys.sp_cdc_get_captured_columns
@capture_instance ='dbo_Employee';
EXEC sys.sp_cdc_help_change_data_capture

To report on the capture instances for the table, run the sys.sp_cdc_help_change_data_capture Stored Procedure.

Utility

The essence of CDC can be effectively put to use. Let us see a very simple demonstration of its utility. If there are certain ETL jobs transforming (insert/update/delete) the target table under CDC scan, a plan can be put in place to be able to monitor the changes undertaken by the ETL job during each specific run. For example, if an SISS package runs 4 AM EST daily, a configuration can be put in place to be able to identify what changes have been done to the data by the job run on 03/04/2011 4 AM EST.

SQL
CREATE TABLE PROCESS
(Process_ID INTIDENTITY (1,1),
 Process_Date DATETIME,
 BEGIN_LSN BINARY(10),
 END_LSN BINARY(10))

A table has been created to log the minimum and maximum LSN during each job run. When the ETL job runs, it will make an entry with the Process_ID (for demo purpose, I have declared the Process_ID as identity; it could rather take the job_id for the running job), the date-time stamp of its run, and the minimum and maximum of the LSN for the records modified by the job. For logging the details in the created table while the job runs, the Stored Procedure below has been created. The job can call and execute the procedure at the very last step before it exits.

SQL
IF EXISTS(SELECT 1 FROM TestDB.sys.objects
WHERE name='ProcessLogEntry'AND [type]='P')
DROP PROCEDURE ProcessLogEntry
GO
CREATE PROCEDURE dbo.ProcessLogEntry

AS

SET NOCOUNTON
SET QUOTED_IDENTIFIERON
SET ANSI_NULLSON

BEGIN
DECLARE @BEGIN_LSN BINARY(10)

      IF ((SELECT COUNT(1)FROM PROCESS)>0)
      BEGIN
            SELECT @BEGIN_LSN=MAX(END_LSN)FROM PROCESS 
            INSERT INTO PROCESS SELECT GETDATE(),
            @BEGIN_LSN,sys.fn_cdc_get_max_lsn();
END
ELSE
INSERT INTO PROCESS SELECT GETDATE(), 
       sys.fn_cdc_get_min_lsn('dbo_employee'),sys.fn_cdc_get_max_lsn();

END;

In this demo session, we execute the procedure manually; it makes the entry for the LSN for the DML we executed previously, i.e.: an insert, a delete, and two updates.

SQL
EXEC ProcessLogEntry
SELECT * FROM PROCESS
Process_ID  Process_Date            BEGIN_LSN              END_LSN
----------- ----------------------- ---------------------- ----------------------
1           2011-03-04 17:33:04.667 0x0000001800000020003A 0x0000001E000000370001

(1 row(s) affected)

As it is logged, let’s execute some more DML:

SQL
INSERT INTO Employee VALUES('Sachin','CEO')
UPDATE Employee SET DEPT='Services' WHERE EID=2

We have inserted a record and updated the department for EID 2. To log the changes, we execute the procedure ProcessLogEntry again.

SQL
EXEC ProcessLogEntry
SELECT * FROM PROCESS
Process_ID  Process_Date            BEGIN_LSN              END_LSN
----------- ----------------------- ---------------------- ----------------------
1           2011-03-04 17:33:04.667 0x0000001800000020003A 0x0000001E000000370001
2           2011-03-04 17:38:32.490 0x0000001E000000370001 0x0000001E000000A80004

(2 row(s) affected)

The second entry has been made in our Process table. Now, this effort enables us to get the changes done by the job on any specific day. The code below gets the net changes done by process_id =2.

SQL
DECLARE @Begin_LSN BINARY(10), @End_LSN BINARY(10)
SELECT @Begin_LSN = BEGIN_LSN FROM PROCESS WHERE Process_ID=2
SELECT @End_LSN = END_LSN FROM PROCESS WHERE Process_ID=2
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
 @Begin_LSN, @End_LSN,'ALL');

__$start_lsn           __$operation __$update_mask EID   ENAME    DEPT
---------------------- ------------ -------------- --- -------- -------
0x0000001E000000A40004 2            NULL           5   Sachin   CEO
0x0000001E000000A80004 4            NULL           2   EMMA     Services

(2 row(s) affected)

Exciting, isn't it? We get precisely those changes undertaken by process 2.

Similarly, the query below simply gets the net changes undertaken by process_id 1.

SQL
DECLARE @Begin_LSN BINARY(10), @End_LSN BINARY(10)
SELECT @Begin_LSN = BEGIN_LSN FROM PROCESS WHERE Process_ID=1
SELECT @End_LSN = END_LSN FROM PROCESS WHERE Process_ID=1
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
 @Begin_LSN, @End_LSN,'ALL');
/*__$start_lsn           __$operation __$update_mask EID         ENAME DEPT
---------------------- ------------ ---------------- ----------- ----- -----
0x0000001C000000610004 1            NULL             3           Brad  HR
0x0000001C000000620004 2            NULL             4           Mary  HR
0x0000001C0000006E0004 4            NULL             2           EMMA  Finance

(3 row(s) affected)

Hope this demo was somewhat helpful in providing insights on CDC and the utility section helped how CDC could be put to use.

Conclusion

SQL Server 2008 has taken this a step further with the advent of Audits, discussed in my next article. Audits come really handy in effectively keeping a check on the activities at a very granular level. It could be employed by DBAs in a very efficient manner to track smarter guys trying to gain any unauthorized access or executing DML (update/delete). All the details as to who fired the query, the timestamp, and the actual query can be recorded for helping DBA personnel report any unethical activities on the server. This will go a long way to boost the security of servers and help organizations preserve the most important aspect of their business, i.e., strictly confidential information. Security policies could now be designed and implemented at a very granular level with the helping hand of two awesome features: CDC and Audits.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)