Introduction
Microsoft SQL Server 2008 has introduced a very exciting feature for logging DML changes. Previously, whenever there was a need for tracking all the modifications on critical tables done by ETL jobs or manual DMLs, developers needed to create triggers and log changes in explicit tables. With the introduction of 2008, all change tracking has been automated at the product level. Change Data Capture reports work for all such requirements. In this article, we will see how CDC could be put to use and how efficiently it could serve our purpose.
Note: Change Data Capture is available only in SQL Server 2008 Enterprise, Developer, and Evaluation editions.
This article will walk through the CDC configuration setup and use the setup CDC for monitoring the changes between modification intervals. The latter part of this article will give readers some insights on CDC’s utility, i.e., how CDC can help figure out the changes on the record set of objects modified by ETL jobs. For example, say an SSIS package runs daily 4 am EST and transforms data on table ABC. The records modified (inserts/updates/deletes) by this job which ran on 03/04/2011 can be effectively recorded for analysis purposes with the help of CDC. The utility section will give insights on this in detail.
Configuring Setup
CDC is by default disabled and must be enabled at the database level followed by enabling on the table. To enable, CDC EXECUTE sys.sp_cdc_enable_db
; it returns 0 once CDC is successfully enabled, or 1 in case of any failure. CDC cannot be enabled on any of the system databases. To validate if CDC has been enabled on the database, query on sys.databases. Only user databases can participate in CDC. To be able to enable CDC, the user’s server roles must be that of a SYSADMIN.
We will follow this up closely with an example.
CREATE DATABASE TestDB;
Use TestDB
GO
CREATE TABLE Employee(
EID INT IDENTITY(1,1)PRIMARY KEY,
ENAME VARCHAR(50),
DEPT VARCHAR(20)
);
INSERT INTO Employee
Values ('Rambo','IT'),
('Jason','Finance'),
('Brad','HR');
SELECT * FROM Employee;
Use TestDB
GO
EXEC sys.sp_cdc_enable_db
SELECT name,is_cdc_enabled FROM sys.databases WHERE name='TestDB'
CDC has been enabled on the TestDB database; at this point, a user ‘CDC’ and schema ‘CDC’ will be created. Next, CDC needs to be enabled on the specific tables for which changes are required to be captured. In this example, the Employee table is the candidate for CDC, hence enable CDC on Employee.
EXEC sys.sp_cdc_enable_table
@source_schema ='dbo',
@source_name ='Employee',
@role_name ='EmpCDCRole',
@supports_net_changes = 1
SELECT name,is_tracked_by_cdc FROM sys.tables WHERE name='Employee'
The sys.sp_cdc_enable_table
command enables CDC on the tables. An important point to note is, a new role for CDC needs to be created 'EmpCDCRole'. Users who are not associated with the db_owner role and need access to CDC details should be associated with this role. Secondly, the @supports_net_changes
parameter needs to be set to 1 if all changes happening on a record needs to be summarized in the form of net change. It is good if that’s set to 1.
Once execution of this command is successful, two new jobs will be created in the SQL Agent to capture and clean up the change. To make sure that CDC has been enabled on the table, validate if the is_tracked_by_cdc column has been set to 1, FROM sys.tables WHERE name='Employee'.
For disabling CDC on the table, sys.sp_cdc_disable_table
needs to be executed.
EXEC sys.sp_cdc_disable_table
@source_schema ='dbo',
@source_name ='Employee',
@capture_instance ='dbo_Employee'
Similarly, for disabling CDC at the database level, we need EXEC sys.sp_cdc_disable_db
.
Remember to check sys.databases and sys.tables and validate if CDC has rightly been set to 0 (disabled) once the disable query has completed its execution.
Demonstration
In this section, a demonstration on changed data capturing is given. CDC on the database TestDB and table Employee has been configured. Now fire a few DML for the CDC to be able to capture the changes.
DELETE FROM Employee WHERE EID=3
INSERT INTO Employee VALUES('Mary','HR')
UPDATE Employee SET ENAME='Nichole' WHERE EID=2
UPDATE Employee SET ENAME='EMMA' WHERE EID=2
SELECT * FROM Employee
A delete, an insert, and two updates have been fired, and the final table looks as above. To see the captured changes, SQL provides two functions:
cdc.fn_cdc_get_net_changes_dbo_Employee
andcdc.fn_cdc_get_all_changes_dbo_Employee
As the name suggests, the 'net changes' function gives the net changes on the records and the 'all changes' function provides all the changes before and after the DML is executed.
DECLARE @Begin_LSN binary(10), @End_LSN binary(10)
SELECT @Begin_LSN =sys.fn_cdc_get_min_lsn('dbo_Employee')
SELECT @End_LSN =sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_Employee(
@Begin_LSN, @End_LSN,'ALL');
LSN
All the changes are logged in terms of LSN or Log Sequence Number. SQL distinctly identifies each operation of DML via a Log Sequence Number. Any committed modifications on any tables are recorded in the transaction log of the database with a specific LSN provided by SQL Server. The __$operationcolumn values are: 1 = delete, 2 = insert, 3 = update (values before update), 4 = update (values after update).
cdc.fn_cdc_get_net_changes_dbo_Employee
gives us all the records net changed falling between the LSN we provide in the function. We have three records returned by the net_change
function; there was a delete, an insert, and two updates, but on the same record. In case of the updated record, it simply shows the net changed value after both the updates are complete.
For getting all the changes, execute cdc.fn_cdc_get_all_changes_dbo_Employee
; there are options either to pass 'ALL' or 'ALL UPDATE OLD'. The 'ALL' option provides all the changes, but for updates, it provides the after updated values. Hence we find two records for updates. We have one record showing the first update when Jason was updated to Nichole, and one record when Nichole was updated to EMMA.
DECLARE @Begin_LSN binary(10), @End_LSN binary(10)
SELECT @Begin_LSN =sys.fn_cdc_get_min_lsn('dbo_Employee')
SELECT @End_LSN =sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
@Begin_LSN, @End_LSN,'ALL');
The 'ALL UPDATE OLD
' option takes it one step ahead and provides the before and after image of the records in case of updates. The records before (__$operation code=3
) and after (__$operation code=4
) the update have been captured by CDC here. Initially, Jason was updated to Nichole and again Nichole updated to EMMA.
DECLARE @Begin_LSN binary(10), @End_LSN binary(10)
SELECT @Begin_LSN =sys.fn_cdc_get_min_lsn('dbo_Employee')
SELECT @End_LSN =sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
@Begin_LSN, @End_LSN,'ALL UPDATE OLD');
To get the details on the columns being captured, the sys.sp_cdc_get_captured_columns
function can be put to use. This returns change data capture metadata information for the captured source columns tracked by the specified capture instance.
EXEC sys.sp_cdc_get_captured_columns
@capture_instance ='dbo_Employee';
EXEC sys.sp_cdc_help_change_data_capture
To report on the capture instances for the table, run the sys.sp_cdc_help_change_data_capture
Stored Procedure.
Utility
The essence of CDC can be effectively put to use. Let us see a very simple demonstration of its utility. If there are certain ETL jobs transforming (insert/update/delete) the target table under CDC scan, a plan can be put in place to be able to monitor the changes undertaken by the ETL job during each specific run. For example, if an SISS package runs 4 AM EST daily, a configuration can be put in place to be able to identify what changes have been done to the data by the job run on 03/04/2011 4 AM EST.
CREATE TABLE PROCESS
(Process_ID INTIDENTITY (1,1),
Process_Date DATETIME,
BEGIN_LSN BINARY(10),
END_LSN BINARY(10))
A table has been created to log the minimum and maximum LSN during each job run. When the ETL job runs, it will make an entry with the Process_ID
(for demo purpose, I have declared the Process_ID
as identity; it could rather take the job_id
for the running job), the date-time stamp of its run, and the minimum and maximum of the LSN for the records modified by the job. For logging the details in the created table while the job runs, the Stored Procedure below has been created. The job can call and execute the procedure at the very last step before it exits.
IF EXISTS(SELECT 1 FROM TestDB.sys.objects
WHERE name='ProcessLogEntry'AND [type]='P')
DROP PROCEDURE ProcessLogEntry
GO
CREATE PROCEDURE dbo.ProcessLogEntry
AS
SET NOCOUNTON
SET QUOTED_IDENTIFIERON
SET ANSI_NULLSON
BEGIN
DECLARE @BEGIN_LSN BINARY(10)
IF ((SELECT COUNT(1)FROM PROCESS)>0)
BEGIN
SELECT @BEGIN_LSN=MAX(END_LSN)FROM PROCESS
INSERT INTO PROCESS SELECT GETDATE(),
@BEGIN_LSN,sys.fn_cdc_get_max_lsn();
END
ELSE
INSERT INTO PROCESS SELECT GETDATE(),
sys.fn_cdc_get_min_lsn('dbo_employee'),sys.fn_cdc_get_max_lsn();
END;
In this demo session, we execute the procedure manually; it makes the entry for the LSN for the DML we executed previously, i.e.: an insert, a delete, and two updates.
EXEC ProcessLogEntry
SELECT * FROM PROCESS
Process_ID Process_Date BEGIN_LSN END_LSN
1 2011-03-04 17:33:04.667 0x0000001800000020003A 0x0000001E000000370001
(1 row(s) affected)
As it is logged, let’s execute some more DML:
INSERT INTO Employee VALUES('Sachin','CEO')
UPDATE Employee SET DEPT='Services' WHERE EID=2
We have inserted a record and updated the department for EID 2. To log the changes, we execute the procedure ProcessLogEntry
again.
EXEC ProcessLogEntry
SELECT * FROM PROCESS
Process_ID Process_Date BEGIN_LSN END_LSN
1 2011-03-04 17:33:04.667 0x0000001800000020003A 0x0000001E000000370001
2 2011-03-04 17:38:32.490 0x0000001E000000370001 0x0000001E000000A80004
(2 row(s) affected)
The second entry has been made in our Process table. Now, this effort enables us to get the changes done by the job on any specific day. The code below gets the net changes done by process_id =2
.
DECLARE @Begin_LSN BINARY(10), @End_LSN BINARY(10)
SELECT @Begin_LSN = BEGIN_LSN FROM PROCESS WHERE Process_ID=2
SELECT @End_LSN = END_LSN FROM PROCESS WHERE Process_ID=2
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
@Begin_LSN, @End_LSN,'ALL');
__$start_lsn __$operation __$update_mask EID ENAME DEPT
0x0000001E000000A40004 2 NULL 5 Sachin CEO
0x0000001E000000A80004 4 NULL 2 EMMA Services
(2 row(s) affected)
Exciting, isn't it? We get precisely those changes undertaken by process 2.
Similarly, the query below simply gets the net changes undertaken by process_id
1.
DECLARE @Begin_LSN BINARY(10), @End_LSN BINARY(10)
SELECT @Begin_LSN = BEGIN_LSN FROM PROCESS WHERE Process_ID=1
SELECT @End_LSN = END_LSN FROM PROCESS WHERE Process_ID=1
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Employee(
@Begin_LSN, @End_LSN,'ALL');
Hope this demo was somewhat helpful in providing insights on CDC and the utility section helped how CDC could be put to use.
Conclusion
SQL Server 2008 has taken this a step further with the advent of Audits, discussed in my next article. Audits come really handy in effectively keeping a check on the activities at a very granular level. It could be employed by DBAs in a very efficient manner to track smarter guys trying to gain any unauthorized access or executing DML (update/delete). All the details as to who fired the query, the timestamp, and the actual query can be recorded for helping DBA personnel report any unethical activities on the server. This will go a long way to boost the security of servers and help organizations preserve the most important aspect of their business, i.e., strictly confidential information. Security policies could now be designed and implemented at a very granular level with the helping hand of two awesome features: CDC and Audits.