Introduction
I'm working on a Big Data-warehouse. We have data from number of transactional systems having different database environment and architecture. COBOL flat files, Oracle, SQL server are the main data sources. Nightly process imports daily transactional data from all sources (incremental changes), and combines to form a Single Customer View database.
By comparing Surname, first character of Firstname, DOB, Sex, first two address lines, the ETL process could find Common customers from all transactional systems - Create HASHBYTE
checksum value based on the above criteria field and use a GROUP BY
on Checksum
field. Hence, ETL process creates a De-duplicated Customer
table.
Then the new Staging Customer table is updated with latest contact details (Phone
, Mobile
, email
, address
, etc.). Also, Customer
table undergoes certain data standardizing and verification. For example, make all phone numbers to international format, validate email address, remove bad data, etc.
Each unique Customer
gets a Customer ID (CID
). CID
may merge or de-merge as a part of the changes on the transactional system on the next ETL process. I'm describing only about Customer
table in this article, but ETL process creates number of other tables (e.g., PRODUCT
) based on the CID
.
Finally, I have to synchronize Staging Customer
table with a Production Live Customer table which is too by at daytime and not free even midnight as Web and Mobile APIs are live. As Customer
table has millions of records and chances of CID
merges, it is not easy to update the Live Customer table from staging Customer
table without considerable cost. This process takes a long time and causes other application suffering from timeout. To update live table, I follow the steps below:
- Lock
Customer
table - Load any new
customer
created on live table during ETL process - Truncate live
customer
table - Load data from Staging
Customer
table - Re-build all indexes
- Unlock
Customer
table
Now the issue is above steps taking almost 5 - 10 mins and other Web and Mobile apps suffer from timeout. I have resolved the issue using Temporary table (DROP
and RENAME
). System worked and only lock customer
table less than a second!
Steps:
- Create a temporary
Customer
table on Production
database. - Load data to temporary table from Staging area.
- Create necessary indexes on Temporary table (in my case, a Full-text index as well).
- Load any new
customer
created on live table during ETL process to temporary table. DROP
live Customer
table RENAME
temporary table to live Customer
table
Using the Code
Code creates tmp_Customer
on Production
database and load data from staging table. Next step creates all necessary indices. Collect any missing customer
details from live customer
table - Customer
s created on Production
during the ETL process. Finally, DROP
live customer
table and RENAME
temporary table to live. Code only showing how Customer
table is refreshing on production. You could repeat the code for all other tables. Take a backup of necessary tables before you proceed.
IF OBJECT_ID('dbo.tmp_Customer', 'U') IS NOT NULL DROP TABLE dbo.tmp_Customer;
CREATE TABLE [dbo].[tmp_Customer](
[CID] [bigint] NOT NULL,
[FirstName] [varchar](50) NULL,
[SurName] [varchar](50) NULL,
[Address1] [varchar](50) NULL,
[Address2] [varchar](50) NULL,
[Address3] [varchar](50) NULL,
[Address4] [varchar](50) NULL,
[Address5] [varchar](50) NULL,
[Title] [varchar](10) NULL,
[Sex] [varchar](1) NULL,
[DOB] [varchar](12) NULL,
[Fulladdress] [varchar](200) NULL,
[email] [varchar](100) NULL,
[Mobile] [varchar](20) NULL,
[Fullname] [varchar](100) NULL
PRIMARY KEY CLUSTERED
(
[CID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
insert into tmp_customer
( CID, FirstName, SurName, Address1, Address2, Address3, _
Address4, Title, Sex, DOB, Fulladdress, email, Mobile, Fullname)
select CID, FirstName, SurName, Address1, Address2, Address3, _
Address4, Title, Sex, DOB, Fulladdress, email, Mobile, Fullname from CCF
CREATE NONCLUSTERED INDEX [NonClusteredIndex-Mobile] ON [dbo].[tmp_Customer]
(
[Mobile] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, _
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, _
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
declare @sql varchar(max)
Declare @CONSTRAINT_NAME varchar (256)
SELECT @CONSTRAINT_NAME=KU.CONSTRAINT_NAME
FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS TC
INNER JOIN
INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS KU
ON TC.CONSTRAINT_TYPE = 'PRIMARY KEY' AND
TC.CONSTRAINT_NAME = KU.CONSTRAINT_NAME AND
KU.table_name='tmp_customer'
ORDER BY KU.TABLE_NAME, KU.ORDINAL_POSITION
set @sql=
'CREATE FULLTEXT INDEX ON tmp_customer
(
Email
Language 1033,
Fullname
Language 1033,
Fulladdress
Language 1033
)
KEY INDEX ' + @CONSTRAINT_NAME + '
ON Customer_FullText
WITH STOPLIST = OFF
-- SEARCH PROPERTY LIST = OFF
'
EXEC (@sql)
INSERT INTO tmp_Customer
( CID, FirstName, SurName, Address1, Address2, Address3, Address4, _
Address5, Title, Sex, DOB, Fulladdress, email, Mobile, Fullname)
SELECT CID, FirstName, SurName, Address1, Address2, Address3, Address4, _
Address5, Title, Sex, DOB, Fulladdress, email, Mobile, Fullname
FROM Customer P WHERE NOT EXISTS (SELECT NULL FROM tmp_Customer F WHERE P.CID = F.CID)
IF OBJECT_ID('dbo.Customer', 'U') IS NOT NULL DROP TABLE dbo.Customer;
EXEC sp_rename 'tmp_customer', 'Customer'
History
- 5th June, 2019: Initial version