Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / DevOps / testing

ETL Testing, If Only An Automation Rule Testing Engine was Possible …

5.00/5 (2 votes)
1 Feb 2018CPOL7 min read 9.2K  
ETL Testing and data-validation, dynamic SQL rule generator

Data Warehouse projects are usually complex as there are a lot of involved variables. Project managers and data analysts must understand the client needs, must know and have proper access to data sources (that sometimes is not as simple as it may seem ...). After that, proper outcomes must be suggested to client, the underlying logic data model must be built, a proper infrastructure must exist, a possible ODS database, a staging area database, the final Data Warehouse database must also exist. On the other hand, you need powerful (and expensive) machines to support all that, people to manage those machines, database administrators, system administrators. Or, up to a certain point you have cloud services … that also cost money.

To explore data, a user friendly tool is usually adopted and it also requires machines, installation processes, metadata definition, report design.

After all this is built, a permanent ETL and reporting time response must be monitored so proper actions can be made in case the Data Warehouse reduces its performance. Actions can be at physical level, at software or development level.

As a side note, usually the fastest (not the cheapest) option is: let’s buy more powerful machines … my opinion is to first understand the critical paths and bottlenecks (and I am referring to ETL chain sequence and ETL testing here) as I saw queries response time reducing from one hour to three minutes after a rewriting approach or an index creation.

No doubt, DWH projects are complex as in the end they must provide accurate information to final users. CEOs and CFOs must trust their outcomes as they make decisions based in their reports. So ETL testing is crucial but not a simple task.

Once data can come from a lot of different sources, errors can happen in a lot of different stages. Data Warehouse and ETL developers must be aware that some tests should be part of pre-load data quality processes, others must be executed after data is loaded and others can and should even be executed during the ETL process so data can be purged for verification. While some are obvious, others are not simple decisions as some tests may depend on data aggregation that only happens when the main ETL process runs or can even depend on some lookup stage that also happens when the lookup processes start. Visual schemas are pretty but each project corresponds to a specific situation and all you can read in books and in web should be looked at as guidelines. By the way, take a look at these links as they are interesting (DWH testing link and ETL testing related tools).

I will dedicate the remaining article to writing about the approach of a test rules validation engine. It should be more pointed to validating data after it is loaded although it would not be limited to it. In the end, the final aim is to properly identify the rows that for some reason do not pass a specific rule defined in the rule engine system. There is no other way, if the data warehouse was built over SQL based database engines, prepare yourself to build really complex SQL queries (and I mean really complex and big ones as some tests are really difficult to make).
The base idea is to have quite a simple database relational model that will store rules definitions, their parameters and will run them when required (usually in the end of the ETL process). In the end, we expect to identify the rows that are causing problems. A possible approach could be something like the model below:

Image 1

Entity tbl_Rule stores master rule information, its unique id, name, its type (important when implementing its underlying logic) and a description. Tbl_Outcome will store each test / validation rule outcome. On each row, one will be able to identify the rule that was tested, against what table and each fact surrogate key value that will allow identifying the rows that did not pass the test. Each rule will be linked to one or more parameter sets. The parameter set will define the master info for a group of parameters. Each parameter set can have 1 or N parameters. The query generator engine will know how to handle them (yes, you will have to program to achieve this, there are no miracles).

To test this approach, you should create the required tables and the required procedure to generate the SQL queries based in the parameters defined in tables. Let’s proceed, to create the tables, you should use the following SQL script:

SQL
CREATE TABLE [dbo].[tbl_Outcome](
    [id_Rule_Parameter_Set_Fk] [int] NOT NULL,
    [id_Fact_Table] [int] NOT NULL,
    [fact_Surrogate_Key] [int] NOT NULL,
 CONSTRAINT [PK_tbl_Outcome] PRIMARY KEY CLUSTERED
(
    [id_Rule_Parameter_Set_Fk] ASC,
    [id_Fact_Table] ASC,
    [fact_Surrogate_Key] ASC
)

CREATE TABLE [dbo].[tbl_Parameter](
    [id_Parameter_Set_Fk] [int] NOT NULL,
    [parameter_Name] [nvarchar](50) NOT NULL,
    [parameter_Value] [nvarchar](50) NOT NULL,
    [parameter_Type] [nvarchar](5) NOT NULL,
 CONSTRAINT [PK_tbl_Parameter_Set] PRIMARY KEY CLUSTERED
(
    [id_Parameter_Set_Fk] ASC,
    [parameter_Name] ASC
)

CREATE TABLE [dbo].[tbl_Parameter_Set](
    [id_Parameter_Set] [int] IDENTITY(1,1) NOT NULL,
    [parameter_Set_Name] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_tbl_Parameter_Set_1] PRIMARY KEY CLUSTERED
(
    [id_Parameter_Set] ASC
)

CREATE TABLE [dbo].[tbl_Rule](
    [id_Rule] [int] IDENTITY(1,1) NOT NULL,
    [rule_Name] [nvarchar](100) NOT NULL,
    [rule_Type] [nvarchar](30) NOT NULL,
    [rule_Description] [nvarchar](max) NULL,
    [flag_execute] [nvarchar](1) NULL,
 CONSTRAINT [PK_tbl_Rule] PRIMARY KEY CLUSTERED
(
    [id_Rule] ASC
)

CREATE TABLE [dbo].[tbl_Rule_Parameter_Set](
    [id_Rule_Parameter_Set] [int] IDENTITY(1,1) NOT NULL,
    [id_Rule_Fk] [int] NOT NULL,
    [id_Parameter_Set_Fk] [int] NOT NULL,
 CONSTRAINT [PK_tbl_Rule_Parameter_Set] PRIMARY KEY CLUSTERED
(
    [id_Rule_Parameter_Set] ASC
)

Now let’s create all foreign keys:

SQL
ALTER TABLE [dbo].[tbl_Outcome]  WITH CHECK ADD  CONSTRAINT _
[FK_tbl_Outcome_tbl_Rule_Parameter_Set] FOREIGN KEY([id_Rule_Parameter_Set_Fk])
REFERENCES [dbo].[tbl_Rule_Parameter_Set] ([id_Rule_Parameter_Set])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[tbl_Outcome] CHECK CONSTRAINT [FK_tbl_Outcome_tbl_Rule_Parameter_Set]
GO
ALTER TABLE [dbo].[tbl_Parameter]  WITH CHECK ADD  CONSTRAINT _
[FK_tbl_Parameter_tbl_Parameter_Set] FOREIGN KEY([id_Parameter_Set_Fk])
REFERENCES [dbo].[tbl_Parameter_Set] ([id_Parameter_Set])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[tbl_Parameter] CHECK CONSTRAINT [FK_tbl_Parameter_tbl_Parameter_Set]
GO
ALTER TABLE [dbo].[tbl_Rule_Parameter_Set]  WITH CHECK ADD CONSTRAINT _
[FK_tbl_Rule_Parameter_Set_tbl_Parameter_Set] FOREIGN KEY([id_Parameter_Set_Fk])
REFERENCES [dbo].[tbl_Parameter_Set] ([id_Parameter_Set])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[tbl_Rule_Parameter_Set] CHECK CONSTRAINT _
[FK_tbl_Rule_Parameter_Set_tbl_Parameter_Set]
GO
ALTER TABLE [dbo].[tbl_Rule_Parameter_Set]  WITH CHECK ADD  CONSTRAINT _
[FK_tbl_Rule_Parameter_Set_tbl_Rule] FOREIGN KEY([id_Rule_Fk])
REFERENCES [dbo].[tbl_Rule] ([id_Rule])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[tbl_Rule_Parameter_Set] CHECK CONSTRAINT [FK_tbl_Rule_Parameter_Set_tbl_Rule]
GO

After the base tables are created, let’s insert some dummy rules parameters. The rules defined will generate outer join queries that validate if a specific Fact table misses some foreign key values when joining to a specific Dimension table. Those values are dynamic and defined at parameters table level.

The second defined rule will validate if a specific field in a specific Fact table is numeric. These must be looked at as dummy examples regarding ETL Testing as some could also apply to data validation rules.

All generated queries will insert data into tbl_Outcomes table so their syntax must be correct and generate all fields that table is expected to receive.

The procedure will be presented after. This one was developed using TSQL, but this approach can be developed using any development language that is able to connect to a database, pull data and treat strings. In the end, this code is only generating strings that ultimately are SQL Statements that will be executed.

Please check the insert statements first (pay attention to the insertion sequence as parent data must be inserted first - rules and parameter set are the parent tables).

SQL
SET IDENTITY_INSERT [dbo].[tbl_Rule] ON
GO
INSERT [dbo].[tbl_Rule] ([id_Rule], [rule_Name], [rule_Type], [rule_Description], _
[flag_execute]) VALUES (1, N'Foreign Key Validation', N'ForeignKey', N'Validates Missing FKs', N'T')
GO
INSERT [dbo].[tbl_Rule] ([id_Rule], [rule_Name], [rule_Type], [rule_Description], _
[flag_execute]) VALUES (4, N'Field Type Validation', N'IsNumber', N'Validates Field Type', N'T')
GO
SET IDENTITY_INSERT [dbo].[tbl_Rule] OFF

SET IDENTITY_INSERT [dbo].[tbl_Parameter_Set] ON
GO
INSERT [dbo].[tbl_Parameter_Set] ([id_Parameter_Set], [parameter_Set_Name]) _
VALUES (1, N'Foreign_Key_Validation_Column_A')
GO
INSERT [dbo].[tbl_Parameter_Set] ([id_Parameter_Set], [parameter_Set_Name]) _
VALUES (2, N'Foreign_Key_Validation_Column_B')
GO
INSERT [dbo].[tbl_Parameter_Set] ([id_Parameter_Set], [parameter_Set_Name]) _
VALUES (3, N'Is Number Validation Column Wage')
GO
INSERT [dbo].[tbl_Parameter_Set] ([id_Parameter_Set], [parameter_Set_Name]) _
VALUES (4, N'Is Number Validation Column Age')
GO
SET IDENTITY_INSERT [dbo].[tbl_Parameter_Set] OFF

SET IDENTITY_INSERT [dbo].[tbl_Rule_Parameter_Set] ON
GO
INSERT [dbo].[tbl_Rule_Parameter_Set] ([id_Rule_Parameter_Set], [id_Rule_Fk], _
[id_Parameter_Set_Fk]) VALUES (1, 1, 1)
GO
INSERT [dbo].[tbl_Rule_Parameter_Set] ([id_Rule_Parameter_Set], [id_Rule_Fk], _
[id_Parameter_Set_Fk]) VALUES (2, 1, 2)
GO
INSERT [dbo].[tbl_Rule_Parameter_Set] ([id_Rule_Parameter_Set], [id_Rule_Fk], _
[id_Parameter_Set_Fk]) VALUES (5, 4, 3)
GO
INSERT [dbo].[tbl_Rule_Parameter_Set] ([id_Rule_Parameter_Set], [id_Rule_Fk], _
[id_Parameter_Set_Fk]) VALUES (6, 4, 4)
GO
SET IDENTITY_INSERT [dbo].[tbl_Rule_Parameter_Set] OFF

SET IDENTITY_INSERT [dbo].[tbl_Parameter_Set] OFF
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (1, N'LeftColumn', N'Column_A_Fk', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (1, N'LeftTable', N'fac_Table_A', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (1, N'RightColumn', N'Column_A_sk', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (1, N'RightTable', N'dim_Table_A', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (2, N'LeftColumn', N'Wage_Range_Fk', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (2, N'LeftTable', N'fac_Table_A', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (2, N'RightColumn', N'Wage_Range_Fk', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (2, N'RightTable', N'dim_Wage_Range', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (3, N'ColumnName', N'valSalesMonth', N'N')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (3, N'TableName', N'fac_Sales', N'C')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (4, N'ColumnName', N'valAge', N'N')
GO
INSERT [dbo].[tbl_Parameter] ([id_Parameter_Set_Fk], [parameter_Name], _
[parameter_Value], [parameter_Type]) VALUES (4, N'TableName', N'fac_Sales', N'C')
GO

Now code must be implemented to interpret what we store in the rules parameters relational model. I used TSQL but any other programming language can be used as well. The procedure below will iterate through the defined rules and parameters in the database. Based in the rule type field, it should behave accordingly and look for the parameters that kind of rule is expected to use to build the desired database validation query.

This procedure is only building the query and will print them to the screen so the user understands the logic behind this approach (a full version should execute it). I have inserted only four queries generation parameters but imagine if you need to generate one hundred queries or even more, putting a machine generating and running them for you will make all sense.

Also to add new tests, only the required parameters need to be properly defined in the tables. Once there, the engine will execute all it finds. Please check the procedure outcome after its code below.

SQL
create proc [dbo].[sp_rule_execution] as
    declare @id_rule int, @rule_name nvarchar(100), @rule_type nvarchar(30), _
    @id_Parameter_Set_Fk int, @id_Rule_Parameter_Set int

    declare @dmlSql nvarchar(max)
    declare @id_Fact_Table int

    declare c_test_rules cursor local forward_only static read_only
    for
    select a.id_rule,
           a.rule_name,
           a.rule_type,
           b.id_Parameter_Set_Fk,
           b.id_Rule_Parameter_Set
    from dbo.tbl_rule a
    inner join dbo.tbl_Rule_Parameter_Set b
    on a.id_Rule = b.id_Rule_Fk
    where flag_execute = 'T'

    open c_test_rules

    fetch next
    from c_test_rules
    into  @id_rule,
          @rule_name,
          @rule_type,
          @id_Parameter_Set_Fk,
          @id_Rule_Parameter_Set

    while @@fetch_status = 0 --when fetch is successful
        begin
            /**********************************************************************************
            for this specific rule type a specific set of parameters is required
            this rule will run an outer join query checking for a specific foreign key
            if some value does not match any lookup dimension table
            understand it is possible foreign key constraints may be disabled during the
            load process or even not exist in physical model only in logical one
            ***********************************************************************************/
            if @rule_type = 'ForeignKey'
                begin
                    declare @lefttable nvarchar(50), @righttable nvarchar(50), _
                            @leftcolumn nvarchar(50), @rightcolumn nvarchar(50)

                    select @id_Fact_Table = isnull(object_id(@lefttable), -1)

                    select @lefttable = o.parameter_Value
                    from dbo.tbl_Parameter o
                    where o.id_Parameter_Set_Fk = @id_Parameter_Set_Fk and
                          o.parameter_Name = 'lefttable'

                    select @righttable = o.parameter_Value
                    from dbo.tbl_Parameter o
                    where o.id_Parameter_Set_Fk = @id_Parameter_Set_Fk and
                          o.parameter_Name = 'righttable'

                    select @leftcolumn = o.parameter_Value
                    from dbo.tbl_Parameter o
                    where o.id_Parameter_Set_Fk = @id_Parameter_Set_Fk and
                          o.parameter_Name = 'leftcolumn'

                    select @rightcolumn = o.parameter_Value
                    from dbo.tbl_Parameter o
                    where o.id_Parameter_Set_Fk = @id_Parameter_Set_Fk and
                          o.parameter_Name = 'rightcolumn'

                    set @dmlSql = '/* Rule Id ' + cast(@id_rule as nvarchar(5)) + ' / Parameter Set ' _
                            + cast(@id_Rule_Parameter_Set as nvarchar(5)) + ' */' + char(13) +
                                  'insert into dbo.tbl_Outcome ' + char(13) +
                                  'select ' + cast(@id_Rule_Parameter_Set as nvarchar(20)) + _
                                    ' as id_Rule_Parameter_Set_Fk, ' + _
                                    cast(@id_Fact_Table as nvarchar(5)) + ' as id_Fact_Table, _
                                    a.Surrogate_Key '  + char(13) +
                                  'from ' + @lefttable + ' a '  + char(13) +
                                  'left outer join ' + @righttable + ' b ' + char(13) +
                                  'on a.' + @leftcolumn + ' = b.' + @rightcolumn + ' ' + char(13) +
                                  'where b.' + @rightcolumn + ' is null'

                    print @dmlSql
                    print char(13)

                end
            /**********************************************************************************
            for this specific rule only a source table and column name are required
            ***********************************************************************************/
            else if @rule_type = 'IsNumber'
                begin
                    declare @TableName nvarchar(50), @ColumnName nvarchar(50)

                    select @id_Fact_Table = isnull(object_id(@lefttable), -1)

                    select @TableName = o.parameter_Value
                    from dbo.tbl_Parameter o
                    where o.id_Parameter_Set_Fk = @id_Parameter_Set_Fk and
                          o.parameter_Name = 'TableName'

                    select @ColumnName = o.parameter_Value
                    from dbo.tbl_Parameter o
                    where o.id_Parameter_Set_Fk = @id_Parameter_Set_Fk and
                          o.parameter_Name = 'ColumnName'

                    set @dmlSql = '/* Rule Id ' + cast(@id_rule as nvarchar(5)) + _
                           ' / Parameter Set ' + cast(@id_Rule_Parameter_Set as nvarchar(5)) + _
                           ' */' + char(13) +
                                  'insert into dbo.tbl_Outcome ' + char(13) +
                                  'select ' + cast(@id_Rule_Parameter_Set as nvarchar(20)) + _
                                     ' as id_Rule_Parameter_Set_Fk, ' + _
                                     cast(@id_Fact_Table as nvarchar(5)) + ' as id_Fact_Table, _
                                     a.Surrogate_Key '  + char(13) +
                                  'from ' + @TableName + ' a '  + char(13) +
                                  'where isnumeric(b.' + @ColumnName + ') = 0'

                    print @dmlSql
                    print char(13)
                end

            fetch next
            from c_test_rules
            into  @id_rule,
                  @rule_name,
                  @rule_type,
                  @id_Parameter_Set_Fk,
                  @id_Rule_Parameter_Set
        End

To execute the procedure, one just needs to call it. Of course, this step could (and even should) be automated in the ETL process.

SQL
exec [dbo].[sp_rule_execution]

The generated queries are:

SQL
/* Rule Id 1 / Parameter Set 1 */
insert into dbo.tbl_Outcome
select 1 as id_Rule_Parameter_Set_Fk, -1 as id_Fact_Table, a.Surrogate_Key
from fac_Table_A a
left outer join dim_Table_A b
on a.Column_A_Fk = b.Column_A_sk
where b.Column_A_sk is null

/* Rule Id 1 / Parameter Set 2 */
insert into dbo.tbl_Outcome
select 2 as id_Rule_Parameter_Set_Fk, -1 as id_Fact_Table, a.Surrogate_Key
from fac_Table_A a
left outer join dim_Wage_Range b
on a.Wage_Range_Fk = b.Wage_Range_Fk
where b.Wage_Range_Fk is null

/* Rule Id 4 / Parameter Set 5 */
insert into dbo.tbl_Outcome
select 5 as id_Rule_Parameter_Set_Fk, -1 as id_Fact_Table, a.Surrogate_Key
from fac_Sales a
where isnumeric(b.valSalesMonth) = 0

/* Rule Id 4 / Parameter Set 6 */
insert into dbo.tbl_Outcome
select 6 as id_Rule_Parameter_Set_Fk, -1 as id_Fact_Table, a.Surrogate_Key
from fac_Sales a
where isnumeric(b.valAge) = 0

Once executed, the above queries should insert into tbl_Outcomes all rows that do not validate them. Using that table, it would be possible to build reports over it and identify the columns and rows that returned problems in the ETL process sequence.

This approach is dedicated to validations related to loaded data but this logic can be applied to data quality validations as well as to all parts of the overall ETL testing process. Other validations and tests can and should be executed on a Data Warehouse project. Calculation validations (also possible with this “engine” approach), grouping validations, final comparison metrics validations (number of rows, maximum, minimum values, etc.). The database model could also consider the option of very specific queries that should be saved in the parameter set using a parameter name like “baseDmlQuery”, the rule type could be named “hard_coded_Query”.

Data Warehouse projects also require testing on report layout presentation, but those can hardly be automated.

For sure, this works as I even have more complex solutions developed. To manage the validation rules tables, the best approach is to have a proper visual interface that can be developed using Java, .NET or even Microsoft Access.

I hope this idea will help developers considering dynamic SQL generation as a possibility when a large sets of tests are required. As usual, creativity is the way to go and if you look for interesting reading related to ETL testing, take a look at these useful sources as well: Liaison and tutorialspoint.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)