Introduction
If you ever tried to create SCD2 lookup using standard Microsoft SSIS components (Lookup and Merge), you probably noticed that it quickly becomes the biggest bottleneck in your Data Flow. Although Merge seems reasonably quick, sometimes it can't be used because of the volumes of data in two tables and data itself (a lot of SCD2 changes per record).
The following picture from this blog shows just how slow Lookup and Merge components can be comparing to Script Lookup which I will present in this trick:
There are different approaches you could use to solve this performance issue:
- Convince your client that they actually don't need SCD2 type dimension
- Multicast your data flow onto several Lookup components (parallel execution)
- Buy some fancy SSIS component pack which provides multi-thread SCD2 lookup component
- Etc.
It this tip, I will describe an approach which uses standard Microsoft SSIS components and one very clever piece of code I found in the above mentioned blog which I modified to use any type of column.
Background
Because this is not an article, I won't bother you too much with explaining the code. If you want to completely understand the power behind the code, please familiarize yourself with SCD2, C# Dictionaries and B-tree structures.
Using the Code
Script lookup is used in a similar way as standard Lookup component. After script component is placed in Data Flow and connected to data source above it, copy the attached C# code and paste it into Script component. Note that code is written in C#.
Note that you will need to manually create output columns on this script component.
The only thing left is to modify few lines of code at the beginning of the script. Let's see how to do this and what each configurable line means.
Let's take an example from the attached code:
public Dictionary<String, String> SurrogateKeys = new Dictionary<String, String>{
{ "AccountNumber", "AccountNumber"} };
public readonly Dictionary<string, string> Mappings = new Dictionary<string, string>
{ { "CardId", "CardNumber" } }; public readonly String[] DimDateRangeKeys = { "ValidFrom", "ValidTo" }; public const String FactDateKey = "ViewStart";
public const String TableName = "[dbo].[DimCard]";
public readonly bool IsDateRangeOverlapped = false;
public readonly bool IsDateTimeMode = false;
The best way to describe each variable usage is through the example. Let's say our Data Flow uses one source of data (OLE DB Source for example) and it reads all columns from fact table named Fact1
. T-SQL equivalent would be:
SELECT F.*
FROM Fact1 F
So, let's start with the variables.
TableName
This variable represents the name of the (dimension) table which we want to use in lookup and which contains historic (SCD type 2) data. In our case, it is table DimCard
.
Mappings
Similar to Lookup
component, we need to map input columns with columns from lookup
table in script
component. In our case, we are joining input data flow (Fact1
) with DimCard
table using CardId
column from fact table and CardNumber
column from DimCard
table. T-SQL equivalent would be:
SELECT F.*
FROM Fact1 F
JOIN DimCard C on C.CardNumber = F.CardId
DimDateRangeKeys & FactDateKey
As SCD2 is time dependent, we need to define date
or datetime
columns we want to use to join fact and dim (lookup) tables.
SCD2 dimension table must have from
and to
date columns which represent record validity period. These two columns are then stated in DimDateRangeKeys
.
FactDateKey
represent column in fact table used to "probe" dimension table.
T-SQL equivalent would be:
SELECT F.*
FROM Fact1 F
JOIN DimCard C on C.CardNumber = F.CardId and F.ViewStart between C.ValidFrom and C.ValidTo
SurrogateKeys
These are output (lookup) columns from the script component. The first element is column name from lookup table and the second is output alias.
For example:
SELECT F.*, C.AccountNumber as AccountNumber
FROM Fact1 F
JOIN DimCard C on C.CardNumber = F.CardId and F.ViewStart between C.ValidFrom and C.ValidTo
IsDateRangeOverlapped
Sometimes, SCD2 type dimension ValidTo
value from previous and ValidFrom
value from current record can overlap. In that case, you need to set this parameter to true
to avoid duplication of data.
IsDateTimeMode
ValidFrom
and ValidTo
in dimension tables are usually of date
type, but some systems use datetime
column type. In that case, you need to set this parameter to true
to make script work properly.
Points of Interest
I have tested this code several times and it is significantly faster than standard Lookup
component.
A small downside of the attached code is that it works only in Full Cache mode as I didn't have time to implement Partial Cache. Thus, you need to be careful when using it - if your lookup (dimension) table has many rows, component might take a lot of memory during package execution and performance might degrade. If you are good in C# and you managed to understand the code, you could fix this by implementing partial cache.
A big downside in using this approach is maintenance. What if code has a bug and you already used it in 37 places in your project? You could search and replace piece(s) of code, but script components are validated only when it's script editor is 'closed'. At the time of writing this tip, I haven't found a good way of updating (refreshing) script components. This can be solved easily if you are able to find "Script Component Refresher" or if you create a Custom SSIS Component.
History
- December 2015 - Initial version