Introduction
This article was written to demonstrate using a script transform to convert Julian dates and create extra columns for building a date dimension. It will show the basics of wiring up one of these script components and how to use it to create an output row for every row input/processed. I had trouble finding useful examples the first time that I tried to use one of these SSIS components and I wrote this believing that others might benefit from what I learned. I also want to include a warning. I believe there is a limitation limitation in this code for dates prior to 2000. I have an older version of it in vb6 where it checks to see if a date is prior to 2000 by checking to see if the length is less than 6. It may also need to be modified to handle leap years.
Background
I wrote this script transform as part of a project to move
JD Edwards transactional data out of an OLTP system with a DB2 backend. The
company had decided to close several lines of business and would no longer need
to use the JD Edwards system for new transactions. Looking ahead they decided
that they were not going to renew their license for JD Edwards and would
therefore lose access to the system. A decision was also made to sell off the
IBM iSeries hardware that ran the system.
My job was to create a copy of the data on a SQL Server
database and create reports that would let them review the data at any point in
the future. SQL Server was chosen because the remaining business segment would continue
to maintain the databases on that server to support continuing operations. So I
had a new Data Warehouse project to develop.
The company used JD Edwards for order transactions and so
one of the first parts that I needed to build was an order detail fact table
and some dimensions such as product, customer and date. So far it looks like a pretty typical Data
Warehouse project. In the order detail table there were three important dates.
The first was the transaction date, the second was the ship date and from an
accounting standpoint the all-important invoice date which is when the revenue
is recognized.
The problem I ran into was that the dates were in a format
called Julian. The JD Edwards system does a fine job of presenting a normal
looking date in the user interfaces and reports but the underlying date in the
system was in this Julian format. So for example 1/1/2012 is represented as 112001
and 2/1/2012 is 112032. The first three digits indicate the year. 2011 would be
111. The last three are the day of the year.
I decided that since the company only needed the last four
or five years of history the date dimension would not be very large, only about
365 rows for each year minus weekends, holidays and any other days where there
was no activity. I would use the Julian date as the key to the date dimension
and assign it to an integer data type for efficiency.
What follows is the part of the ETL process which converts
the Julian date into a regular date for populating the date dimension with all
existing dates and new dates entered between now and the end of the year when
the software will no longer be used.
Building the ETL
So my ETL process for the basic conversion looks like this:
I have an OLE DB data source that looks at the order detail
table as it exists on the SQL Server and it queries a distinct list of the
dates.
For simplicity what you see above is a single column
retrieved from the OrderDetail table.
The next step is to add a script component to the data flow.
When you do this you are presented with the choice to use the script component
for a source, destination or a transform. In this case the choice is
transformation.
Now you have a scrip component on the Data Flow and the
first thing that you want to do is connect it to the OLE DB Source so that it
has an input. If you skip this step you will not be able to fully configure the
component and if you try you will get the following warning:
Now right click on the script component and choose edit. The
left pane contains a list of items to configure and the main pane is where you
make changes and settings. So for the first item Script we can set the properties.
The following properties appear in bold and you can change them. You have Name,
Description and Script Language. I use C# for most of my projects so the code
example that follows will be in C#.
Now we can look at the next item which is input columns.
Below you will see the single column returned in the select statement. It is
always better to write an explicit SQL statement telling the SSIS OLE DB source
exactly what you want returned. This keeps things efficient by bringing back
only the data you need and preventing issues that may arise in the future if
the source table is modified. Notice the default name for the input column
(input 0). The names of the inputs and outputs are important because they match
the method signatures in the script code that you see when you click the edit
script button that you see in the image above.
Ok, since InvoiceDate is checked off in the above image we
are done with this item. If we were bringing in multiple columns or the entire
table we could chose columns using the check boxes. Now we move on the next and
final item that we need to configure which is the Inputs and Outputs item. In
the image below I have expanded both inputs and outputs. The input is what you
would expect, InvoiceDate. Notice that we cannot set the data type because the component
knows the underlying data type of that column. In the JD Edwards source database the Julian date field might even be a string but I am using it as an
integer in my project because it will be the key to the date dimension and it
will be an index and integers are more efficient in an index. It also just makes
sense because incrementing the field as an integer does exactly what you would
expect to the date value and that is, it adds a day within the domain of 0-365.
Also notice that there is an output (Output 0) but there are
no columns. You should also know that you can rename both the input(s) and the
output(s). If you do they will be changed in the script. The point is that if
you are going to change them then it is best to do it before you code the
script section. Now in this example I want to add two columns to the output and
I do this by using the add column button. This button will be enabled after
clicking on and highlighting the Output Columns folder.
In the image below you can see that I have added two columns
which are JulianDate and RegularDate. I want the Julian date column to have the
original Julian date as it appears in the source table and the regular date
will be a normal date in the usual format (or at least a format that can be
modified by common built-in functions).
OK now we can do some coding by choosing Script, the first
item on the left. When you do so, you will once again have access to the Edit Script
button that appears in one of the images above. Below is what you would see in
this example which uses the default input and output names.
I know that it is somewhat hard to see the details but you
will see it clearly on your machine. After adding code my Input0_ProcessInputRow
looks like this:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
String julianDate = Convert.ToString(Row.InvoiceDate);
DateTime dt_date;
if (julianDate.Length == 6)
{
try
{
string sDay = julianDate.Substring(3);
string sYear = "20" + julianDate.Substring(1, 2);
dt_date = DateTime.Now;
int iCurrentDay = Convert.ToInt16(sDay) - dt_date.DayOfYear;
dt_date = dt_date.AddDays(iCurrentDay);
dt_date = dt_date.AddYears(Convert.ToInt16(sYear) - dt_date.Year);
julianDate = Convert.ToString(dt_date.ToShortDateString());
Output0Buffer.AddRow();
Output0Buffer.RegularDate = Convert.ToDateTime(julianDate);
Output0Buffer.JulianDate = Row.InvoiceDate;
}
catch
{
}
}
}
The first part of the code does the Julian date conversion and
it is pretty straight forward. The last three lines in the try block create the
output on a row by row basis without any need for a loop. notice the use of
the name Output0
in the Output0Buffer
statements. If you change the input name
the method name will change. If you change the output name you will have to
adapt these statements as well so do any renaming first. So the first of those
statements adds an output row for each input row. Because I have added two
output columns I can now output both the regular date in the second statement
and the original Julian date in the third statement. I will be adding
additional columns for dayOfWeek, Year etc.
Now when I put a dataviewer on the input of the script component
and one on the output and send it to a Union All to hold the data and enable me
to run it you can see one column of Julian values going in and two columns
coming out (one Julian and one Regular). If you would like to see some Webinars
that demo SSIS tasks try Pragmatic Works and look on the training menu. You can
choose to sign up for future webinars and view past ones as well.
Conclusion
Hopefully you can see the power available in this component. Best practice says that you should not use the script component
to accomplish what the existing components can do for you because they will typically be more efficient and besides why waste
the time (unless of course you enjoy it like I do " src="http://www.codeproject.com/script/Forums/Images/smiley_smile.gif" />.
I also hope that this example will help someone get started. The first time I tried to use this component I did so without
any instruction and I did not understand the nuances of adding columns, out put rows and input processing.
Once I figured out how all that worked I was off and running, hopefully you will be too!