Introduction
Loading Excel files with same schema but different file name
or sheet name is a challenge for many SSIS developers. Microsoft does not have
any Dynamic Excel loading component that search Excel files using a pattern and
indentify the sheet contains data and load it.
The article helps you to challenge such scenarios. The
package described here is able to search Excel files using a pattern and load
it without any user intervention. The
Sheet contains data does not have start at first row or first column.
Features
- Filename can be anything(ie it should match the
pattern used for searching)
- Format of excel can be any format (.xls – Office
2003, .xlsx – Office 2007 or later)
- Data to be loaded can be any sheet. Zero
dependency on sheet name.
- Data can start at any different column or row,
ie it does not have to be at A1 or A2 or B10 etc.
- Report out if any additional fields are added to
the excel data after the design.
- Archive processed files to Processed folder.
Building the package
- Create Folder structure(please feel free to create anywhere, for simplicity I use C:\ drive).
a. Create folder SSISLoad in C:\ b. Create subfolder in Data on C:\SSISLoad c. Create subfolder in Country on C:\SSISLoad\Data d. Create subfolder in Test on C:\SSISLoad\Data\Country e. Create subfolder in Processed on C:\SSISLoad\Data\Country<o:p> 2. Package Design Preparations
a. Identify the Sheet that contains data. Rename it to Data b. Delete empty rows and columns such that data starts at cell A1 c. Save it as Country Details - XX.xlsx & Copy to C:\SSISLoad\Data\Country\Test\ d. Create a new SSIS Package Load Dynamic Excel.dtsx e. Create and Set package variable (string type) Input_Share_Root as C:\SSISLoad\Data\ f. Create and Set package variable (string type) DataFile_Input_Path as Country\ g. Create and Set package variable (string type) DataFile_Search_Pattern as Country;.xlsx h. Create package variable (string type) NewFieldsSummary i. Create package variable (string type) NewFields j. Create package variable (boolean type) FileFound k. Create package variable (string type) SourceFile l. Create package variable (string type) Extraction_Statement m. Create package variable (string type) Connection_String Save the package. 3. Set package variable Connection_String
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\SSISLoad\Data\Country\Test\Country Details - XX.xlsx;Extended Properties="EXCEL 12.0 XML;HDR=YES;IMEX=1"; 4. Preparing package variable Extraction_Statement
a. Add a DataFlow Component to package(Drag DataFlow from Tools), name it as Load Excel b. Add a Excel connection to package(Right Click Connection manager area >> New Connection >> Excel >> Browse and Point to C:\SSISLoad\Data\Country\Test\ Country Details - XX.xlsx Name the connection as Excel_Source c. Right click Excel_Source connection >>Properties >> Expression >> Click Ellipse button i. Pick Connection String for Property from drop down , click Ellipse on Expression ii. Set expression as @[User::Connection_String] or Drag it from Variables on the expression builder window. d. Go to Load Excel data flow and add Excel Source by dragging from Tool Box, name it as Country Details e. Edit Country Details f. Set OLEDB Connection manager as Excel_Source (from drop down) g. Set Data Access Mode as SQL Command (from drop down) h. Click Build Query i. Click Add Table symbol and Select Data$ and click Add, Click Close j. Manually check each and every column, do not check *(All Columns) k. Copy the SQL Generated from the query pane: SELECT ID, StateProvinceCode, CountryRegionCode, IsOnlyStateProvinceFlag, Name, TerritoryID
FROM [Data$]
l. Cancel the Build Query process. Cancel Excel Source Editor window. m. Make the SQL in single line and set as value for Extraction_Statement n. Save the package. 5. Set Country Details excel source component
a. Edit Country Details and set Connection manager as Excel_Source (from drop down) b. Set Data Access Mode as SQL Command from variable(from drop down) c. Set Variable name as User:: Extraction_Statement (from drop down) d. Click Columns from Left pane. e. See all columns are populated (you may uncheck any column you do not want). Click OK 6. Set Destination as you like and map columns.
a. Create OLEDB Destination Connection manager OLEDB_Destination_Conn (sample used local server, Test as DB b. Create table CountryDetails and map columns
|
|
|
|
Write Automation Code
1. Search excel file using the pattern specified in variable DataFile_Search_Pattern.
This pattern separated by extension using semicolon format
2. Enumerate Excel sheets for the file obtained by Search
a.Use OLEDB driver to create connection to Excel and read schema.
3. Identify Sheet contains data and data start address.
a.Search each sheet for fields used in Extraction_Statement and determine sheetname and data start address.
b.Modify connection string as per the excel file.
Using the Code
How it works(Concept)
The package search for the file using the specification supplied, determines the sheet contains data and its start address(like A1 OR C10 etc), and finally modifies the connection string to the Excel file to point the file obtained and modifies SQL Query to read the excel file from the sheet identified. This level of dynamic loading is achieved with the help of user defined variables in the package, and thus the run-time setting of Connection Managers, SQL Command to read/pull data are updated with the help of variables.
Let's now analyze concept (code perspective) as below:
The following method searches file in the path supplied. It uses matching pattern as parameter searchstring
and use parameter extension
as filter
public string SearchFile(string path, string extension, string searchstring)
{
DirectoryInfo di = null;
if (Directory.Exists(path))
{
di = new DirectoryInfo(path);
}
else
return "Directory Does not Exist";
string newestFile;
IEnumerable<System.IO.FileInfo> fileList = di.GetFiles("*" + searchstring.ToLower() + "*");
IEnumerable<System.IO.FileInfo> fileQuery =
from file in fileList
where (extension.ToLower().Contains(file.Extension.ToLower()))
orderby file.LastWriteTime
select file;
try
{
var FileSearchedResult = (from file in fileQuery orderby file.LastWriteTime select new { file.FullName, file.Name, file.CreationTime }).Last();
newestFile = FileSearchedResult.FullName;
FileSearchedResult = null;
fileList = null;
di = null;
return newestFile;
}
catch
{
fileList = null;
di = null;
return null;
}
}
The following method GetExcleSheetNames enumerates all available sheets in the excelFile
supplied as parameter. It uses OLEDDB driver to connect the Excel file and reads the Schema using GetOleDbSchemaTable.
private List<String> GetExcelSheetNames(string excelFile)
{
OleDbConnection objConn = null;
System.Data.DataTable dt = null;
try
{
String connString = this.GetConnectionString(excelFile);
objConn = new OleDbConnection(connString);
objConn.Open();
dt = objConn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);
if (dt == null)
{
return null;
}
List<String> excelSheets = new List<string>();
foreach (DataRow row in dt.Rows)
{
string WorkSheet = row["TABLE_NAME"].ToString();
WorkSheet = WorkSheet.StartsWith("'") ? "[" + WorkSheet.Substring(1) : WorkSheet;
WorkSheet = WorkSheet.StartsWith("[") ? WorkSheet : "[" + WorkSheet;
WorkSheet = WorkSheet.EndsWith("'") ? WorkSheet.Substring(0, WorkSheet.Length - 1) + "]" : WorkSheet;
WorkSheet = WorkSheet.Substring(0, WorkSheet.Length - 1).EndsWith("$") ? WorkSheet : WorkSheet.Substring(0, WorkSheet.Length - 1) + "$]";
if (!excelSheets.Exists(delegate(string k) { return (k.ToLower() == WorkSheet.ToLower() || k.ToLower() + "$" == WorkSheet.ToLower()); }))
excelSheets.Add(WorkSheet);
}
return excelSheets;
}
catch (Exception ex)
{
return null;
}
finally
{
if (objConn != null)
{
objConn.Close();
objConn.Dispose();
}
if (dt != null)
{
dt.Dispose();
}
}
}
The following method ScanWorkSheet builds DataSet for first 1000 rows from WorkSheet
(parameter)
private SourceFileDetails ScanWorkSheet(string excelFile, string WorkSheet, string ExtractionStatement, string FindColumn, out bool Success)
{
Success = false;
System.Data.DataSet excelDataSet = new DataSet();
string connectionString = this.GetConnectionString(excelFile);
using (OleDbConnection objConn = new System.Data.OleDb.OleDbConnection(connectionString))
{
try
{
objConn.Open();
OleDbDataAdapter cmd = new OleDbDataAdapter("select top 1000 * from " + WorkSheet, objConn);
cmd.Fill(excelDataSet, WorkSheet);
cmd.Dispose();
}
catch { }
finally
{
if (objConn != null)
{
objConn.Close();
objConn.Dispose();
}
}
}
SourceFileDetails sd = ScanRows(excelDataSet, ExtractionStatement, FindColumn);
if (sd != null)
{
Success = true;
sd.FileNamePath = excelFile;
sd.ConnectionString = this.GetConnectionString(excelFile, true);
return sd;
}
return (SourceFileDetails)null;
}
The following method ScanRows scans each row find our interested columns. This is the way it identifies sheet contains data, and data start address. They key step in this method is build the list of columns/fields we are interested and search for it. Here we assume those fields/columns occur within 1000 rows. 1000 rows are selected here for performance and nobody create an excel file contains data after leaving first 1000 rows as blank.
public SourceFileDetails ScanRows(DataSet excelDataSet, string ExtractionStatement, string FindColumn)
{
if (excelDataSet.Tables.Count < 1)
return (SourceFileDetails)null;
string ExtractFields = this.ReplaceString(this.ReplaceString(ExtractionStatement, "SELECT", ""), "From [Data$]", "");
List<string> FindStrings = ExtractFields.Split(',').Select(s => s.Trim().Replace("[", "").Replace("]", "").Replace("#", ".").Replace("(", "").Replace(")", "")).ToList();
foreach (DataTable dt in excelDataSet.Tables)
{
List<ItemValueHolder> FoundAddress = new List<ItemValueHolder>();
int iDuplicates = 0;
foreach (DataRow dr in dt.Rows)
{
int iItemColumnIndex = 0;
foreach (var fieldValue in dr.ItemArray)
{
object cellData = fieldValue;
string sCellData = cellData.ToString().Replace("#", ".").Replace("[", "").Replace("]", "").Replace("(", "").Replace(")", "");
if (cellData != null)
if (FindStrings.Exists(delegate(string k) { return k.ToLower() == sCellData.ToString().ToLower(); }))
{
if (!FoundAddress.Exists(delegate(ItemValueHolder t) { return t.Item.ToLower() == sCellData.ToLower(); }))
FoundAddress.Add(new ItemValueHolder(sCellData, dt.Rows.IndexOf(dr), iItemColumnIndex));
else
iDuplicates++;
}
iItemColumnIndex++;
}
int iTotalFields = FindStrings.Count - iDuplicates;
if (100 * FoundAddress.Count / (float)iTotalFields >= 90.00)
{
var query = FoundAddress.GroupBy(
item => item.ItemValue,
(itemvalue, items) => new
{
Key = (int)itemvalue,
Count = items.Count(),
});
int dataStartAddress = (from p in query
where p.Count == (query.Max(it => it.Count))
select p.Key).Max();
dataStartAddress += 1;
string SheetName = "[" + dt.TableName + "$A" + dataStartAddress.ToString() + ":IV]";
string _selectCommand = this.ReplaceString(ExtractionStatement, "[Data$]", SheetName);
string reportDate = "";
if (!string.IsNullOrEmpty(FindColumn))
try
{
int ColIndex = (from p in FoundAddress
where p.Item.ToLower() == FindColumn.ToLower()
select p.ColumnIndex).First();
DataRow drRowReport = dt.Rows[dataStartAddress];
reportDate = drRowReport[ColIndex].ToString();
}
catch (Exception e)
{
}
string NewFields = GetNewFieldsAtSource(FindStrings, dr);
return new SourceFileDetails("", "", "", _selectCommand, reportDate, NewFields);
}
}
}
return (SourceFileDetails)null;
}
The following class is used to store Excel file details that we can use to modify connection strings dynamically . The class is nothing but a information bundle; no other important methods that we use inside this class.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ST_b050e37a03e84ca7b15b853949d29aa6.csproj
{
public class SourceFileDetails
{
private string _fileNamePath = "";
public string FileNamePath
{
get { return _fileNamePath; }
set { _fileNamePath = value; }
}
private string _fileExtension = "";
public string FileExtension
{
get { return _fileExtension; }
set { _fileExtension = value; }
}
private string _connectionString = "";
public string ConnectionString
{
get { return _connectionString; }
set { _connectionString = value; }
}
private string _extractQuery = "";
public string ExtractQuery
{
get { return _extractQuery; }
set { _extractQuery = value; }
}
private string _reportDate = "";
public string ReportDate
{
get { return _reportDate; }
set { _reportDate = value; }
}
private string _newFields = "";
public string NewFields
{
get { return _newFields; }
set { _newFields = value; }
}
public SourceFileDetails() { }
public SourceFileDetails(string fileNamePath, string fileExtension, string connectionString, string extractQuery, string reportDate, string newFields)
{
this.FileNamePath = fileNamePath;
this.FileExtension = fileExtension;
this.ConnectionString = connectionString;
this.ExtractQuery = extractQuery;
this.ReportDate = reportDate;
this.NewFields = newFields;
}
}
}
The following class is again an intermediate information storage and no other functionality.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ST_b050e37a03e84ca7b15b853949d29aa6.csproj
{
public class ItemValueHolder
{
private string _item = "";
public string Item
{
get { return _item; }
set { _item = value; }
}
private object _itemValue = "";
public object ItemValue
{
get { return _itemValue; }
set { _itemValue = value; }
}
private int _columnIndex;
public int ColumnIndex
{
get { return _columnIndex; }
set { _columnIndex = value; }
}
public ItemValueHolder() { }
public ItemValueHolder(string item, object itemValue, int columnIndex)
{
this.Item = item;
this.ItemValue = itemValue;
this.ColumnIndex = columnIndex;
}
}
}
The following method builds the connection string for the particular excel file supplied. It can build connection string irrespective of data has Header row not. Variable HasHeader determines if we have Header row or not.
public string GetConnectionString(string FileNamePath, bool HasHeader)
{
string ConnectionString = "";
string Extension = Path.GetExtension(FileNamePath).ToLower();
string BinaryExcelProvider = "Microsoft.Jet.OLEDB.4.0";
string XmlExcelProvider = "Microsoft.ACE.OLEDB.12.0";
string BinaryExcelExtProperties = "Excel 8.0";
string XmlExcelExtProperties = "Excel 12.0";
string XmlMacroExcelExtProperties = "EXCEL 12.0 Macro";
string Provider = "";
string ExtendedProperties = "";
switch (Extension)
{
case ".xls":
Provider = BinaryExcelProvider;
ExtendedProperties = BinaryExcelExtProperties;
break;
case ".xlsx":
Provider = XmlExcelProvider;
ExtendedProperties = XmlExcelExtProperties;
break;
case ".xlsm":
Provider = XmlExcelProvider;
ExtendedProperties = XmlMacroExcelExtProperties;
break;
}
string Header = ";HDR=NO;IMEX=1";
if (HasHeader)
Header = ";HDR=YES;IMEX=1";
string ConnectionStringFormat = "Provider={0};Data Source={1};Extended Properties=\"{2}{3}\";";
ConnectionString = string.Format(ConnectionStringFormat, Provider, FileNamePath, ExtendedProperties, Header);
return ConnectionString;
}
All codes, packages(SSIS Project Solution file), sql scripts to generate destination table and sample excel files that can used for testing are attached.
Points of Interest
Multiple files loading with for each file enumerator instead using Search method is also uploaded. Please come back with questions or doubts, I will be glad to help you.
History
v3