Importing Excel Data to a Generic List Using Open XML SDK and Object Model Mapping

Shenwei Liu

5.00/5 (7 votes)

5 Jun 2014CPOL9 min read

68K

2.9K

C# class that populates a generic list with data retrieved from Excel data file

In this article, I will be presenting and explaining an Open XML SDK Library based ExcelReader class for importing Excel data to a generic list through the object model mapping.

Download source - 1.7 MB

Introduction

The generic list with an underlying data model is the most favorable collection structure used for the data application development due to its true object oriented programming nature. It's also the best form for importing data from the Microsoft Excel spreadsheet to applications and databases. This article and sample application shows a C# class that populates a generic list with the data retrieved from the Excel data file. The article especially addresses these topics:

Using Microsoft Open XML SDK library and merge DLL files to the main assembly. No Office Interop or other third party tool is needed.
Obtaining data values from the Excel Workbook shared string table.
Handling the empty spreadsheet cell issue during row iterations.
Detecting used range to eliminate the empty row processing.
Mapping spreadsheet data to object model with friendly column names instead of index numbers.
Standardizing the string to primitive data type conversions.

Running Sample Application

The downloaded source contains everything for running the sample application with the Visual Studio 2010 - 2013. An Excel file is included in the downloaded source with the sample data shown below:

Here is the corresponding plain object model for the sample data:

public class Product
{
    public int ProductID { get; set; }
    public string ProductName { get; set; }
    public int? CategoryID { get; set; }
    public Decimal? UnitPrice { get; set; }                
    public bool OutOfStock { get; set; }
    public DateTime? StockDate { get; set; }    
}

To see the data import in action, you can run the sample application in debug or release mode. The imported data results can be shown as a formatted string on a console window or in another Excel worksheet window using the GenericListOutput class that is included in the sample application.

Overview of Data Import Function

The GetDataToList function in the ExcelReader.cs class file performs the major tasks of importing data from an Excel file to a generic list using the syntax:

public IList<T> GetDataToList<T>(string filePath, 
       Func<IList<string>, IList<string>, T> dataMappingFunctionName);    
public IList<T> GetDataToList<T>(string filePath, string sheetName, 
       Func<IList<string>, IList<string>, T> dataMappingFunctionName);

The function is designed to import data from a single sheet to a single generic list at a time. The first overloaded version always takes the first worksheet in the Excel file. If you have multiple sheets, you can call the second overloaded version with explicit sheetName values multiple times to create corresponding multiple List objects for your models.

The dataMappingFunctionName points to a function that conducts the mapping of spreadsheet field data to object model under the hood. We will be discussing the data mapping details later.

As demonstrated in the sample application, to import data from the Excel file to the List with the Product model, just call this line of code by passing the file path (excelPath) and the name of the delegated data mapping function (AddProductData):

//Get product list from the spreadsheet
IList<Product> dataList = ExcelReader.GetDataToList(excelPath, AddProductData);

Using Open XML SDK Library

If we develop a server application that needs to import the data from an Excel file, it's not a good practice to use the Office InterOp library for Excel. Third party tools, such as Aspose, function well but have the licence limit and may be an overkill if used just for reading the data from the Excel. Although the Open XML SDK for Office is not so developer-friendly, it's a standardized and mature tool from the Microsoft development network family. Using the tool could be an excellent choice for the Excel related .NET programming work if we can well handle some hassles.

Two Open XML SDK Library assemblies, DocumentFormat.OpenXml.dll and WindowsBase.dll, are needed for obtaining the Excel data. The sample application includes two files with the Open XML SDK version 2.5 in the lib folder. These files are working fine for Excel 2007 - 2013 spreadsheet files and Visual Studio 2010 - 2013 projects with .NET Framework 4.0 and above. If you use .NET Framework versions below 4.0, you can download and install the Open XML SDK version 2.0, replace the two DLL files with the version 2.0 files, and then re-reference them in the project.

An automatic assembly merge process can be used in some situations requiring one consolidated executable or DLL file for easy deployment as shown in the sample application.

Setting two DLL files as resource files on the Resources tab of the project Properties page. This will create the Resources.* file group under the project's Properties folder.

Adding the following event handler routine code into the startup class:

//Merge files specified as resources into one output executable or dll
System.Reflection.Assembly CurrentDomain_AssemblyResolve
                  (object sender, ResolveEventArgs args)
{
    string dllName = args.Name.Contains(',') ? 
    args.Name.Substring(0, args.Name.IndexOf(',')) : args.Name.Replace(".dll", "");
    dllName = dllName.Replace(".", "_");
    if (dllName.EndsWith("_resources")) return null;
    var obj = new Object();
    System.Resources.ResourceManager rm = new System.Resources.ResourceManager
                                 (obj.GetType().Namespace + ".Properties.Resources", 
                                 System.Reflection.Assembly.GetExecutingAssembly());
    byte[] bytes = (byte[])rm.GetObject(dllName);
    return System.Reflection.Assembly.Load(bytes);
}

Raising the event in the initiating routine or constractor of the startup class:

AppDomain.CurrentDomain.AssemblyResolve += 
          new ResolveEventHandler(CurrentDomain_AssemblyResolve);

Getting Cell Values from Shared String Table

The Excel stores potentially repeated string data values in the shared string table XML structure in a workbook for the performance improvement purpose. The worksheet cells are assigned index numbers that point to the shared string table node values. You don't have to care how to get real data values from worksheet cells if working with the Office InterOp. But when using the Open XML SDK Library, you need to check the cell data type attributes and obtain the data items from the shared string table through index numbers stored in cells.

The GetCellValue function in the ExcelReader class checks that if the value of the DataType property for a cell object is SharedString, then the real data item will be retrieved from the SharedStringTable object. The below code snippet shows the logic:

if (cell == null) return null;
string value = cell.InnerText;

//Process values particularly for those data types
if (cell.DataType != null)
{
    switch (cell.DataType.Value)
    {
        //Obtain values from shared string table
        case CellValues.SharedString:                        
            var sstPart = 
                document.WorkbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
            value = sstPart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
            break;
        //Other conditions
        //. . .
    }
}
return value;

Adding Empty Cells into Spreadsheet Rows

When getting a data row from the object with the DocumentFormat.OpenXml.Spreadsheet.Row type, any cell with empty value is not included due to the underlying XML node structures. This causes a major issue on iterating cells for the column mapping in rows. The Open XML SDK library does not provide any solution for the issue. Consumers need to add their own code to handle it. Fortunately, the DocumentFormat.OpenXml.Spreadsheet.Cell class has the CellReference property that can be used to match the built-in column names and thus facilitates inserting empty cells at the missing positions.

To easily detect missing cells, all built-in column names are firstly cached to a generic list during the iteration for the custom header row. The GetColumnAddress called in the loop is a simple function to return letter part of the Cell.CellReference value (See downloaded source for details).

var columnLetters = new List<string>();

//Iterate cells of custom header row
foreach (Cell cell in wsPart.Worksheet.Descendants<Row>().ElementAt(0))
{                        
    //Get custom column names
    //. . .
    
    //Get built-in column names by extracting letters from cell references                
    columnLetters.Add(GetColumnAddress(cell.CellReference));                        
}

The GetCellsForRow function in the ExcelReader class will then add empty cells at positions where they are missing in a spreadsheet row and return an enumerable containing the complete number of cells. Note that the logic checks the missing cells in positions of the beginning, middle, and end of any particular row.

private static IEnumerable<Cell> GetCellsForRow(Row row, List<string> columnLetters)
{
    int workIdx = 0;        
    foreach (var cell in row.Descendants<Cell>())
    {
        //Get letter part of cell address
        var cellLetter = GetColumnAddress(cell.CellReference);

        //Get column index of the matched cell  
        int currentActualIdx = columnLetters.IndexOf(cellLetter);

        //Add empty cell if work index smaller than actual index
        for (; workIdx < currentActualIdx; workIdx++)
        {
            var emptyCell = new Cell() 
                            { DataType = null, CellValue = new CellValue(string.Empty) };
            yield return emptyCell;
        }
                
        //Return cell with data from Excel row
        yield return cell;
        workIdx++;

        //Check if it's ending cell but there still is any unmatched columnLetters item   
        if (cell == row.LastChild)
        {
            //Append empty cells to enumerable 
            for (; workIdx < columnLetters.Count(); workIdx++)
            {
                var emptyCell = new Cell() 
                                { DataType = null, CellValue = new CellValue(string.Empty) };
                yield return emptyCell;
            }
        }                
    }                
}

Detecting Used Range

Unlike the Office InterOp, there is no Range object in the Open XML SDK library. The WorksheetPart.Worksheet.SheetDimention also doesn't work based on my attempts. Detecting and bypassing the empty rows are necessary since Excel worksheets sometimes leave empty rows after users remove entire data in the spreadsheet rows. In such cases, errors or inefficient iterations will occur during the data transport.

The GetUsedRows function shows how to find the rows having at least one cell value and return those rows as an enumerable no matter what the empty row position is.

private static IEnumerable<Row> GetUsedRows(SpreadsheetDocument document, WorksheetPart wsPart)
{
    bool hasValue;        
    //Iterate all rows except the first one.
    foreach (var row in wsPart.Worksheet.Descendants<Row>().Skip(1))
    {
        hasValue = false;
        foreach (var cell in row.Descendants<Cell>())
        {
            //Find at least one cell with value for a row
            if (!string.IsNullOrEmpty(GetCellValue(document, cell)))
            {
                hasValue = true;
                break;
            }
        }
        if (hasValue)
        {
            //Return the row and keep iteration state.
            yield return row;
        }
    }
}

Data Mapping Function

The data mapping functionality depends on spreadsheet columns, object model properties, and data types. The optimal approach is to delegate the mapping process to a function that is passed as a parameter for calling the ExcelReader.GetDataToList function. In the sample application, the AddProductData function performs the data mapping and then loading the data into the Product object:

//Function for mapping and entering data into Product object.
private static Product AddProductData(IList<string> rowData, IList<string> columnNames)
{
    var product = new Product()
    {
        ProductID = rowData[columnNames.IndexFor("ProductID")].ToInt32(),
        ProductName = rowData[columnNames.IndexFor("ProductName")],
        CategoryID = rowData[columnNames.IndexFor("CategoryID")].ToInt32Nullable(),
        UnitPrice = rowData[columnNames.IndexFor("UnitPrice")].ToDecimalNullable(),
        OutOfStock = rowData[columnNames.IndexFor("OutOfStock")].ToBoolean(),
        StockDate = rowData[columnNames.IndexFor("StockDate")].ToDateTimeNullable()
    };            
    return product;
}

The mapping function is actually called within the iteration of spreadsheet rows and return a full data object which will in turn be added to the output generic list.

//Calls the delegated function to add it to the collection.
resultList.Add(addProductData(rowData, columnNames));

There are some add-in features provided by the ExcelReader class to make the data mapping and type conversions convenient and efficient. The following sections will discuss two of these features in details.

Mapping Data with Column Names

Using column names, instead of indexes, for mapping and entering data is a very user-friendly and easy maintenance approach. Column sequences can be ignored both in the data source files and during the mapping process. To make this work, we need to create a generic list to cache the custom column names:

var columnNames = new List<string>();

//Iterate cells of custom header row.
foreach (Cell cell in wsPart.Worksheet.Descendants<Row>().ElementAt(0))
{
    //Get custom column names.
    //Remove spaces, symbols (except underscore), 
    //and make lower cases and for all values in columnNames list.                    
    columnNames.Add(Regex.Replace
                (GetCellValue(document, cell), @"[^A-Za-z0-9_]", "").ToLower());

    //Other code...                        
}

The column names added to the list items are alpha-numeric only, symbol free except the underscore which is a valid character for C# object property names, and all lower cases. This is a very robust feature in that column names can be set in spreadsheets with any cases, spaces in any positions, and any symbols. Looking back at the source Excel spreadsheet screenshot, you can see asterisks appended to the column names such as "Product Name*" indicating the required field. You can also see all user-friendly column names such as "Out of Stock" instead of "OutOfStock". The mapped output will correctly show the Product object properties ProductName and OutOfStock for these examples.

The steps of the case-insensitive mapping can further be tuned by adding a List class extension method IndexFor into the ExcelReader class so that we can use the extension method instead of calling the native IndexOf and ToLower methods from each items in the data mapping function.

public static int IndexFor(this IList<string> list, string name)
{        
    int idx = list.IndexOf(name.ToLower());
    if (idx < 0) 
    {
        throw new Exception(string.Format("Missing required column mapped to: {0}.", name));
    }
    return idx;
}

String to Primitive Data Type Conversions

All cell values retrieved from a spreadsheet are in the form of string which needs to be explicitly converted to the types of data model properties if the underlying data type is not the string. The ExcelReader class provides the String class extension methods making the type conversion code neat and centralized:

ToInt32()
ToDouble()
ToDecimal()
ToDateTime()
ToBoolean()
ToGuid()

Each method also has its nullable version, for example, ToInt32Nullable(). You can see how easy to use the conversion methods in the AddProductData function code shown before. You can also add your own methods for any other type conversions if needed.

It's worth noting that, for date/time entries in the Excel spreadsheet, if the column or cell is not explicitly set as the text format, the Excel will output the OLE Automation date/time (OADate format, a.k.a., the number of Double type referencing to the 12/31/1899 midnight). Below is an example of how the converter handles the situation when the OADate format date/time is received:

public static DateTime? ToDateTimeNullable(this string source)
{
    DateTime outDt;        
    if (DateTime.TryParse(source, out outDt))
    {
        return outDt;
    }
    else
    {
        //Check and handle OLE Automation date time
        if (IsNumeric(source))
        {
            return DateTime.FromOADate(source.ToDouble());
        }
        return (DateTime?)null;                       
    }
}

Although the code in the sample AddProductData function only maps the data for a simple object model, you can use the same scenario for mapping a pretty complex object containing hierarchical structures. Just make sure that the model properties and all child object properties match the spreadsheet columns for your planed data transport.

Summary

The ExcelReader class presented here is easy to use, powerful, and free of Office InterOp for importing the data from an Excel spreadsheet to a generic list. I hope that the article and sample application are helpful for developers who need to work on the similar areas.

History

8^th May, 2014: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)