Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

CSV/Excel File Parser - A Revisit

5.00/5 (7 votes)
7 Apr 2018CPOL19 min read 31.4K   887  
An example of evolving code to fit new demands

Introduction

First off, I'll admit it - I'm not really breaking much new ground here. I wrote a earlier article about parsing CSV files - CSV File Parser[^] - and while it provides adequate code, I figured I could make it a little easier on the programmer who was using it. I chose to present this as a new article because a) it serves as a good example of how code evolves when you get a chance to revisit it, and b) it presents what I consider to be a significantly improved approach over the original article(s).

This code was actually extracted from a recent series of articles I posted that allows you to create scheduleable agents for SQL Server Express - SQLXAgent - Jobs for SQL Express - Part 1 of 6[^]. That article included most of the code you see here, but with the database support removed because this article is targeted at getting data from a CSV (or Excel) file into a collection in memory. Additional processing beyond that point is (understandably) left to the programmer implementing this code into his own project, and I have provided an example of how this might be accomplished near the end of the article.

I was tempted to just refer everyone to the earlier CSVParser articles, but considered that if I was the person reading this article, I would be annoyed to have to click a link to read stuff that may or may not be applicable to this article, so please excuse any duplication of narrative regarding the CSV parsing code, and remember, I'm doing it for the children.

There are no images, pretty pictures, or fluff in this article. If you're looking for that kind of stuff, go to a nearby news stand and pick up a copy of Cosmopolitan.

Background

I (still) live in a very bizarre programming world. The project for which this code was originally developed has since been abandoned, but it is the reason this code existed in the first place. That project involved importing data from almost five DOZEN different data sources, comprised mostly of Excel XLSX files, at least one CSV file, a web site that provides raw data in XML format, and a couple of actual database queries. The Excel spreadsheets come from a mix of database queries on web sites that can return the results in the form of said spreadsheet files, with the rest being manually generated by humans.

Garbage In

You would think that the human-generated files would present the most quirks because humans are flawed. Mind numbingly boring data entry, a substantial lack of monetary compensation in the form of a paycheck, and the fact that it's a government employee performing the work all conspire to form a perfect storm of inability to muster anything resembling an attention to detail, which leads to "nuances" in the tabular data. However, database pulls can be equally fraught with errors (amusingly enough), especially if the data-entry side of the database doesn't quite catch all of the potential errors that a human is apt to introduce, or if the persona in control of the data extraction decides that a column needs to be renamed, moved, deleted, or a new column added. Of course, they feel no pressing need to inform the consumers of their data regarding changes to format or content, and this leads to a world of wonderful surprises that are best observed in the form of failed SQL server jobs, or as was the case for this now abandoned project, a sometimes spectacular failure of the code because I didn't anticipate the most recent act of wanton human stupidity.

Garbage Out

The primary method of importing said spreadsheets was a library called EPPlus. While it's an adequate library in general, it has some weaknesses. The one that drove me to writing the code featured in the original article is that, for some reason (as yet undiscovered by - well - anybody), some .XLSX files simply would not load using the library. This "nuance" forced me to use Excel to save the affected sheet as a CSV file, which then caused me to write more code to implement the capability.

I have since switched to a seemingly more robust Excel reader, which while obsoleting the need to import CSV files, did not necessarily remove the need for the CSV parsing code by you, my legions of adoring fans.

Assumptions

As with most of my articles, this one is not about theory, or the newest doo-dad or gee-gaw that Microsoft seems to think we want to see in .NET. To put it simply, this is real-world code that lives and breathes in an actual project. As it gets used, it gets tested a little more thoroughly, and as problems crop up, they are promptly fixed. The code as presented here appears to work fairly well - today. Tomorrow will be a fresh hell of "might not", because I cannot think of every possible thing that might get thrown at it. I try to head most of the obvious stuff off, but like everything else related to programming, just when you think your code is idiot-proof, the world invents a better idiot, and you end up performing what I call knee-jerk programming.

This article assumes that you're a moderately accomplished developer, but also one who wants some code to put out a fire that you'd really rather not put too much time into yourself. I didn't do anything too fancy or elegant, because "fancy and elegant" code is quite often a lot harder to understand and maintain. The code is heavily commented, so there should be ample explanation about how it works. If I mention something here that you don't understand, there are several search engines at your immediate disposal. My advice - use them.

Sample Files, Suitability, and a Heads-Up

The sample code includes a short XLSX file, as well as a similarly short CSV file. The CSV file has some intentionally invalid lines so that the code can be reasonably well exercised.

To ensure that the class meets your particular needs, use the included sample project to establish the code's suitability, and make any changes that you deem necessary before putting it into your own project(s).

While perusing the code, remember that coding styles differ between pretty much every programmer. Don't get hung up on formatting, or why I did something a certain way. Remember, you have the source code, so you are free to mould the code in your own image. Also keep in mind that you may encounter a bug or three. Nobody writes perfect code, and I may not have taken sufficient steps to keep programmers from abusing the code. If you find something like that, be a programmer and address it.

The Code

This library loads data from an Excel or CSV file, and makes it available in the form of a DataTable to the programmer for further processing within his application. Each column's data type can either be determined automatically by the contents of the columns themselves, or can be manually specified by the programmer.

NuGet Package - ExcelDataReader

When I developed this code for use in the SQLXAgent project cited above, I found a NuGet package called ExcelDataReader that quite handily loaded Excel files and returned that loaded data in the form of a DataTable object. I was honestly shocked that this package didn't go the extra distance and try to determine appropriate data types for loaded data, and a lot of the code in this article is dedicated to that aspect of loading the data.

This package can load both Excel and CSV files, but I only use it to load Excel files, and prefer to use my own code to load CSV files. The reason for this is that ExcelDataReader doesn't allow me enough control over what happens as the CSV file is parsed. For instance, if a problem is encountered in the form of a malformed row, ExcelDataReader throws an exception and you don't get ANY of the data back from the file. I wanted to return as much data as possible, but still report what rows were malformed. I also wanted to control when and how column data types were determined.

General Architecture

The code is comprised of a base importer class, a source-specific importer class that inherits the base class, and several support classes. Also included are a number of extension methods for various .NET classes that I've developed over the years, many of which are not necessarily utilized by the importer code, but that some of you might find valuable in other situations. I leave those methods for you to find via the joy of intellisense (and may the Force be with you).

The ImportBase Class

This class is admittedly where most of the action happens. As you might expect, it contains methods and properties that are commonly used by both the CsvImportBase and ExcelImportBase classes. This class is also abstract, which means you can't directly instantiate it.

You may notice that pretty much all of the methods are virtual. This means if you have a different vision as to how stuff should work, you can override these methods and implement your own version of insanity without destroying the original methods.

Importer Configuration Properties

  • string FileName - Represents the fully qualified path of the file being imported
  • ImportFileType FileType - Represents the type of file being imported (Excel or CSV), based on which deriving base-importer class is being used.
  • bool CanImport - A flag indicating whether the file can be imported. This flag is set by the deriving base-importer class as a result of sanity checks performed for the given importer.

Importer Control Properties

To make importing as configurable (yet still automagic) as possible, there are a number of properties that can be set in your inheriting classes.

  • ItemHints ColumnHints - Represents a collection of items that indicate the data type for a given imported column. Please refer to the section that specifically discusses this collection.
  • bool AllFieldsAreStrings (false) - A flag that indicates that all imported columns contain strings.
  • bool DeleteEmptyRows (true) - A flag that indicates that empty rows encountered in the imported data are not to be included in the resulting DataTable object.
  • bool FirstRowDeterminesType (false) - A flag that indicates that only the first row of data should be used to determine column data types in the ColumnHints collection.
  • bool FirstRowIsHeader (true) - A flag that indicates that the first row in the file contains column header names. If you set this to false for a given file, column names will be assigned in the format "Column_n", where "n" indicates the ordinal position of the column in the row.
  • string ImportedTableName (string.Empty) - The table name assigned to the DataTable. If this string is null/empty at the time the Import method is called, the filename (and if the file being imported is an Excel file, the sheet name is also included) is used as the table name.

Importer Output Properties

  • DataTable ImportedData - This property is populated as a result of importing the data. After the Import method has exited, and assuming no exceptions were thrown during the import process, this object returns all valid rows with column names and appropriate column data types already set.
  • int RemovedRowCount - This value indicates the number of empty rows that were removed during the import process.

Abstract Method - Import()

This is the only abstract method in the class, and as such, must be overridden somewhere in the inheritance chain. In our case, it is overridden by the CSVImportBase and ExcelImportBase class because importing is somewhat different for the two file types. Each class's version will be discussed individually a little later in this article.

Method - Init

The Init method is called by the deriving class, and serves to perform sanity checks on the filename, and set default values for the control and output properties.

C#
protected virtual void Init()
{
    DebugMsgs.Add(this, DebugLevel.Full, "ImportBase");

    this.DoSanityChecks();

    this.ImportedTableName      = string.Empty;
    this.FirstRowIsHeader       = true;
    this.ColumnHints            = null;
    this.DeleteEmptyRows        = true;
    this.FirstRowDeterminesType = false;
    this.AllFieldsAreStrings    = false;
}	

Method - DoSanityChecks

This method makes sure we have a) a valid filename, and b) that the file exists. I recognize that an empty/null filename will return false for the File.Exists() method, but I separate out the two possible exceptions because they are distinctly different problems.

C#
protected virtual void DoSanityChecks()
{
    this.CanImport = false;
    if (string.IsNullOrEmpty(this.FileName))
    {
        throw new ParserAgentException(ParserExceptionEnum.InvalidFileName);
    }
    else
    {
        if (!File.Exists(this.FileName))
        {
            throw new ParserAgentException(ParserExceptionEnum.ImportFileNotFound, 
                                           new FileNotFoundException(this.FileName));
        }
    }
    this.CanImport = true;
}    

Method - ProcessData

This method is invoked by the derived class' overridden Import method, and takes care of massaging the imported data as indicated by the comments. Each of the methods that it calls can be overridden in the inheriting class to provide additional and/or alternative functionality.

C#
protected virtual void ProcessData()
{
    // remove empty rows (if configured to do so)
    this.RemoveEmptyRows();

    // remove linefeeds, spaces, and non-alphanumeric characters from 
    // column names
    this.NormalizeNames();

    // Create column hints list if necessary. If you want to use your own, 
    // instantiate it in your derived class BEFORE calling the Import() 
    // method.
    this.CreateColumnHints();

    // Correct the datatypes in the ImportedData object based on the 
    // ColumnHints content.
    this.SetColumnTypes();
}	

Method - RemoveEmptyRows

If the DeleteEmptyRows property is true, this method removes empty rows from the data table if all of the column values are null or an empty string. Otherwise, no processing is performed.

C#
protected virtual void RemoveEmptyRows()
{
    this.RemovedRowCount = 0;
    if (this.DeleteEmptyRows)
    {
        for (int i = this.ImportedData.Rows.Count-1; i >= 0; i--)
        {
            DataRow row = this.ImportedData.Rows[i];
            if (row.ItemArray.All(x=>x is DBNull || string.IsNullOrEmpty(x.ToString())))
            {
                this.ImportedData.Rows.Remove(row);
                this.RemovedRowCount++;
            }
        }
    }
}	

Method - NormalizeNames

If the FirstRowIsHeader property is true, this method replaces all spaces and non-alphanumeric characters with underscores for each column name. This is a hold-over from the SQLXAgent article that was implemented to avoid weird column names in a SQL table (I personally hate having to use square brackets in queries). Additionally, the table name is set (unless the programmer specified a static one) to whetever the filename is.

C#
protected virtual void NormalizeNames()
{
    // If the first row is the header, pull the names from the first row, and 
    // delete the row.
    if (this.FirstRowIsHeader)
    {
        // first row
        DataRow headerRow = this.ImportedData.Rows[0];

        // process each column
        for (int i = 0; i < headerRow.ItemArray.Count(); i++)
        {
            // to ease typing
            DataColumn excelColumn = this.ImportedData.Columns[i];
            string rowColumn = (string)headerRow.ItemArray[i];

            // Set the column name from the first row of the data
            //                        if the column in the row is null/empty
            excelColumn.ColumnName = (string.IsNullOrEmpty(rowColumn)
                                    // keep the name we already have
                                        ? excelColumn.ColumnName
                                    // otherwise set the excel column to whatever 
                                    // the row says it should be
                                        : rowColumn
                                    // trim leading/trailing spaces, and 
                                    // replace linefeeds and embedded spaces 
                                    // with underscores
                                        ).Trim().ReplaceNonAlphaNumeric('_');
        }

        // Delete the header row - i do this here because we've already satisfied 
        // the FirstRowIsHeader condition, and there's really no point in checking 
        // again, just to delete the header row.
        this.ImportedData.Rows.RemoveAt(0);
    }
    // set the table name based on the name of the file (and if Excel file, the 
    // name of the sheet)
    this.ImportedData.TableName = this.BuildTableName();
}    

Method - BuildTableName

This method builds the table name based on the name of the file being imported. Again, spaces and non-alphanumeric characters are replaced with underscores.

C#
protected virtual string BuildTableName()
{
    // We use either the specified ImportedTableName, or we build the table name 
    // based on the filename and (if this is an excel file) sheet name. I try to 
    // avoid nested ternary conditions, but sometimes, it just makes sense to use 
    // them. In these situations, code formatting goes a long way toward assisting 
    // a programmer who is not familiar with the code base.
    string newTableName = (string.IsNullOrEmpty(this.ImportedTableName)) 
                            ? string.Concat("Imported_", Path.GetFileName(this.FileName).Trim())
                            : this.ImportedTableName;
    return newTableName.ReplaceNonAlphaNumeric('_');
}    

Method - CreateColumnHints

This method creates column hints as directed by the various control properties. In a nutshell, column hints are created as indicated below. In the interest of brevity, all of the methods listed below are included in the same code block.

  • If the AllFieldsAreStrings property is true, this method calls the CreateAllStringsColumnHints method, which simply sets all the data type for all columns to string. Since no value checking is performed, this is the fastest column hint creation method available in the importer. It's also quite possibly the most useless in terms of processing most imported data. If the AllFieldsAreStrings property is false, and...
  • If the FirstRowDeterminesType property is true, this method calls the CreateColumnHintFromFirstCompleteRow method, which only checks the values in the first row of data to establish the column types. This is the 2nd fastest column hint creation method because it only processes each column in the first row of data to establish column types. If the FirstRowDeterminesType property is false, and...
  • If the programmer has not already manually created the ColumnHints collection, each column in each row is evaluated to determine the most appropriate data type for a given column. How data types are determined is discussed in the section that describes the ColumnHints collection.
C#
protected virtual void CreateColumnHints()
{
    if (this.AllFieldsAreStrings)
    {
        this.CreateAllStringsColumnHints();
    }
    else if (this.FirstRowDeterminesType)
    {
        this.CreateColumnHintFromFirstCompleteRow();
    }
    else
    {
        // if the programmer hasn't already specified a hints list
        if (this.ColumnHints == null || this.ColumnHints.Count == 0)
        {
            // instantiate
            this.ColumnHints = new ItemHints();

            // for each column in the Columns collection
            for (int i = 0; i < this.ImportedData.Columns.Count; i++)
            {
                // get the column
                DataColumn col = this.ImportedData.Columns[i];

                // if the name isn't null/empty (theoretically impossible, but we 
                // check simply because it's a good idea)
                if (!string.IsNullOrEmpty(col.ColumnName))
                {
                    // create a new hint item
                    HintItem hint = new HintItem() 
                    { Name = col.ColumnName, ColNumb = col.Ordinal, ItemType = null };

                    // iterate each row
                    foreach (DataRow row in this.ImportedData.Rows)
                    {
                        // try to determine the best data type based on all of the 
                        // possible values
                        hint.DetermineType(row[col], this.FileType == ImportFileType.CSV);

                        // if we determine at any point that the column should be a 
                        // string, we can quit because a string type is our ultimate 
                        // fallback data type.
                        if (hint.ItemType.Name.IsLike("String"))
                        {
                            break;
                        }
                    }

                    // add the hint to our list
                    this.ColumnHints.Add(hint);
                }
            }
        }
    }
}

public virtual void CreateAllStringsColumnHints()
{
    // instantiate a new collection of hints
    this.ColumnHints = new ItemHints();

    // iterate all of the columns
    foreach (DataColumn col in this.ImportedData.Columns)
    {

        // if the name isn't null/empty (theoretically impossible, but we 
        // check simply because it's a good idea)
        if (!string.IsNullOrEmpty(col.ColumnName))
        {

            // create a new hint item and add it to the column hints collection
            HintItem hint = new HintItem() 
            { Name = col.ColumnName, ColNumb = col.Ordinal, ItemType = typeof(string) };
            this.ColumnHints.Add(hint);
        }
    }
}

protected virtual void CreateColumnHintFromFirstCompleteRow()
{
    // instantiate a new collection of hints
    this.ColumnHints = new ItemHints();

    // get the first row
    DataRow row = this.ImportedData.Rows[0];

    // iterate all of the columns
    foreach (DataColumn col in this.ImportedData.Columns)
    {
        // if the name isn't null/empty (theoretically impossible, but we 
        // check simply because it's a good idea)
        if (!string.IsNullOrEmpty(col.ColumnName))
        {

            // create a new hint item and add it to the column hints collection
            HintItem hint = new HintItem() 
            { Name = col.ColumnName, ColNumb=col.Ordinal, ItemType = null };
            hint.DetermineType(row[col], this.FileType == ImportFileType.CSV);
            this.ColumnHints.Add(hint);
        }
    }

    // if a column in the first row is empty, that column will be determined 
    // to be a string column.
}    

Method - SetColumnTypes

This method uses the ColumnHints collection to reset column data types for the imported data. Here's something I learned while writing this code - once a DataTable object has rows, the column data type can't be changed. If you want to change data types in this situation, you must clone the data table object (which copies JUST the schema to the new datatable), set the column data types, and then import the old data table's rows one at a time to the new data table.

C#
protected virtual void SetColumnTypes()
{
    // if we have column hints
    if (this.ColumnHints != null)
    {
        // you can't change the datatype of a column once there are rows in the 
        // datatable, so clone the datatable (brings over schema, but not data)
        using (DataTable cloned = this.ImportedData.Clone())
        {
            // set the column types
            for (int i = 0; i < cloned.Columns.Count; i++)
            {
                cloned.Columns[i].DataType = ColumnHints[i].ItemType;
            }
            for (int i = 0; i < this.ImportedData.Rows.Count; i++)
            {
                DataRow row = this.ImportedData.Rows[i];
                cloned.ImportRow(row);
            }
            // set the imported data to the cloned data table
            this.ImportedData = cloned;
        }
    }
    else
    {
        using (DataTable cloned = this.ImportedData.Clone())
        {
            // set the column types
            for (int i = 0; i < cloned.Columns.Count; i++)
            {
                cloned.Columns[i].DataType = typeof(object);
            }
            for (int i = 0; i < this.ImportedData.Rows.Count; i++)
            {
                DataRow row = this.ImportedData.Rows[i];
                cloned.ImportRow(row);
            }
            // set the imported data to the cloned data table
            this.ImportedData = cloned;
        }
    }
}    

Methods - GetColumnValue (one overload)

These methods retrieve the value contained in the specified row/column. To avoid countless overloads of the method, I chose to use generic types so that the appropriate casting could be performed on the values. Again, in the interest of brevity, I included both overloads in the same code block.

C#
/// <summary>
/// Gets the row/column value by column name. The column index is found, and this 
/// method then calls the method overload.
/// </summary>
/// <param name="row">The row index in the datatable</param>
/// <param name="columnName">The name of the column in the datab table</param>
/// <param name="defaultValue">The value to return if the value is null</param>
/// <returns>An object representing the value at the specified row index/column name</returns>
public T GetColumnValue<t>(int row, string colName, T defaultValue)
{
    // sanity checks first
    if (row < 0 || row >= this.ImportedData.Rows.Count)
    {
        throw new ParserAgentException(ParserExceptionEnum.RowIndexOutOfRange, 
		                               new ArgumentOutOfRangeException(
		                               string.Format("Index {0} is out of range 
                                       (max row index is {1})", 
		                               row, this.ImportedData.Rows.Count-1)));
    }
    if (string.IsNullOrEmpty(colName))
    {
        throw new ArgumentNullException("Column name parameter cannot be null or empty.");
    }
    if (!this.ImportedData.Columns.Contains(colName))
    {
        throw new ParserAgentException(ParserExceptionEnum.ColumnDoesNotExist, 
		                               new ArgumentException(
		                               string.Format("Specified column name '{0}' 
                                                     not found in data table", 
		                                             colName)));
    }
    if (this.ColumnHints.FindByColumnName(colName) == null)
    {
        throw new ParserAgentException(ParserExceptionEnum.ColumnDoesNotExist, 
		                               new ArgumentException(
		                               string.Format("Specified column name '{0}' 
                                                     not found in column hints", 
		                                             colName)));
    }

    // if we get here, we're guaranteed a valid column index
    int column = -1;
    for (int i = 0; i < this.ImportedData.Columns.Count; i++)
    {
        if (ImportedData.Columns[i].ColumnName.ToUpper() == colName.ToUpper())
        {
            column = i;
        }
    }
    object value = this.ImportedData.Rows[row].ItemArray[column];
    T result = (value != null) ? (T)value : defaultValue;
    return result;
}

/// <summary>
/// Gets the row/column value by index. 
/// </summary>
/// <param name="row">The row index in the datatable</param>
/// <param name="column">The column index in the datatable</param>
/// <param name="defaultValue">The value to return if the value is null</param>
/// <returns>An object representing the value at the specified row/column index</returns>
public T GetColumnValue<t>(int row, int column, T defaultValue)
{
    // sanity checks first
    if (row < 0 || row >= this.ImportedData.Rows.Count)
    {
        throw new ParserAgentException(ParserExceptionEnum.RowIndexOutOfRange, 
		                               new ArgumentOutOfRangeException
                                       (string.Format("Index {0} is out of range 
                                       (max row index is {1})", 
		                               row, this.ImportedData.Rows.Count-1)));
    }
    
    if (column < 0 || column >= this.ImportedData.Columns.Count)
    {
        throw new ParserAgentException(ParserExceptionEnum.ColumnIndexOutOfRange, 
		                               new ArgumentOutOfRangeException(
		                               string.Format("Column index {0} is out of range 
                                       (max column index is {1})", 
		                               column, this.ImportedData.Columns.Count-1)));
    }
    if (column >= this.ColumnHints.Count)
    {
        throw new ParserAgentException(ParserExceptionEnum.ColumnIndexOutOfRange, 
		                               new ArgumentException(
		                               string.Format("Column index {0} is out of range 
                                       in column hints collection (max column index is {1})", 
		                               column, this.ColumnHints.Count-1)));
    }	
    T result;
    object value =  this.ImportedData.Rows[row].ItemArray[column];
    result = (value != null) ? (T)value : defaultValue;
    return result;
}

ColumnHints

The ColumnHints collection contains a HintItem object for every column found in the imported data. The HintItem contains a Name, DataType, and ColNumb property, but those aren't the most interesting part of the class. The real magic happens in the DetermineType method.

Type determination is essentially handled in descending order of complexity, starting with attempting to parse a value into a DateTime, progressing through the various numeric and bool types, and when all else fails, the type is determined to be a string. There are various methods you can use to override the behavior of the data type determination, and all of those methods have already been discussed.

C#
public void DetermineType(object value, bool fromCSV=false)
{
    // if the type isn't already a string
    if (this.ItemType == null || !this.ItemType.Name.IsLike("String"))
    {
        Type theType = this.ItemType;

        // we have to check if its from a CSV import because everything imported 
        // from a CSV defaults to string.
        if (value is string && !fromCSV)
        {
            theType = typeof(String);
        }
        else
        {
            string valueStr = value.ToString();
            DateTime parsedDateTime;
            Int32    parsedInt32;
            double   parsedDouble;
            bool     parsedBool;

            // We try to parse the value using all standard datetime formats 
            // (custom formats are not supported unless you want to modify the 
            // DateFormatStrings object.).
            if (DateTime.TryParseExact(valueStr, DateFormatStrings.Formats, 
                CultureInfo.CurrentUICulture, DateTimeStyles.None, out parsedDateTime))
            {
                theType = typeof(DateTime);
            }
            else if (Int32.TryParse(valueStr, out parsedInt32))
            {

                // we only want to change the type if it's not already a double
                if (theType == null || !theType.Name.IsLike("Double"))
                {
                    theType = typeof(Int32);
                }
            }
            else if (double.TryParse(valueStr, out parsedDouble))
            {
                theType = typeof(Double);
            }
            else if (bool.TryParse(valueStr, out parsedBool))
            {
                theType = typeof(Boolean);
            }
            else 
            {
                theType = typeof(String);
            }
        }
        this.ItemType = theType;
    }
}    

The CSV Importer

This class is responsible for importing CSV files. Like the ImportBase class, it is abstract and MUST be inherited by a class that YOU write. Fortunately, this class's default settings makes it possible to do so with minimal effort on your part (see the Using the Importers section below).

Control Properties

  • bool FailOnMalformedLine (false) - If this property is true, importing will stop with an exception if a malformed row is encountered.
  • string CurrencySymbol - This represents the localized currency symbol. By default, the code will set this value based on the operating system's settings. However, the programmer can set it manually if desired.
  • string[] CurrentData - This represents the array of column data for the last-parsed row.
  • bool RemoveCurrencySymbols (true) - If this property is true, currency symbols will be removed from data BEFORE the column data types are determined.
    Note: If you want your currency data to be seen as numeric data, you must leave this flag set to true. Otherwise, the entire column will be seen as a string.
  • bool RemoveQuotes (true) - If this property is true, enclosing quote characters will be removed from string fields.

Processing Properties

  • string CurrentLine - Represents the current row of text that's being parsed.
  • bool IsMalformed - Indicates whether the current line was determined to be malformed.

Output Properties

  • List<invalidline> InvalidLines - A collection of invalid lines that were encountered while processing the CSV file.
  • int TotalLinesProcessed - Indicates the number of lines that were processed in the CSV file. This number includes invalid lines.

Method - Init

This method initializes the CSV importer object. First, it calls the base class' version of the method, and then goes on to set appropriate values to both the base class properties, as well as its own properties.

C#
protected override void Init()
{
    // call the base class Init method
    base.Init();

    // set the file type
    this.FileType              = ImportFileType.CSV;

    // unlike the excel importer, we have some additional properties we need to set
    this.RemoveCurrencySymbols = true;
    this.CurrencySymbol        = CultureInfo.CurrentUICulture.NumberFormat.CurrencySymbol;
    this.CurrentLine           = string.Empty;
    this.CurrentData           = null;
    this.IsMalformed           = false;
    this.InvalidLines          = new List<invalidline>();
    this.ImportedData          = new DataTable();
    this.TotalLinesProcessed   = 0;
    this.FailOnMalformedLine   = false;
    this.RemoveQuotes          = true;
}    

Method - Import

This is the abstract method that must eventually be overridden, and marshalls the process of importing. In this case, we call a method to import and parse the data, and then call the base class ProcessData method.

C#
public override void Import()
{
    if (this.CanImport)
    {
        this.Parse();
        base.ProcessData();
    }
}    

Method - Parse

This method opens the file, parses one row at a time, and if no problems are encountered, adds the parsed data to the data table. If a problem is encountered in a row, the line number, text of the line, and the reason it was determined to be invalid is stored in the InvalidLines collection.

C#
protected virtual void Parse()
{
    // Yes, I'm aware that the ExcelDataReader can import CSV files, but I wanted 
    // more control over the process so I could report on invalid lines, etc.
    this.InvalidLines.Clear();

    try
    {
        using (StreamReader dataStream = new StreamReader(this.FileName))
        {
            this.InvalidLines.Clear();

            // sheet is not going to be null by the time we get here
            string line = string.Empty;
            int lineCounter = -1;

            while ((line = dataStream.ReadLine()) != null)
            {
                lineCounter++;

                // ignore blank lines
                if (string.IsNullOrEmpty(line))
                {
                    continue;
                }

                // read the fields in the string. If the line is malformed, this method 
                // will detect that condition and set the IsMalformed property in this 
                // class.
                this.CurrentData = this.ReadFields(lineCounter, line);

                // initialize the data table (if necessary)
                this.InitDataTable(this.CurrentData);
                Debug.WriteLine(line);

                // if the string is malformed, add its index to the InvalidLines list
                if (this.IsMalformed)
                {
                    if (this.FailOnMalformedLine)
                    {
                        throw new ParserAgentException(ParserExceptionEnum.CsvMalformedLine, 
                                                       new InvalidOperationException(
                                                       string.Format("Line n is malformed.", 
                                                                     lineCounter)));
                    }
                }
                else
                {
                    // we add all rows to the datatable, 
                    // even the header row (if the file has one).
                    this.AddToDataTable(this.CurrentData);
                }
                this.TotalLinesProcessed++;
            } // while ((line = this.DataStream.ReadLine()) != null)
        } // using (StreamReader dataStream = new StreamReader(this.FileName))
    }
    catch (Exception ex)
    {
        throw new Exception("Error parsing file.", ex);
    }
}    

Method - ReadFields

This method is called by the Parse method for each row in the CSV file, and parses the supplied line of text that represents a row of data. No attempt is made to fix invalid lines. If a line is determined to be invalid (either a closing quote is missing, or the number of columns found does not match the number expected), the errant line is added to the InvalidLines collection.

C#
protected virtual string[] ReadFields(int lineCounter, string text)
{
    //assume we have a proper line of text
    this.IsMalformed = false;

    // split the string on commas (because this is a CSV file, after all)
    string[] parts = text.Trim().Split(',');

    // create a container for our results
    List<string> newParts = new List<string>();

    // set some initial values
    bool inQuotes = false;
    string currentPart = string.Empty;

    // iterate the parts array
    for (int i = 0; i < parts.Length; i++)
    {
        // get the part at the current index
        string part = parts[i];

        // if we're in a quoted string and the current part starts with a single double 
        // quote AND currentPart isn't empty, assume the currentPart is complete, add it to 
        // the newParts list, and reset for the new part
        if (inQuotes && part.StartsWithSingleDoubleQuote() == true && 
           !string.IsNullOrEmpty(currentPart))
        {
            currentPart = string.Concat(currentPart, "\"");
            newParts.Add(currentPart);
            currentPart = string.Empty;
            inQuotes = false;
        }

        // see if we're in a quoted string
        inQuotes = (inQuotes || (!inQuotes && part.StartsWithSingleDoubleQuote() == true));

        // if so, add the part to the current currentPart
        if (inQuotes)
        {
            currentPart = (string.IsNullOrEmpty(currentPart)) 
		                  ? part : string.Format("{0},{1}", currentPart, part);
        }

        // otherwise, simply set the currentPart to the part
        else
        {
            currentPart = part;
        }

        // see if we're still in a quoted string
        inQuotes = (inQuotes && currentPart.EndsWithSingleDoubleQuote() == false);

        // if not
        if (!inQuotes)
        {
            // remove the quote characters
            currentPart = (this.RemoveQuotes) ? currentPart.Trim('\"') : currentPart;

            // put the currentPart into our container
            newParts.Add(currentPart);

            // reset the currentPart
            currentPart = string.Empty;
        }
    }
    // determine if the line is somehow invalid, and if it is, save it
    this.IsMalformed = (inQuotes || this.ImportedData.Columns.Count > 0 && 
                        newParts.Count != this.ImportedData.Columns.Count);

    if (this.IsMalformed)
    {
        string reason = (inQuotes) ? "Missing end-quote" : "Missing at least one column";
        this.InvalidLines.Add(new InvalidLine() 
        { 
		    LineNumber = lineCounter, 
            LineText   = text, 
            Reason     = reason 
        });
    }
    return newParts.ToArray();
}	

Method - InitDataTable

This method is called for every row that is processed, but it only initializes the datatable if there aren't any columns yet defined therein. It simply adds columns.

C#
protected virtual void InitDataTable(string[] parts)
{
    if (this.ImportedData == null)
    {
        this.ImportedData = new DataTable();
    }

    // if the columns have not yet been defined, we need to create them.
    if (this.ImportedData.Columns.Count == 0)
    {
        for (int i = 0; i < parts.Length; i++)
        {

            // If the first row contains headers, use the contents of the parts 
            // array to specify the names. Otherwise, set the column names to 
            // "Column_n" where "n" is the current index into the parts array.
            this.ImportedData.Columns.Add((this.FirstRowIsHeader) ? 
                  parts[i] : string.Format("Column_{0}", i + 1), typeof(object));
        }
    }
}	

Method - AddToDataTable

This method adds the parsed data to the data table. By the time we get here, no validation should be required.

C#
protected virtual void AddToDataTable(string[] parts)
{
    // Create a new row. New rows contain n items in the ItemArray property. The 
    // number of items is determined by the number of columns the datatable has.
    DataRow row = this.ImportedData.NewRow();

    // now add our parts to the row
    for (int i = 0; i < parts.Length; i++)
    {
        string part = parts[i];
        if (!string.IsNullOrEmpty(this.CurrencySymbol))
        {
            part = (part.StartsWith(this.CurrencySymbol) || 
                    part.EndsWith(this.CurrencySymbol)) ? 
                    part.Replace(this.CurrencySymbol, "").Replace(",", "") : part;
        }
        part = part.Replace("\"", "").Trim();
        row[i] = part;
    }

    // add the new row to the datatable.
    this.ImportedData.Rows.Add(row);

    // Yes, we are adding the header row (if there is a header row), but before 
    // we determine data types, we'll be deleting it, so no harm no foul.
}	

The Excel Importer

Since we're allowing the ExcelDataReader library do all of the heavy lifting, there is very little to do in this class.

Configuration Properties

  • string SheetName - Represents the name of the worksheet to be imported.

Constructor

Because we have to specify a qworksheet within the file, this constructor has an extra line to handle it.

C#
public ExcelImportBase(string filename, string sheetName) : base(filename)
{
    this.SheetName = sheetName;

    // initialize the properties that control how the importer functions
    this.Init();
}    

Method - Init

Like the CSV importer, the "local" Init method calls the base class version of the method, and sets the file type.

C#
protected override void Init()
{
    base.Init();
    this.FileType = ImportFileType.EXCEL;
}	

Method - DoSanityChecks

The Excel importer needs additional sanity checking to verify that the worksheet name is valid. First, it calls the base class version of the method, and then, it performs the sanity check on the sheet name.

C#
protected override void DoSanityChecks()
{
    base.DoSanityChecks();

    if (this.CanImport)
    {
        if (this.FileType == ImportFileType.EXCEL)
        {
            if (string.IsNullOrEmpty(this.SheetName))
            {
                this.CanImport = false;
                throw new ParserAgentException(ParserExceptionEnum.InvalidSheetName);
            }
        }
    }
}    

Method - Import

Again, like the CSV importer, the overridden abstract Import method loads the file, and then calls the base class ProcessData method.

C#
public override void Import()
{
    if (this.CanImport)
    {
        // load the worksheet from the file
        this.ImportedData  = this.ReadSheet(this.FileName, this.SheetName);

        // process the data
        base.ProcessData();
    }
}	

Method - BuildTableName

Because we want to make the worksheet name part of the table name, we override BuildTableName, which calls the base version of the method, and then if the ImportedTableName is a null/empty string, appends the worksheet name to the current table name.

C#
protected override string BuildTableName()
{
    string newTableName = base.BuildTableName();
    newTableName = (!string.IsNullOrEmpty(this.ImportedTableName)) 
                    ? this.ImportedTableName 
                    : string.Concat(newTableName, "_", this.SheetName.Trim());
    return newTableName.ReplaceNonAlphaNumeric('_');
}	

Method - ReadSheet

This method causes the ExcelDataReader library to load the specified file/sheet into a dataset, which is then used to create the resulting data table. Since that library does all the heavy lifting, there's nothing for us to do but accept it. Amazingly, that library makes little/no effort to properly identify data types for the columns, so we still call ProcessData after this method returns.

C#
protected virtual DataTable ReadSheet(string filename, string sheetName)
{
    DataSet ds = null;
    DataTable data  = null;
    try
    {
        using (System.IO.FileStream stream = 
               File.Open(filename, FileMode.Open, FileAccess.Read))
        {
            // all of the heavy lifting is performed by the ExcelDataReader library.
            using (var reader = ExcelDataReader.ExcelReaderFactory.CreateReader(stream))
            {
                do
                {
                    if (reader.Name.IsLike(sheetName))
                    {
                        ds = reader.AsDataSet();
                    }
                } while (reader.NextResult());
            }
            if (ds != null)
            {
                data = ds.Tables[0];
            }
        }
    }
    catch (Exception ex)
    {
        throw new ParserAgentException(ParserExceptionEnum.ExcelDataReaderError, ex);
    }
    return data;
}    

Using the Importers

You will be STUNNED at how little code you actually have to write to make these classes work. First, you create your Importer class. If you're using the default settings, all you need is a class with a constructor in it, assuming you're just doing a standard import without changing any of the default property values in the base classes.

C#
public class CSVLoader : CsvImportBase
{
    public CSVLoader(string filename) : base(filename)
    {
    }
}	

And then you write the code to use it. The sample app that's supplied with this article is much more extensive (and more heavily commented). The code block below is used to illustrate how little effort is required to load a CSV file.

C#
class Program
{
	private static string AppPath { get; set; }
	static void Main(string[] args)
	{
		string codeBase = Assembly.GetExecutingAssembly().CodeBase;
		UriBuilder uri = new UriBuilder(codeBase);
		string path = Uri.UnescapeDataString(uri.Path).ToLower().Replace("/", @"\");
		AppPath = System.IO.Path.GetDirectoryName(path);
		ImportCSV();
	}

	private static void ImportCSV()
	{
        string filename = System.IO.Path.Combine(AppPath, @"testdata\sample1.csv");
		CSVLoader loader = null;
        try
        {
            loader = new CSVLoader(filename);
            loader.Import();
			// this is where you add code to do something meaningful with the data
        }
        catch (Exception ex)
        {
			if (ex != null) { }
        }
	}
}	

Pulling Data From the Loader

Ultimately, you'll most likely be writing code to somehow extract the data into objects for further manipulation/use, so I've included a sample that illustrates how to do exactly that.

The CSVItem Class

First, we need an object to hold an item's data. In the interest of sampulation (hey, if you wingnuts can use "elephant" instead of "f*ck", I can damn well make up my own words), this object simply defines the properties necessary to hold the data.

C#
public class CSVItem
{
    public int ID            { get; set; }
    public string Comment    { get; set; }
    public DateTime The_Date { get; set; }
    public double FloatValue { get; set; }
    public int IntValue      { get; set; }
    public string IPAddress  { get; set; }
    public double Currency   { get; set; }
}    

Next, I defined a collection that inherits List>CSVItem>. Inside this class are two overrides of the List.Add method.

C#
public class CSVItems : List<CSVItem>
{
    /// <summary>
    /// Add the entire data table collection to this collection, 
    /// using DataTableReader extension methods.
    /// </summary>
    /// <param name="data">The datatable that represents the imported data</param>
    public void Add(DataTable data)
    {
        try
        {
            // I already have a set of extension methods to extract column data from a 
            // DataTableReader object (these methods do all the casting and sets default 
            // values), so it's a simple matter create a DataTableReader from the 
            // DataTable to utilize these existing methods.
            using (DataTableReader reader = data.CreateDataReader())
            {
                if (reader.HasRows)
                {
                    int ORD_ID     = reader.GetOrdinal("ID");
                    int ORD_COMM   = reader.GetOrdinal("Comment");
                    int ORD_DATE   = reader.GetOrdinal("The_Date");
                    int ORD_FLOATV = reader.GetOrdinal("FloatValue");
                    int ORD_INTV   = reader.GetOrdinal("IntValue");
                    int ORD_IP     = reader.GetOrdinal("IPAddress");
                    int ORD_CURR   = reader.GetOrdinal("Currency");
                    while (reader.Read())
                    {
                        this.Add(new CSVItem()
                        {
                            ID         = reader.GetInt32OrDefault   (ORD_ID,     0),
                            Comment    = reader.GetStringOrDefault  (ORD_COMM,   
                                                                     "No Comment specified"),
                            The_Date   = reader.GetDateTimeOrDefault(ORD_DATE,   
                                                                     new DateTime(0)),
                            FloatValue = reader.GetDoubleOrDefault  (ORD_FLOATV, 0d),
                            IntValue   = reader.GetInt32OrDefault   (ORD_INTV,   0),
                            IPAddress  = reader.GetStringOrDefault  (ORD_IP,     "1.0.0.0"),
                            Currency   = reader.GetDoubleOrDefault  (ORD_CURR,   0d),
                        });
                    }
                }
            }
        }
        catch (Exception ex)
        {
            throw new Exception("Exception encountered while storing data 
                                 into a list of objects", ex);
        }
    }
    /// <summary>
    /// Add the entire data table collection to this collection, using the importer object's 
    /// GetColumnValue method.
    /// </summary>
    /// <param name="loader">The importer that contains the imported data</param>
    public void Add(CsvImportBase loader)
    {
        try
        {
            // The loader has a method called GetColumnValue which requires less code than 
            // the extension methods described above. Most of you will probably prefer to 
            // use those methods.
            for (int row = 0; row < loader.ImportedData.Rows.Count; row++)
            {
                this.Add(new CSVItem()
                {
                    ID         = loader.GetColumnValue(row, "ID",         0),
                    Comment    = loader.GetColumnValue(row, "Comment",    
                                                       "No Comment specified"),
                    The_Date   = loader.GetColumnValue(row, "The_Date",   new DateTime(0)),
                    FloatValue = loader.GetColumnValue(row, "FloatValue", 0d),
                    IntValue   = loader.GetColumnValue(row, "IntValue",   0),
                    IPAddress  = loader.GetColumnValue(row, "IPAddress",  "1.0.0.0"),
                    Currency   = loader.GetColumnValue(row, "Currency",   0d),
                });
            }
        }
        catch (Exception ex)
        {
            throw new Exception("Exception encountered while storing data 
                                 into a list of objects", ex);
        }
    }
}	

Closing Statements

I am of the opinion that proper use of object-oriented principals should abstract out as much of the implementation as possible, while allowing as much extensibility as is reasonable. The ultimate goal of this code was to make it as trivial as possible to load a CSV/Excel file. I think this code succeeds in those goals.

If you don't want/need the Excel importing code, it's a simple matter to delete it from the project. I would start by using the Nuget Package Manager Console to uninstall the ExcelDataReader package, and allow the Intellisense stuff to go insane. At that point, just go down the Visual Studio Error List window, and delete all of the code that shows up with a red underscore. It shouldn't take more than a few minutes.

History

  • 7th April, 2018 - Initial publication

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)