Introduction
First off, I'll admit it - I'm not really breaking much new ground here. I wrote a earlier article about parsing CSV files - CSV File Parser[^] - and while it provides adequate code, I figured I could make it a little easier on the programmer who was using it. I chose to present this as a new article because a) it serves as a good example of how code evolves when you get a chance to revisit it, and b) it presents what I consider to be a significantly improved approach over the original article(s).
This code was actually extracted from a recent series of articles I posted that allows you to create scheduleable agents for SQL Server Express - SQLXAgent - Jobs for SQL Express - Part 1 of 6[^]. That article included most of the code you see here, but with the database support removed because this article is targeted at getting data from a CSV (or Excel) file into a collection in memory. Additional processing beyond that point is (understandably) left to the programmer implementing this code into his own project, and I have provided an example of how this might be accomplished near the end of the article.
I was tempted to just refer everyone to the earlier CSVParser articles, but considered that if I was the person reading this article, I would be annoyed to have to click a link to read stuff that may or may not be applicable to this article, so please excuse any duplication of narrative regarding the CSV parsing code, and remember, I'm doing it for the children.
There are no images, pretty pictures, or fluff in this article. If you're looking for that kind of stuff, go to a nearby news stand and pick up a copy of Cosmopolitan.
Background
I (still) live in a very bizarre programming world. The project for which this code was originally developed has since been abandoned, but it is the reason this code existed in the first place. That project involved importing data from almost five DOZEN different data sources, comprised mostly of Excel XLSX files, at least one CSV file, a web site that provides raw data in XML format, and a couple of actual database queries. The Excel spreadsheets come from a mix of database queries on web sites that can return the results in the form of said spreadsheet files, with the rest being manually generated by humans.
Garbage In
You would think that the human-generated files would present the most quirks because humans are flawed. Mind numbingly boring data entry, a substantial lack of monetary compensation in the form of a paycheck, and the fact that it's a government employee performing the work all conspire to form a perfect storm of inability to muster anything resembling an attention to detail, which leads to "nuances" in the tabular data. However, database pulls can be equally fraught with errors (amusingly enough), especially if the data-entry side of the database doesn't quite catch all of the potential errors that a human is apt to introduce, or if the persona in control of the data extraction decides that a column needs to be renamed, moved, deleted, or a new column added. Of course, they feel no pressing need to inform the consumers of their data regarding changes to format or content, and this leads to a world of wonderful surprises that are best observed in the form of failed SQL server jobs, or as was the case for this now abandoned project, a sometimes spectacular failure of the code because I didn't anticipate the most recent act of wanton human stupidity.
Garbage Out
The primary method of importing said spreadsheets was a library called EPPlus
. While it's an adequate library in general, it has some weaknesses. The one that drove me to writing the code featured in the original article is that, for some reason (as yet undiscovered by - well - anybody), some .XLSX files simply would not load using the library. This "nuance" forced me to use Excel to save the affected sheet as a CSV file, which then caused me to write more code to implement the capability.
I have since switched to a seemingly more robust Excel reader, which while obsoleting the need to import CSV files, did not necessarily remove the need for the CSV parsing code by you, my legions of adoring fans.
Assumptions
As with most of my articles, this one is not about theory, or the newest doo-dad or gee-gaw that Microsoft seems to think we want to see in .NET. To put it simply, this is real-world code that lives and breathes in an actual project. As it gets used, it gets tested a little more thoroughly, and as problems crop up, they are promptly fixed. The code as presented here appears to work fairly well - today. Tomorrow will be a fresh hell of "might not", because I cannot think of every possible thing that might get thrown at it. I try to head most of the obvious stuff off, but like everything else related to programming, just when you think your code is idiot-proof, the world invents a better idiot, and you end up performing what I call knee-jerk programming.
This article assumes that you're a moderately accomplished developer, but also one who wants some code to put out a fire that you'd really rather not put too much time into yourself. I didn't do anything too fancy or elegant, because "fancy and elegant" code is quite often a lot harder to understand and maintain. The code is heavily commented, so there should be ample explanation about how it works. If I mention something here that you don't understand, there are several search engines at your immediate disposal. My advice - use them.
Sample Files, Suitability, and a Heads-Up
The sample code includes a short XLSX file, as well as a similarly short CSV file. The CSV file has some intentionally invalid lines so that the code can be reasonably well exercised.
To ensure that the class meets your particular needs, use the included sample project to establish the code's suitability, and make any changes that you deem necessary before putting it into your own project(s).
While perusing the code, remember that coding styles differ between pretty much every programmer. Don't get hung up on formatting, or why I did something a certain way. Remember, you have the source code, so you are free to mould the code in your own image. Also keep in mind that you may encounter a bug or three. Nobody writes perfect code, and I may not have taken sufficient steps to keep programmers from abusing the code. If you find something like that, be a programmer and address it.
The Code
This library loads data from an Excel or CSV file, and makes it available in the form of a DataTable
to the programmer for further processing within his application. Each column's data type can either be determined automatically by the contents of the columns themselves, or can be manually specified by the programmer.
NuGet Package - ExcelDataReader
When I developed this code for use in the SQLXAgent project cited above, I found a NuGet package called ExcelDataReader
that quite handily loaded Excel files and returned that loaded data in the form of a DataTable
object. I was honestly shocked that this package didn't go the extra distance and try to determine appropriate data types for loaded data, and a lot of the code in this article is dedicated to that aspect of loading the data.
This package can load both Excel and CSV files, but I only use it to load Excel files, and prefer to use my own code to load CSV files. The reason for this is that ExcelDataReader
doesn't allow me enough control over what happens as the CSV file is parsed. For instance, if a problem is encountered in the form of a malformed row, ExcelDataReader
throws an exception and you don't get ANY of the data back from the file. I wanted to return as much data as possible, but still report what rows were malformed. I also wanted to control when and how column data types were determined.
General Architecture
The code is comprised of a base importer class, a source-specific importer class that inherits the base class, and several support classes. Also included are a number of extension methods for various .NET classes that I've developed over the years, many of which are not necessarily utilized by the importer code, but that some of you might find valuable in other situations. I leave those methods for you to find via the joy of intellisense (and may the Force be with you).
The ImportBase Class
This class is admittedly where most of the action happens. As you might expect, it contains methods and properties that are commonly used by both the CsvImportBase
and ExcelImportBase
classes. This class is also abstract, which means you can't directly instantiate it.
You may notice that pretty much all of the methods are virtual. This means if you have a different vision as to how stuff should work, you can override these methods and implement your own version of insanity without destroying the original methods.
Importer Configuration Properties
string FileName
- Represents the fully qualified path of the file being imported ImportFileType FileType
- Represents the type of file being imported (Excel or CSV), based on which deriving base-importer class is being used. bool CanImport
- A flag indicating whether the file can be imported. This flag is set by the deriving base-importer class as a result of sanity checks performed for the given importer.
Importer Control Properties
To make importing as configurable (yet still automagic) as possible, there are a number of properties that can be set in your inheriting classes.
ItemHints ColumnHints
- Represents a collection of items that indicate the data type for a given imported column. Please refer to the section that specifically discusses this collection. bool AllFieldsAreStrings (false)
- A flag that indicates that all imported columns contain strings. bool DeleteEmptyRows (true)
- A flag that indicates that empty rows encountered in the imported data are not to be included in the resulting DataTable
object. bool FirstRowDeterminesType (false)
- A flag that indicates that only the first row of data should be used to determine column data types in the ColumnHints
collection. bool FirstRowIsHeader (true)
- A flag that indicates that the first row in the file contains column header names. If you set this to false for a given file, column names will be assigned in the format "Column_n", where "n" indicates the ordinal position of the column in the row. string ImportedTableName (string.Empty)
- The table name assigned to the DataTable
. If this string is null
/empty at the time the Import
method is called, the filename (and if the file being imported is an Excel file, the sheet name is also included) is used as the table name.
Importer Output Properties
DataTable ImportedData
- This property is populated as a result of importing the data. After the Import
method has exited, and assuming no exceptions were thrown during the import process, this object returns all valid rows with column names and appropriate column data types already set. int RemovedRowCount
- This value indicates the number of empty rows that were removed during the import process.
Abstract Method - Import()
This is the only abstract method in the class, and as such, must be overridden somewhere in the inheritance chain. In our case, it is overridden by the CSVImportBase
and ExcelImportBase
class because importing is somewhat different for the two file types. Each class's version will be discussed individually a little later in this article.
Method - Init
The Init
method is called by the deriving class, and serves to perform sanity checks on the filename, and set default values for the control and output properties.
protected virtual void Init()
{
DebugMsgs.Add(this, DebugLevel.Full, "ImportBase");
this.DoSanityChecks();
this.ImportedTableName = string.Empty;
this.FirstRowIsHeader = true;
this.ColumnHints = null;
this.DeleteEmptyRows = true;
this.FirstRowDeterminesType = false;
this.AllFieldsAreStrings = false;
}
Method - DoSanityChecks
This method makes sure we have a) a valid filename, and b) that the file exists. I recognize that an empty/null filename will return false for the File.Exists()
method, but I separate out the two possible exceptions because they are distinctly different problems.
protected virtual void DoSanityChecks()
{
this.CanImport = false;
if (string.IsNullOrEmpty(this.FileName))
{
throw new ParserAgentException(ParserExceptionEnum.InvalidFileName);
}
else
{
if (!File.Exists(this.FileName))
{
throw new ParserAgentException(ParserExceptionEnum.ImportFileNotFound,
new FileNotFoundException(this.FileName));
}
}
this.CanImport = true;
}
Method - ProcessData
This method is invoked by the derived class' overridden Import
method, and takes care of massaging the imported data as indicated by the comments. Each of the methods that it calls can be overridden in the inheriting class to provide additional and/or alternative functionality.
protected virtual void ProcessData()
{
this.RemoveEmptyRows();
this.NormalizeNames();
this.CreateColumnHints();
this.SetColumnTypes();
}
Method - RemoveEmptyRows
If the DeleteEmptyRows
property is true
, this method removes empty rows from the data table if all of the column values are null or an empty string. Otherwise, no processing is performed.
protected virtual void RemoveEmptyRows()
{
this.RemovedRowCount = 0;
if (this.DeleteEmptyRows)
{
for (int i = this.ImportedData.Rows.Count-1; i >= 0; i--)
{
DataRow row = this.ImportedData.Rows[i];
if (row.ItemArray.All(x=>x is DBNull || string.IsNullOrEmpty(x.ToString())))
{
this.ImportedData.Rows.Remove(row);
this.RemovedRowCount++;
}
}
}
}
Method - NormalizeNames
If the FirstRowIsHeader
property is true
, this method replaces all spaces and non-alphanumeric characters with underscores for each column name. This is a hold-over from the SQLXAgent
article that was implemented to avoid weird column names in a SQL table (I personally hate having to use square brackets in queries). Additionally, the table name is set (unless the programmer specified a static one) to whetever the filename is.
protected virtual void NormalizeNames()
{
if (this.FirstRowIsHeader)
{
DataRow headerRow = this.ImportedData.Rows[0];
for (int i = 0; i < headerRow.ItemArray.Count(); i++)
{
DataColumn excelColumn = this.ImportedData.Columns[i];
string rowColumn = (string)headerRow.ItemArray[i];
excelColumn.ColumnName = (string.IsNullOrEmpty(rowColumn)
? excelColumn.ColumnName
: rowColumn
).Trim().ReplaceNonAlphaNumeric('_');
}
this.ImportedData.Rows.RemoveAt(0);
}
this.ImportedData.TableName = this.BuildTableName();
}
Method - BuildTableName
This method builds the table name based on the name of the file being imported. Again, spaces and non-alphanumeric characters are replaced with underscores.
protected virtual string BuildTableName()
{
string newTableName = (string.IsNullOrEmpty(this.ImportedTableName))
? string.Concat("Imported_", Path.GetFileName(this.FileName).Trim())
: this.ImportedTableName;
return newTableName.ReplaceNonAlphaNumeric('_');
}
Method - CreateColumnHints
This method creates column hints as directed by the various control properties. In a nutshell, column hints are created as indicated below. In the interest of brevity, all of the methods listed below are included in the same code block.
- If the
AllFieldsAreStrings
property is true
, this method calls the CreateAllStringsColumnHints
method, which simply sets all the data type for all columns to string
. Since no value checking is performed, this is the fastest column hint creation method available in the importer. It's also quite possibly the most useless in terms of processing most imported data. If the AllFieldsAreStrings
property is false
, and... - If the
FirstRowDeterminesType
property is true
, this method calls the CreateColumnHintFromFirstCompleteRow
method, which only checks the values in the first row of data to establish the column types. This is the 2nd fastest column hint creation method because it only processes each column in the first row of data to establish column types. If the FirstRowDeterminesType
property is false
, and... - If the programmer has not already manually created the
ColumnHints
collection, each column in each row is evaluated to determine the most appropriate data type for a given column. How data types are determined is discussed in the section that describes the ColumnHints
collection.
protected virtual void CreateColumnHints()
{
if (this.AllFieldsAreStrings)
{
this.CreateAllStringsColumnHints();
}
else if (this.FirstRowDeterminesType)
{
this.CreateColumnHintFromFirstCompleteRow();
}
else
{
if (this.ColumnHints == null || this.ColumnHints.Count == 0)
{
this.ColumnHints = new ItemHints();
for (int i = 0; i < this.ImportedData.Columns.Count; i++)
{
DataColumn col = this.ImportedData.Columns[i];
if (!string.IsNullOrEmpty(col.ColumnName))
{
HintItem hint = new HintItem()
{ Name = col.ColumnName, ColNumb = col.Ordinal, ItemType = null };
foreach (DataRow row in this.ImportedData.Rows)
{
hint.DetermineType(row[col], this.FileType == ImportFileType.CSV);
if (hint.ItemType.Name.IsLike("String"))
{
break;
}
}
this.ColumnHints.Add(hint);
}
}
}
}
}
public virtual void CreateAllStringsColumnHints()
{
this.ColumnHints = new ItemHints();
foreach (DataColumn col in this.ImportedData.Columns)
{
if (!string.IsNullOrEmpty(col.ColumnName))
{
HintItem hint = new HintItem()
{ Name = col.ColumnName, ColNumb = col.Ordinal, ItemType = typeof(string) };
this.ColumnHints.Add(hint);
}
}
}
protected virtual void CreateColumnHintFromFirstCompleteRow()
{
this.ColumnHints = new ItemHints();
DataRow row = this.ImportedData.Rows[0];
foreach (DataColumn col in this.ImportedData.Columns)
{
if (!string.IsNullOrEmpty(col.ColumnName))
{
HintItem hint = new HintItem()
{ Name = col.ColumnName, ColNumb=col.Ordinal, ItemType = null };
hint.DetermineType(row[col], this.FileType == ImportFileType.CSV);
this.ColumnHints.Add(hint);
}
}
}
Method - SetColumnTypes
This method uses the ColumnHints
collection to reset column data types for the imported data. Here's something I learned while writing this code - once a DataTable
object has rows, the column data type can't be changed. If you want to change data types in this situation, you must clone the data table object (which copies JUST the schema to the new datatable), set the column data types, and then import the old data table's rows one at a time to the new data table.
protected virtual void SetColumnTypes()
{
if (this.ColumnHints != null)
{
using (DataTable cloned = this.ImportedData.Clone())
{
for (int i = 0; i < cloned.Columns.Count; i++)
{
cloned.Columns[i].DataType = ColumnHints[i].ItemType;
}
for (int i = 0; i < this.ImportedData.Rows.Count; i++)
{
DataRow row = this.ImportedData.Rows[i];
cloned.ImportRow(row);
}
this.ImportedData = cloned;
}
}
else
{
using (DataTable cloned = this.ImportedData.Clone())
{
for (int i = 0; i < cloned.Columns.Count; i++)
{
cloned.Columns[i].DataType = typeof(object);
}
for (int i = 0; i < this.ImportedData.Rows.Count; i++)
{
DataRow row = this.ImportedData.Rows[i];
cloned.ImportRow(row);
}
this.ImportedData = cloned;
}
}
}
Methods - GetColumnValue (one overload)
These methods retrieve the value contained in the specified row/column. To avoid countless overloads of the method, I chose to use generic types so that the appropriate casting could be performed on the values. Again, in the interest of brevity, I included both overloads in the same code block.
public T GetColumnValue<t>(int row, string colName, T defaultValue)
{
if (row < 0 || row >= this.ImportedData.Rows.Count)
{
throw new ParserAgentException(ParserExceptionEnum.RowIndexOutOfRange,
new ArgumentOutOfRangeException(
string.Format("Index {0} is out of range
(max row index is {1})",
row, this.ImportedData.Rows.Count-1)));
}
if (string.IsNullOrEmpty(colName))
{
throw new ArgumentNullException("Column name parameter cannot be null or empty.");
}
if (!this.ImportedData.Columns.Contains(colName))
{
throw new ParserAgentException(ParserExceptionEnum.ColumnDoesNotExist,
new ArgumentException(
string.Format("Specified column name '{0}'
not found in data table",
colName)));
}
if (this.ColumnHints.FindByColumnName(colName) == null)
{
throw new ParserAgentException(ParserExceptionEnum.ColumnDoesNotExist,
new ArgumentException(
string.Format("Specified column name '{0}'
not found in column hints",
colName)));
}
int column = -1;
for (int i = 0; i < this.ImportedData.Columns.Count; i++)
{
if (ImportedData.Columns[i].ColumnName.ToUpper() == colName.ToUpper())
{
column = i;
}
}
object value = this.ImportedData.Rows[row].ItemArray[column];
T result = (value != null) ? (T)value : defaultValue;
return result;
}
public T GetColumnValue<t>(int row, int column, T defaultValue)
{
if (row < 0 || row >= this.ImportedData.Rows.Count)
{
throw new ParserAgentException(ParserExceptionEnum.RowIndexOutOfRange,
new ArgumentOutOfRangeException
(string.Format("Index {0} is out of range
(max row index is {1})",
row, this.ImportedData.Rows.Count-1)));
}
if (column < 0 || column >= this.ImportedData.Columns.Count)
{
throw new ParserAgentException(ParserExceptionEnum.ColumnIndexOutOfRange,
new ArgumentOutOfRangeException(
string.Format("Column index {0} is out of range
(max column index is {1})",
column, this.ImportedData.Columns.Count-1)));
}
if (column >= this.ColumnHints.Count)
{
throw new ParserAgentException(ParserExceptionEnum.ColumnIndexOutOfRange,
new ArgumentException(
string.Format("Column index {0} is out of range
in column hints collection (max column index is {1})",
column, this.ColumnHints.Count-1)));
}
T result;
object value = this.ImportedData.Rows[row].ItemArray[column];
result = (value != null) ? (T)value : defaultValue;
return result;
}
ColumnHints
The ColumnHints
collection contains a HintItem
object for every column found in the imported data. The HintItem
contains a Name
, DataType
, and ColNumb
property, but those aren't the most interesting part of the class. The real magic happens in the DetermineType
method.
Type determination is essentially handled in descending order of complexity, starting with attempting to parse a value into a DateTime
, progressing through the various numeric and bool types, and when all else fails, the type is determined to be a string
. There are various methods you can use to override the behavior of the data type determination, and all of those methods have already been discussed.
public void DetermineType(object value, bool fromCSV=false)
{
if (this.ItemType == null || !this.ItemType.Name.IsLike("String"))
{
Type theType = this.ItemType;
if (value is string && !fromCSV)
{
theType = typeof(String);
}
else
{
string valueStr = value.ToString();
DateTime parsedDateTime;
Int32 parsedInt32;
double parsedDouble;
bool parsedBool;
if (DateTime.TryParseExact(valueStr, DateFormatStrings.Formats,
CultureInfo.CurrentUICulture, DateTimeStyles.None, out parsedDateTime))
{
theType = typeof(DateTime);
}
else if (Int32.TryParse(valueStr, out parsedInt32))
{
if (theType == null || !theType.Name.IsLike("Double"))
{
theType = typeof(Int32);
}
}
else if (double.TryParse(valueStr, out parsedDouble))
{
theType = typeof(Double);
}
else if (bool.TryParse(valueStr, out parsedBool))
{
theType = typeof(Boolean);
}
else
{
theType = typeof(String);
}
}
this.ItemType = theType;
}
}
The CSV Importer
This class is responsible for importing CSV files. Like the ImportBase
class, it is abstract and MUST be inherited by a class that YOU write. Fortunately, this class's default settings makes it possible to do so with minimal effort on your part (see the Using the Importers section below).
Control Properties
bool FailOnMalformedLine (false)
- If this property is true
, importing will stop with an exception if a malformed row is encountered. string CurrencySymbol
- This represents the localized currency symbol. By default, the code will set this value based on the operating system's settings. However, the programmer can set it manually if desired. string[] CurrentData
- This represents the array of column data for the last-parsed row. bool RemoveCurrencySymbols (true)
- If this property is true
, currency symbols will be removed from data BEFORE the column data types are determined.
Note: If you want your currency data to be seen as numeric data, you must leave this flag set to true
. Otherwise, the entire column will be seen as a string
. bool RemoveQuotes (true)
- If this property is true
, enclosing quote characters will be removed from string
fields.
Processing Properties
string CurrentLine
- Represents the current row of text that's being parsed. bool IsMalformed
- Indicates whether the current line was determined to be malformed.
Output Properties
List<invalidline> InvalidLines
- A collection of invalid lines that were encountered while processing the CSV file. int TotalLinesProcessed
- Indicates the number of lines that were processed in the CSV file. This number includes invalid lines.
Method - Init
This method initializes the CSV importer object. First, it calls the base class' version of the method, and then goes on to set appropriate values to both the base class properties, as well as its own properties.
protected override void Init()
{
base.Init();
this.FileType = ImportFileType.CSV;
this.RemoveCurrencySymbols = true;
this.CurrencySymbol = CultureInfo.CurrentUICulture.NumberFormat.CurrencySymbol;
this.CurrentLine = string.Empty;
this.CurrentData = null;
this.IsMalformed = false;
this.InvalidLines = new List<invalidline>();
this.ImportedData = new DataTable();
this.TotalLinesProcessed = 0;
this.FailOnMalformedLine = false;
this.RemoveQuotes = true;
}
Method - Import
This is the abstract method that must eventually be overridden, and marshalls the process of importing. In this case, we call a method to import and parse the data, and then call the base class ProcessData
method.
public override void Import()
{
if (this.CanImport)
{
this.Parse();
base.ProcessData();
}
}
Method - Parse
This method opens the file, parses one row at a time, and if no problems are encountered, adds the parsed data to the data table. If a problem is encountered in a row, the line number, text of the line, and the reason it was determined to be invalid is stored in the InvalidLines
collection.
protected virtual void Parse()
{
this.InvalidLines.Clear();
try
{
using (StreamReader dataStream = new StreamReader(this.FileName))
{
this.InvalidLines.Clear();
string line = string.Empty;
int lineCounter = -1;
while ((line = dataStream.ReadLine()) != null)
{
lineCounter++;
if (string.IsNullOrEmpty(line))
{
continue;
}
this.CurrentData = this.ReadFields(lineCounter, line);
this.InitDataTable(this.CurrentData);
Debug.WriteLine(line);
if (this.IsMalformed)
{
if (this.FailOnMalformedLine)
{
throw new ParserAgentException(ParserExceptionEnum.CsvMalformedLine,
new InvalidOperationException(
string.Format("Line n is malformed.",
lineCounter)));
}
}
else
{
this.AddToDataTable(this.CurrentData);
}
this.TotalLinesProcessed++;
}
}
}
catch (Exception ex)
{
throw new Exception("Error parsing file.", ex);
}
}
Method - ReadFields
This method is called by the Parse
method for each row in the CSV file, and parses the supplied line of text that represents a row of data. No attempt is made to fix invalid lines. If a line is determined to be invalid (either a closing quote is missing, or the number of columns found does not match the number expected), the errant line is added to the InvalidLines
collection.
protected virtual string[] ReadFields(int lineCounter, string text)
{
this.IsMalformed = false;
string[] parts = text.Trim().Split(',');
List<string> newParts = new List<string>();
bool inQuotes = false;
string currentPart = string.Empty;
for (int i = 0; i < parts.Length; i++)
{
string part = parts[i];
if (inQuotes && part.StartsWithSingleDoubleQuote() == true &&
!string.IsNullOrEmpty(currentPart))
{
currentPart = string.Concat(currentPart, "\"");
newParts.Add(currentPart);
currentPart = string.Empty;
inQuotes = false;
}
inQuotes = (inQuotes || (!inQuotes && part.StartsWithSingleDoubleQuote() == true));
if (inQuotes)
{
currentPart = (string.IsNullOrEmpty(currentPart))
? part : string.Format("{0},{1}", currentPart, part);
}
else
{
currentPart = part;
}
inQuotes = (inQuotes && currentPart.EndsWithSingleDoubleQuote() == false);
if (!inQuotes)
{
currentPart = (this.RemoveQuotes) ? currentPart.Trim('\"') : currentPart;
newParts.Add(currentPart);
currentPart = string.Empty;
}
}
this.IsMalformed = (inQuotes || this.ImportedData.Columns.Count > 0 &&
newParts.Count != this.ImportedData.Columns.Count);
if (this.IsMalformed)
{
string reason = (inQuotes) ? "Missing end-quote" : "Missing at least one column";
this.InvalidLines.Add(new InvalidLine()
{
LineNumber = lineCounter,
LineText = text,
Reason = reason
});
}
return newParts.ToArray();
}
Method - InitDataTable
This method is called for every row that is processed, but it only initializes the datatable if there aren't any columns yet defined therein. It simply adds columns.
protected virtual void InitDataTable(string[] parts)
{
if (this.ImportedData == null)
{
this.ImportedData = new DataTable();
}
if (this.ImportedData.Columns.Count == 0)
{
for (int i = 0; i < parts.Length; i++)
{
this.ImportedData.Columns.Add((this.FirstRowIsHeader) ?
parts[i] : string.Format("Column_{0}", i + 1), typeof(object));
}
}
}
Method - AddToDataTable
This method adds the parsed data to the data table. By the time we get here, no validation should be required.
protected virtual void AddToDataTable(string[] parts)
{
DataRow row = this.ImportedData.NewRow();
for (int i = 0; i < parts.Length; i++)
{
string part = parts[i];
if (!string.IsNullOrEmpty(this.CurrencySymbol))
{
part = (part.StartsWith(this.CurrencySymbol) ||
part.EndsWith(this.CurrencySymbol)) ?
part.Replace(this.CurrencySymbol, "").Replace(",", "") : part;
}
part = part.Replace("\"", "").Trim();
row[i] = part;
}
this.ImportedData.Rows.Add(row);
}
The Excel Importer
Since we're allowing the ExcelDataReader
library do all of the heavy lifting, there is very little to do in this class.
Configuration Properties
string SheetName
- Represents the name of the worksheet to be imported.
Constructor
Because we have to specify a qworksheet
within the file, this constructor has an extra line to handle it.
public ExcelImportBase(string filename, string sheetName) : base(filename)
{
this.SheetName = sheetName;
this.Init();
}
Method - Init
Like the CSV importer, the "local" Init
method calls the base class version of the method, and sets the file type.
protected override void Init()
{
base.Init();
this.FileType = ImportFileType.EXCEL;
}
Method - DoSanityChecks
The Excel importer needs additional sanity checking to verify that the worksheet name is valid. First, it calls the base class version of the method, and then, it performs the sanity check on the sheet name.
protected override void DoSanityChecks()
{
base.DoSanityChecks();
if (this.CanImport)
{
if (this.FileType == ImportFileType.EXCEL)
{
if (string.IsNullOrEmpty(this.SheetName))
{
this.CanImport = false;
throw new ParserAgentException(ParserExceptionEnum.InvalidSheetName);
}
}
}
}
Method - Import
Again, like the CSV importer, the overridden abstract Import
method loads the file, and then calls the base class ProcessData
method.
public override void Import()
{
if (this.CanImport)
{
this.ImportedData = this.ReadSheet(this.FileName, this.SheetName);
base.ProcessData();
}
}
Method - BuildTableName
Because we want to make the worksheet name part of the table name, we override BuildTableName
, which calls the base version of the method, and then if the ImportedTableName
is a null
/empty string
, appends the worksheet name to the current table name.
protected override string BuildTableName()
{
string newTableName = base.BuildTableName();
newTableName = (!string.IsNullOrEmpty(this.ImportedTableName))
? this.ImportedTableName
: string.Concat(newTableName, "_", this.SheetName.Trim());
return newTableName.ReplaceNonAlphaNumeric('_');
}
Method - ReadSheet
This method causes the ExcelDataReader
library to load the specified file/sheet into a dataset, which is then used to create the resulting data table. Since that library does all the heavy lifting, there's nothing for us to do but accept it. Amazingly, that library makes little/no effort to properly identify data types for the columns, so we still call ProcessData
after this method returns.
protected virtual DataTable ReadSheet(string filename, string sheetName)
{
DataSet ds = null;
DataTable data = null;
try
{
using (System.IO.FileStream stream =
File.Open(filename, FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelDataReader.ExcelReaderFactory.CreateReader(stream))
{
do
{
if (reader.Name.IsLike(sheetName))
{
ds = reader.AsDataSet();
}
} while (reader.NextResult());
}
if (ds != null)
{
data = ds.Tables[0];
}
}
}
catch (Exception ex)
{
throw new ParserAgentException(ParserExceptionEnum.ExcelDataReaderError, ex);
}
return data;
}
Using the Importers
You will be STUNNED at how little code you actually have to write to make these classes work. First, you create your Importer class. If you're using the default settings, all you need is a class with a constructor in it, assuming you're just doing a standard import without changing any of the default property values in the base classes.
public class CSVLoader : CsvImportBase
{
public CSVLoader(string filename) : base(filename)
{
}
}
And then you write the code to use it. The sample app that's supplied with this article is much more extensive (and more heavily commented). The code block below is used to illustrate how little effort is required to load a CSV file.
class Program
{
private static string AppPath { get; set; }
static void Main(string[] args)
{
string codeBase = Assembly.GetExecutingAssembly().CodeBase;
UriBuilder uri = new UriBuilder(codeBase);
string path = Uri.UnescapeDataString(uri.Path).ToLower().Replace("/", @"\");
AppPath = System.IO.Path.GetDirectoryName(path);
ImportCSV();
}
private static void ImportCSV()
{
string filename = System.IO.Path.Combine(AppPath, @"testdata\sample1.csv");
CSVLoader loader = null;
try
{
loader = new CSVLoader(filename);
loader.Import();
}
catch (Exception ex)
{
if (ex != null) { }
}
}
}
Pulling Data From the Loader
Ultimately, you'll most likely be writing code to somehow extract the data into objects for further manipulation/use, so I've included a sample that illustrates how to do exactly that.
The CSVItem Class
First, we need an object to hold an item's data. In the interest of sampulation (hey, if you wingnuts can use "elephant" instead of "f*ck", I can damn well make up my own words), this object simply defines the properties necessary to hold the data.
public class CSVItem
{
public int ID { get; set; }
public string Comment { get; set; }
public DateTime The_Date { get; set; }
public double FloatValue { get; set; }
public int IntValue { get; set; }
public string IPAddress { get; set; }
public double Currency { get; set; }
}
Next, I defined a collection that inherits List>CSVItem>
. Inside this class are two overrides of the List.Add
method.
public class CSVItems : List<CSVItem>
{
public void Add(DataTable data)
{
try
{
using (DataTableReader reader = data.CreateDataReader())
{
if (reader.HasRows)
{
int ORD_ID = reader.GetOrdinal("ID");
int ORD_COMM = reader.GetOrdinal("Comment");
int ORD_DATE = reader.GetOrdinal("The_Date");
int ORD_FLOATV = reader.GetOrdinal("FloatValue");
int ORD_INTV = reader.GetOrdinal("IntValue");
int ORD_IP = reader.GetOrdinal("IPAddress");
int ORD_CURR = reader.GetOrdinal("Currency");
while (reader.Read())
{
this.Add(new CSVItem()
{
ID = reader.GetInt32OrDefault (ORD_ID, 0),
Comment = reader.GetStringOrDefault (ORD_COMM,
"No Comment specified"),
The_Date = reader.GetDateTimeOrDefault(ORD_DATE,
new DateTime(0)),
FloatValue = reader.GetDoubleOrDefault (ORD_FLOATV, 0d),
IntValue = reader.GetInt32OrDefault (ORD_INTV, 0),
IPAddress = reader.GetStringOrDefault (ORD_IP, "1.0.0.0"),
Currency = reader.GetDoubleOrDefault (ORD_CURR, 0d),
});
}
}
}
}
catch (Exception ex)
{
throw new Exception("Exception encountered while storing data
into a list of objects", ex);
}
}
public void Add(CsvImportBase loader)
{
try
{
for (int row = 0; row < loader.ImportedData.Rows.Count; row++)
{
this.Add(new CSVItem()
{
ID = loader.GetColumnValue(row, "ID", 0),
Comment = loader.GetColumnValue(row, "Comment",
"No Comment specified"),
The_Date = loader.GetColumnValue(row, "The_Date", new DateTime(0)),
FloatValue = loader.GetColumnValue(row, "FloatValue", 0d),
IntValue = loader.GetColumnValue(row, "IntValue", 0),
IPAddress = loader.GetColumnValue(row, "IPAddress", "1.0.0.0"),
Currency = loader.GetColumnValue(row, "Currency", 0d),
});
}
}
catch (Exception ex)
{
throw new Exception("Exception encountered while storing data
into a list of objects", ex);
}
}
}
Closing Statements
I am of the opinion that proper use of object-oriented principals should abstract out as much of the implementation as possible, while allowing as much extensibility as is reasonable. The ultimate goal of this code was to make it as trivial as possible to load a CSV/Excel file. I think this code succeeds in those goals.
If you don't want/need the Excel importing code, it's a simple matter to delete it from the project. I would start by using the Nuget Package Manager Console to uninstall the ExcelDataReader
package, and allow the Intellisense stuff to go insane. At that point, just go down the Visual Studio Error List window, and delete all of the code that shows up with a red underscore. It shouldn't take more than a few minutes.
History
- 7th April, 2018 - Initial publication