Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

FileUtilities - a library for reading flat files into POCOs

0.00/5 (No votes)
26 Nov 2015 1  
A utility to help turn flat files (.csv, tab separated etc.) into type-safe CLR objects with validation using attributes
 

Introduction

Whilst many modern applications exchange data in well defined standard ways (such as XML or JSON) the reality is that a great deal of data is still packaged in flat files (by which I mean comma separated values, or fixed width text files). 

This library is concerned with turning these files into arrays of type-safe objects and using semantic meaning in those objects to validate the files concerned.

Background

The use of attributes to tag properties of your classes is quite common in both Entity Framework and also in System.Runtime.Serialization.  This library very much follows that model but if you are not familiar with it then I recommend reading this article to start with.

Using the code

For a worked example, we'll start with a flat file which contains historical price information about Microsoft shares from Yahoo:

What is desired is to turn this into an IEnumerable of a class called StockPrice that can be declared thus:-

Public Class YahooStockPrice

    Public Property PriceDate As Date

    Public Property OpenPrice As Nullable(Of Decimal)

    Public Property HighPrice As Nullable(Of Decimal)

    Public Property LowPrice As Nullable(Of Decimal)

    Public Property ClosingPrice As Nullable(Of Decimal)

    Public Property Volume As Nullable(Of Decimal)

    Public Property AdjustedClose As Nullable(Of Decimal)

End Class

Field separator and field definition

You specify the field separator at the class level with the RecordFieldSeperator attribute.  There are four possible standard separators (comma, tab, pipe or semicolon) or you can specify your own custom separator as an array of characters. 

Then each field that you want to read is specified by the RecordColumn attribute.  This takes a column header name and the ordinal position of the field.  You do not have to specify every field but (obviously) any attempt to specify two properties at the same column position will throw an error.

Imports FileUtilities

<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice

  <RecordColumn("Date", 0)>
  Public Property PriceDate As Date

  <RecordColumn("Open", 1)>
  Public Property OpenPrice As Nullable(Of Decimal)

  <RecordColumn("High", 2)>
  Public Property HighPrice As Nullable(Of Decimal)

  <RecordColumn("Low", 3)>
  Public Property LowPrice As Nullable(Of Decimal)

  <RecordColumn("Closing", 4)>
  Public Property ClosingPrice As Nullable(Of Decimal)

  <RecordColumn("Volume", 5)>
  Public Property Volume As Nullable(Of Decimal)

  <RecordColumn("Adj Close", 6)>
  Public Property AdjustedClose As Nullable(Of Decimal)

End Class

Fixed-width files

Another possibility for text files is that they don't have any record separator but instead have implied fields of a given fixed width.  (This is often the case for records coming from mainframe / COBOL style applications.)  In this case you do not need to specify a record separator but each field needs to have the additional information of the length of the field and the field start location specified.

For example if your account number field is defined as teh 5 letters that start from character position # 6 it is specified thus:

<RecordColumn("Account Number", 2, False, 5, 6)>

Excel files

In version 2 there is a reader for Excel 2007 files.  This depends on the ClosedXML library from CodePlex (so I have included the version separately - if you don't need Excel, use version 1)

To specify that a record should be read from a specific worksheet within the file mark the class with a RecordExcelWorksheet attribute:-

<RecordExcelSourceSheet("House Summary")>

If no worksheet is specified then the code will use the first worksheet in the file.  You can specify that each property of your class comes from an explicit cell or from a particular column in the excel sheet:-

<RecordsSpecificExcelCell("Business Date:", RecordColumnExcelCellAttribute.ExcelColumn.I, 4)>

Semantic meaning

When initially read in, every field is just considered to be a string.  In order to turn it into the correct type and apply any data format rules you need to add the appropriate semantic meaning tag.  These are RecordColumnDate, RecordColumnNumber etc.

Imports FileUtilities

<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice

  <RecordColumn("Date", 0)>
  <RecordColumnDate("M/d/yyyy")>
  Public Property PriceDate As Date

  <RecordColumnNumber()>
  <RecordColumn("Open", 1)>
  Public Property OpenPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("High", 2)>
  Public Property HighPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Low", 3)>
  Public Property LowPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Closing", 4)>
  Public Property ClosingPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Volume", 5)>
  Public Property Volume As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Adj Close", 6)>
  Public Property AdjustedClose As Nullable(Of Decimal)

End Class

Validation

Now that we have some semantic meaning attributes attached to our class we can also specify validation rules to skip over any invalid rows.  These are specified using the RecordColumnValidation attribute, and there are a number of validations built in:

Public Enum ValidationRule
  ''' <summary>
  ''' No validation is applied to this column
  ''' </summary>
  NoValidation = 0
  ''' <summary>
  ''' Record field must not be empty
  ''' </summary>
  NotBlank = 1
  ''' <summary>
  ''' Column data does not contain the column name
  ''' </summary>
  NotColumnName = 2
  ''' <summary>
  ''' Column must contain a number that maps to its RecordColumnNumberAttribute settings
  ''' </summary>
  NumberValidation = 3
  ''' <summary>
  ''' Column must contain a date that maps to its RecordColumnDateAttribute settings
  ''' </summary>
  DateValidation = 4
  ''' <summary>
  ''' Record must equal the CompareTo property
  ''' </summary>
  MustEqual = 5
  ''' <summary>
  ''' Record must not equal the CompareTo property
  ''' </summary>
  MustNotEqual = 6
  ''' <summary>
  ''' The value must match the regular expression setting in the CompareTo property
  ''' </summary>
  RegularEpressionMatch = 7
  ''' <summary>
  ''' Test the value against the assigned instrument identifier attributes
  ''' </summary>
  InstrumentIdentifier = 8
End Enum

 

In the case of the Yahoo file we need to validate that the fields do not match the column name and that they are valid numeric or date data.

In version 2 there is also the concept of marking a validation as "fatal" by setting the optional parameter in the validation attribute constructor.  For exampel if a certain field must be set to the capital letter M only you could use the following validation to enforce that:

<RecordColumnValidation(RecordColumnValidationAttribute.ValidationRule.MustEqual, "M", True)>

Unusual formats

As well as the standard number and date formats provided in the .NET framework there are some unusual cases that can be sent (again suually as output from mainframe systems).  For example the number might have a trailing sign and might have an implied number of decimals.  For these cases the RecordColumnCustomNumber attribute can be used to effectively preprocess the number before converting it.

These special formats are expressed as a Flags based enumerated type as they can sometimes be combined together in one field.

<Flags()>
Public Enum CustomNumberFormatFlags
    ''' <summary>
    ''' There are no custom transforms for this number
    ''' </summary>
    StandardNumber = &H0
    ''' <summary>
    ''' There is an implied decimal precision that is fixed for this field
    ''' </summary>
    FixedDecimalPoint = &H1
    ''' <summary>
    ''' The sign comes at the end of the number (e.g. 12345- means -12345)
    ''' </summary>
    ''' <remarks>
    ''' If no - then assume number is positive
    ''' </remarks>
    TrailingSign = &H2
    ''' <summary>
    ''' Negative numbers are in braces
    ''' </summary>
    ''' <remarks>
    ''' e.g. (123.45) means -123.45
    ''' </remarks>
    NegativeBracketed = &H4
    ''' <summary>
    ''' Number can be followed by a fraction - e.g. "37 1/2" or even "1.85 3/8"
    ''' </summary>
    FractionalSuffix = &H8
    'additional madness can be added here if files contain it
End Enum

Modifying values

Sometimes the data passed in from the file needs to be modified in order to work.  For example some files might contain "NULL" or "n/a" in a field when that field should be specified as blank or a field needs to be trimmed as it is read in.  This is done with the attributes such as ValueSubstitution and ValueTrim.

Culture

If you are receiving a file from a different culture (i.e. for example if it was written by a machine in France and is being imported by a machine in the US) you may need to explicitly denote the culture the file comes from in order to convert numeric and date values.  This is done with the ValidCulture attribute.

Reading the data

The hard work of applying all these attributes and turning out the results is done by the ClassStreamReader class, which is a generic class that takes the POCO type specified with the attributes as input.  It can either read one record at a time with the ReadNext method, or return the complete set of record with the ReadToEnd method.

For example to iterate through the Yahoo file specified you could do:-

Dim reader As New ClassStreamReader(GetType(YahooStockPrice), _
             New System.IO.FileInfo("table.csv"))

For Each rowdata As YahooStockPrice In reader.ReadToEnd()
   Debug.WriteLine(rowdata.PriceDate.ToShortDateString() & _
       " closed at " & rowdata.AdjustedClose.ToString())
Next

If your data comes from an Excel file (per vserion 2 of the code) use a ClassExcelReader in place of a ClassStreamReader.

Making use of the semantic meaning

One of the powerful things that comes out of defining the semantic meaning of the fields in your record is that you can validate a given row against that record and - in some special circumstances - even fix up "defective" input files.

The validation is performed by the method GetValidationWarnings of the class  ClassColumnMapping.  This tries the passed in string against the decorated .NET class and returns a list of 0 or more warnings about the data.

History

2015-03-26 Initial version

2015-03-30 Additional detail on how to do fixed width fields and the more unusual data formats

2015-11-26 Added an Excel reader and the concept of fatal/non fatal validations

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here