Introduction
Whilst many modern applications exchange data in well defined standard ways (such as XML or JSON) the reality is that a great deal of data is still packaged in flat files (by which I mean comma separated values, or fixed width text files).
This library is concerned with turning these files into arrays of type-safe objects and using semantic meaning in those objects to validate the files concerned.
Background
The use of attributes to tag properties of your classes is quite common in both Entity Framework and also in System.Runtime.Serialization. This library very much follows that model but if you are not familiar with it then I recommend reading this article to start with.
Using the code
For a worked example, we'll start with a flat file which contains historical price information about Microsoft shares from Yahoo:
What is desired is to turn this into an IEnumerable of a class called StockPrice that can be declared thus:-
Public Class YahooStockPrice
Public Property PriceDate As Date
Public Property OpenPrice As Nullable(Of Decimal)
Public Property HighPrice As Nullable(Of Decimal)
Public Property LowPrice As Nullable(Of Decimal)
Public Property ClosingPrice As Nullable(Of Decimal)
Public Property Volume As Nullable(Of Decimal)
Public Property AdjustedClose As Nullable(Of Decimal)
End Class
Field separator and field definition
You specify the field separator at the class level with the RecordFieldSeperator attribute. There are four possible standard separators (comma, tab, pipe or semicolon) or you can specify your own custom separator as an array of characters.
Then each field that you want to read is specified by the RecordColumn attribute. This takes a column header name and the ordinal position of the field. You do not have to specify every field but (obviously) any attempt to specify two properties at the same column position will throw an error.
Imports FileUtilities
<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice
<RecordColumn("Date", 0)>
Public Property PriceDate As Date
<RecordColumn("Open", 1)>
Public Property OpenPrice As Nullable(Of Decimal)
<RecordColumn("High", 2)>
Public Property HighPrice As Nullable(Of Decimal)
<RecordColumn("Low", 3)>
Public Property LowPrice As Nullable(Of Decimal)
<RecordColumn("Closing", 4)>
Public Property ClosingPrice As Nullable(Of Decimal)
<RecordColumn("Volume", 5)>
Public Property Volume As Nullable(Of Decimal)
<RecordColumn("Adj Close", 6)>
Public Property AdjustedClose As Nullable(Of Decimal)
End Class
Fixed-width files
Another possibility for text files is that they don't have any record separator but instead have implied fields of a given fixed width. (This is often the case for records coming from mainframe / COBOL style applications.) In this case you do not need to specify a record separator but each field needs to have the additional information of the length of the field and the field start location specified.
For example if your account number field is defined as teh 5 letters that start from character position # 6 it is specified thus:
<RecordColumn("Account Number", 2, False, 5, 6)>
Excel files
In version 2 there is a reader for Excel 2007 files. This depends on the ClosedXML library from CodePlex (so I have included the version separately - if you don't need Excel, use version 1)
To specify that a record should be read from a specific worksheet within the file mark the class with a RecordExcelWorksheet attribute:-
<RecordExcelSourceSheet("House Summary")>
If no worksheet is specified then the code will use the first worksheet in the file. You can specify that each property of your class comes from an explicit cell or from a particular column in the excel sheet:-
<RecordsSpecificExcelCell("Business Date:", RecordColumnExcelCellAttribute.ExcelColumn.I, 4)>
Semantic meaning
When initially read in, every field is just considered to be a string. In order to turn it into the correct type and apply any data format rules you need to add the appropriate semantic meaning tag. These are RecordColumnDate, RecordColumnNumber etc.
Imports FileUtilities
<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice
<RecordColumn("Date", 0)>
<RecordColumnDate("M/d/yyyy")>
Public Property PriceDate As Date
<RecordColumnNumber()>
<RecordColumn("Open", 1)>
Public Property OpenPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("High", 2)>
Public Property HighPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Low", 3)>
Public Property LowPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Closing", 4)>
Public Property ClosingPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Volume", 5)>
Public Property Volume As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Adj Close", 6)>
Public Property AdjustedClose As Nullable(Of Decimal)
End Class
Validation
Now that we have some semantic meaning attributes attached to our class we can also specify validation rules to skip over any invalid rows. These are specified using the RecordColumnValidation attribute, and there are a number of validations built in:
Public Enum ValidationRule
NoValidation = 0
NotBlank = 1
NotColumnName = 2
NumberValidation = 3
DateValidation = 4
MustEqual = 5
MustNotEqual = 6
RegularEpressionMatch = 7
InstrumentIdentifier = 8
End Enum
In the case of the Yahoo file we need to validate that the fields do not match the column name and that they are valid numeric or date data.
In version 2 there is also the concept of marking a validation as "fatal" by setting the optional parameter in the validation attribute constructor. For exampel if a certain field must be set to the capital letter M only you could use the following validation to enforce that:
<RecordColumnValidation(RecordColumnValidationAttribute.ValidationRule.MustEqual, "M", True)>
Unusual formats
As well as the standard number and date formats provided in the .NET framework there are some unusual cases that can be sent (again suually as output from mainframe systems). For example the number might have a trailing sign and might have an implied number of decimals. For these cases the RecordColumnCustomNumber attribute can be used to effectively preprocess the number before converting it.
These special formats are expressed as a Flags based enumerated type as they can sometimes be combined together in one field.
<Flags()>
Public Enum CustomNumberFormatFlags
StandardNumber = &H0
FixedDecimalPoint = &H1
TrailingSign = &H2
NegativeBracketed = &H4
FractionalSuffix = &H8
End Enum
Modifying values
Sometimes the data passed in from the file needs to be modified in order to work. For example some files might contain "NULL" or "n/a" in a field when that field should be specified as blank or a field needs to be trimmed as it is read in. This is done with the attributes such as ValueSubstitution and ValueTrim.
Culture
If you are receiving a file from a different culture (i.e. for example if it was written by a machine in France and is being imported by a machine in the US) you may need to explicitly denote the culture the file comes from in order to convert numeric and date values. This is done with the ValidCulture attribute.
Reading the data
The hard work of applying all these attributes and turning out the results is done by the ClassStreamReader class, which is a generic class that takes the POCO type specified with the attributes as input. It can either read one record at a time with the ReadNext method, or return the complete set of record with the ReadToEnd method.
For example to iterate through the Yahoo file specified you could do:-
Dim reader As New ClassStreamReader(GetType(YahooStockPrice), _
New System.IO.FileInfo("table.csv"))
For Each rowdata As YahooStockPrice In reader.ReadToEnd()
Debug.WriteLine(rowdata.PriceDate.ToShortDateString() & _
" closed at " & rowdata.AdjustedClose.ToString())
Next
If your data comes from an Excel file (per vserion 2 of the code) use a ClassExcelReader in place of a ClassStreamReader.
Making use of the semantic meaning
One of the powerful things that comes out of defining the semantic meaning of the fields in your record is that you can validate a given row against that record and - in some special circumstances - even fix up "defective" input files.
The validation is performed by the method GetValidationWarnings of the class ClassColumnMapping. This tries the passed in string against the decorated .NET class and returns a list of 0 or more warnings about the data.
History
2015-03-26 Initial version
2015-03-30 Additional detail on how to do fixed width fields and the more unusual data formats
2015-11-26 Added an Excel reader and the concept of fatal/non fatal validations