Introduction
This is a very simple implementation of CSV files parser written in C#, which successfully parses files, saved in MS Excel. You can write your own in two-three hours or download from here. As CSV files have very simple structure, this implementation can be used as a starting point for creating more complex parsers.
Background
Comma-separated values format is used to represent tabular data in text format. Usually comma is used as a delimiter between individual values in a row, and a new line (CR+LF) is used to separate rows. If a value includes a special character, it must be quoted with a double quote character (").
State transition table for the algorithm implemented in this demo is below. It incorporates only five states, and four classes of input characters.
| Any character | Comma (,) | Quote (") | EOL |
0 LineStart | 2C | 1V | 3 | 0L |
1 ValueStart | 2C | 1V | 3 | 0VL |
2 Value | 2C | 1V | 2C | 0VL |
3 QuotedValue | 3C | 3C | 4 | 3C |
4 Quote | 3C (?) | 1V | 3C | 0VL |
Footnotes:
C - add character to the current value
V - add current value to the current line
L - finish parsing current line
(?) - represents an error in the source sequence. Decision has been made not to throw an exception, but ignore a single quote character inside a quote value instead.
Using the code
Each state in the table above is represented by a class. Each character class is represented by a method. All state classes derive from the common class named ParserState
, declared as follows.
private abstract class ParserState
{
public static readonly LineStartState LineStartState = new LineStartState();
public static readonly ValueStartState ValueStartState = new ValueStartState();
public static readonly ValueState ValueState = new ValueState();
public static readonly QuotedValueState QuotedValueState = new QuotedValueState();
public static readonly QuoteState QuoteState = new QuoteState();
public abstract ParserState AnyChar(char ch, ParserContext context);
public abstract ParserState Comma(ParserContext context);
public abstract ParserState Quote(ParserContext context);
public abstract ParserState EndOfLine(ParserContext context);
}
A flyweight pattern is utilized to reuse state classes instances. Thus, instead of instantiating a state object on every transition, common state is extracted into a separate class, ParserContext
, instance of which is passed in every transition method. Transition is facilitated by returning a new state by each method. Implementations of ParserState
are straightforward and only dub the rules described above.
Below is the implementation of ParserContext
class.
private class ParserContext
{
private readonly StringBuilder _currentValue = new StringBuilder();
private readonly List<string[]> _lines = new List<string[]>();
private readonly List<string> _currentLine = new List<string>();
public void AddChar(char ch)
{
_currentValue.Append(ch);
}
public void AddValue()
{
_currentLine.Add(_currentValue.ToString());
_currentValue.Remove(0, _currentValue.Length);
}
public void AddLine()
{
_lines.Add(_currentLine.ToArray());
_currentLine.Clear();
}
public List<string[]> GetAllLines()
{
if (_currentValue.Length > 0)
{
AddValue();
}
if (_currentLine.Count > 0)
{
AddLine();
}
return _lines;
}
}
Basically, ParserContext implements the "footnotes" above: it is used to add a character, a value, a line to results, and it also provides the results at the end.
The parser itself accepts an instance of class TextReader
, which can provide access to a file, an in-memory string, or any other stream of text data, and returns an array of arrays of strings. It instantiates ParserContext
, and starts parsing from the LineStartState
. Here is the main method:
public string[][] Parse(TextReader reader)
{
var context = new ParserContext();
ParserState currentState = ParserState.LineStartState;
string next;
while ((next = reader.ReadLine()) != null)
{
foreach (char ch in next)
{
switch (ch)
{
case CommaCharacter:
currentState = currentState.Comma(context);
break;
case QuoteCharacter:
currentState = currentState.Quote(context);
break;
default:
currentState = currentState.AnyChar(ch, context);
break;
}
}
currentState = currentState.EndOfLine(context);
}
List<string[]> allLines = context.GetAllLines();
return allLines.ToArray();
}
Note: all other classes were made nested in the CsvParser
for brevity and simplicity.
Other files available for download include unit tests and an extended implementation of the parser, which supports additional options: reading only a certain number of columns and trimming trailing empty lines (useful for parsing CSV files saved in Excel with a long invisible column on the right).
Points of Interest
This demo shows use of design patterns State and Flyweight.