Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

CSV Parser (C#)

0.00/5 (No votes)
13 Mar 2014 1  
Simple implementation of a parser of comma-separated values (CSV) files in C#

Introduction

This is a very simple implementation of CSV files parser written in C#, which successfully parses files, saved in MS Excel. You can write your own in two-three hours or download from here. As CSV files have very simple structure, this implementation can be used as a starting point for creating more complex parsers.

Background

Comma-separated values format is used to represent tabular data in text format. Usually comma is used as a delimiter between individual values in a row, and a new line (CR+LF) is used to separate rows. If a value includes a special character, it must be quoted with a double quote character (").

State transition table for the algorithm implemented in this demo is below. It incorporates only five states, and four classes of input characters.

Any characterComma (,)Quote (")EOL

0 LineStart

2C1V30L

1 ValueStart

2C 1V 3 0VL

2 Value

2C 1V2C0VL

3 QuotedValue

3C 3C43C
4 Quote 3C (?) 1V 3C 0VL

Footnotes:

C - add character to the current value
V - add current value to the current line
L - finish parsing current line

(?) - represents an error in the source sequence. Decision has been made not to throw an exception, but ignore a single quote character inside a quote value instead.

Using the code

Each state in the table above is represented by a class. Each character class is represented by a method. All state classes derive from the common class named ParserState, declared as follows.

        private abstract class ParserState
        {
            public static readonly LineStartState LineStartState = new LineStartState();
            public static readonly ValueStartState ValueStartState = new ValueStartState();
            public static readonly ValueState ValueState = new ValueState();
            public static readonly QuotedValueState QuotedValueState = new QuotedValueState();
            public static readonly QuoteState QuoteState = new QuoteState();

            public abstract ParserState AnyChar(char ch, ParserContext context);
            public abstract ParserState Comma(ParserContext context);
            public abstract ParserState Quote(ParserContext context);
            public abstract ParserState EndOfLine(ParserContext context);
        }

A flyweight pattern is utilized to reuse state classes instances. Thus, instead of instantiating a state object on every transition, common state is extracted into a separate class, ParserContext, instance of which is passed in every transition method. Transition is facilitated by returning a new state by each method. Implementations of ParserState are straightforward and only dub the rules described above.

Below is the implementation of ParserContext class.

        private class ParserContext
        {
            private readonly StringBuilder _currentValue = new StringBuilder();
            private readonly List<string[]> _lines = new List<string[]>();
            private readonly List<string> _currentLine = new List<string>();

            public void AddChar(char ch)
            {
                _currentValue.Append(ch);
            }

            public void AddValue()
            {
                _currentLine.Add(_currentValue.ToString());
                _currentValue.Remove(0, _currentValue.Length);
            }

            public void AddLine()
            {
                _lines.Add(_currentLine.ToArray());
                _currentLine.Clear();
            }

            public List<string[]> GetAllLines()
            {
                if (_currentValue.Length > 0)
                {
                    AddValue();
                }
                if (_currentLine.Count > 0)
                {
                    AddLine();
                }
                return _lines;
            }
        }

Basically, ParserContext implements the "footnotes" above: it is used to add a character, a value, a line to results, and it also provides the results at the end.

The parser itself accepts an instance of class TextReader, which can provide access to a file, an in-memory string, or any other stream of text data, and returns an array of arrays of strings. It instantiates ParserContext, and starts parsing from the LineStartState. Here is the main method:

        public string[][] Parse(TextReader reader)
        {
            var context = new ParserContext();

            ParserState currentState = ParserState.LineStartState;
            string next;
            while ((next = reader.ReadLine()) != null)
            {
                foreach (char ch in next)
                {
                    switch (ch)
                    {
                        case CommaCharacter:
                            currentState = currentState.Comma(context);
                            break;
                        case QuoteCharacter:
                            currentState = currentState.Quote(context);
                            break;
                        default:
                            currentState = currentState.AnyChar(ch, context);
                            break;
                    }
                }
                currentState = currentState.EndOfLine(context);
            }
            List<string[]> allLines = context.GetAllLines();
            return allLines.ToArray();
        }

Note: all other classes were made nested in the CsvParser for brevity and simplicity.

Other files available for download include unit tests and an extended implementation of the parser, which supports additional options: reading only a certain number of columns and trimming trailing empty lines (useful for parsing CSV files saved in Excel with a long invisible column on the right).

Points of Interest

This demo shows use of design patterns State and Flyweight.


License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here