CSV To/From DataTable

gggustafson

4.92/5 (16 votes)

17 Jul 2017CPOL14 min read

100.5K

978

This article presents methods to convert CSV formatted data to and from DataTables

1. Background

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. There is no “CSV standard”, so the format is operationally defined by the many applications which read and write it. The lack of a standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the overall format is similar enough that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer.

From Python Standard Library - CSV File Reading and Writing [^]

There has been a spate of articles on the Internet that present methods that claim to be able to process CSV data. Unfortunately, many implementations are just plain wrong!

For example, one published implementation uses the String.Split [^] method to obtain the fields in a CSV data record. In data in which there are no embedded field delimiter characters within a field, this approach will work. For example:

en-us

field 1,field 2,field 3<cr><lf>

However, if a field contains an embedded field delimiter character, as in

en-us

field 1,"field, 2",field 3<cr><lf>

the String.Split method will return more fields than are actually in the record. And, if the CSV data in the file conforms to either RFC 4180 [^] or CSV Rendering [^], the field containing an embedded field delimiter character will be surrounded by quotation marks, resulting in even more confusion.

2. CSV Format

There are a number of recognized ways in which to format a CSV file. One is defined in the Internet Engineering Task Force (IETF) RFC 4180 [^]. The other is a Microsoft standard defined in CSV Rendering [^]. In this article, I will only explore IETF RFC 4180 and the Microsoft CSV Rendering. A summary of both appear in the following table.

Notes on the following table.

The phrase "double quotes" has been replaced by "quotation marks." Likewise, "double quote" has been replaced by "quotation mark."
The phrase "double-quotes" has been replaced by "quotation marks."
The phrase "line break" has been replaced by the phrase "record delimiter string."
The phrase "field delimiter string" has been replaced by "field delimiter character" because its primitive data type is character and not string.
The phrase "text qualifier string" has been replaced by "text qualifier character" because its primitive data type is character and not string.
Brackets surround a character sequence that should be treated as a single entity. For example, "[<cr><lf>]" should be read as a single entity that is composed of a carriage-return and a line-feed.
Parentheses surround an example of a character. For example, the phrase "...and comma (,)..." contains an element providing an example of a "comma."
The table cells contain the portions of the two documents, mentioned above. The R-rules are from the IETF RFC; the M-rules are from Microsoft CSV Rendering.
Following the R- and M-rules, are the rules used in the parser (P-rules) and the emitter (E-rules). Within the P- and E-rules are found bolded phrases. These represent public properties whose values can be modified.

IETF RFC 4180	Microsoft CSV Rendering
R1 Each record is located on a separate line, delimited by a record delimiter string ([<cr><lf>]). R2 The last record in the file may or may not have an ending record delimiter string.	M1 The record delimiter string is the carriage return and line feed ([<cr><lf>])
P1 Prior to parsing, the first record in the CSV data is scanned to determine the record delimiter used. E1 The record delimiter string may be changed from its default value of [<cr><lf>] by specifying a new value for Record_Delimiter. The supplied value may be one of the following: [<cr><lf>], [<lf>], [<cr>], or [<lf><cr>]. In all cases the record delimiter string will be emitted at the end of each line, including the last record in the file.
R3 There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file.	M2 Only the first row of the file contains the column headers and each row has the same number of columns.
P2 Whether or nor the first row is treated as a header is dependant on the value of Has_Header. If set true, the first row will be treated as a header; otherwise, the first row will be treated as a data row. The default value of Has_Header is true. P3 If Strict_Rendering is set true, the number of fields in the first row will be treated as the required number of fields for all rows; otherwise, no validation of row-by-row field counts will occur. The default value of Strict_Rendering is true.
R4 Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.	M3 The default field delimiter character is a comma (,). You can change the default field delimiter to any character that you want, including <tab>.
P4 The field delimiter character may be changed from its default value of comma by specifying a new value for Field_Delimiter. The supplied value may not duplicate the value of Text_Qualifier or any character in Record_Delimiter, The default value of Field_Delimiter is the comma (,). E2 The field delimiter character will be emitted at the end of each field except for the last field. The last field will be followed by the record delimiter string.
R5 Each field may or may not be enclosed in quotation marks ("). If fields are not enclosed with quotation marks, then quotation marks may not appear inside the fields. R6 Fields containing record delimiter strings, quotation marks, or field delimiter characters should be enclosed in quotation marks. R7 If quotation marks are used to enclose fields, then a quotation mark appearing inside a field must be escaped by preceding it with another quotation mark.	M4 The default value of the text qualifier character is a quotation mark ("). The CSV renderer does not add text qualifier characters around all text strings. M5 The text qualifier character is to be added around fields that contain a field delimiter character, a text qualifier character, or a character found in the record delimiter string. If the field contains the text qualifier character, the text qualifier character within the field is repeated.
P5 The text qualifier character may be changed from its default value of quotation mark by specifying a new value for Text_Qualifier. The text qualifier character must be different from the field delimiter character and any character in the record delimiter string. The default value of Text_Qualifier is the quotation mark ("). E3 The text qualifier character will be emitted surrounding a field when the field contains a field delimiter character, a character found in the record delimiter string, or a text qualifier character. If the field contains one or more text qualifier characters, each one will be repeated.

It should be noted that the Microsoft CSV Rendering can operate in two modes: one is optimized for Excel and the other is optimized for third-party applications that require strict CSV compliance to the CSV specification in RFC 4180. Most readers of this article are familiar with the Excel mode.

3. Implementing the conversion

Recently, I was faced with a task that involved manipulating CSV files that were located on one or more websites or on a local computer. Directly manipulating the contents of these files in some form of application was not appealing. Basically, I did not want to recreate Microsoft Excel.

The DataTable [^] was a relatively simple and powerful data structure that met my needs. All that was required were methods that would convert CSV data into a DataTable and then convert the DataTable back into CSV data.

In searching the web, I found a number of methods that would perform the conversions. But to my chagrin all were either too simplistic to process most CSV files or were so complex that the underlying algorithms were not readily apparent. I wanted methods that were simple, easily understood at first reading, and able to be implemented using the Microsoft .Net Framework Version 3.5 SP1.

3.1. CSV data to a DataTable

The method I chose to convert CSV data to a DataTable is a parser that requires, at most, one lookahead symbol. The parser is a lexical analyzer that reads a single character at a time and then takes actions appropriate to that character.

Two versions of the parser were required: one for a buffer filled with CSV data and the other for a local file containing CSV data.

At first reading, it may appear strange to have two entry points. But requiring that the whole of a local CSV data file be read at one time into the application space is wasteful. A local file can be accessed one character at a time. However, because StreamReader Class [^] can only handle files on the local machine, files on the web must be first read into a CSV buffer, and then passed to CSV_buffer_to_data_table.

When examined, the two versions of the parser are effectively the same with the exception of retrieving the next character to process.

3.1.1. CSV buffer to a DataTable

To convert a buffer filled with CSV data to a DataTable, the StringReader Class [^] is used.

determine_buffer_line_ending ( ... )
data_table = new DataTable ( )
using ( StringReader sr = new StringReader (
                              csv_buffer ) )
    {
    bool  advance = false;
    char  ch = STX;
    char  next_ch = ETX;

    while ( sr.Peek ( ) > 0 )
        {
        ch = ( char ) sr.Read ( );
        if ( sr.Peek ( ) < 0 )
            {
            next_ch = ETX;
            }
        else
            {
            next_ch = ( char ) sr.Peek ( );
            }

        process_character ( ref state,
                            ref data_table,
                                ending );

        if ( advance )
            {
            advance = false;
            sr.Read ( );
            }
        }
    }

3.1.2. CSV file to a DataTable

To convert a local file filled with CSV data to a DataTable, the StreamReader Class is used.

determine_file_line_ending ( ...  )
data_table = new DataTable ( )
using ( StreamReader sr = new StreamReader (
                              path ) )
    {
    bool  advance = false;
    char  ch = STX;
    char  next_ch = ETX;

    while ( sr.Peek ( ) > 0 )
        {
        ch = ( char ) sr.Read ( );
        if ( sr.Peek ( ) < 0 )
            {
            next_ch = ETX;
            }
        else
            {
            next_ch = ( char ) sr.Peek ( );
            }

        process_character ( ref state,
                            ref data_table,
                                ending );

        if ( advance )
            {
            advance = false;
            sr.Read ( );
            }
        }
    }

3.1.3. The process_character method

Both of the preceding methods have a common character parsing method named process_character. Its signature is:

void process_character ( ref ParseState state,
                         ref DataTable  data_table,
                             Ending     ending )

ParseState is a local class that records the current state of parsing. data-table is the target of the conversion from CSV data. Before the actual parsing of CSV data can begin, the line ending of the CSV data must be determined and the result stored in ending.

The members of ParseState are

public bool            advance;        // true to retrieve next ch
public char            ch;             // character being examined
public StringBuilder   field;          // current field
public List < string > fields;         // fields parsed
public bool            first_ch;       // true if first character
public bool            first_row;      // true if first row
public char            next_ch;        // next character (lookahead)
public bool            quoted_field;   // true if in quoted field

For the declaration of Ending, see Enumerations and Constants, below.

Parsing CSV data is really quite straight forward. A character and its following character (next character) are retrieved from the CSV data. This next character is known as the lookahead token.

Each character is compared with each of the terminal tokens (i.e., any of the Field_Delimiter, Text_Qualifier, or any character in Record_Delimiter). If a match is made, appropriate processing occurs; if no match is made, the character is appended to the current field. When all fields in a CSV data row have been captured, a new record in the data table is created.

en-us

set end of record to false
set advance to false
IF character is a CR OR character is a LF
    IF character is in a quoted field
        append character to current field
    ELSE IF character is the CR
        IF ending is CR_ONLY
            set end of record to true
        ELSE IF ending is a CRLF
            IF next character is a LF
                set advance to true
                set end of record to true
            ENDIF
        ENDIF
    ELSE IF character is the LF
        IF ending is a LF_ONLY
            set end of record to true
        ELSE IF ending is a LFCR
            IF next character is a CR
                set advance to true
                set end of record to true
            ENDIF
        ENDIF
    ENDIF
    IF end of record is true
        set end of record to false
        create a new record in data_table
    ENDIF
ELSE IF character is the Text_Qualifier
                            // ,"....."
                            //  ^               character
    IF first character
        set quoted field to true
                            // ,"...""...",
                            //      ^           character
    ELSE IF quoted field
                            // ,"...""...",
                            //       ^          next character
        IF next character is a Text_Qualifier
            append character to current field
            set advance to true
                            // ,"...",
                            //       ^          next character
        ELSE IF next character is a Field_Delimiter
            set quoted field to false;
                            // ,"..."C
                            // ,"..."L
                            //       ^          next character
        ELSE IF next character is a Record_Delimiter
            quoted field = false;
        ENDIF
    ELSE
                            // ...,..."..,...
                            //        ^         character
                            // INVALID CSV FORMAT
                            // emit the Text_Qualifier
        append character to current field
    ENDIF
    set first character to false
ELSE IF character is the Field_Delimiter
                            // ,"..,...",
                            //     ^            character
    IF quoted field
        append character to current field
                            // ,........,
                            //          ^       character
    ELSE
        IF current field length greater than 0
            append current field to current fields
        ELSE
            append null to current fields
        ENDIF
        set current field length to 0
        set first character to true
    ENDIF
ELSE
    append character to current field
    set first character to false
ENDIF

This helper method was originally contained within both CSV_buffer_to_data_table and local_CSV_file_to_data_table. However, as modifications were made to one or the other of these methods, it became apparent that a single parsing method was needed.

What is important to note is that process_character processes a single character at a time. It may "look ahead" to the next character. If it is determined that a new character needs to be retrieved from the CSV source, either buffer or file, advance is set true.

3.2. DataTable to CSV data

Converting a DataTable to CSV data is much easier. The pseudo-code for the process is:

en-us

IF Has_Header
    FOREACH column in DataTable columns
        emit a CSV field containing the ColumnName
    ENDFOREACH
    emit a Record_Delimiter
ENDIF
FOREACH DataRow in DataTable
    FOREACH column in DataRow
        emit a CSV field containing the column text
    ENDFOREACH
    emit a Record_Delimiter
ENDFOREACH

The only complications arise in the generation of the new CSV field. This process is found in the generate_csv_field method.

// **************************************** generate_csv_field

/// <summary>
/// given a cell from a DataTable, generate a CSV field
/// </summary>
/// <param name="field">
/// contents of the DataTable cell that is to be converted
/// </param>
/// <param name="terminals">
/// a hash of terminal characters, adjusted for the desired
/// line ending
/// </param>
/// <param name="sb">
/// CSV buffer to which the field will be added
/// </param>
void generate_csv_field (     string            field,
                              HashSet < char >  terminals,
                          ref StringBuilder     sb )
    {
    StringBuilder  emitted_field = new StringBuilder ( );

    if ( field != null )
        {
        bool  quoted_field = false;

        foreach ( char ch in field )
            {
            emitted_field.Append ( ch );
            if ( terminals.Contains ( ch ) )
                {
                quoted_field = true;
                }
            if ( ch == Text_Qualifier )
                {               // escape the Text_Qualifier
                emitted_field.Append ( Text_Qualifier );
                }
            }

        if ( quoted_field )
            {
            emitted_field.Append ( Text_Qualifier );
            emitted_field.Insert ( 0, Text_Qualifier );
            }
        sb.Append ( emitted_field );
        }
                                // always end with the
                                // Field_Delimiter
    sb.Append ( Field_Delimiter );
    }

4. Using the CSVToFromDataTable Methods

In its current implementation, CSVToFromDataTable resides in the namespace Utilities. For the purposes of the following discussions, the following using statement should be declared.

using CSV = Utilities.CSVToFromDataTable;

4.1. Enumerations and Constants

CSVToFromDataTable exposes the public enumeration Ending:

public enum Ending
    {
    NOT_SPECIFIED,    //            unexpected
    CRLF,             // <cr><lf> - standard
    LF_ONLY,          // <lf>     - Excel
    CR_ONLY,          // <cr>     - unusual
    LFCR              // <lf><cr> - unusual
    }

Ending is required as a parameter to data_table_to_CSV_buffer to specify the Record_Delimiter to use. For the methods that convert CSV data to a DataTable, the Record_Delimiter is determined internally.

Each enumeration constant may be accessed using a reference like

CSV.Ending.<enumeration-constant-name>

CSVToFromDataTable also exposes the following public constants:

public const char   CR = ( char ) 0x0D;     // line ending
public const char   ETX = ( char ) 0x03;    // end parse
public const char   LF = ( char ) 0x0A;     // line ending
public const char   NUL = ( char ) 0x00;    // empty entry
public const char   STX = ( char ) 0x02;    // start parse

Each constant may be accessed using a reference like

CSV.<name>

4.2. Properties

CSVToFromDataTable exposes the properties described in the following table.

Property	Data type	Default	Description
Field_Delimiter	char	Comma (,)	The field delimiter character that separates one field from another in a record. It can be changed to any character desired. It is highly recommended that the chosen character be a printing character, not normally found in the CSV data.
Has_Header	bool	true	Specifies the action to take when parsing or emitting CSV data. If true, the first row is considered to contain header fields; otherwise, the first row is considered to contain data.
Record_Delimiter	string	<cr><lf>	Specifies the string that terminates each CSV data record. The values that may be supplied are limited to [<cr><lf>], [<lf>], [<cr>], and [<lf><cr>].
Strict_Rendering	bool	true	Specifies that all records in the CSV data must contain the same number of fields. If true, the number of fields is determined from the number of fields in the first record.
Text_Qualifier	char	Quotation Mark (")	Specifies that a field contains one of the field delimiter character, a text qualifier character, or a character found in the record delimiter string. If the field contains any text qualifier character, any text qualifiers character within the field will be repeated.

4.3. Instantiating CSVToFromDataTable

There are two entry points that will instantiate CSVToFromDataTable:

    public CSVToFromDataTable ( )

and 

    public CSVToFromDataTable ( char    field_delimiter,
                                bool    has_header,
                                char    qualifier,
                                string  record_delimiter,
                                bool    strict_rendering )

The first causes CSVToFromDataTable to use default values for the properties; the second allows the invoker to specify all of the properties. If the first instantiation is used, there is nothing prohibiting the invoker from accessing the properties directly.

Using CSV as defined earlier, and local variables as declared below, CSVToFromDataTable can be instantiated as follows:

char    field_delimiter = ',';
bool    has_header = true;
char    qualifier = '"';
string  record_delimiter = String.Format ( "{0}{1}",
                                           CSV.CR,
                                           CSV.LF );
bool    strict_rendering = true;

CSV     csv = new CSV ( field_delimiter,
                        has_header,
                        qualifier,
                        record_delimiter,
                        strict_rendering );

As indicated above, there are two entry points that are used to convert CSV data to a DataTable. CSV_buffer_to_data_table converts a buffer containing CSV data into a DataTable; the other, local_CSV_file_to_data_table, reads a local file containing CSV data, and converts its contents into a DataTable.

4.4. CSV_buffer_to_data_table

This method is invoked when CSV data has been retrieved into a CSV_buffer. It is most applicable when converting CSV data that was retrieved from an FTP web site. The conversion is simply

CSV     csv = new CSV ( );         // use defaults
string  csv_buffer = String.Empty;
string  error_message = String.Empty;
bool    successful = true;
:
:
// fill csv_buffer with CSV data
:
:
successful = csv.CSV_buffer_to_data_table (    csv_buffer,
                                           ref data_table,
                                           ref error_message );
if ( successful )
    {
    :
    :

Filling csv_buffer with data from a web site may be accomplished using one of the WebsiteIO methods contained in the Utilities library included in the downloads.

4.5. local_CSV_file_to_data_table

This method is invoked when CSV data is located in a local computer file. The conversion is simply

CSV     csv = new CSV ( );         // use defaults
string  error_message = String.Empty;
bool    successful = true;

successful = csv.local_CSV_file_to_data_table (
                                               path,
                                           ref data_table,
                                           ref error_message );
if ( successful )
    {
    :
    :

local_CSV_file_to_data_table reads the CSV data contained in the file with the fully qualified path of path. There is no need to read the data into a CSV buffer.

5. The Downloads

There are a number of downloads. With the exception of the TestCSV Executable, the downloads are Visual Studio 2008 projects or solutions.

Utilities Project	The Utilities project contains the CSVToFromDataTable source code as well as other possibly useful methods. All compilation units have the namespace Utilities.
WebOpenFileDialog Project	The WebOpenFileDialog project contains the source code for the web-oriented open file dialog, described in WebOpenFileDialog [^]. It is included here to support the TestCSV project.
TestCSV Project	The TestCSV project contains a demonstration of CSVToFromDataTable. It allows its user to test conversions of CSV data files residing on either the local computer or on the web.
TestCSV Executable	The TestCSV Executable is a self-contained demonstration of CSVToFromDataTable.
TestCSV Solution	The TestCSV Solution combines the TestCSV Project, Utilities Project, and the WebOpenFileDialog into a single Visual Studio 2008 solution.

6. References

DataTable [^]
IETF RFC 4180 [^]
Microsoft CSV Rendering [^].
Microsoft CSV Device Information Settings [^]
Python Standard Library - CSV File Reading and Writing [^]
StreamReader Class [^]
StringReader Class [^]
String.Split [^]
WebOpenFileDialog [^]

7. Development Environment

The CSVToFromDataTable class was developed in the following environment:

Microsoft Windows 7 Professional Service Pack 1

Microsoft Visual Studio 2008 Professional

Microsoft .Net Framework Version 3.5 SP1

Microsoft Visual C# 2008

8. History

07/11/2017	Original article
07/17/2017	Repaired HTML formatting issues

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CSV To/From DataTable

Table of Contents

1. Background

2. CSV Format

3. Implementing the conversion

3.1. CSV data to a DataTable

3.1.1. CSV buffer to a DataTable

3.1.2. CSV file to a DataTable

3.1.3. The process_character method

3.2. DataTable to CSV data

4. Using the CSVToFromDataTable Methods

4.1. Enumerations and Constants

4.2. Properties

4.3. Instantiating CSVToFromDataTable

4.4. CSV_buffer_to_data_table

4.5. local_CSV_file_to_data_table

5. The Downloads

6. References

7. Development Environment

8. History

License