Table of Contents
The symbol returns the reader to the top of the Table of Contents.
1. Background
The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. There is no “CSV standard”, so the format is operationally defined by the many applications which read and write it. The lack of a standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the overall format is similar enough that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer.
From Python Standard Library - CSV File Reading and Writing [^]
There has been a spate of articles on the Internet that present methods that claim to be able to process CSV data. Unfortunately, many implementations are just plain wrong!
For example, one published implementation uses the String.Split [^] method to obtain the fields in a CSV data record. In data in which there are no embedded field delimiter characters within a field, this approach will work. For example:
field 1,field 2,field 3<cr><lf>
However, if a field contains an embedded field delimiter character, as in
field 1,"field, 2",field 3<cr><lf>
the String.Split method will return more fields than are actually in the record. And, if the CSV data in the file conforms to either RFC 4180 [^] or CSV Rendering [^], the field containing an embedded field delimiter character will be surrounded by quotation marks, resulting in even more confusion.
2. CSV Format
There are a number of recognized ways in which to format a CSV file. One is defined in the Internet Engineering Task Force (IETF) RFC 4180 [^]. The other is a Microsoft standard defined in CSV Rendering [^]. In this article, I will only explore IETF RFC 4180 and the Microsoft CSV Rendering. A summary of both appear in the following table.
Notes on the following table.
- The phrase "double quotes" has been replaced by "quotation marks." Likewise, "double quote" has been replaced by "quotation mark."
- The phrase "double-quotes" has been replaced by "quotation marks."
- The phrase "line break" has been replaced by the phrase "record delimiter string."
- The phrase "field delimiter string" has been replaced by "field delimiter character" because its primitive data type is character and not string.
- The phrase "text qualifier string" has been replaced by "text qualifier character" because its primitive data type is character and not string.
- Brackets surround a character sequence that should be treated as a single entity. For example, "[<cr><lf>]" should be read as a single entity that is composed of a carriage-return and a line-feed.
- Parentheses surround an example of a character. For example, the phrase "...and comma (,)..." contains an element providing an example of a "comma."
- The table cells contain the portions of the two documents, mentioned above. The R-rules are from the IETF RFC; the M-rules are from Microsoft CSV Rendering.
- Following the R- and M-rules, are the rules used in the parser (P-rules) and the emitter (E-rules). Within the P- and E-rules are found bolded phrases. These represent public properties whose values can be modified.
IETF RFC 4180 | Microsoft CSV Rendering |
R1 Each record is located on a separate line, delimited by a record delimiter string ([<cr><lf>]).
R2 The last record in the file may or may not have an ending record delimiter string.
| M1 The record delimiter string is the carriage return and line feed ([<cr><lf>])
|
P1 Prior to parsing, the first record in the CSV data is scanned to determine the record delimiter used.
E1 The record delimiter string may be changed from its default value of [<cr><lf>] by specifying a new value for Record_Delimiter. The supplied value may be one of the following: [<cr><lf>], [<lf>], [<cr>], or [<lf><cr>]. In all cases the record delimiter string will be emitted at the end of each line, including the last record in the file.
|
R3 There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file.
| M2 Only the first row of the file contains the column headers and each row has the same number of columns.
|
P2 Whether or nor the first row is treated as a header is dependant on the value of Has_Header. If set true, the first row will be treated as a header; otherwise, the first row will be treated as a data row. The default value of Has_Header is true.
P3 If Strict_Rendering is set true, the number of fields in the first row will be treated as the required number of fields for all rows; otherwise, no validation of row-by-row field counts will occur. The default value of Strict_Rendering is true.
|
R4 Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.
| M3 The default field delimiter character is a comma (,). You can change the default field delimiter to any character that you want, including <tab>.
|
P4 The field delimiter character may be changed from its default value of comma by specifying a new value for Field_Delimiter. The supplied value may not duplicate the value of Text_Qualifier or any character in Record_Delimiter, The default value of Field_Delimiter is the comma (,).
E2 The field delimiter character will be emitted at the end of each field except for the last field. The last field will be followed by the record delimiter string.
|
R5 Each field may or may not be enclosed in quotation marks ("). If fields are not enclosed with quotation marks, then quotation marks may not appear inside the fields.
R6 Fields containing record delimiter strings, quotation marks, or field delimiter characters should be enclosed in quotation marks.
R7 If quotation marks are used to enclose fields, then a quotation mark appearing inside a field must be escaped by preceding it with another quotation mark.
| M4 The default value of the text qualifier character is a quotation mark ("). The CSV renderer does not add text qualifier characters around all text strings.
M5 The text qualifier character is to be added around fields that contain a field delimiter character, a text qualifier character, or a character found in the record delimiter string. If the field contains the text qualifier character, the text qualifier character within the field is repeated.
|
P5 The text qualifier character may be changed from its default value of quotation mark by specifying a new value for Text_Qualifier. The text qualifier character must be different from the field delimiter character and any character in the record delimiter string. The default value of Text_Qualifier is the quotation mark (").
E3 The text qualifier character will be emitted surrounding a field when the field contains a field delimiter character, a character found in the record delimiter string, or a text qualifier character. If the field contains one or more text qualifier characters, each one will be repeated.
|
It should be noted that the Microsoft CSV Rendering can operate in two modes: one is optimized for Excel and the other is optimized for third-party applications that require strict CSV compliance to the CSV specification in RFC 4180. Most readers of this article are familiar with the Excel mode.
3. Implementing the conversion
Recently, I was faced with a task that involved manipulating CSV files that were located on one or more websites or on a local computer. Directly manipulating the contents of these files in some form of application was not appealing. Basically, I did not want to recreate Microsoft Excel.
The DataTable [^] was a relatively simple and powerful data structure that met my needs. All that was required were methods that would convert CSV data into a DataTable and then convert the DataTable back into CSV data.
In searching the web, I found a number of methods that would perform the conversions. But to my chagrin all were either too simplistic to process most CSV files or were so complex that the underlying algorithms were not readily apparent. I wanted methods that were simple, easily understood at first reading, and able to be implemented using the Microsoft .Net Framework Version 3.5 SP1.
3.1. CSV data to a DataTable
The method I chose to convert CSV data to a DataTable is a parser that requires, at most, one lookahead symbol. The parser is a lexical analyzer that reads a single character at a time and then takes actions appropriate to that character.
Two versions of the parser were required: one for a buffer filled with CSV data and the other for a local file containing CSV data.
At first reading, it may appear strange to have two entry points. But requiring that the whole of a local CSV data file be read at one time into the application space is wasteful. A local file can be accessed one character at a time. However, because StreamReader Class [^] can only handle files on the local machine, files on the web must be first read into a CSV buffer, and then passed to CSV_buffer_to_data_table.
When examined, the two versions of the parser are effectively the same with the exception of retrieving the next character to process.
3.1.1. CSV buffer to a DataTable
To convert a buffer filled with CSV data to a DataTable, the StringReader Class [^] is used.
determine_buffer_line_ending ( ... )
data_table = new DataTable ( )
using ( StringReader sr = new StringReader (
csv_buffer ) )
{
bool advance = false;
char ch = STX;
char next_ch = ETX;
while ( sr.Peek ( ) > 0 )
{
ch = ( char ) sr.Read ( );
if ( sr.Peek ( ) < 0 )
{
next_ch = ETX;
}
else
{
next_ch = ( char ) sr.Peek ( );
}
process_character ( ref state,
ref data_table,
ending );
if ( advance )
{
advance = false;
sr.Read ( );
}
}
}
3.1.2. CSV file to a DataTable
To convert a local file filled with CSV data to a DataTable, the StreamReader Class is used.
determine_file_line_ending ( ... )
data_table = new DataTable ( )
using ( StreamReader sr = new StreamReader (
path ) )
{
bool advance = false;
char ch = STX;
char next_ch = ETX;
while ( sr.Peek ( ) > 0 )
{
ch = ( char ) sr.Read ( );
if ( sr.Peek ( ) < 0 )
{
next_ch = ETX;
}
else
{
next_ch = ( char ) sr.Peek ( );
}
process_character ( ref state,
ref data_table,
ending );
if ( advance )
{
advance = false;
sr.Read ( );
}
}
}
3.1.3. The process_character method
Both of the preceding methods have a common character parsing method named process_character. Its signature is:
void process_character ( ref ParseState state,
ref DataTable data_table,
Ending ending )
ParseState is a local class that records the current state of parsing. data-table is the target of the conversion from CSV data. Before the actual parsing of CSV data can begin, the line ending of the CSV data must be determined and the result stored in ending.
The members of ParseState are
public bool advance;
public char ch;
public StringBuilder field;
public List < string > fields;
public bool first_ch;
public bool first_row;
public char next_ch;
public bool quoted_field;
For the declaration of Ending, see Enumerations and Constants, below.
Parsing CSV data is really quite straight forward. A character and its following character (next character) are retrieved from the CSV data. This next character is known as the lookahead token.
Each character is compared with each of the terminal tokens (i.e., any of the Field_Delimiter, Text_Qualifier, or any character in Record_Delimiter). If a match is made, appropriate processing occurs; if no match is made, the character is appended to the current field. When all fields in a CSV data row have been captured, a new record in the data table is created.
set end of record to false
set advance to false
IF character is a CR OR character is a LF
IF character is in a quoted field
append character to current field
ELSE IF character is the CR
IF ending is CR_ONLY
set end of record to true
ELSE IF ending is a CRLF
IF next character is a LF
set advance to true
set end of record to true
ENDIF
ENDIF
ELSE IF character is the LF
IF ending is a LF_ONLY
set end of record to true
ELSE IF ending is a LFCR
IF next character is a CR
set advance to true
set end of record to true
ENDIF
ENDIF
ENDIF
IF end of record is true
set end of record to false
create a new record in data_table
ENDIF
ELSE IF character is the Text_Qualifier
IF first character
set quoted field to true
ELSE IF quoted field
IF next character is a Text_Qualifier
append character to current field
set advance to true
ELSE IF next character is a Field_Delimiter
set quoted field to false;
ELSE IF next character is a Record_Delimiter
quoted field = false;
ENDIF
ELSE
append character to current field
ENDIF
set first character to false
ELSE IF character is the Field_Delimiter
IF quoted field
append character to current field
ELSE
IF current field length greater than 0
append current field to current fields
ELSE
append null to current fields
ENDIF
set current field length to 0
set first character to true
ENDIF
ELSE
append character to current field
set first character to false
ENDIF
This helper method was originally contained within both CSV_buffer_to_data_table and local_CSV_file_to_data_table. However, as modifications were made to one or the other of these methods, it became apparent that a single parsing method was needed.
What is important to note is that process_character processes a single character at a time. It may "look ahead" to the next character. If it is determined that a new character needs to be retrieved from the CSV source, either buffer or file, advance is set true.
3.2. DataTable to CSV data
Converting a DataTable to CSV data is much easier. The pseudo-code for the process is:
IF Has_Header
FOREACH column in DataTable columns
emit a CSV field containing the ColumnName
ENDFOREACH
emit a Record_Delimiter
ENDIF
FOREACH DataRow in DataTable
FOREACH column in DataRow
emit a CSV field containing the column text
ENDFOREACH
emit a Record_Delimiter
ENDFOREACH
The only complications arise in the generation of the new CSV field. This process is found in the generate_csv_field method.
void generate_csv_field ( string field,
HashSet < char > terminals,
ref StringBuilder sb )
{
StringBuilder emitted_field = new StringBuilder ( );
if ( field != null )
{
bool quoted_field = false;
foreach ( char ch in field )
{
emitted_field.Append ( ch );
if ( terminals.Contains ( ch ) )
{
quoted_field = true;
}
if ( ch == Text_Qualifier )
{
emitted_field.Append ( Text_Qualifier );
}
}
if ( quoted_field )
{
emitted_field.Append ( Text_Qualifier );
emitted_field.Insert ( 0, Text_Qualifier );
}
sb.Append ( emitted_field );
}
sb.Append ( Field_Delimiter );
}
4. Using the CSVToFromDataTable Methods
In its current implementation, CSVToFromDataTable resides in the namespace Utilities. For the purposes of the following discussions, the following using statement should be declared.
using CSV = Utilities.CSVToFromDataTable;
4.1. Enumerations and Constants
CSVToFromDataTable exposes the public enumeration Ending:
public enum Ending
{
NOT_SPECIFIED,
CRLF,
LF_ONLY,
CR_ONLY,
LFCR
}
Ending is required as a parameter to data_table_to_CSV_buffer to specify the Record_Delimiter to use. For the methods that convert CSV data to a DataTable, the Record_Delimiter is determined internally.
Each enumeration constant may be accessed using a reference like
CSV.Ending.<enumeration-constant-name>
CSVToFromDataTable also exposes the following public constants:
public const char CR = ( char ) 0x0D;
public const char ETX = ( char ) 0x03;
public const char LF = ( char ) 0x0A;
public const char NUL = ( char ) 0x00;
public const char STX = ( char ) 0x02;
Each constant may be accessed using a reference like
CSV.<name>
4.2. Properties
CSVToFromDataTable exposes the properties described in the following table.
Property | Data type | Default | Description |
Field_Delimiter | char | Comma (,) | The field delimiter character that separates one field from another in a record. It can be changed to any character desired. It is highly recommended that the chosen character be a printing character, not normally found in the CSV data. |
Has_Header | bool | true | Specifies the action to take when parsing or emitting CSV data. If true, the first row is considered to contain header fields; otherwise, the first row is considered to contain data. |
Record_Delimiter | string | <cr><lf> | Specifies the string that terminates each CSV data record. The values that may be supplied are limited to [<cr><lf>], [<lf>], [<cr>], and [<lf><cr>]. |
Strict_Rendering | bool | true | Specifies that all records in the CSV data must contain the same number of fields. If true, the number of fields is determined from the number of fields in the first record. |
Text_Qualifier | char | Quotation Mark (") | Specifies that a field contains one of the field delimiter character, a text qualifier character, or a character found in the record delimiter string. If the field contains any text qualifier character, any text qualifiers character within the field will be repeated. |
4.3. Instantiating CSVToFromDataTable
There are two entry points that will instantiate CSVToFromDataTable:
public CSVToFromDataTable ( )
and
public CSVToFromDataTable ( char field_delimiter,
bool has_header,
char qualifier,
string record_delimiter,
bool strict_rendering )
The first causes CSVToFromDataTable to use default values for the properties; the second allows the invoker to specify all of the properties. If the first instantiation is used, there is nothing prohibiting the invoker from accessing the properties directly.
Using CSV as defined earlier, and local variables as declared below, CSVToFromDataTable can be instantiated as follows:
char field_delimiter = ',';
bool has_header = true;
char qualifier = '"';
string record_delimiter = String.Format ( "{0}{1}",
CSV.CR,
CSV.LF );
bool strict_rendering = true;
CSV csv = new CSV ( field_delimiter,
has_header,
qualifier,
record_delimiter,
strict_rendering );
As indicated above, there are two entry points that are used to convert CSV data to a DataTable. CSV_buffer_to_data_table converts a buffer containing CSV data into a DataTable; the other, local_CSV_file_to_data_table, reads a local file containing CSV data, and converts its contents into a DataTable.
4.4. CSV_buffer_to_data_table
This method is invoked when CSV data has been retrieved into a CSV_buffer. It is most applicable when converting CSV data that was retrieved from an FTP web site. The conversion is simply
CSV csv = new CSV ( );
string csv_buffer = String.Empty;
string error_message = String.Empty;
bool successful = true;
:
:
:
:
successful = csv.CSV_buffer_to_data_table ( csv_buffer,
ref data_table,
ref error_message );
if ( successful )
{
:
:
Filling csv_buffer with data from a web site may be accomplished using one of the WebsiteIO methods contained in the Utilities library included in the downloads.
4.5. local_CSV_file_to_data_table
This method is invoked when CSV data is located in a local computer file. The conversion is simply
CSV csv = new CSV ( );
string error_message = String.Empty;
bool successful = true;
successful = csv.local_CSV_file_to_data_table (
path,
ref data_table,
ref error_message );
if ( successful )
{
:
:
local_CSV_file_to_data_table reads the CSV data contained in the file with the fully qualified path of path. There is no need to read the data into a CSV buffer.
5. The Downloads
There are a number of downloads. With the exception of the TestCSV Executable, the downloads are Visual Studio 2008 projects or solutions.
Utilities Project | The Utilities project contains the CSVToFromDataTable source code as well as other possibly useful methods. All compilation units have the namespace Utilities. |
WebOpenFileDialog Project | The WebOpenFileDialog project contains the source code for the web-oriented open file dialog, described in WebOpenFileDialog [^]. It is included here to support the TestCSV project. |
TestCSV Project | The TestCSV project contains a demonstration of CSVToFromDataTable. It allows its user to test conversions of CSV data files residing on either the local computer or on the web. |
TestCSV Executable | The TestCSV Executable is a self-contained demonstration of CSVToFromDataTable. |
TestCSV Solution | The TestCSV Solution combines the TestCSV Project, Utilities Project, and the WebOpenFileDialog into a single Visual Studio 2008 solution. |
6. References
7. Development Environment
The CSVToFromDataTable class was developed in the following environment:
Microsoft Windows 7 Professional Service Pack 1 |
Microsoft Visual Studio 2008 Professional |
Microsoft .Net Framework Version 3.5 SP1 |
Microsoft Visual C# 2008 |
8. History
07/11/2017 | Original article |
07/17/2017 | Repaired HTML formatting issues |