In this article, I look at CSV Standard and a summary of what a CSV file can do, performance benchmarks, some examples of how to use fastCSV, some helper functions for performance, usage scenarios, a quick peek inside the code, and some sample use cases.
Introduction
Most libraries I looked at didn't really meet my requirements and fastCSV
was born.
Also CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON
.
Features
-
Fully CSV standard compliant
- Multi-line
- Quoted columns
- Keeps spaces between delimiters
-
Really fast reading and writing of CSV files (see performance)
-
Tiny 8kb DLL compiled to net40
or netstandard20
-
Ability to get a typed list of objects from a CSV file
-
Ability to filter a CSV file while loading
-
Ability to specify a custom delimiter
CSV Standard
You can read the CSV RFC here : https://tools.ietf.org/html/rfc4180 as a summary a CSV file can :
- be multi lined if the values in a column contains new lines
- a column must be quoted if it contains a new line, delmiter or quote character
- spaces between the delimiter are considered part of the column
Below is an example of a complex standard compliant CSV file from the wiki page https://en.wikipedia.org/wiki/Comma-separated_values :
Year,Make,Model,Description,Price
1997,Ford,"E350
F150","ac, abs,
moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air,
"",moon,""
roof, loaded",4799.00
1999,BMW,Z3,"used",14900.00
1999, Toyota,Corolla,,7000.00
as you can see some rows are multi lined and contain quotes and commas, which will give the table below:
Year | Make | Model | Description | Price |
1997 | Ford | E350
F150
| ac, abc,
moon
| 3000.00 |
1999 | Chevy | Venture "Extended Edition" | | 4900.00 |
1999 | Chevy | Venture "Extended Edition, Very Large" | | 5000.00 |
1996 | Jeep | Grand Cherokee | MUST SELL!
air,
",moon,"
roof, loaded
| 4799.00 |
1999 | BMW | Z3 | used | 14900.00 |
1999 | Toyota | Corolla | | 7000.00 |
as you can see some columns are multi line, and " Toyota" column of the last row starts with a space.
Performance Benchmarks
Loading the https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip (585Mb) file which has 4,496,263 rows on my machine as a relative comparison to other libraries:
- fastCSV : 11.20s 639Mb used
- NReco.CSV : 6.76s 800Mb used
- fastCSV string.Split() : 11.50s 638Mb used
- TinyCSVparser : 34s 992Mb used
As a comparison of a baseline of what is possible on the same dataset:
File.ReadAllBytes()
: 1.5s 573Mb used File.ReadAllLines()
with no processing : 3.7s 1633Mb used File.ReadLines()
with no processing : 1.9s File.ReadLines()
+ string.Split()
no return list : 7.5s
The difference from 1 to 2 is the overhead of converting the bytes to Unicode strings : 2.2s
The difference between 2 and 3 is the memory overhead of creating string[]
: 1.8s
The difference from 4 to fastCSV
is the overhead of creating T objects and adding to a list : 4s
Roads not taken
Using the code
Below are some examples of how to use fastCSV
:
public class car
{
public string Year;
public string Make;
public string Model;
public string Description;
public string Price;
}
var listcars = fastCSV.ReadFile<cars>(
"csvstandard.csv",
true,
',',
(o, c) =>
{
o.Year = c[0];
o.Make = c[1];
o.Model = c[2];
o.Description = c[3];
o.Price = c[4];
return true;
});
fastCSV.WriteFile<LocalWeatherData>(
"filename2.csv",
new string[] { "WBAN", "Date", "SkyCondition" },
'|',
list,
(o, c) =>
{
c.Add(o.WBAN);
c.Add(o.Date.ToString("yyyyMMdd"));
c.Add(o.SkyCondition);
});
Helper functions for performance
fastCSV
has the following helper functions:
int ToInt(string s)
creates an int
from a string int ToInt(string s, int index, int count)
creates an int
from a substring DateTime ToDateTimeISO(string value, bool UseUTCDateTime)
creates an ISO standard DateTime
i.e. yyyy-MM-ddTHH:mm:ss
( optional part.nnnZ
)
public class LocalWeatherData
{
public string WBAN;
public DateTime Date;
public string SkyCondition;
}
var list = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
{
bool add = true;
o.WBAN = c[0];
o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
fastCSV.ToInt(c[1], 4, 2),
fastCSV.ToInt(c[1], 6, 2));
o.SkyCondition = c[4];
return add;
});
Usage Scenarios
-
Filtering CSV while loading
- In your map function while loading you can write conditions on your loaded line data and filter out lines you don't want by using
return false;
-
Reading CSV to import to other systems
- in your map function you can send the line data to another system and
return false;
- or process the entire file and use the
List<T>
returned
-
Processing/aggregating data while loading
- You can have a
List<T>
which has no bearing on the columns of the CSV file and sum/min/max/avg/etc. the lines read
Inside the code
Essentially the reading is a loop through parsing a line, creating a generic element for a list, handing of the object created and the columns extracted from the line to the user defined map function and adding it to the list for return (if the map function says so) :
var c = ParseLine(line, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
list.Add(o);
Now the CSV standard complexity comes from handling multi lines correctly which is done by counting if there are odd number of quotes in a line, hence it's multi line and reading the lines until the quotes are even, which is done in the ReadFile()
function.
The beauty of this approach is that it is simple, does no reflection and is really fast, with the control being in the users hands.
All the reading code is below :
public static List<T> ReadFile<T>(string filename, bool hasheader, char delimiter, ToOBJ<T> mapper) where T : new()
{
string[] cols = null;
List<T> list = new List<T>();
int linenum = -1;
StringBuilder sb = new StringBuilder();
bool insb = false;
foreach (var line in File.ReadLines(filename))
{
try
{
linenum++;
if (linenum == 0)
{
if (hasheader)
{
int cc = CountOccurence(line, delimiter);
if (cc == 0)
throw new Exception("File does not have '" + delimiter + "' as a delimiter");
cols = new string[cc + 1];
continue;
}
else
cols = new string[_COLCOUNT];
}
var qc = CountOccurence(line, '\"');
bool multiline = qc % 2 == 1 || insb;
string cline = line;
if (multiline)
{
insb = true;
sb.Append(line);
var s = sb.ToString();
qc = CountOccurence(s, '\"');
if (qc % 2 == 1)
{
sb.AppendLine();
continue;
}
cline = s;
sb.Clear();
insb = false;
}
var c = ParseLine(cline, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
list.Add(o);
}
catch (Exception ex)
{
throw new Exception("error on line " + linenum, ex);
}
}
return list;
}
private unsafe static int CountOccurence(string text, char c)
{
int count = 0;
int len = text.Length;
int index = -1;
fixed (char* s = text)
{
while (index++ < len)
{
char ch = *(s + index);
if (ch == c)
count++;
}
}
return count;
}
private unsafe static string[] ParseLine(string line, char delimiter, string[] columns)
{
int col = 0;
int linelen = line.Length;
int index = 0;
fixed (char* l = line)
{
while (index < linelen)
{
if (*(l + index) != '\"')
{
var next = line.IndexOf(delimiter, index);
if (next < 0)
{
columns[col++] = new string(l, index, linelen - index);
break;
}
columns[col++] = new string(l, index, next - index);
index = next + 1;
}
else
{
int qc = 1;
int start = index;
char c = *(l + ++index);
while (index++ < linelen)
{
if (c == '\"')
qc++;
if (c == delimiter && qc % 2 == 0)
break;
c = *(l + index);
}
columns[col++] = new string(l, start + 1, index - start - 3).Replace("\"\"", "\"");
}
}
}
return columns;
}
ParseLine()
is responsible for extracting the columns from a line in an optimized unsafe
way.
And the writing code is just :
public static void WriteFile<T>(string filename, string[] headers, char delimiter, List<T> list, FromObj<T> mapper)
{
using (FileStream f = new FileStream(filename, FileMode.Create, FileAccess.Write))
{
using (StreamWriter s = new StreamWriter(f))
{
if (headers != null)
s.WriteLine(string.Join(delimiter.ToString(), headers));
foreach (var o in list)
{
List<object> cols = new List<object>();
mapper(o, cols);
for (int i = 0; i < cols.Count; i++)
{
var str = cols[i].ToString();
bool quote = false;
if (str.IndexOf('\"') >= 0)
{
quote = true;
str = str.Replace("\"", "\"\"");
}
if (quote == false && str.IndexOf('\n') >= 0)
quote = true;
if (quote == false && str.IndexOf('\r') >= 0)
quote = true;
if (quote == false && str.IndexOf(delimiter) >= 0)
quote = true;
if (quote)
s.Write("\"");
s.Write(str);
if (quote)
s.Write("\"");
if (i < cols.Count - 1)
s.Write(delimiter);
}
s.WriteLine();
}
s.Flush();
}
f.Close();
}
}
Sample Use cases
Splitting data sets for testing and training
In data science you generally split your data into training and testing sets, and in the example below every 3rd row is for testing ( you could make the splitting more elaborate) :
var testing = new List<LocalWeatherData>();
int line = 0;
var training = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
{
bool add = true;
line++;
o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
fastCSV.ToInt(c[1], 4, 2),
fastCSV.ToInt(c[1], 6, 2));
o.SkyCondition = c[4];
if (line % 3 == 0)
{
add = false;
testing.Add(o);
}
return add;
});
Appendix v2.0
With apologies to NReco.CSV
, I messed up the timings for that library which was pointed out to me by KPixel on GitHub, this prompted me to go back to the drawing board and redo the internals of fastCSV
for more speed.
The new code is even faster than a naïve roll-your-own File.ReadLines()
and string.Split()
which can't handle multi-lines.
Performance
The new performance numbers are below which in comparison to the v1 code is nearly 2x faster at the expense of a little more memory usage on the same 4,496,263 rows dataset.
- fastCSV .net4 : 6.27s 753Mb used
- fastCSV core : 6.51s 669Mb used
Interestingly on .net core, the library uses less memory.
Changes Made
- First up was creating a buffered
StreamReader
instead of relying on File.ReadLines()
this is handled by the BufReader
class. This class also handles multi-lines without resorting to a StringBuilder
and the possible case that a line is larger than the buffer being used to read it. - While testing I discovered that using IL to create the generic list objects is faster than using
new T()
- Poor mans (.net 4)
Span
with the MGSpan
class which just passes around the buffer, start and length of the data and delays the creation of strings until they are actually used in the object filling delegate. ReadFile<T>()
looks cleaner now and does not use StringBuilder
for multi-lines.
Some Weird Stuff
The first pass of FillBuffer()
used _tr.BaseStream.Seek()
to go back in the CSV data file when the end of the buffer being read was reached and a line was not completed, this was fine at first but failed non ASCII data and UTF8 encoded files, the reason being that some characters are 2 bytes not 1 and a char
in .net while looking like a byte
in fact can be 2 bytes. This messes the offset computation when seeking which results in reading incorrect lines.
To remedy this I used Array.Copy()
to copy the characters to the start of the buffer and read the rest from the file until the buffer is full, interestingly again you can't use Buffer.BlockCopy()
for the same reason as above.
A simple change of struct MGSpan
instead of class MGSpan
results in a very distinct speed up, probably because the objects are passed around on the stack and don't stay around long which faster than using the heap.
Previous Versions
History
- Initial release : 1st January 2020
- Update v2.0 : 27th January 2020
- article corrections
- speed ups
- Update v2.0.1 : 5th September 2020
- new overloads
- ignore escaped characters at the end of a quoted line (thanks to Denis Samuel)
- Update v2.0.2 : 6th September 2020
- ToString() null check (thanks to Martijn)
- Update v2.0.8 : 6th December 2020
- added unit tests
- bug fix trailing new line sometimes
- bug fix empty trailing data columns showing prev row data
- switched to
TextReader
- more
ReadStream()
overloads - bug fix reading first line when
hasheader = false