Introduction
The code snippet in this article illustrates an efficient/fast row/line count algorithm using
the BinaryReader API.
Background
CSV file format has text data. There is no official restriction on the number of rows, number of columns, or file size.
Due to no restrictions, to read the number of lines, a complete file reading is required.
Using the Code
In the Windows operating system a line break is represented by CRLF (\r \n). The basic approach is to read all the content through streaming, and find
the line breaks. The
BinaryReader API is used for stream reads.
This class reads primitive data types as binary values in a specific encoding.
private static int GetLineCount(string fileName)
{
BinaryReader reader = new BinaryReader(File.OpenRead(fileName));
int lineCount = 0;
char lastChar = reader.ReadChar();
char newChar = new char();
do
{
newChar = reader.ReadChar();
if (lastChar == '\r' && newChar == '\n')
{
lineCount++;
}
lastChar = newChar;
} while (reader.PeekChar() != -1);
return lineCount;
}
The above snippet checks for CRLF that is line break on Windows. The code can be further improved by checking
the Environment.NewLine
property that specifies new line string for the environment.
- For Unix environment, new line string is LF (\n).
- For Mac environment, new line string is CR (\r).
Alternatives
- Read all records at a time, and calculate the array length using the
File.ReadAllLines
API. This is good for small files. For large files (>2GB) an OutOfMemoryException
is expected. StreamReader
API: There are two options:
- using the
ReadLine
function to read lines. This has
a trade-off of line to string conversion which is
not needed. - using the
Read()
and Peek()
methods. This is similar to using the BinaryReader
approach but these methods return integer and
not char so a little bit more logic is required for character comparisons.
Points of Interest
Below are some efficient CSV parsers that I have come across/used.
- TextFieldParser: This is built-in .NET structured
text file parser. This parser is placed in the Microsoft.VisualBasic.dll library.
- KBCsv library: This is an efficient, easy to use library developed
by Kent Boogaart.