A Robust CSV Reader

David A. Gray

3.52/5 (6 votes)

14 Jan 2020CPOL6 min read

36K

The routines in this library can parse any string that I can throw at it, including Common Name strings read from X.509 Digital Certificates.

Introduction

In the course of writing a program to generate a report about the properties of the contents of a Windows certificate store, I discovered that none of the CSV readers that I had at my disposal could process the unusual CSV strings found in many digital certificates. Turning to The Code Project, I found A Fast CSV Reader, which I eagerly downloaded, built, and tested.

Regretfully, I quickly discovered that it couldn't handle these types of strings, either, and set about to roll yet another CSV reader of my own.

Background

The type of string that broke every CSV reader at which I threw it looks like the following actual example from a production store of Trusted Root Certificates.

CN=RapidSSL CA, O="GeoTrust, Inc.", C=US

Specifically, the middle substring, O="GeoTrust, Inc.", is nonstandard, because the opening quotation mark that is intended to protect the comma in the organization name, GeoTrust, Inc. is the fourth character in the substring. This is by design, since the Common Name string is a series of name-value pairs, as are many other certificate properties.

Although every other CSV library in The Code Project and elsewhere that I reviewed mimics a base class, such as System.IO.StreamReader, I chose instead to adhere to the Single Responsibility Principle, and focus entirely on parsing strings, leaving you free to determine how to acquire and manage them.

Getting the Code

Although this library has shown itslf to be quite stable, in anticipation that it might need maintenance,
the source code is in one of my oldest (and most stable) GitHub repositories, at https://github.com/txwizard/AnyCSV. More recently, I added MSDN-style
documentation, which is published at https://txwizard.github.io/AnyCSV/.

There is also a NuGet package available at .https://www.nuget.org/packages/WizardWrx.AnyCSV/.

Using the Code

The working code is in a class library, which I built against Microsoft .NET Framework 2.0, and it can be used in projects that target any newer version of the framework. For your convenience, I included both debug and release builds of the library and the test program.

The package includes a test program, AnyCSVTestStand.exe, which is built from a single source file, Program.cs, and a set of string resources. Since it was intended to prove the robustness of the algorithm, the test program confines itself to the static methods.

Class library WizardWrx.AnyCSV.dll exposes a single class, Parser, which has nine constructor overloads, all of which are optional, and one overloaded static Parse method that does the real work. Since the real worker method is static, and everything it needs can be either passed into the method or allowed to take default values, the only reason to use an instance is when you want to specify non-default values for one or more of the arguments that have corresponding properties on the class in advance, so that you can omit them from the argument list when you call the Parse method in a loop to process a set of strings.

The Parse method has four overloads, all of which return a string [ ] containing the substrings. I shall next summarize the overloads, starting with the fourth, and simplest.

The simplest overload takes two arguments, the string to be parsed and the delimiter, a pchrDelimiter.
Adding a tad of complexity, the third overload adds a second char argument, pchrProtector, which specifies the character that protects a pchrDelimiter that appears in the middle of a substring, which must be ignored.
A third overload adds penmGuardDisposition, which uses a member of the GuardDisposition enumeration to govern disposition of guard characters. Its valid values are straightforward and self-explanatory: Keep and Strip.
The most complex overload adds the final parameter, penmTrimWhiteSpace, which uses the TrimWhiteSpace enumeration to specify four possible options to dispose of leading and trailing blanks in a substring.

TrimWhiteSpace
Value	Outcome
`Leave`	Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
`TrimLeading`	Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
`TrimTrailing`	Trim trailing white space. This option is especially useful with CSV files generated by Microsoft Excel, which often have long runs of meaningless white space, especially when a worksheet has blank rows or columns in its UsedRange.
`TrimBoth`	Given that `TrimLeading` and `TrimTrailing` are required use cases, trimming both ends is essentially free. This flag is implemented such that it can be logically processed as `TrimLeading \| TrimTrailing`.

One final static method, StandardCSVParse, is dedicated to parsing a true Comma Separated Values string. As such, it has a single string argument, pstrAnyCSV. The delimiter character is a comma, and the guard character is the double quotation mark.

Points of Interest

Everything that counts happens in a single method, the most complex of the four static Parse methods. Everything else, including the instance Parse method, calls upon it, specifying default values for omitted arguments or, in the case of the instance method, the corresponding instance properties.

The Parse method gets its robustness from a simple state machine that uses a pair of simple Boolean variables, fInProgress and fProtectDelimiters, as state variables. The key point is that when fProtectDelimiters is TRUE, indicating that a guard character has been found, but its mate has yet to be found, delimiter characters are ignored. Strings are assembled by appending characters to a StringBuilder that is initialized with a size sufficient to hold a degenerate case string - one that is devoid of delimiters. The objective of reserving such a large amount of memory is that the StringBuilder never needs to expand; hence, it need never move a partially built string to a bigger buffer.

To simplify routine use, the class exposes common delimiter and guard characters as public constants. If you prefer to avoid using raw characters, you can specify them in terms of a pair of enumerations, DelimiterChar and GuardChar. The constants and enumerations are especially useful for specifying guard characters, since quotation mark literals are rather fussy.

Translating the DelimiterChar and GuardChar enumerations is aided by arrays of DelimiterMap and GuardMap structures, which are populated by a static Parser constructor.

The instance properties use s_objSyncLock, a private generic object, to make themselves thread safe. I borrowed this proven concept from other classes that use such an object as part of their implementation of the Singleton design pattern.

Room for Improvement

I anticipate that others will find numerous ways to improve this library. One that occurs to me is the addition of a static StringBuilder that grows, but never shrinks during its lifetime. At a minimum, this would require the Parse method to test for its existence and size, and use the s_objSyncLock object to make itself thread-safe.

History

Friday, 01 May 2015 - Initial publication
Monday, 29 April 2019 - Add "Getting the Code" section, since the code lives in a GitHub repsoitory
Tuesday, 14 January 2020 - Remove stray markup from the text.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)