Introduction
In the course of writing a program to generate a report about the properties of the contents of a Windows certificate store, I discovered that none of the CSV readers that I had at my disposal could process the unusual CSV strings found in many digital certificates. Turning to The Code Project, I found A Fast CSV Reader, which I eagerly downloaded, built, and tested.
Regretfully, I quickly discovered that it couldn't handle these types of string
s, either, and set about to roll yet another CSV reader of my own.
Background
The type of string that
broke every CSV reader at which I threw it looks like the following actual example from a production store of Trusted Root Certificates.
CN=RapidSSL CA, O="GeoTrust, Inc.", C=US
Specifically, the middle substring, O="GeoTrust, Inc."
, is nonstandard, because the opening quotation mark that is intended to protect the comma in the organization name, GeoTrust, Inc. is the fourth character in the substring. This is by design, since the Common Name string is a series of name-value pairs, as are many other certificate properties.
Although every other CSV library in The Code Project and elsewhere that I reviewed mimics a base class, such as System.IO.StreamReader
, I chose instead to adhere to the Single Responsibility Principle, and focus entirely on parsing string
s, leaving you free to determine how to acquire and manage them.
Getting the Code
Although this library has shown itslf to be quite stable, in anticipation that it might need maintenance,
the source code is in one of my oldest (and most stable) GitHub repositories, at https://github.com/txwizard/AnyCSV. More recently, I added MSDN-style
documentation, which is published at https://txwizard.github.io/AnyCSV/.
There is also a NuGet package available at .https://www.nuget.org/packages/WizardWrx.AnyCSV/.
Using the Code
The working code is in a class library, which I built against Microsoft .NET Framework 2.0, and it can be used in projects that target any newer version of the framework. For your convenience, I included both debug and release builds of the library and the test program.
The package includes a test program, AnyCSVTestStand.exe, which is built from a single source file, Program.cs, and a set of string
resources. Since it was intended to prove the robustness of the algorithm, the test program confines itself to the static
methods.
Class library WizardWrx.AnyCSV.dll exposes a single class, Parser
, which has nine constructor overloads, all of which are optional, and one overloaded static Parse
method that does the real work. Since the real worker method is static
, and everything it needs can be either passed into the method or allowed to take default values, the only reason to use an instance is when you want to specify non-default values for one or more of the arguments that have corresponding properties on the class in advance, so that you can omit them from the argument list when you call the Parse
method in a loop to process a set of string
s.
The Parse
method has four overloads, all of which return a string [ ]
containing the substrings. I shall next summarize the overloads, starting with the fourth, and simplest.
- The simplest overload takes two arguments, the
string
to be parsed and the delimiter, a pchrDelimiter
. - Adding a tad of complexity, the third overload adds a second
char
argument, pchrProtector
, which specifies the character that protects a pchrDelimiter
that appears in the middle of a substring, which must be ignored. - A third overload adds
penmGuardDisposition
, which uses a member of the GuardDisposition
enumeration to govern disposition of guard characters. Its valid values are straightforward and self-explanatory: Keep
and Strip
. - The most complex overload adds the final parameter,
penmTrimWhiteSpace
, which uses the TrimWhiteSpace
enumeration to specify four possible options to dispose of leading and trailing blanks in a substring.
Value | Outcome |
TrimWhiteSpace Leave | Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates. |
TrimLeading | Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates. |
TrimTrailing | Trim trailing white space. This option is especially useful with CSV files generated by Microsoft Excel, which often have long runs of meaningless white space, especially when a worksheet has blank rows or columns in its UsedRange. |
TrimBoth | Given that TrimLeading and TrimTrailing are required use cases, trimming both ends is essentially free. This flag is implemented such that it can be logically processed as TrimLeading | TrimTrailing . |
One final static
method, StandardCSVParse
, is dedicated to parsing a true Comma Separated Values string
. As such, it has a single string
argument, pstrAnyCSV
. The delimiter character is a comma, and the guard character is the double quotation mark.
Points of Interest
Everything that counts happens in a single method, the most complex of the four static Parse
methods. Everything else, including the instance Parse
method, calls upon it, specifying default values for omitted arguments or, in the case of the instance method, the corresponding instance properties.
The Parse
method gets its robustness from a simple state machine that uses a pair of simple Boolean variables, fInProgress
and fProtectDelimiters
, as state variables. The key point is that when fProtectDelimiters
is TRUE
, indicating that a guard character has been found, but its mate has yet to be found, delimiter characters are ignored. String
s are assembled by appending characters to a StringBuilder
that is initialized with a size sufficient to hold a degenerate case string
- one that is devoid of delimiters. The objective of reserving such a large amount of memory is that the StringBuilder
never needs to expand; hence, it need never move a partially built string
to a bigger buffer.
To simplify routine use, the class exposes common delimiter and guard characters as public
constants. If you prefer to avoid using raw characters, you can specify them in terms of a pair of enumerations, DelimiterChar
and GuardChar
. The constants and enumerations are especially useful for specifying guard characters, since quotation mark literals are rather fussy.
Translating the DelimiterChar
and GuardChar
enumerations is aided by arrays of DelimiterMap
and GuardMap
structures, which are populated by a static Parser
constructor.
The instance properties use s_objSyncLock
, a private
generic object
, to make themselves thread safe. I borrowed this proven concept from other classes that use such an object as part of their implementation of the Singleton design pattern.
Room for Improvement
I anticipate that others will find numerous ways to improve this library. One that occurs to me is the addition of a static StringBuilder
that grows, but never shrinks during its lifetime. At a minimum, this would require the Parse
method to test for its existence and size, and use the s_objSyncLock
object to make itself thread-safe.
History
- Friday, 01 May 2015 - Initial publication
- Monday, 29 April 2019 - Add "Getting the Code" section, since the code lives in a GitHub repsoitory
- Tuesday, 14 January 2020 - Remove stray markup from the text.