In this article, we will take a look at the CSV specifications exposed in RFC-4180 and a brief explanation of the structure that a file of this type can have. I will address some examples of how to use CSV interface, some extreme cases in which the vast majority of parsers fail and I will detail all the features that the library under study has.
Introduction
For many users, design a CSV library is something unfortunate, premise that is based on the existence of tools such as Power Query for Excel and the powerful SQL language for Microsoft Access. However, if our need is to read the information from a CSV file and store the data read in RAM, using Power Query in Excel would result in the steps: establish a physical connection to the file, dump the data into a spreadsheet, copy the data to a VBA array, delete the data dumped to the sheet and delete the connection to the file. Clearly, for the aforementioned case, the best option would be to use a library that allows us to dump the data directly from the CSV file to the RAM memory. That is the need that has led me to design the CSV interface.
Capabilities
- RFC-4180 specs compliant and more. The parser is adjusted to a variety of situations although it is preferable to follow the specifications: supports quoted and multi-line fields, user can skip commented and blank records.
- Stable. Fully Test Driven Developed (TDD) library, (64/64 test passed), that includes 650+ line of code for testing. See VBA test library by Tim Hall.
- Memory-friendly. CSV/TSV files are processed using a custom stream technique, only 0.5MB are in memory at a time.
- Robust. Parser and writer accept Unix-style quotes escape sequences.
- Easy to use. A few lines of code can do the work!
- Automatic delimiter guesser. Don't worry if you forgot the file configuration. The interface has a solid strategy for guessing delimiters!
- Highly Configurable. User can configure the parser to work with a wide range of CSV files.
- CSV data filtering. Only save the CSV data that fill the specified requirements.
- Like SQL queries on CSV files. Add your own logic to mimic SQL queries and filter data by criteria (=, <>, >=, <=, AND, OR).
- Flexible. Import only certain range of records from the given file, import fields (columns) by indexes or names, read records in sequential mode.
- Dynamic Typing support. Turn CSV data field to a desired VBA data type.
- Data sorting. Sort CSV imported data using the hyper-fast Yaroslavskiy Dual-Pivot Quicksort like Java.
- Microsoft Access compatible. The library has a version for those who feel in comfort working through DAO databases, download from here.
RFC-4180 Specs
Currently, there is no defined standard for CSV files, there are a multitude of implementations and variants. Notwithstanding the foregoing, there are specifications such as RFC-4180 that define the basic structure of these files, while variants of this standard are defined in the specifications of the Library Of Congress.
According to the specifications, a CSV can contain:
- A field (column) containing the fields delimiter character, the record delimiter character (multiline fields), or the escape character (usually double quotation marks). This type of field must be encapsulated ("escaped") using the escape character. Literal escape characters in escaped fields must also be escaped by duplicating each escape character or preceding each escape character with a backslash (Unix style).
- A commented or empty record (row/line).
- The tab character (\t) or semicolon (;) as a field delimiter and one of the characters CRLF (\r\n), CR (\r), and LF (\n) as a record delimiter. In exceptional cases, we can find CSV files with a mixture of record delimiters.
The library supports the apostrophe (') as an escape character and, with the limitation of the specifications, a field containing the slang "isn't" should be stored in the CSV file as "isn''t", which is confusing. In these cases, users may choose to use the Unix escape mechanism and store the slang as "isn\'t", which is a more human-readable option.
An example of a CSV with a structure that conforms to the RFC-4180 specifications would be as follows:
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
The above example should produce the fields and records shown in the following table (headers added for demonstration purposes):
Column 1 | Column 2 | Column 3 | Column 4 |
rec1, fld1 | | rec1","fld3.1
",
fld3.2
| rec1
fld4
|
rec2, fld1.1
fld1.2
| rec2 fld2.1"fld2.2"fld2.3 | | rec2 fld4 |
In the table, field #3 of record #1 is multiline, the same is true for field #4 of the aforementioned record, and also for field #1 of record #2. According to RFC-4180 specifications, literal escape characters will be escaped by duplicating each of them, so field #3 of record #1 and field #2 of record #2 have escape characters that must be unescaped by the parser. On the other hand, many fields contain the field separator itself, which must be handled by the parser.
So this example is a perfect scenario for testing any CSV parser, and it is already included as a test case in the VBA CSV interface.
Solution Core
Many authors recommend avoiding loading chunks of files into a buffer in order to reduce the memory footprint, and the reason is that this alternative can be extremely complex. Nevertheless, buffering is the route that has been taken to achieve a parser capable of working with CSV files efficiently.
Using the Code
This section will attempt to analyze all the capabilities of the CSV interface.
Import whole CSV file:
Sub CSVimport()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
End With
With csvinf
.ImportFromCSV .parseConfig
End With
End Sub
Now suppose from the file "Sample.csv" the user only requires to import a specific range of records. It is possible to write a code like the one shown below:
Sub CSVimportRecordsRange()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
.startingRecord = 10
.endingRecord = 20
End With
With csvinf
.ImportFromCSV .parseConfig
End With
End Sub
If the user wants to sort the imported data, a code like the following can be written:
Sub CSVimportAndSort()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
End With
With CSVint
.ImportFromCSV .parseConfig
.Sort SortColumn:=1, Descending:=True
End With
End Sub
CSV data are mainly treated as text strings, what if the user wants to do some calculations on the data obtained from a given file? In this situation, the user can change the behavior of the parser to work in dynamic typing mode. Here's an example:
Sub CSVimportAndTypeData()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
.dynamicTyping = True
.DefineTypingTemplate TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDouble
.DefineTypingTemplateLinks 6, _
7, _
10
End With
With CSVint
.ImportFromCSV .parseConfig
End With
End Sub
The escape character can be defined as one of them, according to an enumeration:
Sub SetEscapeChar()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.escapeToken = EscapeTokens.DoubleQuotes
End With
End Sub
Once the data is imported and saved to the internal object, the user can access it in the same way as a standard VBA array. An example would be:
Sub LoopData(ByRef CSVint As CSVinterface)
With CSVint
Dim iCounter As Long
Dim cRecord() As Variant
Dim cField As Variant
For iCounter = 0 To CSVint.count - 1
cRecord() = .item(iCounter)
cField = .item(iCounter, 2)
Next
End With
End Sub
However, it is sometimes disadvantageous to store data in containers other than VBA arrays. This becomes especially noticeable when it is required to write the information stored in Excel's own objects, such as spreadsheets, or VBA user forms, the case of list boxes, which allow to be filled in a single instruction using arrays. Then, the user can copy the information from the internal object using code like this:
Sub DumpData(ByRef CSVint As CSVinterface)
Dim oArray() As Variant
With CSVint
.DumpToArray oArray
.DumpToJaggedArray oArray
.DumpToSheet
.DumpToAccessTable oAccessDB, _
"Table1", _
2, 3
End With
End Sub
So far, in the examples addressed, the user has been allowed to choose between two actions:
- Import ALL records contained in a CSV file.
- Import a recordset, starting at record X and ending at record Y.
In both options, the user is obliged to import all fields (columns) present in the file. Most CSV file parsers only offer the first option, but what if the user wants to save only the information that is relevant to them? and what happens is intended to store in memory only the registers that meet a certain set of requirements?
An user may need to import 2 of 12 columns from a CSV file, in this case, the user can use something like:
Sub CSVimportDesiredColumns()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
End With
With CSVint
.ImportFromCSV .parseConfig, _
1, "Revenue"
End With
End Sub
So, OK, let's imagine now that an user wants to apply some logic before saving the data, in which case they can step through the records in the CSV file one by one, using the sequential reader, as shown in the following example:
Sub CSVsequentialImport()
Dim CSVint As CSVinterface
Dim csvRecord As ECPArrayList
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
End With
With CSVint
.OpenSeqReader .parseConfig, _
1, "Revenue"
Do
Set csvRecord = .GetRecord
Loop While Not csvRecord Is Nothing
End With
End Sub
Is there a way to sequentially fetch a set of records at a time instead of a single record? Currently, there is no built-in method to do that with a single instruction, as in the examples above, but with a few extra lines of code and the tools provided by the library, it is possible to achieve that goal. This is illustrated in the following example where the CSV file is streamed:
Sub CSVimportChunks()
Dim CSVint As CSVinterface
Dim StreamReader As ECPTextStream
Set CSVint = New CSVinterface
With CSVint.parseConfig
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
End With
Set StreamReader = New ECPTextStream
With StreamReader
.endStreamOnLineBreak = True
.OpenStream "C:\Sample.csv"
Do
.ReadText
CSVint.ImportFromCSVString .bufferString, _
CSVint.parseConfig, _
1, "Revenue"
Loop While Not .atEndOfStream
End With
Set CSVint = Nothing
Set StreamReader = Nothing
End Sub
So far, it has been outlined the way in which you can import the records from a CSV file sequentially, the following example shows how to filter the records, in a like SQL way, according to whether they meet a criterion set by the user:
Sub QueryCSV(path As String, ByVal keyIndex As Long, queryFilters As Variant)
Dim CSVint As CSVinterface
Dim CSVrecords As ECPArrayList
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
End With
If path <> vbNullString Then
Set CSVrecords = CSVint.GetCSVsubset(path, _
queryFilters, _
keyIndex)
CSVint.DumpToSheet DataSource:=CSVrecords
Set CSVint = Nothing
Set CSVrecords = Nothing
End If
End Sub
In some situations, we may encounter a CSV file with a combination of vbCrLf
, vbCr
and vbLf
as record delimiters. This can happen for many reasons, but the most common is by adding data to an existing CSV file without checking the configuration of the previously stored information. These cases will break the logic of many robust CSV parsers, including the demo of the 737K weekly downloaded Papa Parse. The next example shows how an user can import CSV files with mixed line break as record delimiter, an option that uses the turnStreamRecDelimiterToLF
property of the parseConfig
object to work with these special CSV files.
Sub ImportMixedLineEndCSV()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Mixed Line Breaks.csv"
.fieldsDelimiter = ","
.recordsDelimiter = vbCrLf
.turnStreamRecDelimiterToLF = True
End With
With CSVint
.ImportFromCSV .parseConfig
End With
Set CSVint = Nothing
End Sub
In all the above examples, an implicit assumption has been made, and that is that the user knows the configuration of the CSV file to be imported, so the question arises: can it be possible that the user does not know the configuration of the file to be imported? It is certainly possible, so how can the CSV interface help in these cases?
The tool includes a utility to guess field delimiters, record delimiters and escape character. This can be done with code like the following:
Sub DelimitersGuessing()
Dim CSVint As CSVinterface
Set CSVint = New CSVinterface
With CSVint.parseConfig
.path = "C:\Sample.csv"
End With
With CSVint
.GuessDelimiters .parseConfig
End With
End Sub
Points of Interest
Guessing the delimiters, or determining the dialect of the CSV file, may be seen as a simple problem, but it is definitely not. Even at the time of writing, there is open research on the subject. For example, G.J.J. van den Burg's research ended with the implementation of a robust system called CleverCSV which allows the dialect of CSV files to be determined. The solution proposed by the researcher determines the dialect of a CSV by evaluating the patterns of the rows, in order to check their uniformity, and the number of fields that can be converted to the data types established in the system. Then, your Achilles heel comes when you try to define the dialect in a file that makes use of unseated data, as exposed in its issue #37 and in the issue #35.
# This CSV has caused CleaverCSV issue 37
# The parser guess delimiter [:] instead of [,]
"{""fake"": ""json"", ""fake2"":""json2""}",13:31:38,06:00:04+01:00
"{""fake"": ""json"", ""fake2"":""json2""}",22:13:29,14:20:11+02:00
"{""fake"": ""json"", ""fake2"":""json2""}",04:37:27,22:04:28+03:00
"{""fake"": ""json"", ""fake2"":""json2""}",04:25:28,23:12:53+01:00
"{""fake"": ""json"", ""fake2"":""json2""}",21:04:15,08:23:58+02:00
"{""fake"": ""json"", ""fake2"":""json2""}",10:37:03,11:06:42+05:30
"{""fake"": ""json"", ""fake2"":""json2""}",10:17:24,23:38:47+06:00
"{""fake"": ""json"", ""fake2"":""json2""}",00:02:51,20:04:45-06:00
For its part, Papa Parse is unable to disambiguate the field delimiter when faced with CSV files in which the comma (',')
is used as a decimal separator and the semicolon (';')
as a field’s delimiter. Here are some CSVs that prevent Matt Holt's powerful piece of code from properly determining delimiters:
Prüfung1;Prüfung2;Prüfung3
1,5;33,33;15,55
2,5;25,44;30,1
3,5;16,67;45,2
4,5;12;60,3
'Neroductions Group';£ 1,80;£ 9000,50
'Hatchworks Ltd.';£ 2,00;£ 100000,30
id;value
1;3,4,5
2;6,7,8
3;9,10,11
4;13,14,15
5;"15,16,17;also;that"
Does this mean that we can't determine the dialect or guess the delimiters of a CSV with complete certainty? The answer is that there is no foolproof solution to disambiguate tables generated by random configuration of delimiters and escape characters. Anyway, the method of statistical scoring of fields, records and tables implemented in CSV interface works perfectly with the files that caused problems for CleverCSV and Papa Parse.
In the case of CSV interface, the problem of disambiguation becomes insoluble when two dialects produce tables with the same congruence, a situation that can occur in CSV files without headers such as the one shown below:
1|2;3|4;5
3;a|c;6|6
The CSV interface scoring method will produce exactly the same result, both for the dialect that uses the vertical bar as a field delimiter and for the dialect that uses the semicolon for those purposes, there is no machine or human disambiguation possible for those particular cases.
History
- 11th July, 2021: Initial version