Although Office has all the necessary tools to perform a number of tasks, amateur developers and experienced programmers are tempted to automate each of the processes that are executed on a daily basis. CSV file queries are no exception to this rule, and automating them is a problem that involves many solutions that focus on spreadsheets and a few others that focus on opening a path between VBA and CSV files without the need for any other intermediary.
Introductory Words
The capability and flexibility offered by the application ecosystem integrated in Microsoft Office is known by almost all its users, also by the absolute majority of amateur and full-time programmers. This is not surprising, the Office package includes VBA, which is a subset of functionality that was present in the popular Visual Basic programming language until its version 6 (VB6).
One of the most exciting features, which in turn can be considered as the most powerful tool in the Office package, is the ability to work with external data sources with exceptional fluidity. However, do you know the most efficient way to query CSV files and import the result directly into RAM?
Background
The number of tools that allow working with CSV files is growing every day, most of them adhere to the specifications provided in RFC-4180. Others add new features, making these applications more flexible.
In particular, most of the tools designed for the purpose specified above load the entire contents of the CSV into memory, from where it is parsed and loaded into an array or a collection.
The previous solution offers the advantage of being fast, but we must always keep in mind that memory resources are limited when working in VBA. This is why it is imperative to perform queries on the files and load the most relevant information contained in them into memory.
Until the time of writing this article, there was no tool capable of emulating, from VBA and without using objects inherent to a specific application (Excel sheets or Access tables), the complicated queries that Power Query allows to execute on CSV files without knowing the SQL language.
Study Case
We have a dataset, CSV, containing the sales records of the last 10 years (100k records) of a company and we want to perform a query on the file and select the records that meet the following requirements:
- Purchases made from Central America and the Caribbean
- Online purchases
- Items with profit margin above the median (789165.52)
- Query results sorted in descending order
The problem will be addressed using VBA CSV interface, a set of open source VBA class modules.
Solution
Achieving the task can be tricky when using the majority of the VBA CSV parsers, but with the selected one, the code look so simple. The first thing we need to do is convert the data fields to the proper VBA data type.
.DefineTypingTemplate TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDouble, _
TypeConversion.ToDouble, _
TypeConversion.ToDouble TypeConversion.ToDouble
.DefineTypingTemplateLinks 6, _
7, _
8, _
9, _
10, _
11, _
12
The first method of the above code snippet creates a template for type conversion. The second command links the template with the specified fields indexes. As a result, the snippet will convert the 6th and 8th field to Date
, the 7th and 9th field to Long
and the 10th, 11th and 12th fields to Double
.
For some reason, the Power Query editor throws an error when trying to change fields type from Text
to Date
as shown in this image.
Now, we can implement some logic to choose how our CSV data is retrieved.
Public Sub Excel_VBA_Query_Over_CSV()
Dim path As String
Dim conf As parserConfig
Dim CSVrecord As ECPArrayList
Dim CSVrecords As ECPArrayList
Set CSVint = New CSVinterface
Set conf = CSVint.parseConfig
Set CSVrecords = New ECPArrayList
path = "C:\Demo_100k_records.csv"
With conf
.recordsDelimiter = vbCr
.path = path
.dynamicTyping = True
.headers = True
.DefineTypingTemplate TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDouble, _
TypeConversion.ToDouble, _
TypeConversion.ToDouble
.DefineTypingTemplateLinks 6, _
7, _
8, _
9, _
10, _
11, _
12
End With
CSVint.OpenSeqReader conf
Set CSVrecord = CSVint.GetRecord
If conf.headers Then
If Not CSVrecord Is Nothing Then
CSVrecords.Add CSVrecord(0)
End If
End If
Do While Not CSVrecord Is Nothing
If CSVrecord(0)(0) = "Central America and the Caribbean" Then
If CSVrecord(0)(3) = "Online" Then
If CSVrecord(0)(11) > 789165.52 Then
CSVrecords.Add CSVrecord(0)
End If
End If
End If
Set CSVrecord = CSVint.GetRecord
Loop
CSVrecords.Sort 2, SortColumn:=6, Descending:=True
CSVint.DumpToSheet DataSource:=CSVrecords
Set CSVint = Nothing
Set CSVrecords = Nothing
End Sub
If the host application is MS Access, the problem is solved as shown below:
Public Sub Access_VBA_Query_Over_CSV()
Dim path As String
Dim conf As parserConfig
Dim dBase As DAO.Database
Dim CSVrecord As ECPArrayList
Dim CSVrecords As ECPArrayList
Set CSVint = New CSVinterface
Set conf = CSVint.parseConfig
Set CSVrecords = New ECPArrayList
With conf
.recordsDelimiter = vbCr
path = "C:\Demo_100k_records.csv"
.path = path
.dynamicTyping = True
.headers = True
.DefineTypingTemplate TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDate, _
TypeConversion.ToLong, _
TypeConversion.ToDouble, _
TypeConversion.ToDouble, _
TypeConversion.ToDouble
.DefineTypingTemplateLinks 6, _
7, _
8, _
9, _
10, _
11, _
12
End With
CSVint.OpenSeqReader conf
Set CSVrecord = CSVint.GetRecord
If conf.headers Then
If Not CSVrecord Is Nothing Then
CSVrecords.Add CSVrecord(0)
End If
End If
Do While Not CSVrecord Is Nothing
If CSVrecord(0)(0) = "Central America and the Caribbean" Then
If CSVrecord(0)(3) = "Online" Then
If CSVrecord(0)(11) > 789165.52 Then
CSVrecords.Add CSVrecord(0)
End If
End If
End If
Set CSVrecord = CSVint.GetRecord
Loop
Set CSVint.items = CSVrecords
Set dBase = CurrentDb
CSVint.Sort(SortColumn:=6, Descending:=True).DumpToAccessTable dBase, "Demo_100k"
Set CSVint = Nothing
Set dBase = Nothing
Set CSVrecords = Nothing
End Sub
The Access_VBA_Query_Over_CSV
procedure will create a table named Demo_100k
in the current database. The created table has no CSV field indexes.
Points of Interest
Although with VBA CSV interface we can sequentially access the records stored in a CSV file, the cost paid for doing so is the reduced performance of our query. However, it is preferable to load only the required portion of information into memory rather than store all the records in RAM.
History
- 17th March, 2021: Initial version