In this article, we will approach CSV files from the perspective of the analysis of their content, urging within the different situations that require a management of them, while proposing a solution coded in VBA for a variety of common situations in the day to day of a data analyst, without calls to external DLLs or APIs, without intermediary objects other than those of Microsoft's macro language.
Introduction
The CSV format is the king when it comes to shared data in plain text format, to such an extent that it is the format used by official institutions in countries such as the United Kingdom, as well as by the Library of Congress in the United States, whose page states that "the Library of Congress Recommended Formats Statement (RFS) includes CSV as a preferred format for datasets". This format is widely used in the industry, so that most applications that work with data (including spreadsheets and the most relevant database engines) support import and export of CSV files or some variant of it, furthermore, the Python and Ruby programming languages have a library called "csv
" that allows you to work directly with CSV files.
The idea of all this is to achieve that, as Python and Ruby do, VBA can work with CSV files directly without relying on objects outside the programming language or external libraries (COM, DLLs). However, most VBA solutions for working with CSV files focus on the flat reading of these files, acting as intermediaries used exclusively to read the information contained in the files. In other words, most of the tools focus on reading the data from the CSV files in memory, lacking the ability to analyze the information obtained and then serve the results to technologies such as Power Query or DaxStudio for further analysis. These two tools are highly appreciated and valued by data analysts, as they allow them to make sense of a data set with a few clicks, in the case of Power Query, or with a few lines of code, in the case of DaxStudio, from an attractive and intuitive graphical interface.
The Problem
Currently, there are no tools that allow manipulating variants of CSV files, like DSV or CSV with distinct dialect that those specified at RFC-4180, from VBA, so the programmers of this programming language face difficulties when working with these files since the flow of information must be interrupted for users to use external tools from the precise moment in which the slightest modification of the data read is required. This is a situation that puts developers in a difficult situation, since many of the basic operations such as sorting, rearranging fields are extremely necessary.
The Solution
Feature | CSV Interface | CSVKit | XSV | Miller |
Manage fields/records: insert, delete, shift, merge, split, rearranging... | YES | YES* | YES* | YES* |
Dedupe records | YES | NO* | NO* | NO* |
Data sorting | YES | YES | YES | YES |
Data grouping | YES | YES | NO | YES |
Descriptive statistics | NO | YES | YES | YES |
Data select/filtering | YES | YES | YES | YES |
Calculated fields | YES | NO | NO | YES |
Advanced CSV Join | YES | NO** | NO** | NO** |
Split-up CSV files | YES | YES | NO | YES |
Sort CSV files on-disk | YES | NO | NO | NO |
Language | VBA | Python | Rust | Go |
* There is no direct method to do most of the fields/records editing tasks, instead there are commands like SQL SELECT to work the fields and records at load time.
** Join operations are performed using simple or no conditions. It does not support complex expressions.
All the tools analyzed have extensive data management capabilities, almost all of them offer advanced statistical analysis, so it should be noted that they all have their weaknesses and strengths. What is a palpable fact is that CSV Interface offers the most common features in the CSV file management environment and is the only similar tool available for VBA.
Solving CSV Problems with VBA
In all the examples shown, the asterisk (*
) represents a CSVArrayList
object, which is a component of the CSV Interface
library. Let's start.
In data analysis, there is a very common task when working with files and it is nothing more and nothing less than the removal of duplicates. The goal is that the records/rows contain unique information in the specified fields. Let's see how we can achieve this goal.
*.Dedupe("1,5,6")
*.Dedupe("5-8,11")
Another very frequent task is filtering, in which the elimination of records that do not meet a specific criterion is sought. The filtering method, from VBA, accepts four parameters, two of which are mandatory. The first parameter defines the filtering criterion, which is evaluated to decide if a specific record passes the filter (condition that coincides with the return of a True
value). The second parameter indicates the position of the first record to be filtered. The third parameter, of optional character, decides whether records that comply with the filtering rule should be excluded, when True
is returned. The last parameter indicates whether the data has a header field.
*.Filter("Region = 'Asia' & Units Sold > 20 & Units Sold <= 50 & _
Ship Date $ '10/*/2014'", 2)
*.Filter("f1 = 'Asia' & f9 > 20 & f9 <= 50 & f8
$ '10/*/2014'", 2)
As you can see, the filtering criteria can be a complex expression, and can also use more than 50 VBA language functions (text, numerical, financial, date and time). It is important to note that the literal text strings contained in the filter pattern are delimited by apostrophes ('
).
Among the tasks of data analysis is the editing of records, usually requiring the insertion of calculated fields from other existing fields in the data to be analyzed. In these cases, it is also possible, from VBA, to use expressions to obtain values and insert them in each of the records.
*.InsertField 12, "Taxes" , Formula:="FORMAT(Total Revenue * _
Percent(18);'Currency')")
The above example will define the new field as 18% of Total Revenue
field, the result will be formated as VBA Currency
.
On many occasions, it is required to return tables derived from the analysis of data scattered in two or more files, this is where the join functions (left, right and inner) play an important role since they allow to join data using matching fields in the different records, when a optional given condition/predicate is fulfilled.
*.LeftJoin Table1, Table2, "{1,Country};{Total_Revenue}",
"Order_ID;Order_ID", "t2.Total_Revenue>3000000 &
t1.Region='Central America and the Caribbean'"
The above method performs a Left join returning the 1st
and Country
fields of the left table (Table1
) and the Total_Revenue
field of the right table (Table2
) , joined in the Order_ID
field of both tables, of those records that satisfy that the total revenue exceeds $3,000,000
with Central America and the Caribbean
as the Region
. RightJoin
and InnerJoin
commands can also be invoked. Again, the condition/predicate used to select the records can be very complex.
Data sorting is also one of the essential tasks when performing data analysis, so tools that allow you to perform this task are highly valued. Using CSV Interface, it is possible to sort CSV files. Having said that, it is worth mentioning that in the VBA application ecosystem, at the time of writing this article, there were few alternatives that allowed advanced CSV data sorting.
ReDim SortKeys(0 To 2): SortKeys(0) = -1: SortKeys(1) = 5: SortKeys(2) = -11
*.Sort SortingKeys:=SortKeys, SortAlgorithm:=SA_Quicksort
The above code fragment will sort the data in descending order in column 1, then sort in ascending order in column 5 and sort in descending order in column 11. This multilevel sorting is "stable".
Closing Thoughts
Although VBA is a limited programming language, it puts amazing computing power in the hands of developers. CSV Interface is the example that the "macro" language of office applications has quite strong muscles.
If you liked the article, do not hesitate to share it, you can also give a star on GitHub by visiting the following link: https://github.com/ws-garcia/VBA-CSV-interface
History
- 5th January, 2023: First release