This article compares different approaches to solving the problems of loading large amounts of line-based, comma-delimited text from a file into data structures, and writing this data back out to another file. The performance profiles of the approaches are compared and conclusions drawn.
Introduction
I was interested in comparing the performance of using different programming languages and different approaches to solving a basic programming problem, so I developed applications in C#, C++, and C, and drew conclusions for making decisions about which languages and approaches to use in different circumstances.
The Benchmark
The problem I came up with is for a program to load an input CSV file into an array of data structures, and use the data structures to write the data back out to an output CSV file. Your basic input / output data processing problem, without any data processing. This is a file I/O and data structure serialization benchmark.
I chose to use a Unicode-encoded text file for the CSV data, as C# is a Unicode language and C/C++ can work with Unicode data pretty well at this point. I chose to have seven data fields per line of CSV text, typical demographic stuff: first name, last name, address line 1, address line 2, city, state, and zip code. To keep things simple, each field is limited to 127 characters of Unicode text.
The CSV generation program (the "gen
" project in the attached code) outputs 100,000 lines of this random CSV data to a file on the desktop.
using System.Text;
const int FIELD_LEN = 128;
const int FIELD_CHAR_LEN = FIELD_LEN - 1;
const int RECORD_COUNT = 100 * 1000;
var rnd = new Random(0);
string rnd_str()
{
StringBuilder sb = new StringBuilder(FIELD_CHAR_LEN);
int rnd_len = rnd.Next(1, FIELD_CHAR_LEN);
for (int r = 1; r <= rnd_len; ++r)
sb.Append((char)((sbyte)'A' + rnd.Next(0, 25)));
return sb.ToString();
}
string output_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (output_file_path == null)
throw new NullReferenceException("output_file_path");
using (var output = new StreamWriter(output_file_path, false, Encoding.Unicode))
{
for (int r = 1; r <= RECORD_COUNT; ++r)
output.Write($"{rnd_str()},{rnd_str()},{rnd_str()},{rnd_str()},
{rnd_str()},{rnd_str()},{rnd_str()}\n");
}
Let's Skip to the Punchline: What are the Results?
Here are the timings of the loading and writing phases of the different programs. The programs run through the load /write cycle four times. I took the best load and write times from all the runs for each program, the best of the best.
Conclusions and Points of Interest
The C# programs, loops and batch, are clean and easy to read and have good performance. The loops use StreamReader
/ StreamWriter
and are intuitive and easy to develop and maintain. The batch program uses the File
class functions ReadAllLines
/ WriteAllLines
and are much faster than the C# loops program for reading, LINQ and all, and slower for writing. Given this, you'd use ReadAllLines
/ LINQ for loading and StreamWriter
for writing.
The big news is that there's something very wrong with Approach 3: C++ loops (class). It comes down to the std::getline
call for loading, and the stream output for writing; the other code costs little. I'm interested in folks reproducing these numbers and reporting how to solve these performance problems.
The C batches program handily outruns the others for loading data because it uses fixed-size strings packed together in struct
s stored sequentially in one array, so the data locality is excellent. We're reading 90 MB of CSV in around 100ms. Wow! For some reason, the C program is slower for writing the data to the output file; the code looks clean, not sure about that.
Inspired by the C program, the "C++ batches" program does not come close to C at reading the data, but bests them all at writing the data out. You would choose C for reading and C++ for writing.
Approach 6, recordio - rhymes with rodeo - the hybrid of the C and C++ batch approaches, in a reusable package, yields 108 / 149, besting C#'s bests of 223 / 178. The difference in write performance is not significant. The 2X speedup for load performance cannot be ignored. The use of fixed-length strings packed into a single vector of struct
s, that can't be beat by any storage of C++ wstring
s or C# strings. The code is clean, in implementation and in test driver illustration of use of the reusable class, have at it.
My overall advice would be to do this sort of thing in C/C++, with structs with fixed-width fields, how you might imagine things are stored in a database, and take advantage of recordio or similar to get great file and string I/O.
If you're in C# code, use File.ReadAllLines
and LINQ to load the data, and StreamWriter
to write it back out, and get respectable performance with great productivity and security.
Here's more information about the different approaches.
Approach 1: C# with Loops
In the attached code, this is the "net
" project. This might be the most intuitive of the languages and approaches. You have StreamReader
slurp up the data, and StreamWriter
write it all out:
using System.Diagnostics;
using System.Text;
var sw = Stopwatch.StartNew();
string input_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (input_file_path == null)
throw new NullReferenceException("input_file_path");
for (int iter = 1; iter <= 4; ++iter)
{
var nets = new List<info>();
using (var input = new StreamReader(input_file_path, Encoding.Unicode))
{
while (true)
{
string? line = input.ReadLine();
if (line == null)
break;
else
nets.Add(new info(line.Split(',')));
}
}
Console.WriteLine($".NET load took {sw.ElapsedMilliseconds} ms");
sw.Restart();
using (var output = new StreamWriter("output.csv", false, Encoding.Unicode))
{
foreach (var cur in nets)
output.Write($"{cur.firstname},{cur.lastname},
{cur.address1},{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n");
}
Console.WriteLine($".NET write took {sw.ElapsedMilliseconds} ms");
sw.Restart();
}
class info
{
public info(string[] parts)
{
firstname = parts[0];
lastname = parts[1];
address1 = parts[2];
address2 = parts[3];
city = parts[4];
state = parts[5];
zipcode = parts[6];
}
public string firstname;
public string lastname;
public string address1;
public string address2;
public string city;
public string state;
public string zipcode;
}
Approach 2: C# in Bulk
In the attached source, this is the "net2
" project. You might say to yourself, "Loops are tedious. I can use the File
class functions like ReadAllLines
to do things en masse. In .NET I trust!" It's certainly less code...
using System.Diagnostics;
using System.Runtime.ConstrainedExecution;
using System.Text;
var sw = Stopwatch.StartNew();
string input_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (input_file_path == null)
throw new NullReferenceException("input_file_path");
for (int iter = 1; iter <= 4; ++iter)
{
var nets =
File.ReadAllLines(input_file_path, Encoding.Unicode)
.Select(line => new info(line.Split(',')));
Console.WriteLine($".NET 2 load took {sw.ElapsedMilliseconds} ms");
sw.Restart();
int count = nets.Count();
string[] strs = new string[count];
int idx = 0;
foreach (var cur in nets)
strs[idx++] = $"{cur.firstname},{cur.lastname},{cur.address1},
{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n";
File.WriteAllLines("output.csv", strs, Encoding.Unicode);
Console.WriteLine($".NET 2 write took {sw.ElapsedMilliseconds} ms");
sw.Restart();
}
Approach 3: C++ Loops
C++ has come a long way when it comes to Unicode file I/O and streams. It is now easy to write C++ code that rivals C#'s Loopy Approach 1 for readability and simplicity:
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
struct info
{
info(const std::vector<std::wstring>& parts)
: firstname(parts[0])
, lastname(parts[1])
, address1(parts[2])
, address2(parts[3])
, city(parts[4])
, state(parts[5])
, zipcode(parts[6])
{
}
std::wstring firstname;
std::wstring lastname;
std::wstring address1;
std::wstring address2;
std::wstring city;
std::wstring state;
std::wstring zipcode;
};
std::vector<std::wstring> split(const std::wstring& str, const wchar_t seperator)
{
std::vector<std::wstring> retVal;
retVal.reserve(FIELD_COUNT);
std::wstring acc;
acc.reserve(FIELD_CHAR_LEN);
for (wchar_t c : str)
{
if (c == seperator)
{
retVal.push_back(acc);
acc.clear();
}
else
acc.push_back(c);
}
if (!acc.empty())
retVal.push_back(acc);
return retVal;
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
std::vector<std::wstring> lines;
{
std::wifstream input(argv[1], std::ios::binary);
input.imbue(std::locale(input.getloc(), new std::codecvt_utf16<wchar_t,
0x10ffff, std::codecvt_mode(std::consume_header | std::little_endian)>));
if (!input)
{
std::cout << "Opening input file failed\n";
return 1;
}
std::wstring line;
while (std::getline(input, line))
lines.push_back(line);
}
std::vector<info> infos;
infos.reserve(lines.size());
for (const auto& line : lines)
infos.emplace_back(split(line, ','));
t.report("class load ");
{
std::wofstream output("output.csv", std::ios::binary);
output.imbue(std::locale(output.getloc(), new std::codecvt_utf16<wchar_t,
0x10ffff, std::codecvt_mode(std::generate_header | std::little_endian)>));
if (!output)
{
std::cout << "Opening output file failed\n";
return 1;
}
for (const auto& record : infos)
{
output
<< record.firstname << ','
<< record.lastname << ','
<< record.address1 << ','
<< record.address2 << ','
<< record.city << ','
<< record.state << ','
<< record.zipcode << '\n';
}
}
t.report("class write");
}
}
For the file load step, a quick timing check shows that all the time is spent on the std::getline()
call. And for the file write step, all the time is spent in the output loop, where else would it be? This mystery is left as an exercise to the reader. What is wrong with this simple code?
Approach 4: C Batches
If we are willing to load the entire text into memory, then we can play tricks with slicing and dicing the data in place, and take advantage of fixed-length string buffers and character-level string operations that take advantage of data locality and unsafe operations. What fun!
#include <stdio.h>
#include <stdlib.h>
const size_t FIELD_LEN = 128;
const size_t FIELD_CHAR_LEN = FIELD_LEN - 1;
const size_t FIELD_COUNT = 7;
const size_t RECORD_LEN = std::max(FIELD_COUNT * FIELD_LEN + 1, size_t(1024));
struct info
{
wchar_t firstname[FIELD_LEN];
wchar_t lastname[FIELD_LEN];
wchar_t address1[FIELD_LEN];
wchar_t address2[FIELD_LEN];
wchar_t city[FIELD_LEN];
wchar_t state[FIELD_LEN];
wchar_t zipcode[FIELD_LEN];
};
void read_str(const wchar_t*& input, wchar_t* output)
{
size_t copied = 0;
while (*input && *input != ',')
*output++ = *input++;
*output = '\0';
if (*input == ',')
++input;
}
void set_record(info& record, const wchar_t* buffer)
{
read_str(buffer, record.firstname);
read_str(buffer, record.lastname);
read_str(buffer, record.address1);
read_str(buffer, record.address2);
read_str(buffer, record.city);
read_str(buffer, record.state);
read_str(buffer, record.zipcode);
}
wchar_t* add_to_buffer(const wchar_t* input, wchar_t* output, wchar_t separator)
{
while (*input)
*output++ = *input++;
*output++ = separator;
return output;
}
int64_t output_record(const info& record, wchar_t* buffer)
{
const wchar_t* original = buffer;
buffer = add_to_buffer(record.firstname, buffer, ',');
buffer = add_to_buffer(record.lastname, buffer, ',');
buffer = add_to_buffer(record.address1, buffer, ',');
buffer = add_to_buffer(record.address2, buffer, ',');
buffer = add_to_buffer(record.city, buffer, ',');
buffer = add_to_buffer(record.state, buffer, ',');
buffer = add_to_buffer(record.zipcode, buffer, '\n');
return buffer - original;
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
FILE* input_file = nullptr;
if (fopen_s(&input_file, argv[1], "rb") != 0)
{
printf("Opening input file failed\n");
return 1;
}
fseek(input_file, 0, SEEK_END);
int file_len = ftell(input_file);
fseek(input_file, 0, SEEK_SET);
wchar_t* file_contents = (wchar_t*)malloc(file_len + 2);
if (file_contents == nullptr)
{
printf("Allocating input buffer failed\n");
return 1;
}
if (fread(file_contents, file_len, 1, input_file) != 1)
{
printf("Reading input file failed\n");
return 1;
}
size_t char_len = file_len / 2;
file_contents[char_len] = '\0';
fclose(input_file);
input_file = nullptr;
size_t record_count = 0;
for (size_t idx = 0; idx < char_len; ++idx)
{
if (file_contents[idx] == '\n')
{
++record_count;
file_contents[idx] = '\0';
}
}
info* records = (info*)malloc(record_count * sizeof(info));
if (records == nullptr)
{
printf("Allocating records list failed\n");
return 1;
}
wchar_t* cur_str = file_contents;
wchar_t* end_str = cur_str + file_len / 2;
size_t record_idx = 0;
while (cur_str < end_str)
{
set_record(records[record_idx++], cur_str);
cur_str += wcslen(cur_str) + 1;
}
if (record_idx != record_count)
{
printf("Record counts differ: idx: %d - count: %d\n",
(int)record_idx, (int)record_count);
return 1;
}
t.report("struct load ");
wchar_t* file_output = (wchar_t*)malloc(record_count * RECORD_LEN);
if (file_output == nullptr)
{
printf("Allocating file output buffer failed\n");
return 1;
}
size_t output_len = 0;
for (size_t r = 0; r < record_count; ++r)
{
int new_output = output_record(records[r], file_output + output_len);
if (new_output < 0)
{
printf("Writing to output buffer failed\n");
return 1;
}
output_len += new_output;
}
FILE* output_file = nullptr;
if (fopen_s(&output_file, "output.csv", "wb") != 0)
{
printf("Opening output file failed\n");
return 1;
}
if (fwrite(file_output, output_len * 2, 1, output_file) != 1)
{
printf("Writing output file failed\n");
return 1;
}
fclose(output_file);
output_file = nullptr;
t.report("struct write");
free(file_contents);
file_contents = nullptr;
free(records);
records = nullptr;
}
return 0;
}
Approach 5: C++ Batches
Perhaps that pile of C leaves a bad taste in your mouth. Could we apply the same batch approach in C++?
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
struct info
{
info(const std::vector<wchar_t*>& parts)
: firstname(parts[0])
, lastname(parts[1])
, address1(parts[2])
, address2(parts[3])
, city(parts[4])
, state(parts[5])
, zipcode(parts[6])
{
}
std::wstring firstname;
std::wstring lastname;
std::wstring address1;
std::wstring address2;
std::wstring city;
std::wstring state;
std::wstring zipcode;
};
void parse_parts(wchar_t* buffer, wchar_t separator, std::vector<wchar_t*>& ret_val)
{
ret_val.clear();
ret_val.push_back(buffer);
while (*buffer)
{
if (*buffer == separator)
{
*buffer = '\0';
if (*(buffer + 1))
ret_val.push_back(buffer + 1);
}
++buffer;
}
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
std::vector<char> file_contents;
{
std::ifstream file(argv[1], std::ios::binary | std::ios::ate);
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);
file_contents.resize(size + 2); if (!file.read(file_contents.data(), size))
{
std::cout << "Reading file failed\n";
return 1;
}
file_contents.push_back(0);
file_contents.push_back(0);
}
std::vector<wchar_t*> line_pointers;
parse_parts(reinterpret_cast<wchar_t*>
(file_contents.data()), '\n', line_pointers);
std::vector<info> infos;
infos.reserve(line_pointers.size());
std::vector<wchar_t*> line_parts;
for (wchar_t* line : line_pointers)
{
parse_parts(line, ',', line_parts);
infos.emplace_back(line_parts);
}
t.report("C++ 2 load");
std::wstring output_str;
output_str.reserve(file_contents.size() / 2);
for (const auto& record : infos)
{
output_str += record.firstname;
output_str += ',';
output_str += record.lastname;
output_str += ',';
output_str += record.address1;
output_str += ',';
output_str += record.address2;
output_str += ',';
output_str += record.city;
output_str += ',';
output_str += record.state;
output_str += ',';
output_str += record.zipcode;
output_str += '\n';
}
std::ofstream output_file("output.csv", std::ios::binary);
if (!output_file)
{
std::cout << "Opening output file failed\n";
return 1;
}
output_file.write(reinterpret_cast<const char*>
(output_str.c_str()), output_str.size() * 2);
output_file.close();
t.report("C++ 2 write");
}
}
Approach 6: C/C++ Hybrid
Let's take the best from approaches 4 and 5, and make a reusable class that has a prayer of real-world usability.
First, the reusable class. It is templated by the record
type so the record
class can define its field and record
length limits:
namespace recordio
{
template<class record>
class lineio
{
public:
static void load(const char* inputFilePath, std::vector<record>& records)
{
records.clear();
FILE* input_file = nullptr;
if (fopen_s(&input_file, inputFilePath, "rb") != 0)
{
throw std::runtime_error("Opening input file failed");
}
fseek(input_file, 0, SEEK_END);
int file_len = ftell(input_file);
fseek(input_file, 0, SEEK_SET);
size_t char_len = file_len / 2;
std::unique_ptr<wchar_t[]> file_contents(new wchar_t[file_len / 2 + 1]);
if (fread(reinterpret_cast<void*>
(file_contents.get()), file_len, 1, input_file) != 1)
{
throw std::runtime_error("Reading input file failed\n");
}
file_contents[char_len] = '\0';
fclose(input_file);
input_file = nullptr;
size_t record_count = 0;
for (size_t idx = 0; idx < char_len; ++idx)
{
if (file_contents[idx] == '\n')
{
++record_count;
file_contents[idx] = '\0';
}
}
records.reserve(record_count);
wchar_t* cur_str = file_contents.get();
wchar_t* end_str = cur_str + file_len / 2;
while (cur_str < end_str)
{
records.emplace_back(cur_str);
cur_str += wcslen(cur_str) + 1;
}
}
static void write(const char* outputFilePath, const std::vector<record>& records)
{
std::wstring output_str;
output_str.reserve(record::max_record_length * records.size());
for (const auto& cur : records)
{
cur.get_record_str(output_str);
output_str += '\n';
}
std::ofstream output_file(outputFilePath, std::ios::binary);
if (!output_file)
{
throw std::runtime_error("Opening output file failed");
}
output_file.write(reinterpret_cast<const char*>
(output_str.c_str()), output_str.size() * 2);
}
The "derived" record
types have a little work to do, but it's pretty minimal:
struct info
{
info() {}
info(const wchar_t* str)
{
recordio::lineio<info>::read_str(str, firstname);
recordio::lineio<info>::read_str(str, lastname);
recordio::lineio<info>::read_str(str, address1);
recordio::lineio<info>::read_str(str, address2);
recordio::lineio<info>::read_str(str, city);
recordio::lineio<info>::read_str(str, state);
recordio::lineio<info>::read_str(str, zipcode);
}
void get_record_str(std::wstring& str) const
{
str += firstname;
str += ',';
str += lastname;
str += ',';
str += address1;
str += ',';
str += address2;
str += ',';
str += city;
str += ',';
str += state;
str += ',';
str += zipcode;
}
const static size_t max_field_length = FIELD_LEN;
const static size_t max_record_length = RECORD_LEN;
wchar_t firstname[FIELD_LEN];
wchar_t lastname[FIELD_LEN];
wchar_t address1[FIELD_LEN];
wchar_t address2[FIELD_LEN];
wchar_t city[FIELD_LEN];
wchar_t state[FIELD_LEN];
wchar_t zipcode[FIELD_LEN];
};
int main(int argc, char* argv[])
{
if (argc != 2)
{
printf("Usage: recordio <input CSV file path>\n");
return 0;
}
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
std::vector<info> records;
recordio::lineio<info>::load(argv[1], records);
t.report("recordio load ");
recordio::lineio<info>::write("output.csv", records);
t.report("recordio write ");
}
printf("All done.\n");
return 0;
}
And that's it! Looking forward to the comments!
History
- 25th September, 2022: Initial version