This article describes the implementation of a Rust program that performs the file and text I/O benchmark from my earlier article and compares its performance with the other implementations.
Introduction
This article describes a Rust program and compares its performance with similar programs written in C#, C, and C++.
I'm embarrassed to admit this, but this is my first Rust program. I've been making way through the O'Reilly book, Programming Rust, 2nd Edition. Good book. Quite long. Once you get past the initial "what in the...!?!?" of Rust, it comes down to "I know how to do this in C... what are the calls in Rust?" So I just went for it. I'm sure I have some major blindspots, I think we're all still learning Rust...
I know how to write this file and text I/O benchmark: here's the article. To sum up, you've got 90 MB of Unicode CSV in a file, read it into objects (all 100K of them), then write the objects back out to a Unicode CSV file.
Let's see some Rust!
But wait, let's get those performance numbers! The Rust program on my system takes 117 ms to do the read, and 195 ms to do the write. So it's right there with C-like on the read where that takes only 107 ms, and not too great on the write where C-like took 147 ms, and C++ took just 136 ms. Read on for why the Rust write code might be slow.
Rust Program Source
All the source is in one file, main.rs.
Source Header
The script starts with usual namespace helpers:
use stopwatch::Stopwatch;
use std::fs::File;
use std::io::Read;
use std::io::Write;
Stopwatch
is a 3rd party class, enough like .NET's Stopwatch
to fit the bill here.
"use std::fs::File
" makes it possible to use the File
class. You could say "use std::fs::*
" to bring in the whole namespace.
The std::io::Read
and std::io::Write
aren't actually classes, but you need to use them if you want to do any Reading or Writing.
The Data Record
This is the type of the object we'll be loading from CSV and writing back out to CSV.
struct LineData<'a> {
first: &'a str,
last: &'a str,
address1: &'a str,
address2: &'a str,
city: &'a str,
state: &'a str,
zip: &'a str
}
The 'a
stuff is a notation that the contents of the struct
are meant to live as long as the struct
itself. This is the lifetime business, some of the "what in the...!?!?" of Rust. Enough said, code that works with LineData
objects can't outlive each other, they must have the same lifetime. This prevents read-after-free bugs.
The & str
is not a String
in .NET or std::wstring
in C++, it's a reference to a character buffer, essentially a well-groomed const wchar_t*
. That's what really makes this program fly. Imagine a struct
in C with raw character pointers out to who knows where. That'd be optimal, but scary as hell, right? Well, Rust has pulled it off, it's the real deal for just this sort of thing.
Unicode Bytes To String
We know from our previous attempts at this benchmark that reading the entire file into memory is a good start. Once we have all those bytes, in this attempt, we want to turn that into a String
object for later processing.
fn utf16le_buffer_to_string(buffer: &[u8]) -> String {
let char_len = buffer.len() / 2;
let mut output = String::with_capacity(char_len);
for idx in 0..char_len {
output.push(char::from_u32(u16::from_le_bytes
([buffer[idx * 2], buffer[idx * 2 + 1]]) as u32).unwrap());
}
output
}
The buffer input parameter is a &[u8]
, which is a reference to an array of u8
s, bytes. Pre-allocating the String
is probably a good idea. Then we just loop over the range of indexes from 0
to char_len
- 1
, doing some Unicode fun. The unadorned "output
" after the loop is the return value, weird, I know.
main() Begins
What little benchmark app is complete without a simple straight-through main()
function?
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() != 3 {
println!("Usage: {} <input file> <output file>", args[0]);
std::process::exit(0);
}
let input_file_path = &args[1];
let output_file_path = &args[2];
println!("{} -> {}", input_file_path, output_file_path);
We turn the raw array of str std::env::args()
into an easier Vec<String> (std::vector<std::wstring>)
to work with. The &args[x]
allow the input_file_path
/ output_file_path
variables to refer to the arguments without modifying anything. There's a lot of &
s in Rust, it's pretty scary at first, like pointer addresses everywhere, and there are, but it's safe. That's Rust's big gamble, that you'll spackle &
s and other incantations like "mut
" and then you'll trust the compiler with memory correctness, and everything will be okay. The println!
is like printf
, with the {}
placeholders like %
's, less type specifiers.
Stopwatch Timing
let mut sw = Stopwatch::start_new();
let mut cur_ms : i64;
let mut total_read_ms : i64 = 0;
The "mut
" business says you want a read-write reference to the object. No "mut
", no modifications are possible, kind of a const
.
Input File I/O: File -> Buffer
sw.restart();
let mut buffer = Vec::new();
File::open(input_file_path).unwrap().read_to_end(&mut buffer).unwrap();
cur_ms = sw.elapsed_ms();
total_read_ms += cur_ms;
println!("buffer: {} - {} ms", buffer.len(), cur_ms);
We create a new vector, we don't have to say the element type, the compiler figures it out. In one line, we read the file into the vector of... bytes, it's gotta be bytes. The &mut
means that we're passing (they call it borrowing) a read-write reference to read_to_end()
so it can modify our vector.
Input Text I/O: Buffer > String
let str_val = utf16le_buffer_to_string(&buffer);
Here, we use the function we defined above for this purpose, passing in a reference to our byte vector.
Object Input I/O: String > Objects
let mut objects = Vec::new();
let mut parts: [&str; 7] = ["", "", "", "", "", "", ""];
let field_len = parts.len();
let mut idx: usize;
for line in str_val.lines() {
idx = 0;
for part in line.split(',') {
assert!(idx < field_len);
parts[idx] = part;
idx = idx + 1;
}
if idx == 0 {
continue;
}
assert_eq!(idx, parts.len());
objects.push
(
LineData {
first: parts[0],
last: parts[1],
address1: parts[2],
address2: parts[3],
city: parts[4],
state: parts[5],
zip: parts[6]
}
);
}
I had to optimize this code a bit to make it fly. The Vec
objects is where we collect our records. The array parts holds seven str
references, one for each field in our record type, a little array of const wchar_t*
s. In the loop, we walk lines and collect string
s and put them into the objects. Picture character pointers making their way out of lines()
and split()
calls, into the parts array, then into our records. No string
copying at all, just shuffling text in and out of data structures. Amazing!
All Done With Reading
Here is the profile of the reading:
buffer: 90316528 - 29 ms
str_val: 45158266 - 63 ms
objects: 100000 - 25 ms
total read: 117 ms
Smokin'!
Object Output I/O: Objects -> String
let mut big_str: String = String::with_capacity(str_val.len());
for obj in objects {
big_str += obj.first;
big_str += ",";
big_str += obj.last;
big_str += ",";
big_str += obj.address1;
big_str += ",";
big_str += obj.address2;
big_str += ",";
big_str += obj.city;
big_str += ",";
big_str += obj.state;
big_str += ",";
big_str += obj.zip;
big_str += "\n";
}
This matches the relatively fast C++ benchmark application's output code.
String Output I/O: String -> Buffer
let big_char_buffer = big_str.encode_utf16();
let mut big_output_buffer = Vec::<u8>::with_capacity(big_str.len() * 2);
for c in big_char_buffer {
big_output_buffer.push(c as u8);
big_output_buffer.push((c >> 8) as u8);
}</u8>
Writing loops in 2023 seems so 2000s; I could not find anything off the shelf. There should be a way, though.
File Output I/O: Buffer -> File
File::create(output_file_path).unwrap().write_all(&big_output_buffer);
Another fun file I/O one-liner.
Dissecting (Bad) Output Performance
That fun Stopwatch
code yields these tea leaves about where time is spent in the output side of things:
big_str: 12 ms
big_char_buffer: 0 ms
big_output_buffer: 60 ms
output_file: 123 ms
total write: 195 m
big_str
seems fine. big_char_buffer
is surprisingly thrifty. But big_output_buffer
, where we turn all that u16
goodness into u8
s, that costs a lot. It's actually about identical to the cost of reading the buffer into the string
on the read side. In C/C++, you can take a wchar_t*
and just say it's a uint8_t*
and then, presto! it's bytes! Rust is not a fan of that sort of cavalier memory voodoo. And in Rust, string
s are stored in memory as UTF-8 so you can't just make it so like in C/C++. UTF-8 may seem like the lingua franca of the internet, but half the world does better with 16 or 32 bit encodings, so I don't understand that language decision, it's colorblind and inefficient. Hmmm.
Conclusion and Points of Interest
I hope you have picked up enough Rust exposure to be interested in learning a lot more about it. The performance benchmark comparison showed Rust flexing its muscles reading data, but seeming a bit tired when it comes to writing data. This benchmark brought out the good and the bad; on balance I think it's good enough to use Rust for my future projects.
I'm interested in your first blush impression of Rust code and your thoughts on how it went with programming and measuring the benchmark and interpreting the results.
History
- 5th January, 2023: Initial version