I've experienced that raw arrays, structs, and classes have different performance profiles. I wanted to put this experience to the test with programs written in C++ and C#.
Introduction
Let's say you're writing a particle simulator physics program. You need to simulate a lot of particles, say 10 million. To represent each particle, you have position, velocity, and acceleration. At each iteration of your program, you need to update the position with the velocity, the velocity with the acceleration, and for the sake of this article, the acceleration is constant. There would be x, y, and z, but let's just focus on one dimension.
To model your data in a computer program, you could use raw arrays for the position, velocity, and acceleration for each particle, or you could create a struct - C++ or C# - and put the position, velocity, and acceleration in there, or you could create a class like the struct.
It would seem that raw arrays would have the best performance, then C++ structs, then C++ classes, then C# structs, then C# classes. I say this because C++ structs are so thin, and C++ classes are identical to C++ structs in memory, and C# structs are pretty lean, but C# classes incur a whole other level of performance overhead, namely that an array of C# structs have data locality, but C# classes are all over the place.
The Test Programs
Using the Code
The code.zip contains the solution. plusplusms
is the C++ test rig, and sharpms
is the C# one.
plusplusms
has a types.h file with the C++ data structures:
#pragma once
struct BareStruct
{
double p, v, a, dummy;
};
class HeaderClass
{
public:
HeaderClass()
{
set(0, 0, 0);
}
void set(double p, double v, double a)
{
m_p = p;
m_v = v;
m_a = a;
}
void updatep()
{
m_p += m_v;
}
void updatev()
{
m_v += m_a;
}
void update()
{
updatep();
updatev();
}
private:
double m_p, m_v, m_a, dummy;
};
class SourceClass
{
public:
SourceClass();
void set(double p, double v, double a);
void update();
private:
double m_p, m_v, m_a, dummy;
};
types.cpp has the implementation of the SourceClass
, same code as HeaderClass
, just separated out. This separation was to test Link Time Code Generation (LTCG) performance, whether it would result in the same performance as the header-only implementation. UPDATE: LTCG definitely works as advertised.
The C++ main.cpp has macros for the scale of the tests and timing results, and functions that do the data processing.
#include "types.h"
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define SAMPLE_SIZE 10 * 1024 * 1024
#define TEST_RUNS 4
double parr[SAMPLE_SIZE];
double varr[SAMPLE_SIZE];
double aarr[SAMPLE_SIZE];
BareStruct structs[SAMPLE_SIZE];
HeaderClass headers[SAMPLE_SIZE];
SourceClass sources[SAMPLE_SIZE];
typedef std::chrono::high_resolution_clock timer;
typedef std::chrono::milliseconds ms;
typedef std::chrono::duration<int> fsec;
#define START_RUN(name) { timer t; printf("%s: ", name); auto start = t.now();
#define END_RUN() \
auto end = t.now(); \
int elapsedMs = (int)std::chrono::duration_cast<ms>(end - start).count(); \
printf("%d ms\n", elapsedMs); \
}
void updateArrays1()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
parr[i] += varr[i];
}
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
varr[i] += aarr[i];
}
}
void updateArrays2()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
parr[i] += varr[i];
varr[i] += aarr[i];
}
}
void updateStructs1()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].p += structs[i].v;
}
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].v += structs[i].a;
}
}
void updateStructs2()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].p += structs[i].v;
structs[i].v += structs[i].a;
}
}
void updateHeaders1()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
headers[i].updatep();
}
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
headers[i].updatev();
}
}
void updateHeaders2()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
headers[i].update();
}
}
void updateSources1()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
sources[i].updatep();
}
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
sources[i].updatev();
}
}
void updateSources2()
{
for (size_t i = 0; i < SAMPLE_SIZE; ++i)
{
sources[i].update();
}
}
int main()
{
printf("Setting up...\n");
srand(0);
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].p = parr[i] = rand();
structs[i].v = varr[i] = rand();
structs[i].a = aarr[i] = rand();
headers[i].set(structs[i].p, structs[i].v, structs[i].a);
sources[i].set(structs[i].p, structs[i].v, structs[i].a);
}
for (int o = 1; o <= 3; ++o)
{
START_RUN("arrays separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateArrays1();
}
END_RUN();
START_RUN("arrays together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateArrays2();
}
END_RUN();
START_RUN("structs separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateStructs1();
}
END_RUN();
START_RUN("structs together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateStructs2();
}
END_RUN();
START_RUN("header classes separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateHeaders1();
}
END_RUN();
START_RUN("header classes together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateHeaders2();
}
END_RUN();
START_RUN("source classes separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateSources1();
}
END_RUN();
START_RUN("source classes together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateSources2();
}
END_RUN();
printf("\n");
}
}
The C# code is more concise, as to be expected.
using System;
using System.Diagnostics;
namespace sharpms
{
struct Struct
{
public double p, v, a, dummy;
}
sealed class Class
{
public void updatep()
{
p += v;
}
public void updatev()
{
v += a;
}
public void update()
{
p += v;
v += a;
}
public double p, v, a, dummy;
}
class Program
{
const int SAMPLE_SIZE = 10 * 1024 * 1024;
const int TEST_RUNS = 4;
static double[] parr = new double[SAMPLE_SIZE];
static double[] varr = new double[SAMPLE_SIZE];
static double[] aarr = new double[SAMPLE_SIZE];
static Struct[] structs = new Struct[SAMPLE_SIZE];
static Class[] classes = new Class[SAMPLE_SIZE];
static void updateArrays1()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
parr[i] += varr[i];
}
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
varr[i] += aarr[i];
}
}
static void updateArrays2()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
parr[i] += varr[i];
varr[i] += aarr[i];
}
}
static unsafe void updateArrays3()
{
fixed (double* pparr0 = &parr[0])
fixed (double* pvarr0 = &varr[0])
fixed (double* paarr0 = &aarr[0])
{
double* pparr = pparr0;
double* pvarr = pvarr0;
double* paarr = paarr0;
int i = SAMPLE_SIZE;
while (i-- > 0)
{
*pparr++ += *pvarr;
*pvarr++ += *paarr++;
}
}
}
static void updateArrays4()
{
var pref = parr;
var vref = varr;
var aref = aarr;
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
pref[i] += vref[i];
vref[i] += aref[i];
}
}
static void updateStructs1()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].p += structs[i].v;
}
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].v += structs[i].a;
}
}
static void updateStructs2()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i].p += structs[i].v;
structs[i].v += structs[i].a;
}
}
static void updateClasses1()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
classes[i].updatep();
}
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
classes[i].updatev();
}
}
static void updateClasses2()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
classes[i].update();
}
}
static void updateClasses3()
{
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
classes[i].p += classes[i].v;
classes[i].v += classes[i].a;
}
}
static Stopwatch sw = new Stopwatch();
static void Start(string phase)
{
Console.Write(phase + ": ");
sw.Restart();
}
static void Finish()
{
Console.WriteLine(sw.ElapsedMilliseconds + " ms");
}
static void Main(string[] args)
{
Random rnd = new Random(0);
for (int i = 0; i < SAMPLE_SIZE; ++i)
{
structs[i] = new Struct();
classes[i] = new Class();
classes[i].p = structs[i].p = parr[i] = rnd.Next();
classes[i].v = structs[i].v = varr[i] = rnd.Next();
classes[i].a = structs[i].a = aarr[i] = rnd.Next();
}
for (int run = 1; run <= 3; ++run)
{
Start("arrays separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateArrays1();
}
Finish();
Start("arrays together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateArrays2();
}
Finish();
Start("arrays unsafe");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateArrays3();
}
Finish();
Start("arrays with refs");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateArrays4();
}
Finish();
Start("structs separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateStructs1();
}
Finish();
Start("structs together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateStructs2();
}
Finish();
Start("classes separate");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateClasses1();
}
Finish();
Start("classes together");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateClasses2();
}
Finish();
Start("classes inlined");
for (int r = 1; r <= TEST_RUNS; ++r)
{
updateClasses3();
}
Finish();
}
}
}
}
The C++ Results
arrays separate: 75 ms
arrays together: 76 ms
structs separate: 182 ms
structs together: 100 ms
header classes separate: 182 ms
header classes together: 98 ms
source classes separate: 189 ms
source classes together: 100 ms
You can see that using raw arrays is much faster, 2X, than doing the separate technique with structs or classes. Note that LTCG inlines the Source class's code perfectly, and classes are a bit faster than structs. Interesting.
The C# Results
arrays separate: 129 ms
arrays together: 112 ms
arrays unsafe: 91 ms
arrays with refs: 90 ms
structs separate: 196 ms
structs together: 104 ms
classes separate: 384 ms
classes together: 191 ms
classes inlined: 204 ms
Arrays in C# do not have the great performance gain over others as in C++. Note that C# structs have comparable performance with C++ classes. And C# structs are faster than C# arrays. Strange. As expected, C# classes do not perform well, especially with separated updating.
Conclusion and Points of Interest
- If you want optimum data processing performance, parallel arrays in C/C++ are the way to go. You had to know that going in.
- If that code pattern is unacceptable, using C++ classes or C# structs are good choices. The code is more manageable than parallel arrays, and the performance is good.
- C# classes should be avoided. Sealing them and inlining their code does not make them worthwhile.
- Using separate loops for handling different fields in classes or structs does not pay off at all, and should be avoided due to the negative effect is has on code quality. Data caching seems to trump code caching big time in this use case.
- In C#, for parallel arrays you can get the same performance as unsafe code by using local variables to "pin" the data in place while you work with it.
History
- 13th August, 2021: Initial version
- 26th August, 2021: Took Andre_Prellwitz's comment and applied it to the code and the article