So You've Got 10M Particles to Simulate

Michael Sydney Balloni

4.76/5 (21 votes)

8 Sep 2021CPOL3 min read

38.9K

259

Performance of data structures in C++ vs. C#

I've experienced that raw arrays, structs, and classes have different performance profiles. I wanted to put this experience to the test with programs written in C++ and C#.

Download source - 5.4 KB

Introduction

Let's say you're writing a particle simulator physics program. You need to simulate a lot of particles, say 10 million. To represent each particle, you have position, velocity, and acceleration. At each iteration of your program, you need to update the position with the velocity, the velocity with the acceleration, and for the sake of this article, the acceleration is constant. There would be x, y, and z, but let's just focus on one dimension.

To model your data in a computer program, you could use raw arrays for the position, velocity, and acceleration for each particle, or you could create a struct - C++ or C# - and put the position, velocity, and acceleration in there, or you could create a class like the struct.

It would seem that raw arrays would have the best performance, then C++ structs, then C++ classes, then C# structs, then C# classes. I say this because C++ structs are so thin, and C++ classes are identical to C++ structs in memory, and C# structs are pretty lean, but C# classes incur a whole other level of performance overhead, namely that an array of C# structs have data locality, but C# classes are all over the place.

The Test Programs

Using the Code

The code.zip contains the solution. plusplusms is the C++ test rig, and sharpms is the C# one.

plusplusms has a types.h file with the C++ data structures:

C++

#pragma once

// UPDATE: adding a dummy field for cache alignment (RE: Daniel Pfeffer) did not pay off
struct BareStruct
{
    double p, v, a, dummy;
};

class HeaderClass
{
public:
	HeaderClass()
	{
		set(0, 0, 0);
	}

	void set(double p, double v, double a)
	{
		m_p = p;
		m_v = v;
		m_a = a;
	}

    // UPDATE: Separating out updating position from updating velocity did not pay off
	void updatep()
	{
		m_p += m_v;
	}

	void updatev()
	{
		m_v += m_a;
	}

    // NOTE: Having one update function that calls the other two performed very well,
    //       inlining all around.
	void update()
	{
		updatep();
		updatev();
	}

private:
	double m_p, m_v, m_a, dummy;
};

class SourceClass
{
public:
	SourceClass();

	void set(double p, double v, double a);

    // UPDATE: Separating out...did not pay off
	//void updatep();
	//void updatev();
	void update();

private:
	double m_p, m_v, m_a, dummy;
};

types.cpp has the implementation of the SourceClass, same code as HeaderClass, just separated out. This separation was to test Link Time Code Generation (LTCG) performance, whether it would result in the same performance as the header-only implementation. UPDATE: LTCG definitely works as advertised.

The C++ main.cpp has macros for the scale of the tests and timing results, and functions that do the data processing.

C++

#include "types.h"

#include <stdio.h>
#include <stdlib.h>

#include <chrono>

#define SAMPLE_SIZE 10 * 1024 * 1024
#define TEST_RUNS 4

double parr[SAMPLE_SIZE];
double varr[SAMPLE_SIZE];
double aarr[SAMPLE_SIZE];

BareStruct structs[SAMPLE_SIZE];

HeaderClass headers[SAMPLE_SIZE]; 
SourceClass sources[SAMPLE_SIZE];

typedef std::chrono::high_resolution_clock timer;
typedef std::chrono::milliseconds ms;
typedef std::chrono::duration<int> fsec;

#define START_RUN(name) { timer t; printf("%s: ", name); auto start = t.now();
#define END_RUN() \
    auto end = t.now(); \
    int elapsedMs = (int)std::chrono::duration_cast<ms>(end - start).count(); \
    printf("%d ms\n", elapsedMs); \
}

void updateArrays1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        parr[i] += varr[i];
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        varr[i] += aarr[i];
    }
}

void updateArrays2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        parr[i] += varr[i];
        varr[i] += aarr[i];
    }
}

void updateStructs1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].p += structs[i].v;
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].v += structs[i].a;
    }
}

void updateStructs2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].p += structs[i].v;
        structs[i].v += structs[i].a;
    }
}

void updateHeaders1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        headers[i].updatep();
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        headers[i].updatev();
    }
}

void updateHeaders2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        headers[i].update();
    }
}

void updateSources1()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        sources[i].updatep();
    }

    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        sources[i].updatev();
    }
}

void updateSources2()
{
    for (size_t i = 0; i < SAMPLE_SIZE; ++i)
    {
        sources[i].update();
    }
}

int main()
{
    printf("Setting up...\n");
    srand(0);
    for (int i = 0; i < SAMPLE_SIZE; ++i)
    {
        structs[i].p = parr[i] = rand();
        structs[i].v = varr[i] = rand();
        structs[i].a = aarr[i] = rand();

        headers[i].set(structs[i].p, structs[i].v, structs[i].a);
        sources[i].set(structs[i].p, structs[i].v, structs[i].a);
    }

    for (int o = 1; o <= 3; ++o)
    {
        START_RUN("arrays separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateArrays1();
        }
        END_RUN();

        START_RUN("arrays together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateArrays2();
        }
        END_RUN();

        START_RUN("structs separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateStructs1();
        }
        END_RUN();

        START_RUN("structs together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateStructs2();
        }
        END_RUN();

        START_RUN("header classes separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateHeaders1();
        }
        END_RUN();

        START_RUN("header classes together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateHeaders2();
        }
        END_RUN();

        START_RUN("source classes separate");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateSources1();
        }
        END_RUN();

        START_RUN("source classes together");
        for (int r = 1; r <= TEST_RUNS; ++r)
        {
            updateSources2();
        }
        END_RUN();

        printf("\n");
    }
}

The C# code is more concise, as to be expected.

using System;
using System.Diagnostics;

namespace sharpms
{
    // UPDATE: Adding the dummy field did not pay off in C# either
    struct Struct
    {
        public double p, v, a, dummy; 
    }

    // UPDATE: marked class sealed (RE: Andre_Prellwitz) did not pay off
    sealed class Class 
    {
        public void updatep()
        {
            p += v;
        }

        public void updatev()
        {
            v += a;
        }

        public void update()
        {
            p += v;
            v += a;
        }

        public double p, v, a, dummy;
    }

    class Program
    {
        const int SAMPLE_SIZE = 10 * 1024 * 1024;
        const int TEST_RUNS = 4;

        static double[] parr = new double[SAMPLE_SIZE];
        static double[] varr = new double[SAMPLE_SIZE];
        static double[] aarr = new double[SAMPLE_SIZE];

        static Struct[] structs = new Struct[SAMPLE_SIZE];
        static Class[] classes = new Class[SAMPLE_SIZE];

        static void updateArrays1()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                parr[i] += varr[i];
            }

            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                varr[i] += aarr[i];
            }
        }

        static void updateArrays2()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                parr[i] += varr[i];
                varr[i] += aarr[i];
            }
        }

        // UPDATE: Resorting to unsafe code (// RE: igrulya) performs well,
        //         but is not necessary as the next solution shows
        static unsafe void updateArrays3() 
        {
            fixed (double* pparr0 = &parr[0])
            fixed (double* pvarr0 = &varr[0])
            fixed (double* paarr0 = &aarr[0])
            {
                double* pparr = pparr0;
                double* pvarr = pvarr0;
                double* paarr = paarr0;
                int i = SAMPLE_SIZE;
                while (i-- > 0)
                {
                    *pparr++ += *pvarr;
                    *pvarr++ += *paarr++;
                }
            }
        }

        // RE: <a href="https://www.codeproject.com/script/Membership/View.aspx?mid=15342246">OracPrime</a>
        static void updateArrays4() 
        {
            var pref = parr;
            var vref = varr;
            var aref = aarr;
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                pref[i] += vref[i];
                vref[i] += aref[i];
            }
        }

        static void updateStructs1()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i].p += structs[i].v;
            }

            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i].v += structs[i].a;
            }
        }

        static void updateStructs2()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i].p += structs[i].v;
                structs[i].v += structs[i].a;
            }
        }

        static void updateClasses1()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].updatep();
            }

            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].updatev();
            }
        }

        static void updateClasses2()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].update();
            }
        }

        static void updateClasses3()
        {
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                classes[i].p += classes[i].v;
                classes[i].v += classes[i].a;
            }
        }

        static Stopwatch sw = new Stopwatch();

        static void Start(string phase)
        {
            Console.Write(phase + ": ");
            sw.Restart();
        }

        static void Finish()
        {
            Console.WriteLine(sw.ElapsedMilliseconds + " ms");
        }

        static void Main(string[] args)
        {
            Random rnd = new Random(0);
            for (int i = 0; i < SAMPLE_SIZE; ++i)
            {
                structs[i] = new Struct();
                classes[i] = new Class();

                classes[i].p = structs[i].p = parr[i] = rnd.Next();
                classes[i].v = structs[i].v = varr[i] = rnd.Next();
                classes[i].a = structs[i].a = aarr[i] = rnd.Next();
            }

            for (int run = 1; run <= 3; ++run)
            {
                Start("arrays separate");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays1();
                }
                Finish();

                Start("arrays together");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays2(); // bug fix RE: Kermel
                }
                Finish();

                Start("arrays unsafe");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays3();
                }
                Finish();

                Start("arrays with refs");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateArrays4();
                }
                Finish();

                Start("structs separate");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateStructs1();
                }
                Finish();

                Start("structs together");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateStructs2();
                }
                Finish();

                Start("classes separate");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateClasses1();
                }
                Finish();

                Start("classes together");
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateClasses2();
                }
                Finish();

                Start("classes inlined"); // RE: Andre_Prellwitz
                for (int r = 1; r <= TEST_RUNS; ++r)
                {
                    updateClasses3();
                }
                Finish();
            }
        }
    }
}

The C++ Results

C++

arrays separate: 75 ms
arrays together: 76 ms

structs separate: 182 ms
structs together: 100 ms

header classes separate: 182 ms
header classes together: 98 ms

source classes separate: 189 ms
source classes together: 100 ms

You can see that using raw arrays is much faster, 2X, than doing the separate technique with structs or classes. Note that LTCG inlines the Source class's code perfectly, and classes are a bit faster than structs. Interesting.

The C# Results

arrays separate: 129 ms
arrays together: 112 ms

arrays unsafe: 91 ms
arrays with refs: 90 ms

structs separate: 196 ms
structs together: 104 ms

classes separate: 384 ms
classes together: 191 ms
classes inlined: 204 ms

Arrays in C# do not have the great performance gain over others as in C++. Note that C# structs have comparable performance with C++ classes. And C# structs are faster than C# arrays. Strange. As expected, C# classes do not perform well, especially with separated updating.

Conclusion and Points of Interest

If you want optimum data processing performance, parallel arrays in C/C++ are the way to go. You had to know that going in.
If that code pattern is unacceptable, using C++ classes or C# structs are good choices. The code is more manageable than parallel arrays, and the performance is good.
C# classes should be avoided. Sealing them and inlining their code does not make them worthwhile.
Using separate loops for handling different fields in classes or structs does not pay off at all, and should be avoided due to the negative effect is has on code quality. Data caching seems to trump code caching big time in this use case.
In C#, for parallel arrays you can get the same performance as unsafe code by using local variables to "pin" the data in place while you work with it.

History

13^th August, 2021: Initial version
26^th August, 2021: Took Andre_Prellwitz's comment and applied it to the code and the article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)