After playing with the tools, we have some more options to improve the performance of the particle system. This time, we need to rewrite some parts of the code.
In total, the particle system runs almost twice as fast as initially! Read more to see what pieces of code were changed.
The Series
Plan for this Post
Start
We are starting with those numbers, see the previous post (last results).
Core i5 Sandy Bridge
count |
tunnel |
attractors |
fountain |
171000 |
429.195 |
608.598 |
460.299 |
181000 |
460.649 |
647.825 |
490.412 |
191000 |
489.206 |
688.603 |
520.302 |
Core i5 Ivy Bridge
count |
tunnel |
attractors |
fountain |
171000 |
529.188 |
746.594 |
570.297 |
181000 |
565.648 |
792.824 |
605.912 |
191000 |
593.956 |
832.478 |
640.739 |
(time in milliseconds)
SIMD Preparation
Previously, I tried to force the compiler to use SSE2 or AVX instructions. As we saw, there was a nice performance boost (around 10% for AVX). But hey... SIMD should calculate things 4x or 8x times faster... so why we got only a little improvement?
In real life, it's not that simple:
- SIMD can do 4 or 8 instructions at a time, but we still need to wait for the memory. See my summary of a talk "Native code performance on modern CPUs" for more information. In general, we can get max 2.5x speedup using SSE2/4, assuming we have ideally 'vectorizable' code. Not all code is in such a perfect state.
- Current CPUs are superscalar that means CPU can run several different instructions in parallel. Sometimes, SIMD code can be even slower than the original code created by a compiler.
- Additional small problem: SIMD registers needs memory chunks to be aligned to 128bits (16-byte alignment). We need to take care of this when we allocate new memory. So not every variable or array is good for SSE code.
What can we do?
- Since particles operate mostly on
glm::vec4
, there is a high chance to use the full power of SSE. We use 4 floats per vector, 16 bytes.
glm
adds a very nice feature glm::simdVec4
which basically adds SSE code to common vector functions. So I simply changed glm::vec4
to glm::simdVec4
.
- Memory must be aligned, so I used
_aligned_malloc
and _aligned_free
.
Some code examples:
glm::simdVec4 *m_pos;
glm::simdVec4 *m_col;
m_pos = (glm::simdVec4 *)_aligned_malloc(sizeof(glm::vec4)*maxSize, 16);
m_col = (glm::simdVec4 *)_aligned_malloc(sizeof(glm::vec4)*maxSize, 16);
_aligned_free(m_pos);
_aligned_free(m_col);
The results after changes (Visual Studio):
Sandy Bridge:
count |
tunnel |
attractors |
fountain |
171000 |
387.563 |
495.281 |
394.641 |
181000 |
417.320 |
529.660 |
426.330 |
191000 |
447.665 |
563.833 |
450.416 |
Ivy Bridge:
count |
tunnel |
attractors |
fountain |
171000 |
476.625 |
596.313 |
483.656 |
181000 |
514.328 |
639.664 |
523.332 |
191000 |
552.666 |
682.333 |
558.667 |
Wow: almost 20% of improvement! All by proper data structures (for vectors) and memory alignment.
SSE and AVX Instructions
So far, we got a nice speed up... Now, let's write some SSE code for most critical loops. Will it run faster?
Euler update, SSE:
__m128 ga = globalA.Data;
__m128 *pa, *pb, pc;
__m128 ldt = _mm_set_ps1(localDT);
size_t i;
for (i = 0; i < endId; i++)
{
pa = (__m128*)(&p->m_acc[i].x);
*pa = _mm_add_ps(*pa, ga);
}
for (i = 0; i < endId; i ++)
{
pa = (__m128*)(&p->m_vel[i].x);
pb = (__m128*)(&p->m_acc[i].x);
pc = _mm_mul_ps(*pb, ldt);
*pa = _mm_add_ps(*pa, pc);
}
for (size_t i = 0; i < endId; i++)
{
pa = (__m128*)(&p->m_pos[i].x);
pb = (__m128*)(&p->m_vel[i].x);
pc = _mm_mul_ps(*pb, ldt);
*pa = _mm_add_ps(*pa, pc);
}
Readability is much worse in this case.
The results:
Sandy Bridge
count |
tunnel |
attractors |
fountain |
171000 |
386.453 |
492.727 |
393.363 |
181000 |
416.182 |
529.591 |
423.795 |
191000 |
444.398 |
564.199 |
450.099 |
Ivy Bridge:
count |
tunnel |
attractors |
fountain |
171000 |
481.172 |
584.086 |
486.543 |
181000 |
516.271 |
623.136 |
514.068 |
191000 |
547.034 |
656.517 |
541.258 |
Not much, unfortunately. This is because of glm::simdVec4
which uses SSE code. So there is no point in rewriting it. We lose readability and the performance gain is questionable.
Pointer Aliasing: __restrict Keyword
In my previous post, I got a very interesting comment from Matías N. Goldberg:
... In my experience, you get massive performance improvements by just telling the compiler the pointers do not alias to each other...
Matias suggest using __restrict
keyword to tell the compiler that pointers are not aliasing. For instance:
glm::vec4 * __restrict acc = p->m_acc;
glm::vec4 * __restrict vel = p->m_vel;
glm::vec4 * __restrict pos = p->m_pos;
And then, instead of p->m_pos
, just use pos
pointer.
When I did such change in all the updaters (and generators) code, I got the following results:
Sandy Bridge
count |
tunnel |
attractors |
fountain |
171000 |
372.641 |
476.820 |
376.410 |
181000 |
401.705 |
508.353 |
404.176 |
191000 |
427.588 |
542.794 |
432.397 |
Ivy Bridge
count |
tunnel |
attractors |
fountain |
171000 |
475.609 |
591.805 |
480.402 |
181000 |
502.201 |
620.601 |
512.300 |
191000 |
534.150 |
667.575 |
541.788 |
This is not a massive improvement, but it's still worth testing.
Random Number Generator
Right now, there is standard C rand()
function called. For a particle system we, probably, do not need to use something more advanced (like normal distribution random generator) - uniform distribution is just fine... maybe there are some faster generators than the default one?I focused mostly on the updaters part so far. But, generators could also be improved a bit. In this module, a random number generator is heavily used. What if we changed it?
I've searched and found something here, here and here.
I've tried to use this generator:
static unsigned int mirand = 1;
float sfrand(void) {
unsigned int a;
mirand *= 16807;
a = (mirand & 0x007fffff) | 0x40000000;
return(*((float*)&a) - 3.0f);
}
It has a uniform distribution and 23 bits of precision (C rand()
has only 16 bits).
The results:
Sandy Bridge:
count |
tunnel |
attractors |
fountain |
171000 |
334.633 |
443.816 |
348.908 |
181000 |
363.954 |
474.477 |
372.739 |
191000 |
384.869 |
501.435 |
394.217 |
Ivy Bridge:
count |
tunnel |
attractors |
fountain |
171000 |
412.172 |
531.586 |
429.293 |
181000 |
450.146 |
573.073 |
463.037 |
191000 |
473.518 |
606.759 |
484.880 |
Wow! Now it is around 28% of total improvement for Sandy Bridge and almost the same for Ivy Bridge.
Wrap Up
Final Results
CPU |
count |
tunnel |
attractors |
fountain |
Sandy |
191000 |
384.869 (-21.3%) |
501.435 (-27.2%) |
394.217 (-24.2%) |
Ivy |
191000 |
473.518 (-20.3%) |
606.759 (-27.1%) |
484.880 (-24.3%) |
Total (taking times before tools optimization)
CPU |
tunnel |
attractors |
fountain |
Sandy |
35.5% |
43,5% |
39,7% |
Ivy |
33.2% |
38,2% |
35,6% |
We can 'reverse' those numbers and say that now the attractor effect runs almost two times faster! Not that bad!
Conclusion
- Memory alignment and proper data structures are the key factors.
- Write SIMD code only if needed, usually it is better to rely on a compiler and third party libraries.
- Describe your code better: for instance using
__restrict
keyword. That way, a compiler can generate better code.
- Random number generator can make a difference.
What's Next
The renderer is very simple so far. Maybe, there are some options to improve its code. For sure, we have to look at CPU to GPU memory transfers and better buffers usage.
References
CodeProject