Introduction
The Intel SSE intrinsic technology boosts the performance of floating point calculations. Both GCC and Microsoft Visual Studio supports SSE intrinsic. The xmm0-xmm15 (16 xmm registers for 64bit operating system) or xmm0-xmm7(8 xmm registers for 32 bit operating system) registers used for floating point calculations in SSE. Operations in SSE for single precision floating point and double precision floating point is a bit different. My objective is to point the differences between the calculation between these two data types using simple summation operation in floating point array.
SSE Programming
All SSE instructions and data types are defined in #include <xmmintrin.h>
. __m128
is used for single precision floating point number and __m128d
is used for double precision numbers. _mm_load_pd
is used for loading double precision floating point number and _mm_load_ps
is used loading for single precision floating point numbers. Similarly, _mm_add_ps
, _mm_hadd_ps
are used for adding single precision floating point numbers. Meanwhile, _mm_add_pd
and _mm_hadd_pd
are used for adding double precision floating point numbers. The float point array has to be aligned 16 and that can be done using _mm_malloc
.
_mm_add_ps
adds the four single precision floating-point values
r0 := a0 + b0
r1 := a0 + b1
r2 := a2 + b2
r3 := a3 + b3
_mm_add_pd
adds the two double precision floating-point values
r0 := a0 + b0
r1 := a1 + b1
Code
This is the plain C code which we are we wish to convert codes using SSE.
float sum = 0;
for (int i = 0; i < n; i++) {
sum += scores[i];
}
Single precision floating point number addition Sample code:
float sum = 0.0;
__m128 rsum = _mm_set1_ps(0.0);
for (int i = 0; i < n; i+=4)
{
__m128 mr = _mm_load_ps(&a[i]);
rsum = _mm_add_ps(rsum, mr);
}
rsum = _mm_hadd_ps(rsum, rsum);
rsum = _mm_hadd_ps(rsum, rsum);
_mm_store_ss(&sum, rsum);
Double precision floating point number addition Sample code:
double sum = 0.0;
double sum1 = 0.0;
__m128d rsum = _mm_set1_pd(0.0);
__m128d rsum1 = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 4)
{
__m128d mr = _mm_load_pd(&a[i]);
__m128d mr1 = _mm_load_pd(&a[i+2]);
rsum = _mm_add_pd(rsum, mr);
rsum1 = _mm_add_pd(rsum1, mr1);
}
rsum = _mm_hadd_pd(rsum, rsum1);
rsum = _mm_hadd_pd(rsum, rsum);
_mm_store_sd(&sum, rsum);
You can see the difference between single precision float and double precision float is that you can add 4 values in one operation of single precision floating point number
rsum = _mm_add_ps(rsum, mr);
You can add 2 values in one operation and therefore you need two operations for 4 values
rsum = _mm_add_pd(rsum, mr);
rsum1 = _mm_add_pd(rsum1, mr1);
Adding a timer you can see SSE code is very much faster than normal code. In my PC I observed that SSE code is almost 4 times faster than plain code.
Hence, using SSE instruction one can develop faster complex application where time optimization is required.
Last of All
This is my first post in CodeProject. There may be mistakes in this article. Please let me know and give me feedback.