Contents
Objective
My objective of posting this article is to share some simple optimization methods. In future, I will try to spend some time to write more articles.
Introduction
This article is demonstrating Intel's SIMD (Single Instruction Multiple Data) extension technology. Optimization by using new Intel instruction like movdqa
, will move (copy) data faster than typical ones.
Recall
Before we move on, let's recall some existing knowledge we have now. Nowadays, or more commonly, we are using 32 bits processor at home, even in industry. General purpose registers like eax
, ebx
.. etc. are 32 bits. sizeof(int)
= 4 (bytes). But not all registers are 32 bits, there are some registers having longer bit length. Since a decade ago, Intel introduced MMX extension, in which there are 8 registers mm0
, mm1
.. mm7
having 64 bits length. After that, Intel introduced SSE extension, which has another new 8 registers xmm0
, xmm1
.. xmm7
having 128 bits length. If you want to know more details, please go to my Links section. Look for Intel.
Requirement
Ask yourself first, what machine you are using. It should be Intel P3 or newer. You must bear in mind that this optimization method is machine dependant, which means that if your hardware does not support, you won't able to see the difference.
Code
The sample that I created, I purposely made it simple that it runs in console mode. Don't cut and paste, I rather want the reader understand and try it themselves. Here's the sample start..
The demo code will let you see the difference between these two functions that serve the same purpose. Start from here, I won't explain much, you will be alone and please read the comments within the code. I'm sure you will able to catch up. =)
Wait! Get your break point ready first, sit tight. When you do debugging, please try to step through both functions, you will notice the difference.
"DataTransferTypical
" it will copy one int
per loop (sizeof(int)
=4bytes ) whereas "DataTransferOptimised
" it will copy four int
per loop (4*sizeof(int)
=16bytes).
Setting up your Watch window.. In your Watch window, watch "piDst, 101". Then you will see how it is changing...
P.S.: You need to install processor pack in order to get your MSVC++ compile this code. See Links section.
int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes);
int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes);
int main(int argc, char* argv[])
{
unsigned long dwTimeStart = 0;
unsigned long dwTimeEnd = 0;
int *piSrc = NULL;
int *piDst = NULL;
int i = 0;
char cKey = 0;
unsigned long dwDataSizeInBytes = sizeof(int) * DATA_SIZE;
piSrc = (int *)_aligned_malloc(dwDataSizeInBytes,dwDataSizeInBytes);
piDst = (int *)_aligned_malloc(dwDataSizeInBytes,dwDataSizeInBytes);
do
{
memset(piSrc, 1, dwDataSizeInBytes);
memset(piDst, 0, dwDataSizeInBytes);
dwTimeStart = clock();
for(i = 0; i < ITERATION; i++)
DataTransferTypical(piDst, piSrc, dwDataSizeInBytes);
dwTimeEnd = clock();
printf("== Typical Transfer of %d * %d times of %d bytes data ==\nTime
Elapsed = %d msec\n\n",
ITERATION, DATA_SIZE, sizeof(int), dwTimeEnd - dwTimeStart);
memset(piSrc, 1, dwDataSizeInBytes);
memset(piDst, 0, dwDataSizeInBytes);
dwTimeStart = clock();
for(i = 0; i < ITERATION; i++)
DataTransferOptimised(piDst, piSrc, dwDataSizeInBytes);
dwTimeEnd = clock();
printf("== Optimised Transfer of %d * %d times of %d bytes data ==\nTime
Elapsed = %d msec\n\n",
ITERATION, DATA_SIZE, sizeof(int), dwTimeEnd - dwTimeStart);
printf("Rerun? (y/n) ");
cKey = getche();
printf("\n\n");
}while(cKey == 'y');
_aligned_free(piSrc);
_aligned_free(piDst);
return 0;
}
#pragma warning(push)
#pragma warning(disable:4018 4102)
int DataTransferTypical(int* piDst, int* piSrc, unsigned long SizeInBytes)
{
unsigned long dwNumElements = SizeInBytes / sizeof(int);
for(int i = 0; i < dwNumElements; i++)
{
*(piDst + i) = *(piSrc + i);
}
return 0;
}
int DataTransferOptimised(int* piDst, int* piSrc, unsigned long SizeInBytes)
{
unsigned long dwNumElements = SizeInBytes / sizeof(int);
unsigned long dwNumPacks = dwNumElements / (128/(sizeof(int)*8));
_asm
{
pusha;
begin:
mov ecx,SizeInBytes;
mov edi,piDst;
mov esi,piSrc;
begina:
cmp ecx,0;
jz end;
body:
mov ebx,SizeInBytes;
sub ebx,ecx;
movdqa xmm1,[esi+ebx];
movdqa [edi+ebx],xmm1;
bodya:
sub ecx,16;
jmp begina;
end:
popa;
}
return 0;
}
#pragma warning(pop)
Finally
This is my first article in Code Project, please bear with me if something is not right. Also, I hope the demo that I uploaded here is simple enough for beginners. Nothing fancy. Learning is fun, right? =)
Links
History
I will only update this article when people are requesting. The sample code will not be maintained.