In this article, I simply focus on an example of array sum to show how to use some modern instructions to do optimization with MMX, SSE, and AVX in x86 Assembly language on MASM platform. I create a benchmark test by comparing four implementations between a traditional way and three SIMD methods with significant results of time measurement.
Contents
Introduction
SIMD (Single Instruction Multiple Data) is a computing element that performs the same operation on multiple data items simultaneously. An instruction in Assembly language can work on the parallel computations to exploit data level parallelism at a moment. Typical SIMD operations include basic arithmetic such as addition, subtraction, multiplication, division, and also shifts, compares, and data conversions. Processors facilitate SIMD operations by reinterpreting the bit pattern of an operand in a register or memory location. A 32-bit register, for example, containing a single 32-bit DWORD
integer can also accommodate two 16-bit WORD
s or four 8-bit BYTE
s.
Most modern CPU architectures include SIMD instructions to improve software performance. SIMD is particularly applicable to common tasks such as processing pixels in image or adjusting digital volume in audio. Among other things, software compilers heavily rely on modern SIMD instructions to optimize code execution in production, by compiling a program into its binaries in release version. Visual C++, for example, can make its project in release build to run much faster than the similar code logic in the traditional Assembly language.
The development of SIMD technology went through several stages corresponding to the modern CPU designs in years. The first widely deployed was Intel's MMX extensions to x86 in 1996. Then SSE (Streaming SIMD Extensions) was introduced in 1999 and subsequently expanded to SSE2, SSE3, SSSE3, and SSE4. Then AVX (Advanced Vector Extensions) came in 2008 and expanded to AVX2 and AVX-512 in 2013.
In this article, I’ll not talk about details of MMX, SSE, and AVX. I simply focus on an example of array sum to show how to use some modern instructions to do our optimizations. You have to be familiar with traditional x86 Assembly language, preferred on MASM in Windows system. To learn SIMD, I recommend to read the first edition of Apress’ book Modern X86 Assembly Language Programming by Daniel Kusswurm. This could be a nice and convenient approach to SIMD programming with help of Visual Studio. Some of MMX materials cited here are also from this book.
About Array Sum Example
The code logic of array sum example can be seen in the following C/C++ function where parameter x
is a BYTE
array and n
is the array count:
DWORD SumArrayCPP(const BYTE* x, unsigned int n)
{
DWORD sum = 0;
for (unsigned int i = 0; i < n; i++)
sum += x[i];
return sum;
}
I have implemented such a function in x86 Assembly procedures by either traditional (original) or modern (SIMD) instructions, as well as input and output. In order to measure execution time, I run each procedure repeatedly in a number of iterations for efficiency comparison. The benchmark test will let you enter the number of iterations, e.g., one million (1,000,000
), The test generates a large array of 64K (65536
) pseudo random BYTE
integers ranged from 1
to 255
. After run, the results could be shown as the following console output, which is on my i7-8565U CPU, 16.0 GB memory and 64-bit Windows 10.
As you can see, all four tasks get the same sum result. The original method with x86 traditional instructions requires over 80
seconds. For SIMD technologies, MMX method takes only 10%
to make it done in 7
seconds, SSE method takes less in 4
seconds, and AVX method makes another half down to 2
seconds. A remarkable improvement with radical changes in x86 Assembly execution.
Now let’s take a look at each implementation one by one. Also, I’ll show the entry point of benchmark test later followed with irvine32 library support.
Simple Logic in Traditional Way
The following procedure SumArrayOriginal
is easy to understand. In this article, I always use such a prototype with two parameters, the byte array address arrayPtr
and an array count
, the same as those in SumArrayCPP
. I set a loop to simply add each byte as zero-extended to doubleword and return the sum in EAX
:
SumArrayOriginal PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
mov eax,0
mov esi,arrayPtr
mov ecx,count
cmp ecx,0
jle L2
L1:
movzx EBX, BYTE PTR[esi]
add EAX, EBX
add esi,TYPE BYTE
loop L1
L2:
ret
SumArrayOriginal ENDP
MMX Execution Logic and Implementations
MMX technology adds eight 64-bit registers, named MM0
to MM7
, as shown below in the Visual Studio Registers window in debugging.
They can be used to perform SIMD operations using eight 8-bit integers as packed BYTE
s, four 16-bit integers as packed WORD
s, or two 32-bit integers as packed DWORD
s:
The procedure SumArrayMMX
returns the same result as the previous SumArrayOriginal
but significantly reduces 90% execution time. The improvement is achieved based on packed additions. Ten MMX instructions mentioned here, pxor
, movq
, punpcklbw
, punpckhbw
, paddw
, punpcklwd
, punpckhwd
, paddd
, psrlq
, and movd
. Detailed description can be found to search x86 and amd64 instruction reference. Here, I talk about the SumArrayMMX
implementation in several code logic step by step:
SumArrayMMX
maintains the same prototype with two parameters as before. In initialization, I set the loop counter ECX
as count
divided by 16
, because of each iteration to process 16
bytes once in packed additions. For simplicity, this requires an array size to be multiple of 16
like 64K (65536
) here. Then set EAX
as the array address with mm4
, mm5
, and mm7
cleared:
SumArrayMMX PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
mov ecx, count
shr ecx, 4
mov eax,arrayPtr
pxor mm4,mm4
pxor mm5,mm5
pxor mm7,mm7
Next, start a loop where the first 8
bytes are saved in mm0
and mm2
, while the second 8
bytes immediately followed are saved in mm1
and mm3
:
@@:
movq mm0,[eax]
movq mm1,[eax+8]
movq mm2,mm0
movq mm3,mm1
An example of mm1
, mm3
, mm0
, and mm2
can look like:
Now I have to extend above each byte to word before packed additions. Simply use punpcklbw
, lower byte to word, and punpckhbw
, high byte to word.
punpcklbw mm0,mm7
punpcklbw mm1,mm7
punpckhbw mm2,mm7
punpckhbw mm3,mm7
With this example, mm1
, mm3
, mm0
, and mm2
can be:
Now is the time to perform packed additions for words parallelly:
paddw mm0,mm2
paddw mm1,mm3
This is the results:
Another packed addition for words is to add above mm0
and mm1
as results in mm0
:
paddw mm0,mm1
Have 4
words now:
Next, do the similar steps for words in mm0
. First copy mm0
to mm1
and extend each word to doubleword by punpcklwd
, lower word to doubleword, and punpckhwd
, high word to doubleword:
movq mm1,mm0
punpcklwd mm0,mm7
punpckhwd mm1,mm7
Then I have:
Finally, just packed add two doublewords to mm4
and another two doublewords to mm5
. These doublewords are continuously collected in mm4
and mm5
for each iteration in the loop till done.
paddd mm4,mm0
paddd mm5,mm1
add eax,16
dec ecx
jnz @B
When all calculated, I exit the loop with mm4
containing two doublewords and mm5
another two respectively. Now perform packed doubleword addition as mm5
plus mm4
, resulted in mm5
. In order to calculate the final sum, I copy mm5
to mm6
and right shift the high doubleword in mm6
. So, last packed addition is mm5
plus mm6
. The final sum in mm6
is only lower doubleword that is moved into EAX
as the return value.
paddd mm5,mm4
movq mm6,mm5
psrlq mm6,32
paddd mm6,mm5
movd eax,mm6
ret
SumArrayMMX ENDP
SSE Execution Logic and Implementations
SSE technology adds eight 128-bit registers, named XMM0
to XMM7
, as shown below in the Visual Studio Registers window in debugging.
They can be used to perform SIMD operations using sixteen 8-bit integers as packed BYTE
s, eight 16-bit integers as packed WORD
s, four 32-bit integers as packed DWORD
s, or two 64-bit integers as packed QWORD
s :
The procedure SumArraySSE
returns the same result as the previous but reduces half execution time as SumArrayMMX
. This improvement is intuitive, as registers doubled size for packed operations. SSE instructions used here almost the same as those of MMX, as shown in the following sections.
SumArraySSE
maintains the same prototype with two parameters. Now I can process 32
bytes a block in packed additions and initialize the loop counter ECX
as count
divided by 32
, Again for simplicity, this requires an array size to be multiple of 32
as the size 64K (65536
) here. Then set EAX
the array address with xmm4
, xmm5
, and xmm7
cleared:
SumArraySSE PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
mov ecx, count
shr ecx, 5
mov eax,arrayPtr
pxor xmm4,xmm4
pxor xmm5,xmm5
pxor xmm7,xmm7
This time, I’ll trace an example directly from Visual Studio debug environment. The first block of 32
bytes in memory could be like:
Similar to MMX method, I save first two 16-byte blocks in xmm0
and xmm1
:
@@:
movdqa xmm0,xmmword ptr [eax]
movdqa xmm1,xmmword ptr [eax+16]
Notice the Little-Endian concept in memory form the Intel specification, and xmm0
and xmm1
become:
Next, extend byte to word with four registers:
movdqa xmm2,xmm0
movdqa xmm3,xmm1
punpcklbw xmm0,xmm7
punpcklbw xmm1,xmm7
punpckhbw xmm2,xmm7
punpckhbw xmm3,xmm7
Now xmm0
, xmm1
, xmm2
, and xmm3
are all packed words promoted by punpcklbw
and punpckhbw
:
After packed word additions:
paddw xmm0,xmm2
paddw xmm1,xmm3
The results are eight packed words in xmm0
and another eight words in xmm1
:
Now extend values from packed word to doubleword by copying xmm2
and xmm3
with four promotions:
movdqa xmm2,xmm0
movdqa xmm3,xmm1
punpcklwd xmm0,xmm7
punpcklwd xmm1,xmm7
punpckhwd xmm2,xmm7
punpckhwd xmm3,xmm7
I have four packed doublewords in xmm0
, amm1
, xmm2
, and xmm3
respectively
Then make sums from the packed doublewords additions:
paddd xmm0,xmm2
paddd xmm1,xmm3
As a result, four packed doublewords in xmm0
and four packed doublewords in xmm1
as well:
Finally, similar to MMX code logic, I packed add four doublewords to xmm4
and four doublewords to xmm5
. The registers xmm4
and xmm5
continuously collect from xmm0
and xmm1
each iteration till the loop done.
paddd xmm4,xmm0
paddd xmm5,xmm1
add eax,32
dec ecx
jnz @B
When exit from the loop, xmm4
and xmm5
containing four doublewords as packed sums respectively.
Now perform packed doubleword addition to xmm5
from xmm4
. In order to calculate the final sum, more steps needed. I copy xmm5
to xmm6
and shift mm6
high quadword 64
bits (8
bytes with PSRLDQ
), which is different from MMX’s psrlq
with bits’ shift, I have to use PSRLDQ
as Packed Shift Double Quadword Right Logical required with bytes' shift, instead of bits. For details, please check x86 and amd64 instruction reference. I then do packed addition to get the sum in xmm5
.
paddd xmm5,xmm4
movdqa xmm6,xmm5
PSRLDQ xmm6, 8
paddd xmm5,xmm6
Actually, the sum is now contained in lower quadword as two doublewords in xmm5
without need of caring xmm5
high quadword:
The last step is to add two lower doublewords in xmm5
together. I copy it to xmm6
and shift xmm6
high doubleword 32
bits (4
bytes with PSRLDQ
). After the last packed doubleword addition, the final sum in lower doubleword in xmm6
is passed to EAX
as return value.
movdqa xmm6,xmm5
PSRLDQ xmm6, 4
paddd xmm6,xmm5
movd eax,xmm6
ret
SumArraySSE ENDP
AVX Execution Logic and Implementations
AVX technology adds another 128
bits to extend SSE XMM0
to XMM7
to make it as eight 256-bit registers, named YMM0
to YMM7
, as shown below in debug:
They can be used to perform SIMD operations using thirty-two 8-bit integers as packed BYTE
s, sixteen 16-bit integers as packed WORD
s, eight 32-bit integers as packed DWORD
s, or four 64-bit integers as packed QWORD
s.
The procedure SumArrayAVX
returns the same result as the previous three but reduces more execution time. This is reasonable because again doubled register size from 128
to 256
bits. The AVX instructions used here are also similar, except for letter ‘v
’ prefix and with three operands.
Since YMM
so large in 256
bits, I would like to simplify SumArrayAVX
logic not as the previous pattern in MMX and SSE. In the loop, I process 32
bytes once in a single block instead of 64
bytes in two branches. For this purpose, I initialize the loop counter ECX
still as count
divided by 32
. When the loop starts, I just load 32
bytes into ymm1
only:
SumArrayAVX PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
mov ecx, count
shr ecx, 5
mov eax,arrayPtr
vpxor ymm0, ymm0, ymm0
vpxor ymm5, ymm5, ymm5
@@:
vmovdqa ymm1,ymmword ptr [eax]
Let’s still use the same array example as before with the first 32
bytes in memory like:
At this time of loading, ymm1
gets all 32
bytes from the Little-Endian concept memory as this:
Next extend byte to word and perform packed word addition:
VPUNPCKLBW ymm2, ymm1, ymm0
VPUNPCKHBW ymm3, ymm1, ymm0
vpaddw ymm4, ymm2, ymm3
This is the packed word results in ymm4
where you can manually add ymm2
and ymm3
to verify:
Next extend word to doubleword in ymm2
and ymm3
, and perform packed addition to ymm4
:
vmovdqa ymm1,ymm4
VPUNPCKLWD ymm2, ymm1, ymm0
VPUNPCKHWD ymm3, ymm1, ymm0
vpaddd ymm4, ymm2, ymm3
This is again, the packed doubleword results in ymm4
and can be manually verified:
The register ymm5
is the sum collector to continuously add eight doublewords in each iteration from ymm4
before the loop done:
vpaddd ymm5, ymm5, ymm4
add eax,32
dec ecx
jnz @B
When done, to get final sum, I have to play more tricks here. I need to use VPERMQ
with encoding 00001110b
to move upper two QWORD
s from ymm5
to ymm4
. Please consult with x86 and amd64 instruction reference for this Qwords Element Permutation instruction.
vmovdqa ymm4, ymm5
vpermq ymm4, ymm4, 00001110b
Then I have lower two quadwords ready in both ymm4
and ymm5
:
Now I only need to focus on two lower quadwords in ymm4
and ymm5
each containing 4
doublewords respectively. This is exactly the same scenario at the end of SumArraySSE
. Because YMM0
, ... ,YMM7
are just 128
bits extended from XMM0
, ... ,XMM7
, I can use two lower quadwords in YMM
as XMM
named in SSE. Thus I simply reuse that part of code for convenience:
paddd xmm5,xmm4
movdqa xmm6,xmm5
PSRLDQ xmm6, 8
paddd xmm5,xmm6
movdqa xmm6,xmm5
PSRLDQ xmm6, 4
paddd xmm6,xmm5
movd eax,xmm6
ret
SumArrayAVX ENDP
The Benchmark Test Drive
In order to test four procedures from SumArrayOriginal
to SumArrayMMX
/SSE
/AVX
, I have to create a benchmark program to let you enter the loop count and show measurements. One way is to simply delegate all input/output tasks to the high-level-language that is as easy as some bootcamp tutorials used. But for our benchmark, I prefer to make it more straight and accurately measured in the pure Assembly code with I/O directly called from OS. For this purpose, I recommend to link to irvine32 library at Assembly Language for x86 Processors. You can go Getting started with MASM and Visual Studio 2019 to download and install that static-linked library to build my attached project. For reference, please see online Irvine32 Library Help borrowed from California State University Dominguez Hills.
I need two helpers; one is to generate pseudo-random bytes and save them in an array:
FillArray PROC USES eax, pArray: DWORD, Count: DWORD
LOWVAL = 1h
HIGHVAL = 0FFh
mov edi,pArray
mov ecx,Count
L1:
mov eax,HIGHVAL
call RandomRange
add eax,LOWVAL
stosb
loop L1
ret
FillArray ENDP
Another is a reusable one to output the result from four procedure calls:
OutputResult PROC uses edx,
resSum:DWORD, msec:DWORD, labMethod:PTR BYTE, labTime:PTR BYTE
call GetMseconds
sub eax, msec
mov msec, eax
mov EDX, labMethod
call WriteString
mov eax, resSum
call WriteDec
mov EDX, labTime
call WriteString
mov eax, msec
call WriteDec
ret
OutputResult ENDP
Our benchmark entry point can start like:
mainBenchmark PROC
mov EDX, offset InputLabel2
call WriteString
call ReadDec
mov LoopCounter, eax
INVOKE FillArray, ADDR array, LENGTHOF array
Then I do each test like this:
call GetMseconds
mov tm,eax
mov ECX, LoopCounter
LL1:
INVOKE SumArrayOriginal, ADDR array, MAX_ARRAY_SIZE
loop LL1
INVOKE OutputResult, eax, tm, addr LbSumOrg, addr TimeElapsed
call Crlf
call GetMseconds
mov tm,eax
mov ECX, LoopCounter
LL2:
INVOKE SumArrayMMX, ADDR array, MAX_ARRAY_SIZE
loop LL2
INVOKE OutputResult, eax, tm, addr LbSumMMX, addr TimeElapsed
call crlf
...
...
call WaitMsg
exit
mainBenchmark ENDP
The last but not least is the MASM 32-bit directive .XMM that must be declared at the beginning. The directive name sounds vague or inaccurate because it enables Assembly of SIMD instructions all required in this article so far including MMX, SSE, and AVX.
Attention to Assembly Programming
The Assembly code discussed in this article can be downloaded with its MASM project file to run in the recent Visual Studio. However, when your Visual Studio working fine with VC++, it doesn’t mean that it can run a pure Assembly code project. Mainly for the security reason, building or launching an Assembly executable could be blocked by an Anti-Virus application or system settings. The fatal error from the Visual Studio can be LNK1104, ML A1000, etc. It’s hard to understand, since no specific info to tell.
This is an example of an Assembly code blocked by Defender in Windows 10:
Here is some workaround. For an Anti-Virus application, if there is an editable exception or exclusion option, add your Assembly code test folders to bypass. The following is the Exclusions option from Defender in Windows 10:
If no such an option, but there is a choice allowed to disable, just disable it for a while to test, yet this will leave your computer unprotected. You can also read, Was your program's EXE file blocked by an anti-virus scanner?
Summary
I spend so much to explore optimizations just for such a simple program. As you can imagine, for a software compiler, how much effort should be invested to create an intelligent utility. To peek the compiler behavior, you can reverse engineer from a C/CPP program to Assembly listing by a disassembler. Two builds available, the Debug build is relatively readable, almost mapping to original logic often with traditional Assembly, while the Release build is hard to understand as often optimized with modern Assembly like SIMD. You can try to build what here mentioned SumArrayCPP
in Visual Studio to watch, although other platform disassemblers will produce a different Assembly listing. Another attention is that compilers usually upgrade to incorporate new instructions with technology advanced. Thus, the Release disassembly listing can be different from Visual Studio 2010 to 2019, as different optimization methods applied from time to time as in this article.
The immediate motivation of this writing is to create a concise supplement for my students at CSCI 241Assemly Language Programming. The course outline mainly focused on traditional x86-64 Assembly language programming. Based on an intensive learning, the students would like to have an extensive look and feel about modern Assembly language. Although you can find nice books, as they provided a lot detailed SIMD descriptions, it’s hard to pick up one to lecture in a couple of hours. Therefore, I hope this article could serve as a useful reading and practice material for beginners interested in modern Assembly language.
Recently in software industry, different types of assembly languages play an important role in the new device, mobile and microchip design, etc. In particular, time and space efficiency in Assembly project performance is always emphasized as a critical factor. Academically, Assembly Language Programming is considered as one of demanding courses in Computer Science in most colleges. With advanced technology from time to time, optimization could be a good topic to discuss in more areas.
History
- 26th March, 2021: Original version posted