Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / VC++

Using SIMD to Optimize x86 Assembly Code in Array Sum Example

4.83/5 (5 votes)
26 Mar 2021CPOL14 min read 19.5K   109  
A discussion on optimization using SIMD instructions including MMX, SSE, and AVX in x86 Assembly Language Programming
In this article, I simply focus on an example of array sum to show how to use some modern instructions to do optimization with MMX, SSE, and AVX in x86 Assembly language on MASM platform. I create a benchmark test by comparing four implementations between a traditional way and three SIMD methods with significant results of time measurement.

Contents

Introduction

SIMD (Single Instruction Multiple Data) is a computing element that performs the same operation on multiple data items simultaneously. An instruction in Assembly language can work on the parallel computations to exploit data level parallelism at a moment. Typical SIMD operations include basic arithmetic such as addition, subtraction, multiplication, division, and also shifts, compares, and data conversions. Processors facilitate SIMD operations by reinterpreting the bit pattern of an operand in a register or memory location. A 32-bit register, for example, containing a single 32-bit DWORD integer can also accommodate two 16-bit WORDs or four 8-bit BYTEs.

Most modern CPU architectures include SIMD instructions to improve software performance. SIMD is particularly applicable to common tasks such as processing pixels in image or adjusting digital volume in audio. Among other things, software compilers heavily rely on modern SIMD instructions to optimize code execution in production, by compiling a program into its binaries in release version. Visual C++, for example, can make its project in release build to run much faster than the similar code logic in the traditional Assembly language.

The development of SIMD technology went through several stages corresponding to the modern CPU designs in years. The first widely deployed was Intel's MMX extensions to x86 in 1996. Then SSE (Streaming SIMD Extensions) was introduced in 1999 and subsequently expanded to SSE2, SSE3, SSSE3, and SSE4. Then AVX (Advanced Vector Extensions) came in 2008 and expanded to AVX2 and AVX-512 in 2013.

In this article, I’ll not talk about details of MMX, SSE, and AVX. I simply focus on an example of array sum to show how to use some modern instructions to do our optimizations. You have to be familiar with traditional x86 Assembly language, preferred on MASM in Windows system. To learn SIMD, I recommend to read the first edition of Apress’ book Modern X86 Assembly Language Programming by Daniel Kusswurm. This could be a nice and convenient approach to SIMD programming with help of Visual Studio. Some of MMX materials cited here are also from this book.

About Array Sum Example

The code logic of array sum example can be seen in the following C/C++ function where parameter x is a BYTE array and n is the array count:

C++
// Code logic of array sum in C/C++
DWORD SumArrayCPP(const BYTE* x, unsigned int n)
{
    DWORD sum = 0;
    for (unsigned int i = 0; i < n; i++)
        sum += x[i];
    return sum;
}

I have implemented such a function in x86 Assembly procedures by either traditional (original) or modern (SIMD) instructions, as well as input and output. In order to measure execution time, I run each procedure repeatedly in a number of iterations for efficiency comparison. The benchmark test will let you enter the number of iterations, e.g., one million (1,000,000), The test generates a large array of 64K (65536) pseudo random BYTE integers ranged from 1 to 255. After run, the results could be shown as the following console output, which is on my i7-8565U CPU, 16.0 GB memory and 64-bit Windows 10.

   ArySumSimdFig1: console output

As you can see, all four tasks get the same sum result. The original method with x86 traditional instructions requires over 80 seconds. For SIMD technologies, MMX method takes only 10% to make it done in 7 seconds, SSE method takes less in 4 seconds, and AVX method makes another half down to 2 seconds. A remarkable improvement with radical changes in x86 Assembly execution.

Now let’s take a look at each implementation one by one. Also, I’ll show the entry point of benchmark test later followed with irvine32 library support.

Simple Logic in Traditional Way

The following procedure SumArrayOriginal is easy to understand. In this article, I always use such a prototype with two parameters, the byte array address arrayPtr and an array count, the same as those in SumArrayCPP. I set a loop to simply add each byte as zero-extended to doubleword and return the sum in EAX:

ASM
;-------------------------------------------------------------
SumArrayOriginal PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
; Performs a summation for an 8-bit BYTE array
; Returns: sum in EAX
;-------------------------------------------------------------
   mov  eax,0                  ; set the sum to zero
   mov  esi,arrayPtr
   mov  ecx,count
   cmp  ecx,0                  ; array size <= 0?
   jle  L2                     ; yes: quit

 L1:   
   movzx  EBX, BYTE PTR[esi]   ; add each integer to sum
   add EAX, EBX
   add  esi,TYPE BYTE          ; point to next integer
   loop L1                     ; repeat for array size
 L2:   
   ret   
SumArrayOriginal ENDP

MMX Execution Logic and Implementations

MMX technology adds eight 64-bit registers, named MM0 to MM7, as shown below in the Visual Studio Registers window in debugging.

   ArySumSimdFig2: MMX technology adds eight 64-bit registers

They can be used to perform SIMD operations using eight 8-bit integers as packed BYTEs, four 16-bit integers as packed WORDs, or two 32-bit integers as packed DWORDs:

ArySumSimdFig3: eight 8-bit integers as packed BYTEs

The procedure SumArrayMMX returns the same result as the previous SumArrayOriginal but significantly reduces 90% execution time. The improvement is achieved based on packed additions. Ten MMX instructions mentioned here, pxor, movq, punpcklbw, punpckhbw, paddw, punpcklwd, punpckhwd, paddd, psrlq, and movd. Detailed description can be found to search x86 and amd64 instruction reference. Here, I talk about the SumArrayMMX implementation in several code logic step by step:

SumArrayMMX maintains the same prototype with two parameters as before. In initialization, I set the loop counter ECX as count divided by 16, because of each iteration to process 16 bytes once in packed additions. For simplicity, this requires an array size to be multiple of 16 like 64K (65536) here. Then set EAX as the array address with mm4, mm5, and mm7 cleared:

ASM
;-------------------------------------------------------------
SumArrayMMX PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
; Performs a summation for an 8-bit BYTE array
; Returns: sum in EAX
;-------------------------------------------------------------
; Initialize
   mov ecx, count                   ;ecx = 'n'
   shr ecx, 4                       ;ecx = number of 16-byte blocks
   mov eax,arrayPtr                 ;pointer to array 'x'
   pxor mm4,mm4
   pxor mm5,mm5                     ;mm5:mm4 = packed sum (4 dwords)
   pxor mm7,mm7                     ;mm7 = packed zero for promotions

Next, start a loop where the first 8 bytes are saved in mm0 and mm2, while the second 8 bytes immediately followed are saved in mm1 and mm3:

ASM
; Load the next block of 16 array values 
@@:  
   movq mm0,[eax]
   movq mm1,[eax+8]                 ;mm1:mm0 = 16 byte block
   movq mm2,mm0
   movq mm3,mm1

An example of mm1, mm3, mm0, and mm2 can look like:

ArySumSimdFig4: An example of mm1, mm3, mm0, and mm2 can look like

Now I have to extend above each byte to word before packed additions. Simply use punpcklbw, lower byte to word, and punpckhbw, high byte to word.

ASM
; Promote array values from bytes to words
   punpcklbw mm0,mm7                ;mm0 = 4 words
   punpcklbw mm1,mm7                ;mm1 = 4 words
   punpckhbw mm2,mm7                ;mm2 = 4 words
   punpckhbw mm3,mm7                ;mm3 = 4 words

With this example, mm1, mm3, mm0, and mm2 can be:

ArySumSimdFig5: With this example, mm1, mm3, mm0, and

Now is the time to perform packed additions for words parallelly:

ASM
; Packed additions for words   
   paddw mm0,mm2
   paddw mm1,mm3

This is the results:

ArySumSimdFig6: This is the results

Another packed addition for words is to add above mm0 and mm1 as results in mm0:

ASM
; Packed additions for words 
   paddw mm0,mm1                    ;mm0 = pack sums (4 words)

Have 4 words now:

ArySumSimdFig7: Have 4 words now

Next, do the similar steps for words in mm0. First copy mm0 to mm1 and extend each word to doubleword by punpcklwd, lower word to doubleword, and punpckhwd, high word to doubleword:

ASM
; Promote packed sums to dwords, then update dword sums in mm5:mm4
   movq mm1,mm0
   punpcklwd mm0,mm7                ;mm0 = packed sums (2 dwords)
   punpckhwd mm1,mm7                ;mm1 = packed sums (2 dwords)

Then I have:

ArySumSimdFig8: Then I have

Finally, just packed add two doublewords to mm4 and another two doublewords to mm5. These doublewords are continuously collected in mm4 and mm5 for each iteration in the loop till done.

ASM
; Packed additions for doublewords    
   paddd mm4,mm0
   paddd mm5,mm1                    ;mm5:mm4 = packed sums (4 dwords)

   add eax,16                       ;eax = next 16 byte block
   dec ecx
   jnz @B                           ;repeat loop if not done

When all calculated, I exit the loop with mm4 containing two doublewords and mm5 another two respectively. Now perform packed doubleword addition as mm5 plus mm4, resulted in mm5. In order to calculate the final sum, I copy mm5 to mm6 and right shift the high doubleword in mm6. So, last packed addition is mm5 plus mm6. The final sum in mm6 is only lower doubleword that is moved into EAX as the return value.

ASM
; Compute final sum_x
   paddd mm5,mm4                    ;mm5 = packed sums (2 dwords)
   movq mm6,mm5
   psrlq mm6,32                     ;mm6 Shifted Right Logical 32 bits 

   paddd mm6,mm5                    ;mm6[31:0] = final sum_x
   movd eax,mm6                     ;eax = sum_x
   ret   
SumArrayMMX ENDP

SSE Execution Logic and Implementations

SSE technology adds eight 128-bit registers, named XMM0 to XMM7, as shown below in the Visual Studio Registers window in debugging.

ArySumSimdFig9: SSE technology adds eight 128-bit registers

They can be used to perform SIMD operations using sixteen 8-bit integers as packed BYTEs, eight 16-bit integers as packed WORDs, four 32-bit integers as packed DWORDs, or two 64-bit integers as packed QWORDs :

ArySumSimdFig10: sixteen 8-bit integers as packed BYTEs

The procedure SumArraySSE returns the same result as the previous but reduces half execution time as SumArrayMMX. This improvement is intuitive, as registers doubled size for packed operations. SSE instructions used here almost the same as those of MMX, as shown in the following sections.

SumArraySSE maintains the same prototype with two parameters. Now I can process 32 bytes a block in packed additions and initialize the loop counter ECX as count divided by 32, Again for simplicity, this requires an array size to be multiple of 32 as the size 64K (65536) here. Then set EAX the array address with xmm4, xmm5, and xmm7 cleared:

ASM
;-------------------------------------------------------------
SumArraySSE PROC uses ECX, arrayPtr:PTR BYTE, count:DWORD
; Performs a summation for an 8-bit BYTE array
; Returns: sum in EAX
;-------------------------------------------------------------
; Initialize
   mov ecx, count                   ;ecx = 'n'
   shr ecx, 5                       ;ecx = number of 32-byte blocks
   mov eax,arrayPtr                 ;pointer to array 'x'
   pxor xmm4,xmm4
   pxor xmm5,xmm5                   ;xmm5:xmm4 = packed sum (4 dwords)
   pxor xmm7,xmm7                   ;xmm7 = packed zero for promotions

This time, I’ll trace an example directly from Visual Studio debug environment. The first block of 32 bytes in memory could be like:

ArySumSimdFig11: first 32-byte memory block

Similar to MMX method, I save first two 16-byte blocks in xmm0 and xmm1:

ASM
; Load the next block of 16 array values 
@@:  
   movdqa xmm0,xmmword ptr [eax]
   movdqa xmm1,xmmword ptr [eax+16]  ;xmm1:xmm0 = 32 byte block

Notice the Little-Endian concept in memory form the Intel specification, and xmm0 and xmm1 become:

ArySumSimdFig12: Notice the little-Endian

Next, extend byte to word with four registers:

ASM
; Promote array values from bytes to words, then sum the words
   movdqa xmm2,xmm0
   movdqa xmm3,xmm1
   punpcklbw xmm0,xmm7              ;xmm0 = 8 words
   punpcklbw xmm1,xmm7              ;xmm1 = 8 words
   punpckhbw xmm2,xmm7              ;xmm2 = 8 words
   punpckhbw xmm3,xmm7              ;xmm3 = 8 words

Now xmm0, xmm1, xmm2, and xmm3 are all packed words promoted by punpcklbw and punpckhbw:

ArySumSimdFig13: Now xmm0, xmm1, xmm2, and xmm3

After packed word additions:

ASM
; Packed additions for words 
   paddw xmm0,xmm2
   paddw xmm1,xmm3

The results are eight packed words in xmm0 and another eight words in xmm1:

ArySumSimdFig14: results are eight packed words in xmm0

Now extend values from packed word to doubleword by copying xmm2 and xmm3 with four promotions:

ASM
; Promote array values from WORDs to DWORDs
   movdqa xmm2,xmm0
   movdqa xmm3,xmm1
   punpcklwd xmm0,xmm7              ;xmm0 = 4 dwords
   punpcklwd xmm1,xmm7              ;xmm1 = 4 dwords
   punpckhwd xmm2,xmm7              ;xmm2 = 4 dwords
   punpckhwd xmm3,xmm7              ;xmm3 = 4 dwords

I have four packed doublewords in xmm0, amm1, xmm2, and xmm3 respectively

ArySumSimdFig15: I have four packed doublewords in xmm0

Then make sums from the packed doublewords additions:

ASM
; Then sum the DWORDs
   paddd xmm0,xmm2
   paddd xmm1,xmm3

As a result, four packed doublewords in xmm0 and four packed doublewords in xmm1 as well:

ArySumSimdFig16: four packed doublewords in xmm0

Finally, similar to MMX code logic, I packed add four doublewords to xmm4 and four doublewords to xmm5. The registers xmm4 and xmm5 continuously collect from xmm0 and xmm1 each iteration till the loop done.

ASM
; Already packed sums to dwords, just update dword sums in xmm5:xmm4
   paddd xmm4,xmm0
   paddd xmm5,xmm1                  ;xmm5:xmm4 = packed sums (4 dwords)

   add eax,32                       ;eax = next 32 byte block
   dec ecx
   jnz @B                           ;repeat loop if not done

When exit from the loop, xmm4 and xmm5 containing four doublewords as packed sums respectively.

ArySumSimdFig17: When exit from the loop, xmm4 and xmm5

Now perform packed doubleword addition to xmm5 from xmm4. In order to calculate the final sum, more steps needed. I copy xmm5 to xmm6 and shift mm6 high quadword 64 bits (8 bytes with PSRLDQ), which is different from MMX’s psrlq with bits’ shift, I have to use PSRLDQ as Packed Shift Double Quadword Right Logical required with bytes' shift, instead of bits. For details, please check x86 and amd64 instruction reference. I then do packed addition to get the sum in xmm5.

ASM
; Compute final sum_x
   paddd xmm5,xmm4                  ;xmm5 = packed sums (4 dwords)
   movdqa xmm6,xmm5                          
   PSRLDQ xmm6, 8                   ;xmm6 Shifted Right Logical 8 bytes
   paddd xmm5,xmm6

Actually, the sum is now contained in lower quadword as two doublewords in xmm5 without need of caring xmm5 high quadword:

ArySumSimdFig18: Actually, the sum is now contained

The last step is to add two lower doublewords in xmm5 together. I copy it to xmm6 and shift xmm6 high doubleword 32 bits (4 bytes with PSRLDQ). After the last packed doubleword addition, the final sum in lower doubleword in xmm6 is passed to EAX as return value.

ASM
   movdqa xmm6,xmm5
   PSRLDQ xmm6, 4                   ;xmm6 Shifted Right Logical 4 bytes

   paddd xmm6,xmm5                  ;xmm6[31:0] = final sum_x
   movd eax,xmm6                    ;eax = sum_x
   ret   
SumArraySSE ENDP

AVX Execution Logic and Implementations

AVX technology adds another 128 bits to extend SSE XMM0 to XMM7 to make it as eight 256-bit registers, named YMM0 to YMM7, as shown below in debug:

ArySumSimdFig19: YMM0 to YMM7

They can be used to perform SIMD operations using thirty-two 8-bit integers as packed BYTEs, sixteen 16-bit integers as packed WORDs, eight 32-bit integers as packed DWORDs, or four 64-bit integers as packed QWORDs.

ArySumSimdFig20: thirty-two 8-bit integers as packed BYTEs

The procedure SumArrayAVX returns the same result as the previous three but reduces more execution time. This is reasonable because again doubled register size from 128 to 256 bits. The AVX instructions used here are also similar, except for letter ‘v’ prefix and with three operands.

Since YMM so large in 256 bits, I would like to simplify SumArrayAVX logic not as the previous pattern in MMX and SSE. In the loop, I process 32 bytes once in a single block instead of 64 bytes in two branches. For this purpose, I initialize the loop counter ECX still as count divided by 32. When the loop starts, I just load 32 bytes into ymm1 only:

ASM
;-------------------------------------------------------------
SumArrayAVX PROC  uses ECX, arrayPtr:PTR BYTE, count:DWORD
; Performs a summation for an 8-bit BYTE array
; Returns: sum in EAX
;-------------------------------------------------------------
; Initialize
   mov ecx, count                   ;ecx = 'n'
   shr ecx, 5                       ;ecx = number of 32-byte blocks
   mov eax,arrayPtr                 ;pointer to array 'x'
   vpxor   ymm0, ymm0, ymm0
   vpxor   ymm5, ymm5, ymm5
@@:  
   ; Load the next block of 32 bytes from array
   vmovdqa ymm1,ymmword ptr [eax]   

Let’s still use the same array example as before with the first 32 bytes in memory like:

ArySumSimdFig21: first 32-byte memory

At this time of loading, ymm1 gets all 32 bytes from the Little-Endian concept memory as this:

ArySumSimdFig22: all 32 bytes from the little-Endian

Next extend byte to word and perform packed word addition:

ASM
; Packed additions for words 
   VPUNPCKLBW ymm2, ymm1, ymm0
   VPUNPCKHBW ymm3, ymm1, ymm0
   vpaddw ymm4, ymm2, ymm3          ;add words low and high

This is the packed word results in ymm4 where you can manually add ymm2 and ymm3 to verify:

ArySumSimdFig23: packed word results in ymm4

Next extend word to doubleword in ymm2 and ymm3, and perform packed addition to ymm4:

ASM
; Packed additions for doublewords 
   vmovdqa ymm1,ymm4   
   VPUNPCKLWD ymm2, ymm1, ymm0
   VPUNPCKHWD ymm3, ymm1, ymm0
   vpaddd ymm4, ymm2, ymm3          ;add dwords low and high

This is again, the packed doubleword results in ymm4 and can be manually verified:

ArySumSimdFig24: packed doubleword results in ymm4

The register ymm5 is the sum collector to continuously add eight doublewords in each iteration from ymm4 before the loop done:

ASM
; Packed addition last for doublewords
   vpaddd ymm5, ymm5, ymm4          ;ymm5 = packed sums (8 dwords)
   add eax,32                       ;eax = next 32 byte block
   dec ecx
   jnz @B                           ;repeat loop if not done

When done, to get final sum, I have to play more tricks here. I need to use VPERMQ with encoding 00001110b to move upper two QWORDs from ymm5 to ymm4. Please consult with x86 and amd64 instruction reference for this Qwords Element Permutation instruction.

ASM
; Compute final sum_x
   vmovdqa ymm4, ymm5   
   ; Qwords Element Permutation to move upper QWORD to xmm4
   vpermq  ymm4, ymm4, 00001110b

Then I have lower two quadwords ready in both ymm4 and ymm5:

ArySumSimdFig25: have lower quadwords ready

Now I only need to focus on two lower quadwords in ymm4 and ymm5 each containing 4 doublewords respectively. This is exactly the same scenario at the end of SumArraySSE. Because YMM0, ... ,YMM7 are just 128 bits extended from XMM0, ... ,XMM7, I can use two lower quadwords in YMM as XMM named in SSE. Thus I simply reuse that part of code for convenience:

ASM
; Reuse this part of code in SSE 
   paddd xmm5,xmm4                  ;xmm5 = packed sums (4 dwords)
   movdqa xmm6,xmm5                          
   PSRLDQ xmm6, 8                   ;xmm6 Shifted Right Logical 8 bytes
   paddd xmm5,xmm6

   movdqa xmm6,xmm5
   PSRLDQ xmm6, 4                   ;xmm6 Shifted Right Logical 4 bytes
   paddd xmm6,xmm5                  ;xmm6[31:0] = final sum_x
   movd eax,xmm6                    ;eax = sum_x
   ret   
SumArrayAVX ENDP

The Benchmark Test Drive

In order to test four procedures from SumArrayOriginal to SumArrayMMX/SSE/AVX, I have to create a benchmark program to let you enter the loop count and show measurements. One way is to simply delegate all input/output tasks to the high-level-language that is as easy as some bootcamp tutorials used. But for our benchmark, I prefer to make it more straight and accurately measured in the pure Assembly code with I/O directly called from OS. For this purpose, I recommend to link to irvine32 library at Assembly Language for x86 Processors. You can go Getting started with MASM and Visual Studio 2019 to download and install that static-linked library to build my attached project. For reference, please see online Irvine32 Library Help borrowed from California State University Dominguez Hills.

I need two helpers; one is to generate pseudo-random bytes and save them in an array:

ASM
;---------------------------------------------------------------- 
FillArray PROC USES eax, pArray: DWORD, Count: DWORD   
; Fills an array with a random sequence of 32-bit signed 1 to 255
; integers between LowerRange and (UpperRange - 1).
; Returns: nothing
;---------------------------------------------------------------- 
LOWVAL = 1h           ; minimum value
HIGHVAL = 0FFh        ; maximum value

;   call Randomize
   mov   edi,pArray   ; EDI points to the array
   mov   ecx,Count    ; loop counter

L1:   
   mov   eax,HIGHVAL   ; get absolute range
   call   RandomRange
   add   eax,LOWVAL    ; bias the result
   stosb               ; store AL into [edi]
   loop   L1

   ret
FillArray ENDP

Another is a reusable one to output the result from four procedure calls:

ASM
;---------------------------------------------------------------- 
OutputResult PROC uses edx, 
   resSum:DWORD, msec:DWORD, labMethod:PTR BYTE, labTime:PTR BYTE    
;----------------------------------------------------------------
   call   GetMseconds   ; get stop time
   sub   eax, msec
   mov   msec, eax
   mov EDX, labMethod
   call WriteString
   mov eax, resSum
   call WriteDec
   mov EDX, labTime
   call WriteString
   mov eax, msec
   call   WriteDec      ; display elapsed time
   ret
OutputResult ENDP   

Our benchmark entry point can start like:

ASM
;------------------------------------------------------------
mainBenchmark PROC
;------------------------------------------------------------
   ; Ask for count input
   mov EDX, offset InputLabel2
   call WriteString
   call ReadDec
   mov LoopCounter, eax

; Fill an array with random signed integers
   INVOKE FillArray, ADDR array, LENGTHOF array

Then I do each test like this:

ASM
; Original approach:
   call   GetMseconds   
   mov   tm,eax
   mov ECX, LoopCounter
LL1:
   INVOKE SumArrayOriginal, ADDR array, MAX_ARRAY_SIZE
loop LL1
   INVOKE OutputResult, eax, tm, addr LbSumOrg, addr TimeElapsed    
   call   Crlf

; MMX approach:
   call   GetMseconds   
   mov   tm,eax
   mov ECX, LoopCounter
LL2:
   INVOKE SumArrayMMX, ADDR array, MAX_ARRAY_SIZE
loop LL2
   INVOKE OutputResult, eax, tm, addr LbSumMMX, addr TimeElapsed    
   call crlf

; SSE approach:
   ...

; AVX approach:
   ...

   call   WaitMsg
   exit
mainBenchmark ENDP   

The last but not least is the MASM 32-bit directive .XMM that must be declared at the beginning. The directive name sounds vague or inaccurate because it enables Assembly of SIMD instructions all required in this article so far including MMX, SSE, and AVX.

Attention to Assembly Programming

The Assembly code discussed in this article can be downloaded with its MASM project file to run in the recent Visual Studio. However, when your Visual Studio working fine with VC++, it doesn’t mean that it can run a pure Assembly code project. Mainly for the security reason, building or launching an Assembly executable could be blocked by an Anti-Virus application or system settings. The fatal error from the Visual Studio can be LNK1104, ML A1000, etc. It’s hard to understand, since no specific info to tell.

This is an example of an Assembly code blocked by Defender in Windows 10:

ArySumSimdFig26: Assembly code blocked

Here is some workaround. For an Anti-Virus application, if there is an editable exception or exclusion option, add your Assembly code test folders to bypass. The following is the Exclusions option from Defender in Windows 10:

ArySumSimdFig27: Exclusions option from Defender

If no such an option, but there is a choice allowed to disable, just disable it for a while to test, yet this will leave your computer unprotected. You can also read,  Was your program's EXE file blocked by an anti-virus scanner?

Summary

I spend so much to explore optimizations just for such a simple program. As you can imagine, for a software compiler, how much effort should be invested to create an intelligent utility. To peek the compiler behavior, you can reverse engineer from a C/CPP program to Assembly listing by a disassembler. Two builds available, the Debug build is relatively readable, almost mapping to original logic often with traditional Assembly, while the Release build is hard to understand as often optimized with modern Assembly like SIMD. You can try to build what here mentioned SumArrayCPP in Visual Studio to watch, although other platform disassemblers will produce a different Assembly listing. Another attention is that compilers usually upgrade to incorporate new instructions with technology advanced. Thus, the Release disassembly listing can be different from Visual Studio 2010 to 2019, as different optimization methods applied from time to time as in this article.

The immediate motivation of this writing is to create a concise supplement for my students at CSCI 241Assemly Language Programming. The course outline mainly focused on traditional x86-64 Assembly language programming. Based on an intensive learning, the students would like to have an extensive look and feel about modern Assembly language. Although you can find nice books, as they provided a lot detailed SIMD descriptions, it’s hard to pick up one to lecture in a couple of hours. Therefore, I hope this article could serve as a useful reading and practice material for beginners interested in modern Assembly language.

Recently in software industry, different types of assembly languages play an important role in the new device, mobile and microchip design, etc. In particular, time and space efficiency in Assembly project performance is always emphasized as a critical factor. Academically, Assembly Language Programming is considered as one of demanding courses in Computer Science in most colleges. With advanced technology from time to time, optimization could be a good topic to discuss in more areas.

History

  • 26th March, 2021: Original version posted

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)