Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / assembler

MASM - Stack Memory Alignment

3.62/5 (11 votes)
9 Sep 2016CPOL5 min read 24.3K   187  
SIMD instruction sets may expect a special alignment of memory, but when that memory is on the stack MASM does not provide alignment facilities.

Introduction

In MASM, the ALIGN directive does not align local (or stack) variables, i.e. those variables that you declare at the start of a procedure by using the LOCAL directive. The only guarantee you have for local variables is that 32-bit Windows aligns them on a 4-byte boundary and 64-bit Windows aligns them on an 8-byte boundary.

Of course, MASM does align variables declared in the .DATA section of the source code, but these are static and may not be what you require, namely if the code is meant to be thread safe.

The lack of stack data alignment facilities has not become really critical until the appearance of the SSE instruction set. Many SSE instructions that read data from memory, require data to be aligned on a 16-byte boundary, otherwise a fault is granted.

Most recent C/C++ compilers have directives to align stack data, but we are dealing with MASM. If you are linking C/C++ with Assembly Language, or doing applications in Assembly Language, you need to be aware of the potential problems.

SSE provides instructions to load potentially misaligned data into registers or to store data from the SSE registers into potentially misaligned memory, namely the "movups" and "movdqu" instructions. The performance penalty is not as evident on modern CPUs as it used to be on the old Pentium 3 and 4, and this is the route to take in most cases.

Still, it continues to be useful to know how to align stack memory in MASM. For example, you might need to call external functions from within Assembly Language that expect the received data to be 16-byte aligned.

Using the code

The problem of aligning stack memory in Assembly Language has been discussed in various forums for years but I never found a really manageable solution, so I decided to propose my own recipe.

This solution permits an unlimited number of aligned 16-byte (it can be easily modified to 32-byte or higher, if needed. This is left as an exercise) memory variables.

It works in the following stages:

1) Save the current stack position, so that we can restore it later.
This is done through a macro (here is the 32-bit version. Both 32-bit and 64-bit version can be downloaded from the link above):

SAVE_STACK_POSITION macro
    mov TopOfAllocatedStackMem, esp ;; Save the current top of stack
endm

2) Reserve a chunk of 16-byte aligned memory on the stack for some variable. A variable containing a pointer to it has been previously declared through a LOCAL directive, so that we can access it later.
Another macro does this job (here is the 32-bit version):

ASM
CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT macro memsize, PtrToStackMem
    and rsp, -10h ;; Align to 16 byte boundary
    sub rsp, memsize ;; Make room for the new variable
    mov PtrToStackMem, rsp
endm

3) Now that we have memory for the variable, we can save some data there.
A third macro (here, the 32-bit version) does this job:

ASM
SAVE_XMM_IN_ALIGNED_STACK_MEMORY macro PtrToStackMem, reg
    push eax ;; Now we are safe to "push" and "pop" registers.
    mov eax, PtrToStackMem
    movaps [eax], reg ;; Were the memory not aligned and an exception would occur here.
    pop eax
endm

4) When we need to retrieve a variable, we make use of a fourth macro (here, the 32-bit version):

ASM
RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY macro PtrToStackMem, reg
    push eax
    mov eax, PtrToStackMem
    movaps reg, [eax] ;; Were the memory not aligned and an exception would occur here.
    pop eax
endm

5) When the procedure is returning to the caller we need to release all the memory we have allocated from the stack.
So insert the following macro (here, the 32-bit version), just before the "ret" instruction (the MASM compiler will issue a "leave" before the "ret"):

ASM
RESTORE_STACK_POSITION macro
    mov esp, TopOfAllocatedStackMem
endm

 

Our demo

Our demo consists of a callable ASM function (AsmMemAlignDemo) and a mini C++ project containing the caller to it. AsmMemAlignDemo is called with 2 parameters, a __m128, which corresponds in ASM to a XMMWORD, and a float, which corresponds in ASM to a REAL4. It returns a __m128.

Its C++ declaration is:

__m128 AsmMemAlignDemo(__m128 param1, float param2);

AsmMemAlignDemo is called with the param1 containing a vector of 4 floats (1.0, 2.0, 3.0, 4.0) and param2 contains the float 10.0
 

Within the ASM function, 4 operations will take place to obtain the final result.
1) The float is multiplied by the vector, obtaining the partial result:
(10.0, 20.0, 30.0, 40.0)
2) A value of 17.0 is added to the vector, obtaining the partial result:
(27.0, 37.0, 47.0, 57.0)
3) The vector is divided by 3, giving the final result of:
(9.000000, 12.333333,15.666667,19.000000)

Of course, there is also an opportunity to demonstrate our recipe for the 16-byte stack memory alignment, this is after all the main purpose of the article, (here, the 32-bit version):

; __m128 AsmMemAlignDemo(__m128 param1, float param2);
; *See important comments in the "C++" project

; This function will:
; 1) multiply the param1 vector by the float param2
; 2) add 17.0 then divide the result by 3.0
AsmMemAlignDemo proc public par2:REAL4 

; These are the stack variables. On 64-bit Windows they aligned on 8-byte boundaries, 
; on 32-bit Windows they are aligned on 4-byte boundaries.
LOCAL valueToAdd : DWORD
LOCAL valueToDivideFor : DWORD

LOCAL TopOfAllocatedStackMem : DWORD;
LOCAL PointerTo16ByteAlignedvalueToAdd : ptr XMMWORD
LOCAL PointerTo16ByteAlignedvalueToDivideFor : ptr XMMWORD

    movss xmm5, par2 ; Move the passed float to the first 32 bits of xmm5
    shufps xmm5, xmm5, 0 ; Replicate it across the register to obtain 4 identical floats
    ; cdecl: the _m128 param1 came in xmm0
    mulps xmm0, xmm5 ; Part 1) is completed. The partial result is in xmm0
    
    ; Set some data to compose our example. First the 17.0 to add to the partial result
    mov dword ptr valueToAdd, 17
    movss xmm3, valueToAdd
    shufps xmm3, xmm3,0 ; replicate across
    cvtdq2ps xmm3, xmm3 ; convert to a vector of 4 floats 
    ; Set the value to divide for, which is 3.0
    mov dword ptr valueToDivideFor, 3 
    movss xmm2, valueToDivideFor
    shufps xmm2, xmm2,0 ; replicate across
    cvtdq2ps xmm2, xmm2 ; convert to a vector of 4 floats 

    ; Now, we will begin the part that demonstrates how to align stack memory.
    ; This is the real purpose of the article, till now, everything was just a "mise-en-scene"
    ; First, save the current stack position.
    SAVE_STACK_POSITION
    ; Reserve a chunk of 16-byte aligned memory on the stack for the addition vector
    CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT sizeof XMMWORD, PointerTo16ByteAlignedvalueToAdd
    ; Save the xmm3 contents in there.
    SAVE_XMM_IN_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToAdd, xmm3
    ; Reserve a chunk of 16-byte aligned memory on the stack for the division vector    
    CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT sizeof XMMWORD, PointerTo16ByteAlignedvalueToDivideFor
    ; Save the xmm2 contents in there.    
    SAVE_XMM_IN_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToDivideFor, xmm2        
    
    xorps xmm3, xmm3 ; zero out the xmm registers to prove we are not cheating :)
    xorps xmm2, xmm2 
    ; Check with a debugger that the stored vectors will be loaded back sucessfully using the "movaps" instructions. But we will not be using xmm3 and xmm2 for the final calculation.
    RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToAdd, xmm3
    RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToDivideFor, xmm2
    ; Instead we will be doing the calculations directly with the aligned memory 
    mov eax, PointerTo16ByteAlignedvalueToAdd 
    addps xmm0, [eax] ; Add Packed Single-Precision Floating-Point Values
    mov eax, PointerTo16ByteAlignedvalueToDivideFor
    divps xmm0, [eax]  ; Divide Packed Single-Precision Floating-Point Values
    ; That's it, the result will be returned in xmm0
    ; Finally deallocate our stack memory, all you need to do is restore the stack pointer.
    RESTORE_STACK_POSITION
    ret ; The MASM compiler will issue a leave instruction before the ret
AsmMemAlignDemo endp

In the "C++" project, we produced a few comments about the way __m128 parameters are passed and the __m128 result received under the cdecl and Microsoft x64 calling conventions, under Visual Studio. You will become aware that that they depart from the specifications, as they are understood by other compiler vendors.

extern "C" {
    __m128 AsmMemAlignDemo(__m128 param1, float param2);
}

int _tmain(int argc, _TCHAR* argv[])
{
    // Through some compiler magic the __128 data type is automatically aligned on 16-byte boundaries.
    __m128 mappedXMMRegister = { 1.0, 2.0, 3.0, 4.0 };

    // On x86, we are using the cdecl calling convention, still what happens in this particular case contradicts most of the common wisdom
    // and beaware, not every compiler do it this way:
    // 1) The __m128 variable is sent in xmm0 (yes, it is!)
    // 2) the float is sent on the stack (no surprises here)
    // 3) The return value comes in xmm0 (yes, it does!)

    // On x64, the Microsot x64 calling convention is standard. But the __m128 return value comes in xmm0, and some people people will become surprised because 
    // this information is somewhat hidden. Beaware, not every compiler do it this way. So:
    // 1) The return value will come in xmm0 (yes, it comes, usually structures are returned to what is pointed to by the RCX register, but not __m128).
    // 2) the float parameter is in a XMM register, since xxm0 is already reserved for the return value, the float will be passed in xmm1.
    // 3) Data structures are sent on the stack, so our __m128 parameter will be sent on the stack, and RCX will point to where it is.

    // As a conclusion, it appears that cdecl deals better with __m128 parameters than the x64 calling convention.
    // But with the new "Vector Calling Convention" the things do get better.
    
    __m128 result = AsmMemAlignDemo(mappedXMMRegister, 10.0);
    printf("The test was a success!\n");
    printf("Results: %f, %f, %f, %f", result.m128_f32[0], result.m128_f32[1], result.m128_f32[2], result.m128_f32[3]);
    getchar();
    return 0;
}

Finally, the ASM compilation:

To compile the 32-bit ASM, you run MASM with:
"Path to Visual Studio"\VC\bin\ml /c asmtest32.asm
Note: You can also compile with JWasm without any change:
"Path to JWasm"\jwasm" -coff asmtest32.asm

To compile for 64-bit ASM, you run MASM with:
"Path to Visual Studio"\VC\bin\amd64\ml64" -c asmtest64.asm
Note: You can also compile with JWasm without any change:
"Path to JWasm"\jwasm -c -win64 asmtest64.asm

 

After compilation you will have to link your Visual Studio 32-bit build with the asmtest32.obj  and the Visual Studio 64-bit build with asmtest64.obj.

Important:

The reliabilty of this recipe is based on the assumption that all unwinding will take place after the call to RESTORE_STACK_POSITION.
This happens in our Demo, the MASM compiler will issue a 'leave' followed by a 'ret' after the RESTORE_STACK_POSITION.
If an ASM module needs to handle SEH (Structured Exception Handling) or preserve some registers across the whole procedure (i.e., "pushes" registers on the stack before SAVE_STACK_POSITION), some extra care needs to be taken. The same if the ASM module is not a leaf (calls other procedures). JWasm makes it easy to deal with these cases, but MASM requires that you know exactly what you are doing.

History

6th September 2016 - CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT macro

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)