Introduction
In MASM, the ALIGN directive does not align local (or stack) variables, i.e. those variables that you declare at the start of a procedure by using the LOCAL directive. The only guarantee you have for local variables is that 32-bit Windows aligns them on a 4-byte boundary and 64-bit Windows aligns them on an 8-byte boundary.
Of course, MASM does align variables declared in the .DATA section of the source code, but these are static and may not be what you require, namely if the code is meant to be thread safe.
The lack of stack data alignment facilities has not become really critical until the appearance of the SSE instruction set. Many SSE instructions that read data from memory, require data to be aligned on a 16-byte boundary, otherwise a fault is granted.
Most recent C/C++ compilers have directives to align stack data, but we are dealing with MASM. If you are linking C/C++ with Assembly Language, or doing applications in Assembly Language, you need to be aware of the potential problems.
SSE provides instructions to load potentially misaligned data into registers or to store data from the SSE registers into potentially misaligned memory, namely the "movups" and "movdqu" instructions. The performance penalty is not as evident on modern CPUs as it used to be on the old Pentium 3 and 4, and this is the route to take in most cases.
Still, it continues to be useful to know how to align stack memory in MASM. For example, you might need to call external functions from within Assembly Language that expect the received data to be 16-byte aligned.
Using the code
The problem of aligning stack memory in Assembly Language has been discussed in various forums for years but I never found a really manageable solution, so I decided to propose my own recipe.
This solution permits an unlimited number of aligned 16-byte (it can be easily modified to 32-byte or higher, if needed. This is left as an exercise) memory variables.
It works in the following stages:
1) Save the current stack position, so that we can restore it later.
This is done through a macro (here is the 32-bit version. Both 32-bit and 64-bit version can be downloaded from the link above):
SAVE_STACK_POSITION macro
mov TopOfAllocatedStackMem, esp ;; Save the current top of stack
endm
2) Reserve a chunk of 16-byte aligned memory on the stack for some variable. A variable containing a pointer to it has been previously declared through a LOCAL directive, so that we can access it later.
Another macro does this job (here is the 32-bit version):
CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT macro memsize, PtrToStackMem
and rsp, -10h
sub rsp, memsize
mov PtrToStackMem, rsp
endm
3) Now that we have memory for the variable, we can save some data there.
A third macro (here, the 32-bit version) does this job:
SAVE_XMM_IN_ALIGNED_STACK_MEMORY macro PtrToStackMem, reg
push eax
mov eax, PtrToStackMem
movaps [eax], reg
pop eax
endm
4) When we need to retrieve a variable, we make use of a fourth macro (here, the 32-bit version):
RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY macro PtrToStackMem, reg
push eax
mov eax, PtrToStackMem
movaps reg, [eax]
pop eax
endm
5) When the procedure is returning to the caller we need to release all the memory we have allocated from the stack.
So insert the following macro (here, the 32-bit version), just before the "ret" instruction (the MASM compiler will issue a "leave" before the "ret"):
RESTORE_STACK_POSITION macro
mov esp, TopOfAllocatedStackMem
endm
Our demo
Our demo consists of a callable ASM function (AsmMemAlignDemo) and a mini C++ project containing the caller to it. AsmMemAlignDemo is called with 2 parameters, a __m128, which corresponds in ASM to a XMMWORD, and a float, which corresponds in ASM to a REAL4. It returns a __m128.
Its C++ declaration is:
__m128 AsmMemAlignDemo(__m128 param1, float param2);
AsmMemAlignDemo is called with the param1 containing a vector of 4 floats (1.0, 2.0, 3.0, 4.0) and param2 contains the float 10.0
Within the ASM function, 4 operations will take place to obtain the final result.
1) The float is multiplied by the vector, obtaining the partial result:
(10.0, 20.0, 30.0, 40.0)
2) A value of 17.0 is added to the vector, obtaining the partial result:
(27.0, 37.0, 47.0, 57.0)
3) The vector is divided by 3, giving the final result of:
(9.000000, 12.333333,15.666667,19.000000)
Of course, there is also an opportunity to demonstrate our recipe for the 16-byte stack memory alignment, this is after all the main purpose of the article, (here, the 32-bit version):
; __m128 AsmMemAlignDemo(__m128 param1, float param2);
; *See important comments in the "C++" project
; This function will:
; 1) multiply the param1 vector by the float param2
; 2) add 17.0 then divide the result by 3.0
AsmMemAlignDemo proc public par2:REAL4
; These are the stack variables. On 64-bit Windows they aligned on 8-byte boundaries,
; on 32-bit Windows they are aligned on 4-byte boundaries.
LOCAL valueToAdd : DWORD
LOCAL valueToDivideFor : DWORD
LOCAL TopOfAllocatedStackMem : DWORD;
LOCAL PointerTo16ByteAlignedvalueToAdd : ptr XMMWORD
LOCAL PointerTo16ByteAlignedvalueToDivideFor : ptr XMMWORD
movss xmm5, par2 ; Move the passed float to the first 32 bits of xmm5
shufps xmm5, xmm5, 0 ; Replicate it across the register to obtain 4 identical floats
; cdecl: the _m128 param1 came in xmm0
mulps xmm0, xmm5 ; Part 1) is completed. The partial result is in xmm0
; Set some data to compose our example. First the 17.0 to add to the partial result
mov dword ptr valueToAdd, 17
movss xmm3, valueToAdd
shufps xmm3, xmm3,0 ; replicate across
cvtdq2ps xmm3, xmm3 ; convert to a vector of 4 floats
; Set the value to divide for, which is 3.0
mov dword ptr valueToDivideFor, 3
movss xmm2, valueToDivideFor
shufps xmm2, xmm2,0 ; replicate across
cvtdq2ps xmm2, xmm2 ; convert to a vector of 4 floats
; Now, we will begin the part that demonstrates how to align stack memory.
; This is the real purpose of the article, till now, everything was just a "mise-en-scene"
; First, save the current stack position.
SAVE_STACK_POSITION
; Reserve a chunk of 16-byte aligned memory on the stack for the addition vector
CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT sizeof XMMWORD, PointerTo16ByteAlignedvalueToAdd
; Save the xmm3 contents in there.
SAVE_XMM_IN_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToAdd, xmm3
; Reserve a chunk of 16-byte aligned memory on the stack for the division vector
CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT sizeof XMMWORD, PointerTo16ByteAlignedvalueToDivideFor
; Save the xmm2 contents in there.
SAVE_XMM_IN_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToDivideFor, xmm2
xorps xmm3, xmm3 ; zero out the xmm registers to prove we are not cheating :)
xorps xmm2, xmm2
; Check with a debugger that the stored vectors will be loaded back sucessfully using the "movaps" instructions. But we will not be using xmm3 and xmm2 for the final calculation.
RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToAdd, xmm3
RETRIEVE_XMM_FROM_ALIGNED_STACK_MEMORY PointerTo16ByteAlignedvalueToDivideFor, xmm2
; Instead we will be doing the calculations directly with the aligned memory
mov eax, PointerTo16ByteAlignedvalueToAdd
addps xmm0, [eax] ; Add Packed Single-Precision Floating-Point Values
mov eax, PointerTo16ByteAlignedvalueToDivideFor
divps xmm0, [eax] ; Divide Packed Single-Precision Floating-Point Values
; That's it, the result will be returned in xmm0
; Finally deallocate our stack memory, all you need to do is restore the stack pointer.
RESTORE_STACK_POSITION
ret ; The MASM compiler will issue a leave instruction before the ret
AsmMemAlignDemo endp
In the "C++" project, we produced a few comments about the way __m128 parameters are passed and the __m128 result received under the cdecl and Microsoft x64 calling conventions, under Visual Studio. You will become aware that that they depart from the specifications, as they are understood by other compiler vendors.
extern "C" {
__m128 AsmMemAlignDemo(__m128 param1, float param2);
}
int _tmain(int argc, _TCHAR* argv[])
{
__m128 mappedXMMRegister = { 1.0, 2.0, 3.0, 4.0 };
__m128 result = AsmMemAlignDemo(mappedXMMRegister, 10.0);
printf("The test was a success!\n");
printf("Results: %f, %f, %f, %f", result.m128_f32[0], result.m128_f32[1], result.m128_f32[2], result.m128_f32[3]);
getchar();
return 0;
}
Finally, the ASM compilation:
To compile the 32-bit ASM, you run MASM with:
"Path to Visual Studio"\VC\bin\ml /c asmtest32.asm
Note: You can also compile with JWasm without any change:
"Path to JWasm"\jwasm" -coff asmtest32.asm
To compile for 64-bit ASM, you run MASM with:
"Path to Visual Studio"\VC\bin\amd64\ml64" -c asmtest64.asm
Note: You can also compile with JWasm without any change:
"Path to JWasm"\jwasm -c -win64 asmtest64.asm
After compilation you will have to link your Visual Studio 32-bit build with the asmtest32.obj and the Visual Studio 64-bit build with asmtest64.obj.
Important:
The reliabilty of this recipe is based on the assumption that all unwinding will take place after the call to RESTORE_STACK_POSITION.
This happens in our Demo, the MASM compiler will issue a 'leave' followed by a 'ret' after the RESTORE_STACK_POSITION.
If an ASM module needs to handle SEH (Structured Exception Handling) or preserve some registers across the whole procedure (i.e., "pushes" registers on the stack before SAVE_STACK_POSITION), some extra care needs to be taken. The same if the ASM module is not a leaf (calls other procedures). JWasm makes it easy to deal with these cases, but MASM requires that you know exactly what you are doing.
History
6th September 2016 - CREATE_16BYTE_ALIGNED_STACK_MEM_SLOT macro