This article explains how to create fully functional .DLL and .LIB modules for Windows using pure assembly language. While much of the discussion for the article is centered on working off of Visual Studio, the .DLL and .LIB modules you create can be integrated into any language that allows their use. What’s generated are standard .DLLs, with nothing to distinguish the final product from .DLL modules created any other way.
Visual Studio only allows inline assembly in 32-bit mode, and it doesn’t even allow that much in 64-bit mode. In the latter, you have to use the enormously complex and confusing intrinsics instead. Either way, you’re only getting a fraction of the power that ASM offers for handling processor-intensive tasks effectively. Most of what you can do in a full-fledged ASM .DLL module can't be touched with inline assembly or intrinsics.
Creating an all-assembly .DLL module is nowhere near as complicated as many may think – you can do it with Notepad alone (assuming you have a suitable assembler and linker). It opens up the full power of the language – including functions, macros, and a host of other benefits that are unavailable in Visual Studio (or any other environment that lets you work with some form of ASM).
In Windows, the Portable Executable format – PE – is used universally for executables, drivers, and .DLLs. The only real difference between them is what the loader chooses to do with them. There are other subtle changes in various fields within the files but the overall format is identical between all three file types.
Getting Started – the Main Module
ASM comments use the ;
character. There is no open/close comment pair, although you can use:
comment ^
This is comment text.
It can run on forever as many politicians do.
Vote for me and put the Purple Party in power!
^
The ^
character can be anything, but keep in mind that whatever is used will close the comment block the first time it’s encountered by the assembler. So pick something that isn’t going to be part of the comment block itself.
Aside from comments, the first line of your ASM source file should be:
.data
Once declared, you can then include files that contain data declarations, or enter those declarations directly. The .data block inherently ends when .code is encountered; it's the next line after the .data block.
Although they can go anywhere, I typically throw my macro definitions into the data block as well. Do what works for you, but macros, of course, must be defined before they’re actually used.
After your data declarations come:
.code
Now you’re in the code segment.
At this point, you’re thinking well duh!, but that’s about the extent of how complex an assembly language app is so get used to "easy."
For .DLL modules, the traditional entry point is DllMain
, and you’ll have to declare it as a function:
DllMain proc ; 64-bit function
… code goes here …
ret
DllMain endp ; End function
If you’re using 32-bit code, the declaration is:
DllMain proc near hInstDll:dword, fdwReason:dword, lpReserved:dword
… code goes here …
ret 12 ; Return to caller
DllMain endp ; End function
In the 32-bit version, the parameters hInstDll
, fdwReason
, and lpReserved
can be accessed directly, by name. In the 64-bit version, the 64-bit calling convention is followed, which means that on entry into the DllMain
function:
RCX = hInstDll
RDX = fdwReason
R8 = lpReserved
The Windows loader sets up the parameters to pass so when entering DllMain
, the input values will always be as specified above. The loader doesn’t know or care which language was used to create a .DLL module. If it’s formatted correctly, DllMain
will enter with standard parameters passed.
The fdwReason
parameter will contain one of only four possible values: DLL_PROCESS_ATTACH (1)
, DLL_PROCESS_DETACH (0)
, DLL_THREAD_ATTACH (2)
, and DLL_THREAD_DETACH (3)
. These values should be declared as constants somewhere in your data segment (or before, if you prefer, as equates are insensitive to which segment they live in), as follows:
DLL_PROCESS_ATTACH equ 1
DLL_PROCESS_DETACH equ 0
DLL_THREAD_ATTACH equ 2
DLL_THREAD_DETACH equ 3
This allows you to work with these values by their standard names instead of using hard-coded integers.
Handling fdwReason Within DllMain
The fdwReason
parameter answers the question "why are you calling me?" I employ a method of message routing that I’ve used primarily in window callback functions, but I also apply it to DllMain
functions. I've done it since the dawn of mankind. This method looks up the incoming message (in this case fdwReason
) on a lookup table, and jumps to the same position on a router table. This allows lookup/router table pairs to be employed in place of switch
statements. Since all the values in both tables are in static memory, oodles of code are saved over using the brute-force switch
statement, which performs a single compare of the incoming messages against a list of possible values, one at a time. In addition to hard-coding the values in the code stream, this method is also empirically slow and inefficient. The impact of using the lookup method in DllMain
will actually be negligible, considering the function is only called two times for each process or thread that attaches to it, but I use it nonetheless if for no other reason than it involves a lot less coding.
The lookup table for the fdwReason
value is shown below – don’t type it in because it’s going to be ditched shortly; it’s presented here for informational purposes:
dll_reasons qword ( dll_reasons_e – dll_reasons_s ) / 8 ; Count of values in the table
;--------------------------------------------------
dll_reasons_s qword DLL_PROCESS_DETACH ; DLL_PROCESS_DETACH
qword DLL_PROCESS_ATTACH ; DLL_PROCESS_ATTACH
qword DLL_THREAD_ATTACH ; DLL_THREAD_ATTACH
qword DLL_THREAD_DETACH ; DLL_THREAD_DETACH
;--------------------------------------------------
dll_reasons_e label qword ; Reference label
The router table is listed below:
dll_router qword DllMain_P_Detach ; DLL_PROCESS_DETACH = 0
qword DllMain_P_Attach ; DLL_PROCESS_ATTACH = 1
qword DllMain_T_Attach ; DLL_THREAD_ATTACH = 2
qword DllMain_T_Detach ; DLL_THREAD_DETACH = 3
To use this process, the CPU provides the instruction repnz scasq
. It’s short for repeat while zero flag clear (or repeat while not zero): scan string quadword. The instruction searches 64-bit quadwords (qwords) at the location pointed to by the RDI register, for a count of RCX, comparing the value in RAX against each successive qword. This register usage is hardwired for this instruction so it cannot change. After setting the RAX, RCX, and RDI registers, the instruction is issued. The CPU then scans qword after qword (RDI auto-advances with each scan) beginning at memory location RDI, decrementing RCX each scan. This continues until either a match is found or the count in RCX reaches zero. Since the scan of a matching value must complete to determine that it’s a match, RDI will always point after the matching value.
Coding of this process is shown below:
lea rdi, dll_reasons ; Set the scan start pointer
mov rcx, [rdi] ; Load the first qword – the entry count for the table
scasq ; Skip over the entry count
mov rsi, rdi ; Save the location of table entry 0
mov rax, fdwReason ; Set the scan target (the value to search for)
repnz scasq ; Execute the scan
jnz <no_match> ; Not found – do <whatever>
sub rdi, rsi ; Set byte count into table, remembering target was passed over
sub rdi, 4 ; Undo the scan overshot
lea rax, dll_router ; Get the router table offset
jmp qword ptr [rdi + rax] ; Jump to the target process
This code handles tables of any size. It eliminates the need to check the incoming message (fdwReason
, in this case) against each possible value one at a time, and allows for very clean source that lists all the possible values in one place. The dll_reason
table is set up to allow the addition of an unlimited number of new entries without constantly having to update the table’s entry count – it updates automatically.
Now that I’ve presented the lookup process, I’m going to abandon the lookup part. There’s no point in using it in this relatively unique case. Since fdwReason
can only have the values 0
, 1
, 2
, or 3
, simply multiply the value by 8
(shift left 3 bits) and use that value as an offset directly into the router table. In the almost-exclusive case of handling fdwReason
within DllMain
, using the lookup table can’t return any information that isn’t already contained in fdwReason
.
mov rax, fdwReason ; Get the incoming value
shl rax, 3 ; Scale to * 8 for qword size
jmp dll_router [rax] ; Jump to the target process
The above handles the routing for the fdwReason
value perfectly.
Moving On…
Once you’ve executed the desired handler for the incoming fdwReason
value, you can simply exit DllMain
. The final return value must be present in RAX on return from DllMain. 1 means success, 0
means failure. If the Windows loader receives a return code of 0 from DllMain
, it will unload the library and call it a wash so the proper return value is critical.
External Functions
Linking to Windows libraries requires declaring the functions you’re going to be calling as external. Nobody likes name mangling, even though the 64-bit Windows libraries have dropped all the @24 type stuff at the end of the parameter name. But you still have __imp_
preceding each function name and nobody wants to work with that every time a function is called.
Assembly’s text equates rescue the developer from this nightmare. The full declaration for the Windows function LoadLibrary
is shown below – noting that the A and W deviants still must be specified when any of the parameters going into a function are string
s.
First, the 64-bit version:
extrn __imp_LoadLibraryA:qword ; In 64 bit mode, externals are always declared as qwords
LoadLibary textequ <__imp_LoadLibraryA>
The 32 bit version looks like this:
extrn _imp__LoadLibraryA@4:dword ; In 32 bit mode, externals are always declared as dwords
LoadLibrary textequ <_imp__LoadLibraryA@4>
Count the underscores carefully! They’re not the same between 32- and 64-bit declarations.
For those not familiar with abhorrent name mangling in 32-bit mode, the number after the @
is always the number of parameters passed to the function times four. LoadLibrary
takes one parameter so it’s declared with @4
.
Once the externals are declared, you can simply:
call LoadLibary ; Load the target library
and you’re off and running.
64-Bit Parameter Passing
There are many articles online covering 64-bit parameter passing. It’s done in the registers, not on the stack, in 64-bit mode. It can become confusing when it comes to float
s, but the following table should clarify:
Parameter Number
1
2
3
4
|
Non-Float
RCX
RDX
R8
R9
|
Float
?MM0
?MM1
?MM2
?MM3
|
Float
values are passed in XMM registers for single-precision, YMM for double, and ZMM for 128-bit values.
All parameters after the fourth are passed on the stack so register usage is irrelevant, however float
s cannot be passed directly beyond the fourth parameter – they must be passed by reference (a pointer to the value is passed instead of the actual value).
If the function foo
were being called with the following C++ code:
int hr = foo ( "Hello", "world!", 3.14f, 95 );
“Hello
” would be pointed to by RCX, “world!
” would be pointed to by RDX, 3.14
would be contained in XMM2 (it’s the third parameter), and the integer value 95
would be in R9
.
I’ve written in-depth about the 64-bit calling convention, as have many others. My article on the subject is on CodeProject at Nightmare on (Overwh)Elm Street: The 64-bit Calling Convention.
For 32-bit code, parameters are passed on the stack. The function that's called will clear them when it returns so no cleanup is required. Remember to pass parameters in the reverse order from what's displayed. 32-bit code for calling foo
, as shown in the C++ code above, might appear as follows (any register will work since only the stack is looked at by foo
for accessing parameter data):
push 95
push pi ; "pi" is a 32-bit real4 variable initialized at 3.14
push offset world_string
push offset hello_string
call foo
Wrapping Up DllMain
When you're done coding the function, there’s little left to do beyond closing out DllMain
:
ret ; Return from DllMain (for 32-bit code, use ret 12)
DllMain endp ; End procedure
The last line of the main module is simply end
for 64-bit code, and end DllMain
for 32-bit code.
Compiling the Project
The batch file to compile the code is shown below:
@echo off
rem Set this value to the location of rc.exe under the VC directory; it contains the RC.EXE executable
set rc_directory="C:\Program Files (x86)\Windows Kits\10\bin\x86
rem Set this value to the location of ml64.exe under the VC directory
set ml_directory="C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64
rem Set this value to the location of link.exe under the VC directory;
it contains the LINK.EXE executable
set link_directory="C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin
rem Set this directory to the INCLUDE location for ASM source
set asm_source_directory="C:\[your ASM directory]
rem Set this directory to the include path for Windows libraries.
Use c:\dir1;c:\dir2 format for multiple directories.
NOTE THAT THERE CAN BE NO TERMINATING \ IN THE STRING DEFINITION OR THE PATH WILL NOT BE FOUND.
set lib_directory="C:\Program Files (x86)\Windows Kits\10\Lib\10.0.10586.0\um\x64
%rc_directory%\rc.exe" %asm_source_directory%\resource.rc"
%ml_directory%\ml64.exe" /c /Cp /Cx /Fm /FR /W2 /Zd /Zf /Zi /Ta
%asm_source_directory%\your_dll.asm" /I%asm_source_directory% >
%asm_source_directory%\asm_errors.txt"
%link_directory%\link.exe" %asm_source_directory%\your_dll.obj"
/debug /def:%asm_source_directory%\your_dll.def" /entry:DllMain
/manifest:no /machine:x64 /map /dll /out:%asm_source_directory%\your_dll.dll" /pdb:
%asm_source_directory%\your_dll.pdb" /subsystem:windows,6.0 /LIBPATH:%lib_directory%"
user32.lib kernel32.lib
rem <-----> use /debug for debug symbols
<----------> use /machine:x86 for 32-bit code>
rem use /debug:none for
release version
rem copy *.dll [wherever you want the .DLL and .LIB files to copy to]"
type %asm_source_directory%\asm_errors.txt"
There are so many variations in directory setups between any two developers that common sense will have to be applied to get the batch file functioning correctly. What's in it is straightforward and should not pose any problems.
Conclusion
This article has been aimed at guiding you through the process of creating an all-assembly .DLL module for the extreme situations that might warrant it. Beyond what's been covered here, the rest of what's involved is simply creating your functions as you normally would in ASM.
Using ASM in high-demand situations can potentially save thousands and thousands of cumulative man-hours for a team of developers over being restricted to inline assembly or intrinsics - depending, of course, on the specific situation. For example, in today's world, it's considered normal for compiles that inherently require seconds to complete to take hours. Creating ASM-coded .DLL modules opens up the power of the entire language. If you have Visual Studio, you can use ml64.exe and the C++ linker; if you use another language you can use its linker and any number of alternative assemblers available online. Of course, if you use another assembler besides Microsoft's, you'll have to adjust the code and data snippets shown here for the particulars of that assembler.
Creating just one .DLL project using the direction in this article should be all that’s needed to make any developer comfortable with the process, illustrating how relatively simple the task really is.
There are situations in the real world that demand special handling. Being properly armed with the capability to apply ASM to a given task when it’s called for can translate to enormous time savings for many developers and end users over time. Assessing the situation realistically, windows is so bogged down with machine-specific dependencies, OS dependencies, and dependencies on dependencies, with ten gazillion versions of all the above on any given day, nothing about adding an assembly language .DLL to your project should be out of line.