The Library
Introduction
Previously: PolyHook V1 Article
I've spent the last 2 years re-writing PolyHook to fix a lot of the known edge cases in V1. I'll briefly cover how the implemented hooking methods work, but this is an advanced topic and you should read my other article first which goes in depth on that. This article will focus on the edge cases, and why it took me 2 years to get it working in release mode with modern compilers on multiple architectures. It's still not perfect, but it's significantly better in all ways. There's a lot to be said about just how deep the rabbit hole goes, I've only just recently crawled back out of it.
Background
Hooking is the process of redirecting the control flow of a program from its original path. Typically, when used access to the source code is not available, so it is an inherently low level process that operates at the assembly level or at least after the compilation stage. Depending on the method used, different effects can be achieved, all methods allow executing a callback that fires just before a hooked method would be called. Some methods allow changing function arguments, or return values. And furthermore, some methods modify the compiled programs code while others abuse techniques transparent to the running program.
The Bugs
In V1, there were a few unhandled edge cases of inline hooks:
Jmp
s back into prologue not supported - Indirect prologue (
jmp
at beginning) - x64 stack touched
- Failure to hook left original function malformed in a partially overwritten state
- Hooking would race trampoline creation
And also a lot of bugs in other hooking methods:
- Mutex acquired in Vectored Exception Handler
- Breakpoint type and width not set in
Dr7
- IAT failed to find import thunk to hook
Let's see what all that means. We'll start with my favorite.
Jmps into prologue (1)
0: 55 push ebp
1: 89 e5 mov ebp,esp <-
3: 89 e5 mov ebp,esp |
5: 89 e5 mov ebp,esp |
7: 89 e5 mov ebp,esp |
9: 90 nop |
a: 90 nop |
b: 7f f4 jg 0x1 ----
Notice the jg
assembly instruction jumps back to address 0x1
. When performing a hook on x86, the above prologue is overwritten with a 5 byte e9 style jump so that it becomes the following:
0: e9 ef be ad de jmp hook_callback <--
5: 89 e5 mov ebp,esp | <--- callback executes, runs the
7: 89 e5 mov ebp,esp | overwritten instructions and
9: 90 nop | returns here once done
a: 90 nop |
b: 7f f4 jg 0x1 -------------
That jg
now points to byte ef
, belonging to the jmp
. This is a problem as when it's executed, it will be in the middle of the instruction and won't be interpreted as a jmp
, but rather some garbage. There are many ways to fix this, some more complex than others. We could re-encode the jg to point to 0x0 so that it follows the jmp
and no longer executes garbage, but when the jmp
landed, it would break the control flow as the user callback would fire a second time, and the execution would not continue execution at mov ebp, esp
like it did originally; so this is wrong.
We could also try to build a jmp
table, and overwrite a little bit more of the prologue to make room for the jmp
table entries to write a wider jmp
type all the way to wherever the trampoline is. The whole prologue section would be copied to the trampoline, and we could just place a jmp
to there when we want to execute them by redirecting the condition jmp
to the bigger jmp
.
0: e9 ef be ad de jmp hook_callback
3: e9 ef be ad de jmp trampoline_mov_ebp_esp <- points copy in trampoline
8: 90 nop
9: 90 nop
a: 90 nop
b: 7f f4 jg 0x3
But this has a really big problem. IT'S SUPER HARD. The jmp
table must be in the prologue because we only have +- 127 bytes of displacement to work with ( single signed byte of jg 7f f4). This makes it so that the more fixups we have to do, the more of the prologue we overwrite, which could potentially mean even more fixups, which means...yea it's an unbounded recursive solution trying to be solved in a fixed amount of space. And what happens when you need to do so many fixups that your jmp
table grows to a size that it hits the first jump you fixed (address b
in this example). I tried to implement this many times but this introduces more edge cases than it fixes and can be solved better and simpler with the method mentioned next.
The general solution that I chose was to K.I.S.S and just expand the prologue section that is copied to the trampoline, and fix the jump there if it was in range. Here is what the current example turns into:
0: e9 ef be ad de jmp hook_callback
.... nops all the way down ...
b: 90 nop
trampoline:
100: 55 push ebp
101: 89 e5 mov ebp,esp <-
103: 89 e5 mov ebp,esp |
105: 89 e5 mov ebp,esp |
107: 89 e5 mov ebp,esp |
109: 90 nop |
10a: 90 nop |
10b: 7f f4 jg 0x101 ---
Let's look at a more complicated example that also requires a jmp
table entry in the trampoline:
Original function:
145804c [1]: 55 push ebp <--
145804d [2]: 8b ec mov ebp, esp | <-
145804f [2]: 74 fb je 0x145804c -- | <-
1458051 [2]: 74 ea je 0x145803d ----- |
1458053 [2]: 74 fa je 0x145804f ---------
1458055 [2]: 8b ec mov ebp, esp
1458057 [2]: 8b ec mov ebp, esp
1458059 [2]: 8b ec mov ebp, esp
Trampoline:
c11a20 [1]: 55 push ebp <-
c11a21 [2]: 8b ec mov ebp, esp |
c11a23 [2]: 74 fb je 0xc11a20 --- <-
c11a25 [2]: 74 07 je 0xc11a2e ---- |
c11a27 [2]: 74 fa je 0xc11a23 -- | --
c11a29 [5]: e9 27 66 84 00 jmp 0x1458055 |
c11a2e [5]: e9 0a 66 84 00 jmp 0x145803d <-
These jmp
s make it complicated to just move the prologue section. We have to move the whole thing as a chunk and then redirect the conditional je
to point to a bigger jmp
once it's relocated to the trampoline. This is because the je
only has +-127 bytes of displacement to work with and it's extremely unlikely the trampoline's buffer happened to be allocated that close. Therefore, this solution of expanding the prologue works but it gets really complicated to redirect all the jmp
s to preserve code flow and stay within the displacement size of each instruction. This is implemented in polyhook V2.
Indirect Prologue (2)
Turns out compilers like to optimize stuff! In release mode, many calls are not directly to the function. But rather to a jmp
table first. The following demonstrates this:
foo();
typical asm:
call foo
optimized asm:
jmp 0x0
jmp table:
0: jmp foo_implementation <- jmp to actual guts of foo
5: jmp bar_implementation
10: jmp foobar_implementation
...
So hooking would fail because this:
void (*pFnFoo)() = &foo;
would not point to the guts of foo
but actually to the jmp
in the jmp
table, where things would go horribly wrong and the jmp
table would be malformed and other seemingly random functions would do who knows what since they now pointed to who knows where. The fix was to follow these jmp
s until we landed at code. This also fixes hooking a function multiple times, as the second hook will just follow the first callback and hook the callback, chaining callback hooks at runtime in assembly...isn't that neat.
Stack touched (3)
56360477b000 [1]: 55 push rbp
56360477b001 [3]: 48 89 e5 mov rbp, rsp
56360477b004 [3]: 89 7d fc mov dword ptr [rbp - 4], edi
56360477b007 [4]: 83 7d fc 00 cmp dword ptr [rbp - 4], 0
56360477b00b [2]: 7e 15 jle 0x56360477b022
56360477b00d [5]: b8 0f 00 00 00 mov eax, 0xf
56360477b012 [1]: 50 push rax <- oopsies just overwrote edi
56360477b013 [a]: 48 b8 4d 5a 53 04 36 56 00 00 movabs rax, 0x563604535a4d
56360477b01d [4]: 48 87 04 24 xchg qword ptr [rsp], rax
56360477b021 [1]: c3 ret
On x64 in polyhook V1 the gadget push
, mov
, xchg
, ret
was used to jmp
back to the original function, and the push from that gadget clobers stack values. This caused hard to diagnose behavior differences in hooked functions. In V2, this is fixed by using the FF 25 style jmp<font color="#007000" face=""Segoe UI",Arial,Sans-Serif">.</font>
ff 25 ef be ad de jmp [0xdeadbeef]
deadbeef: &original_function
As you can see, there is no stack or register usage involved, so it's fine. It does mix code and data however as the destination to jmp
to is actually written into memory somewhere in the .text section...it's fine with careful book-keeping and in V2, I write this data at the very end of the trampoline where the data can never be accidentally executed as code.
Malformed Prologue on Errors
There's various errors that could occur that cause a hook to fail mid-way through modification of the assembly. An allocate could fail, disassembler could hit a bad instruction, we might fail to resolve a jmp
, etc. If one of these cases were to be hit in V1, the assembly would be left in a partially overwritten state and it would be up to the user to fix. This is bad design. In V2, all of the hooking logic operates on a cached byte buffer of the instructions. When writes occur, they write to the buffer (one small buffer per instruction). Only once the end of the hooking operation is done and we are reasonably sure all is well are these byte buffers actually written and the original assembly modified. As an added bonus, the features to do this were upstreamed to Capstone 'next'. Now unlike V1 PolyHook does not require a fork of capstone to work properly.
Trampoline Creation Race Condition
The API in V1 was meant to be simple. You call setup, then hook, then a method to get the allocated trampoline to call the original:
Detour detour;
detour.setup(&hookMe, &myCallback);
detour.hook();
pTrampoline = detour.getOriginal();
The problem however was that you could only get a pointer to the allocated trampoline AFTER you had hook the function. So it was possible that just in between when you called hook
, and when you filled the pTrampoline
variable that your callback would be dispatched. If this happened, then the callback would fire and attempt to call pTrampoline
which would hold an invalid value. And then you'd crash. The allocation of the trampoline occurs inside the hook()
routine so there was no simple fix for this in V1. In V2 however, the interface was changed. The constructor takes pTrampoline
as a constructor argument now and fills it for you just before the hook is committed to memory. Because the trampoline variable you pass is filled before the hook overwrites the original function, you get the guarantee that your callback only fires once your trampoline variable is valid.
Detour detour(&hookMe, &myCallback, &trampoline)
detour.hook();
Vectored Exception and Vectored Continue Handlers
To implement the hooking types that throw exceptions, PolyHook
needs to register an exception handler. This exception handler needs to catch the exception so that it can call the callback and resume as if the hook never threw an exception in the first place. This is done with the API:
PVOID WINAPI AddVectoredExceptionHandler(
_In_ ULONG FirstHandler,
_In_ PVECTORED_EXCEPTION_HANDLER VectoredHandler
);
It takes a pointer to a function to be called when the exception occurs, and potentially multiple hook types will generate different exceptions, but they all will be routed to the same handler. If we take a look at the MSDN remarks, the first thing it says is:
Quote:
The handler should not call functions that acquire synchronization objects or allocate memory, because this can cause problems. Typically, the handler will simply access the exception record and return.
Now let's go look at the first line for the handler code for V1:
std::lock_guard<std::mutex> <span class="pl-c1">m_Lock</span>(m_TargetMutex);
Whoops, that's undefined. V2 fixes this. There's also another interesting type of exception handler though, a VectoredContinueHandler
. A VectoredExceptionHandler
is raised when the exception is thrown, but a VectoredContinueHandler
is raised once another handler has decided to return EXCEPTION_CONTINUE_EXECUTION
. Turns out debuggers return this if you click play (not single stepping). This is a nice method to detect if BP hooks are being used, or debuggers are attached. Here's a good post about these things.
There is also a secret magic number C++ exceptions throw which I found during development:
0xE06D7363:
BP Type and Size
When you place a hardware breakpoint, the debugger actually writes into a special register on your CPU the type of breakpoint, the address to hit on, and the size to hit on. This location (should actually say locations, it's multiple registers) are Dr0
-Dr7
. You are allowed to place up to 4 BPs per thread, and Dr0
-Dr3
hold the addresses you want to break on, and a few bits in Dr7
control if they are enabled, their type, and their size. In V1, I had a bug where I didn't set the bits in Dr7
correctly. I wrote the address to hit on, and then enabled the breakpoint by writing:
switch (m_regIdx) {
case 0:
ctx.Dr0 = (decltype(ctx.Dr0))m_fnAddress;
break;
case 1:
ctx.Dr1 = (decltype(ctx.Dr1))m_fnAddress;
break;
case 2:
ctx.Dr2 = (decltype(ctx.Dr2))m_fnAddress;
break;
case 3:
ctx.Dr3 = (decltype(ctx.Dr3))m_fnAddress;
break;
}
ctx.Dr7 |= 1ULL << (2 * m_regIdx);
This tells the CPU to turn on one of the HW bp's and to hit on address m_fnAddress
, but not whether to hit on read, write, or execute, and also not the size of memory it should monitor. To do that, I needed:
ctx.Dr7 &= ~(3ULL << (16 + 4 * m_regIdx)); ctx.Dr7 &= ~(3ULL << (18 + 4 * m_regIdx));
which sets a 1 byte breakpoint to hit on execution. For reference, here is the bit layout of Dr7
from:
https://wiki.osdev.org/CPU_Registers_x86#Debug_Registers
bit | Description |
0 | local DR0 enable |
1 | global DR0 enable |
2 | local DR1 enable |
3 | global DR1 enable |
4 | local DR2 enable |
5 | global DR2 enable |
6 | local DR3 enable |
7 | global DR3 enable |
16-17 | type DR0 |
18-19 | size DR0 |
20-21 | type DR1 |
22-23 | size DR1 |
24-25 | type DR2 |
26-27 | size DR2 |
28-29 | type DR3 |
30-31 | size DR3 |
Quote:
00b condition means execution break, 01b means a write watchpoint, and 11b means an R/W watchpoint. 10b is reserved for I/O R/W (unsupported).
Currently, I still set the Debug registers with a call to setthreadcontext
from the same thread, which is undefined according to Microsoft. I'm wagering this is ok because I only set the debug registers and I've never had it fail in any of my testing, but I have not done any in-depth analysis to check if this is truly ok.
Finding IAT Thunks Failed
In V1, the IAT hook would sometimes fails because it couldn't find the import. This was because I made the mistake of only walking my own processes' IAT, and not also the other modules it had loaded. If you want to resolve the thunk of an entry you have to kind of do it recursively. A process loads a few modules (what I call DLLs) and those DLLs export some entries. Those DLLs however ALSO have IATs and can load other things which also have...which also...you get it. And this is where my mistake was, I naively only went the first level deep in V1 so it failed to find APIs sometimes, I also used the dbghelp.lib to find the IMPORT_DIRECTORY_ENTRY_IMPORT
which was nice but added a dependency. So the fix was to walk the PEB to find all loaded modules, and then for each loaded module to walk its IAT.
The peb
stores a linked list of modules at Peb
->Ldr
->InLoadOrderModuleList
and you can grab an image base from there. Then to get the IAT, you cast the image base to a DosHeader
then go to DosHeader
->e_lfanew
which is NTHeader
->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT]
. You also need to carefully check for null
pointers as some of the fields in the IAT are zero'd depending on the compiler. Full code is on github.
The result is V2 can search the IAT correctly and recursively now (capped the list to show only a few APIs):
Module: PolyHook_2.exe
--DLL: KERNEL32.dll
----API: GetStdHandle
----API: IsDebuggerPresent
----API: OutputDebugStringA
----API: AddVectoredExceptionHandler
----API: RemoveVectoredExceptionHandler
----API: SetThreadStackGuarantee
----API: GetConsoleScreenBufferInfo
--DLL: MSVCP140.dll
----API: ?_Getgloballocale@locale@std@@CAPEAV_Locimp@12@XZ
----API: ?always_noconv@codecvt_base@std@@QEBA_NXZ
----API: ?tolower@?$ctype@D@std@@QEBADD@Z
----API: ?tolower@?$ctype@D@std@@QEBAPEBDPEADPEBD@Z
----API: ?_Getcat@?$ctype@D@std@@SA_KPEAPEBVfacet@locale@2@PEBV42@@Z
----API: ?in@?$codecvt@DDU_Mbstatet@@@std@@QEBAHAEAU_Mbstatet@@PEBD1AEAPEBDPEAD3AEAPEAD@Z
----API: ?out@?$codecvt@DDU_Mbstatet@@@std@@QEBAHAEAU_Mbstatet@@PEBD1AEAPEBDPEAD3AEAPEAD@Z
--DLL: VCRUNTIME140.dll
----API: strrchr
----API: _purecall
----API: __std_terminate
----API: __std_type_info_destroy_list
----API: memchr
----API: memmove
----API: strchr
--DLL: api-ms-win-crt-runtime-l1-1-0.dll
----API: _seh_filter_dll
----API: _configure_narrow_argv
----API: _initialize_narrow_environment
----API: _initialize_onexit_table
----API: _register_onexit_function
----API: _execute_onexit_table
----API: _crt_atexit
--DLL: api-ms-win-crt-heap-l1-1-0.dll
----API: _callnewh
----API: free
----API: realloc
----API: calloc
----API: _set_new_mode
----API: malloc
--DLL: api-ms-win-crt-utility-l1-1-0.dll
----API: rand
----API: srand
----API: qsort
--DLL: api-ms-win-crt-math-l1-1-0.dll
----API: _dtest
----API: __setusermatherr
----API: pow
----API: _fdtest
--DLL: api-ms-win-crt-stdio-l1-1-0.dll
----API: _set_fmode
----API: _get_stream_buffer_pointers
----API: fclose
----API: fflush
----API: fgetc
----API: fgetpos
----API: __stdio_common_vsprintf
--DLL: api-ms-win-crt-filesystem-l1-1-0.dll
----API: _lock_file
----API: _unlock_file
--DLL: api-ms-win-crt-string-l1-1-0.dll
----API: isalnum
----API: tolower
----API: strncpy
----API: strncmp
--DLL: api-ms-win-crt-time-l1-1-0.dll
----API: strftime
----API: _gmtime64_s
----API: _time64
--DLL: api-ms-win-crt-convert-l1-1-0.dll
----API: atoi
--DLL: api-ms-win-crt-locale-l1-1-0.dll
----API: _configthreadlocale
Module: ntdll.dll
[!]ERROR:PEs without import tables are unsupported
Module: KERNEL32.DLL
--DLL: api-ms-win-core-rtlsupport-l1-1-0.dll
----API: RtlVirtualUnwind
----API: RtlUnwindEx
----API: RtlRestoreContext
----API: RtlLookupFunctionEntry
----API: RtlInstallFunctionTableCallback
----API: RtlRaiseException
----API: RtlDeleteFunctionTable
--DLL: ntdll.dll
----API: RtlSizeHeap
----API: RtlLCIDToCultureName
----API: RtlUnicodeStringToInteger
----API: _wcslwr
----API: RtlGetUILanguageInfo
----API: EtwEventEnabled
----API: RtlpConvertLCIDsToCultureNames
--DLL: KERNELBASE.dll
----API: lstrlenA
----API: BaseFormatObjectAttributes
----API: GetVolumeNameForVolumeMountPointW
----API: AppContainerFreeMemory
----API: AppContainerLookupMoniker
----API: BasepNotifyTrackingService
----API: MoveFileWithProgressTransactedW
--DLL: api-ms-win-core-processthreads-l1-1-0.dll
----API: GetProcessTimes
----API: GetProcessId
----API: GetThreadId
----API: GetCurrentProcess
----API: GetCurrentProcessId
----API: GetThreadPriority
----API: GetThreadPriorityBoost
--DLL: api-ms-win-core-processthreads-l1-1-3.dll
----API: GetProcessInformation
----API: SetProcessInformation
----API: SetThreadIdealProcessor
----API: GetProcessShutdownParameters
--DLL: api-ms-win-core-processthreads-l1-1-2.dll
----API: GetThreadIOPendingFlag
----API: SetThreadInformation
----API: GetSystemTimes
----API: GetThreadInformation
----API: SetProcessPriorityBoost
----API: GetProcessPriorityBoost
--DLL: api-ms-win-core-processthreads-l1-1-1.dll
----API: GetProcessHandleCount
----API: SetProcessMitigationPolicy
----API: GetProcessMitigationPolicy
----API: SetThreadIdealProcessorEx
----API: GetThreadIdealProcessorEx
----API: GetThreadContext
----API: GetThreadTimes
--DLL: api-ms-win-core-registry-l1-1-0.dll
----API: RegLoadMUIStringW
----API: RegLoadMUIStringA
----API: RegNotifyChangeKeyValue
----API: RegLoadKeyA
----API: RegGetValueA
----API: RegFlushKey
----API: RegEnumValueW
--DLL: api-ms-win-core-heap-l1-1-0.dll
----API: HeapCreate
----API: HeapWalk
----API: HeapAlloc
----API: GetProcessHeap
----API: HeapFree
----API: HeapUnlock
----API: HeapSetInformation
--DLL: api-ms-win-core-heap-l2-1-0.dll
----API: LocalFree
--DLL: api-ms-win-core-memory-l1-1-1.dll
----API: QueryMemoryResourceNotification
----API: CreateMemoryResourceNotification
----API: GetLargePageMinimum
----API: GetProcessWorkingSetSizeEx
----API: GetSystemFileCacheSize
----API: SetProcessWorkingSetSizeEx
----API: SetSystemFileCacheSize
--DLL: api-ms-win-core-memory-l1-1-0.dll
----API: MapViewOfFileEx
----API: OpenFileMappingW
----API: MapViewOfFile
----API: CreateFileMappingW
----API: VirtualQueryEx
----API: VirtualQuery
----API: VirtualProtectEx
... AND SO ON ...
Compiler Optimization WTF moments
An optimizing compiler used to be my best friend... we've since parted ways:
-
The compiler may inline a function you took a function pointer too, leaving your pointer pointing to the middle of another block of code. Likely this was because the function pointer was never called, but used to get an address to the assembly to modify. Mark the function __declspec(noinline)
.
- The compiler may completely remove a function you took a function pointer to if it's not called. Leaving you with a dangling pointer to invalid memory. WTF Compiler!?! Mark
__declspec(noinline)
and use lots of volatiles inside seems to fix. Also adding printf
or other calls to functions with side effects keeps this behavior at bay. - The compiler may re-order statements to occur in a different order. Well known but this bit me a few times. Marking volatile fixes this... sometimes.
- The compiler may remove reads and writes to unused variables or parameters. Mark everything volatile.
- Release mode calls are sometimes indirected through a
jmp
table. Why?
Conclusion
Hooking is really hard, but fun.