Before we start
This article assumes the reader has some skills in Win32 API. In particular, such terms as virtual memory, threads, thread stack, should be clear prior to reading this article. To learn those subjects, I'd personally advise a book of J. Richter's.
Introduction
We'll discuss the Win32 thread stack allocation mechanism here. I believe this mechanism lacks some very basic functionality which may be useful sometimes. In some situations, a lot of system resources can be saved by a relatively easy way.
Thread stack
Every thread in Win32 has its own stack. It serves as a placeholder for automatic (local) variables, function parameters, function return address, and some other stuff. Whenever you call a function - more stack is consumed. And it is freed after the function is exited.
The stack allocation mechanism in Win32 is the following: each thread has its maximum stack size which must by specified upon creation. For the primary thread created by for the application by the system, it's written in the EXE by the linker; for other threads, the stack size is specified as an argument in the CreateThread
function. If you give 0 to CreateThread
, then it takes the same value used to create the primary application thread. The default value MS puts in the EXE is 1Mb. It can be overridden by the following:
#pragma comment (linker, "/STACK:0x500000")
Well, you'd probably ask why so much memory is reserved for the stack? The answer is that the memory is reserved, not necessarily allocated. Upon creation of the stack, the system reserves the specified size of the virtual memory in the process' address space, with only a few pages allocated. When the thread consumes all the allocated pages, new pages area allocated on-the-fly by the system. Up until the whole reserved range is allocated. If the thread attempts to consume even more memory - an EXCEPTION_STACK_OVERFLOW
exception is raised.
The next question that would probably arise is why limit the stack at all? The answer is that the stack of the thread must be a contiguous memory block (in the virtual address space). When you start a thread, the system finds a free contiguous memory range in your address space with the adequate size. This range is marked as reserved (so that its parts won't be used for anything else). So that if you reserve too much space for every thread - you can run out of virtual address space, even though the system still has more memory. For 64-bit applications, you can freely give every thread 10Gigs, there shouldn't be a problem with that (well, I haven't checked that), but in Win32, with its 4Gb address space, with 2Gb accessible in the user mode, you should be careful about how much the stack of the thread is allowed to grow.
We said the system allocates the memory pages for the stack on-the-fly. How is this implemented? When the thread is created, the range for its stack is reserved and a few pages are initially allocated. The last allocated page has a PAGE_GUARD
attribute. When it is eventually accessed, the system raises a STAUS_GUARD_PAGE
exception, which is handled by the system. If there's no room for the stack to grow then an EXCEPTION_STACK_OVERFLOW
is raised. Otherwise, the next consecutive page is allocated with the PAGE_GUARD
attribute. In such a way, the stack for the thread becomes allocated.
Note: This means that the thread should access its stack page-by-page, without any trespassing. Otherwise, there's a chance to miss the guard page and fall on the unallocated one, which will lead to GPF (EXCEPTION_ACCESS_VIOLATION
exception). For example, you could declare a large array on the stack, and access it from the end. In order to avoid such situations, when you declare large variables on the stack, the compiler automatically adds the so-called stack probes (which is an explicit access of the stack in the needed order).
What is missing
Up until now, we talked about how the stack grows. But what if we'd like to shrink the stack? Consider a thread calls a very stack-consuming function (some deep recursion or etc.). During the call, memory pages are allocated (unless the stack was already large enough). Than, after the function exits, the stack remains allocated, even though the thread may never need so much stack ever again. There is no mechanism to automatically free unneeded stack pages. Once the page is allocated - there is no way back! What can be done in such a case?
See the following example:
void Consume900K()
{
volatile BYTE pSomeBuf[900*1024];
pSomeBuf[sizeof(pSomeBuf) - 1] = 30;
}
int WINAPI WinMain(HINSTANCE, HINSTANCE, LPSTR, int)
{
MessageBox(NULL, "Default stack", "", 0);
Consume900K();
MessageBox(NULL,
"900K consumed for the 1st time", "", 0);
Consume900K();
MessageBox(NULL,
"900K consumed for the 2nd time", "", 0);
return 0;
}
Try running this program and check by the task manager how much committed resources (either physical memory or the page file) this program consumes. You'll see that on the second message box, this amount grows dramatically, by about 900K. On the third message box, however, it is not significantly changed.
I remember J. Richter wrote in one of his books that such a situation is unwanted, and an application should create another thread to perform the stack-consuming operation. After that thread exits, the whole stack will be freed by the OS. Well, needless to say, this method is not ideal. Creating another thread apart from the obvious overhead (context switches, and etc.) sometimes can create too much mess, since every thread has its own TLS, APC, messaging subsystem, and etc. Besides that, sometimes you don't know from the beginning that a particular operation could be significantly stack-consuming.
Explicit stack release
Once we know the mechanism used to allocate the stack, we can make explicit changes ourselves, can't we?
Let's start. The memory page on x86 32 bit processors is 4K, on Alpha processors, it is 8K. Fortunately, there's a programmatic way to determine the page size. Let's define a global variable (although I hate using global variables) g_dwProcessorPageSize
, and initialize it by the following code:
SYSTEM_INFO stSysInfo;
GetSystemInfo(&stSysInfo);
ASSERT(stSysInfo.dwPageSize);
g_dwProcessorPageSize = stSysInfo.dwPageSize;
Next, before we make changes to the stack, let us have some diagnostic function to view the stack:
void DbgDumpStack()
{
#ifdef _DEBUG
OutputDebugString(_T("### Stack Dump Start\n"));
PBYTE pPtr;
_asm mov pPtr, esp;
MEMORY_BASIC_INFORMATION stMemBasicInfo;
VERIFY(VirtualQuery(pPtr, &stMemBasicInfo, sizeof(stMemBasicInfo)));
PBYTE pPos = (PBYTE) stMemBasicInfo.AllocationBase;
do
{
VERIFY(VirtualQuery(pPos, &stMemBasicInfo, sizeof(stMemBasicInfo)));
VERIFY(stMemBasicInfo.RegionSize);
TCHAR szTxt[0x100];
wsprintf(szTxt, _T("### Range: 0x%08X - 0x%08X,
Protect=%X, State=%X, Pages=%u\n"),
pPos, pPos + stMemBasicInfo.RegionSize,
stMemBasicInfo.Protect, stMemBasicInfo.State,
stMemBasicInfo.RegionSize / g_dwProcessorPageSize);
OutputDebugString(szTxt);
pPos += stMemBasicInfo.RegionSize;
} while (pPos < pPtr);
OutputDebugString(_T("### Stack Dump Finish\n"));
#endif
}
Note: The stack is used in the reverse order. The push
command decreases the stack pointer, that's how the processor works. Hence the memory is allocated for the stack in the reverse order too. So, this function will first display free (unallocated) range of pages, and then would come a guard page, and then you'll see the committed pages.
And now, finally, let's try to make some changes to the stack. Make a function StackShrink
which would de-commit as much stack as it is currently possible.
void StackShrink()
{
PBYTE pPtr;
_asm mov pPtr, esp;
PBYTE pAllocate = pPtr - (((DWORD) pPtr) %
g_dwProcessorPageSize) - g_dwProcessorPageSize;
PBYTE pGuard = pAllocate - g_dwProcessorPageSize;
PBYTE pFree = pGuard - g_dwProcessorPageSize;
MEMORY_BASIC_INFORMATION stMemBasicInfo;
VERIFY(VirtualQuery(pPtr, &stMemBasicInfo,
sizeof(stMemBasicInfo)));
VERIFY(VirtualQuery(stMemBasicInfo.AllocationBase,
&stMemBasicInfo, sizeof(stMemBasicInfo)));
ASSERT(stMemBasicInfo.State == MEM_RESERVE);
PBYTE pFirstAllocated =
((PBYTE) stMemBasicInfo.BaseAddress) +
stMemBasicInfo.RegionSize;
if (pFirstAllocated <= pFree)
{
volatile BYTE nVal = *pAllocate;
VERIFY(VirtualFree(pFirstAllocated,
pGuard - pFirstAllocated, MEM_DECOMMIT));
VERIFY(VirtualAlloc(pGuard, g_dwProcessorPageSize,
MEM_COMMIT, PAGE_READWRITE | PAGE_GUARD));
DbgDumpStack();
}
}
And that seems to be all.
To test this technique, let's also make a function that probes the stack to consume the specified stack size.
void StackConsume(DWORD dwSizeExtra)
{
PBYTE pPtr;
_asm mov pPtr, esp;
for ( ; dwSizeExtra >= g_dwProcessorPageSize;
dwSizeExtra -= g_dwProcessorPageSize)
{
pPtr -= g_dwProcessorPageSize;
volatile BYTE nVal = *pPtr;
}
DbgDumpStack();
}
And, finally, make a test application:
int WINAPI WinMain(HINSTANCE, HINSTANCE, LPSTR, int)
{
SYSTEM_INFO stSysInfo;
GetSystemInfo(&stSysInfo);
ASSERT(stSysInfo.dwPageSize);
g_dwProcessorPageSize = stSysInfo.dwPageSize;
DbgDumpStack();
MessageBox(NULL, "Default stack", "", 0);
StackConsume(100*1024);
MessageBox(NULL, "100K consumed", "", 0);
StackShrink();
MessageBox(NULL, "Stack compacted", "", 0);
StackConsume(900*1024);
MessageBox(NULL, "900K consumed", "", 0);
StackShrink();
MessageBox(NULL, "Stack compacted", "", 0);
return 0;
}
You can run this application and monitor it with the task manager. You can see that it frees the stack after it is used.
Where to use
You can place an explicit call to StackShrink
after every heavy (stack-consuming) operation. Also, it should be called before a call to any Win32 wait
function, because while your thread is asleep, there is no need to consume extra memory pages. If you're implementing a GUI application with the message loop, then it'd be a good idea to call StackShrink
every time the message queue is empty (before GetMessage
). For MFC applications, there is an OnIdle
callback. If you're writing your own message loop, then instead of writing it like this:
MSG msg;
while (GetMessage(&msg, NULL, 0, 0) > 0)
DispatchMessage(&msg);
Redesign to do this:
while (true)
{
MSG msg;
while (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
DispatchMessage(&msg);
StackShrink();
if (GetMessage(&msg, NULL, 0, 0) <= 0)
break;
DispatchMessage(&msg);
}
Precautions
I've tested this method under Windows XP, and it seems to work OK. I believe there shouldn't be any problem in any Win32-based OS, since their stack allocation mechanism is the same. The only possible problem I suspect is that the OS theoretically may hold, in addition, the information about the thread stack in some other place, so that making explicit modifications makes the stack inconsistent. For example, each thread has a TIB (thread information block) associated with it, which has members such as StackBase
, StackLimit
, and etc. But up until now, I haven't seen such a problem, so that it seems not to be such a place.
Another note: I've seen pieces of code where functions return addresses to local buffers. Example:
PBYTE StupidFunc()
{
BYTE pVal[0x1000];
return pVal;
}
Of course it is immoral to do so, it contradicts to the programmers' ethics. But the fact is that, it works! After a call to such a function, you can access the buffer it returns, until you call another function that overwrites that piece of the stack.
And now, if you place StackShrink
after the call to such a function, you'll not only damage a couple of the first bytes in the returned buffer, but more likely, you'll make a part of this buffer inaccessible. So, such a dirty technique must not be mixed with explicit stack shrinking.