Introduction
I've just completed a long project making an embedded system work on Windows 98. Our criterion for release was that system should be able to run weeks (if not months) under continuous usage. My job was to make that the case on Windows 98.
Before you roll your eyes and think that's not possible, let me say that much of the perceived instability of Windows 9x can be blamed on the hardware (or more precisely, bad software drivers for the hardware) or the poor quality of application software used on Windows 9x. Very few of the stability problems I've encountered can be squarely blamed on Windows 9x. In our case, we controlled the hardware because what our users are buying isn't a PC, they're buying a piece of test equipment.
You're probably wondering why we didn't base our product on Windows NT or Windows 2000 if we had such strong stability concerns. The reason is pretty simple; our product had to look like a piece of test equipment. This means that we needed to control the power up sequence so that the equipment booted to a running instrument rather than a log on screen. There were several other issues to related to drivers, and software cost, but main issue was we had more control over Windows 9x than NT. We also, believed from the beginning that Windows 98 would be stable enough.
How'd It Get So Dark In Here?
As you'd probably guess, my fellow developers were first to find a large assortment of heap and resource leaks in my code. The symptoms of the leak were that equipment would die after just a few hours of running. The vexing part of this was it would die even if the equipment was sitting in a steady state. I ran BoundsChecker and it gave a clean bill of health even though the application was clearly chomping large amounts of the heap (at a rate of about 8Megs an hour).
BoundsChecker is a very good tool. It's probably indispensable for any professional developer. It finds a large number of memory leaks and resource leaks. It also misses a large number of leaks. I personally, always run my version of code with BoundsChecker enabled. Until this project, I considered a clean bill of health from BoundsChecker to be sufficient to say that all classes of leaks have been handled. This assumption was quickly proved wrong.
At this point, I realized I was working in the dark.
Illuminating Bugs
After spending a week (including evenings and weekends) inspecting code with little success, I was getting pretty desperate, I was desperate enough to sit down and write a tool. Since the application I was working on was actually an ATL out-of-process server and some client applications, I wrote a tool to log heap usage for each process that the tool encounters when it is invoked.
It turns out that NT/2000 and Win9x use different APIs to access process and heap information (that shouldn't be much of a surprise to anyone who has worked in both environments). I've included the project ProcMan that collects information about all of the running processes and saves them as a .csv file on the desktop. ProcMan consumes a fair amount of the CPU, so it only snaps the state every few seconds. The result when plotted in Excel can be very illuminating.
After running ProcMan and logging the data overnight, it was clear that both my server and UI client application were leaking at a pretty good clip.
Clearing the Fog
Now I had an idea of where to look. I needed a way to test what in the program was leaking. So I stole some code out of ProcMon and modified it. The outcome was a function called GetHeapSize()
(see the source below). This function returns the amount of free blocks currently available to the heap. I used this by calling this function at the beginning of a function that I was testing and then when I was about to exit the function. The difference would tell me if any items had been added to the heap. After a couple of days, a pattern emerged.<!-- STEP 3. Add the article text.-->
#include <tlhelp32.h>
long GetHeapSize()
{
DWORD blocks = 0;
HANDLE snapshot = CreateToolhelp32Snapshot(TH32CS_SNAPHEAPLIST, 0);
if (((int) snapshot) != -1) {
HEAPLIST32 hl = {sizeof(hl)};
HEAPENTRY32 he;
for (BOOL fOKHL = Heap32ListFirst(snapshot, &hl); fOKHL;
fOKHL = Heap32ListNext(snapshot, &hl))
{
memset(&he, 0, sizeof(he));
he.dwSize = sizeof(he);
BOOL fOKHE = Heap32First(&he, 0, hl.th32HeapID);
for (; fOKHE; fOKHE = Heap32Next(&he))
{
if ((he.dwFlags & LF32_FREE) == 0)
{
blocks += he.dwBlockSize;
}
}
}
CloseHandle(snapshot);
}
return blocks;
}
I Really Hate BSTRs
After finding several of the problems this way, it turned out that most of the problem was in handling BSTRs. After reading a few Microsoft MSDN articles, some cussing, and trial and error, I determined how to solve these problems. The funny thing is I had inspected all the code that had the problem without noticing the leak. This goes to show that there is no substitute for tools in many cases.
Resources Leaks
The next set of problems we had were resource leaks. Most of these problems showed up in Stress Testings. Stress testing is using an automated to tool to randomly exercise all the functionally in your application. We used a commercial tool to randomly bring up dialogs, push buttons, exercise menu selections and so forth. It didn't take long to notice that resources were slowly disappearing. The GDI resources were the biggest loss.
The GDI resource losses turned out to be the classic problem. We were using selection object with SelectObject()
and then not putting back the original object when we were done. We inspected our code and got most of those. However, the system leak was different.
The system leak only occurred when the system was stressed to the maximum. If we had it saturated with events from our hardware, the system resources would leak until they were gone and the system would lock up. It turned out that many of these hardware events turned into Windows Messages. The developer who wrote the code to post and handle these messages always posted a message whether he needed or not and filtered them when he was handling the message. When the system was stressed, the messages weren't getting handled fast enough and the message queue would fill. When it was filled, the system resources would be at 0 and the system would lockup. The solution was to filter the message before they were posted. This solved that goofy problem.
Connection Points and Asynchronous Event Handling
One of the last really nasty problems we tackled had several seemingly unrelated symptoms. When our software was running, and certain features were enabled, the Windows clock would stop. We also had problems when all the features were enabled, we couldn't keeping up with the event traffic. The result was the software would stop working and the UI wouldn't respond. When debugging the code, it was clear that the programs weren't dead. They were just consumed handling update events from the server.
The problem turned out to be how connection points were serviced in the COM server. When we started the project, the out-of-process COM server was created with the Visual C++ 5.0 project wizard. The COM object was created using the default settings in the ATL Project Wizard. This meant that the object was single threaded. That didn't seem like a problem initially.
When we integrated the hardware event handling with the COM server, we had a thread that monitored vent from our driver, post the event as a windows message and then send the event out to the clients through the connection point. This sequence was a problem. It was far easier to saturate the update mechanism with events that I had anticipated. The solution was to queue the events, service them in OnIdle()
, and combine any duplicate events. This solved many goofy problems we were having, including having the Windows clock loose time and even stop for long periods of time.
In retrospect, it's pretty clear that we should have spent more time on the details of the COM server and servicing updates to clients.
Beware of Testing Tools
The last set of problems were particularly very vexing. Our UI Stress tests would run hours and then die. We thought we had the heap leaks whipped so we didn't bother to run ProcMon on all the tests that were run. The odd thing was, we would only see this during UI Stress. No one had ever seen this in the actual use.
We thought we'd missed something small, and it just hadn't shown up in general useā¦ yet. So we ran that same test with ProcMon running. The outcome was odd. There clearly was a leak. However, it was registered against kernal32.dll, rather than against any of the running applications. After much code inspection and cussing, it turned out that the problem was how the testing tool logged its actions. Rather than saving them into a file, the application saved the logged actions in memory. After 20+ hours of UI Stress testing, enough room had been consumed to keep anything else from running correctly. When we turned of this type of logging, the problem went away.
Another issue was that the testing tool required us to an OCX to every dialog we created. Though an annoyance, we added the OCX to every dialog. During UI Stress testing, we would occasionally have oleaut32.dll crashes. After some research, it turned out that the testing tool was making calls into the OCX that very, very infrequently we cause this crash. The tool manufacturer is interested in solving this problem, but in the mean time we are working around it.
Last Thoughts
Since I work in a company that has traditionally developed equipment using an embedded processor and a real-time Unix-like OS, there was much concern that a Win 9x based system would be reliable. By the end of the project, we had proven that the system was as reliable, out-of-the-box, as anything piece of equipment we'd developed before. Of course, users can shoot themselves in the foot, but the benefit of an open system has outweighed any problems that it has caused.
The end result was a piece of equipment that can be expected to work 24/7 and is powered by Intel and Microsoft (not to mention the custom stuff we've added). I know many folks (especially the Linux folks) would consider Intel+Win98 = reliable
to be unattainable. But the truth is it can be.
License
This article has no explicit license attached to it, but may contain usage terms in the article text or the download files themselves. If in doubt, please contact the author via the discussion board below.
A list of licenses authors might use can be found here.