Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Htmlhelp Forensics

0.00/5 (No votes)
12 Aug 2012 2  
Cracking the htmlhelp .chm storage format to remove annoying file-lock bug and for the sheer fun of it!

Sample Image - Htmlhelp lockfile tester

Introduction

What started out as a simple bug hunting in the htmlhelp api ended up many days later as an improved understanding of the inner workings of the htmlhelp api and the .chm fileformat. The initial problem was an annoying feature that keeps htmlhelp based files locked even after they are closed, preventing updates of the file, until you manually close all processes touching the file. The problem will only turn up if you use context id's and do not close your application immediately after the htmlhelp is closed. Strangely enough, topic based lookups does not provoke the same lock. 

So what if you simply switched all context based lookup's to topic based. The sheer number of codelines needing changes would quickly stop this approach - But how do you access the context/topic mapping at runtime? The solution turned out rather nasty: Crack the chm-file format, decode the internal streams yourself,  build a map between contexts and topics and use this map to translate all context based lookups to topic based.

All of the samples are shown in C++ since the hardcore ITStorage cracking would be difficult if not impossible to implement in any other language not supporting native com and Windows API's by heart - Feel free to prove me wrong!

But first a very brief introduction to compressed html based help files.

Getting started

The basic idea behind htmlhelp is to move multiple html and graphics files into one single compressed file with extension chm (or its). When the file is viewed by the user, contents will be displayed by decompressing the internal files just-in-time when needed. The viewer is the already installed Internet Explorer, cunningly invoked by either hh.exe or the more obscure control shdocvw.ocx. Files with CHM extension will be launched with the hh.exe viewer, whereas files with ITS extensions will be launched directly in internet explorer. The two extensions have the same format controlled by the COM interface ITStorage which will described later.

You can create your chm-files in the free HTML Help Workshop supplied by Microsoft. The program is barely adequate and needs a separate HTML editor combined with a good understanding of  the htmlhelp format. Make sure you search the links below for alternatives, if you are in any way serious about help creation. For testing purposes I've created a small help project with this tool to have something well known to reverse engineer. This idea about reverse engineering your own simple files makes it somewhat easier than trying to reverse engineer a large file with unknown and unmapped contents.

Create your own chm-file with two topics and context aliases for both - or simply steal the one found with the source code for this article.

Invoking the help - HtmlHelp API

To actually link your program to the chm-file, you'll need to use the HtmlHelp API documented in the MSDN library. The API is really just a single function HtmlHelp with the following signature:

// remember to #include <htmlhelp.h>
HWND WINAPI HtmlHelp(
  HWND hwndCaller,
  LPCSTR pszFile,
  UINT uCommand,
  DWORD_PTR dwData
);

The command given in third argument controls the behavior of the API call. Something like 19 different command are available, only a few of the most important of them are used in the following example:

HWND res;
LPCSTR pszFile = "sample.chm";
// show specific topic
res = HtmlHelp(NULL, pszFile, HH_DISPLAY_TOPIC, (DWORD) "first.htm");
// show specific context
res = HtmlHelp(NULL, pszFile, HH_HELP_CONTEXT, (DWORD) 1001);
// try to show invalid context
res = HtmlHelp(NULL, pszFile, HH_HELP_CONTEXT, (DWORD) 1002);
if (res==0)
{
  // grab that error
  HH_LAST_ERROR hherr;
  hherr.cbStruct = sizeof(HH_LAST_ERROR);
  HtmlHelp(NULL, pszFile, HH_GET_LAST_ERROR, (DWORD) &hherr);
  if (FAILED(hr) && hherr.description!=NULL)
  {
    printf("ERROR %08X\n%S", hherr.hr, hherr.description);
    // free the string after use
  ::SysFreeString(hherr.description);
  }
}
// close the htmlhelp window
res = HtmlHelp(NULL, NULL, HH_CLOSE_ALL, NULL);

The above example uses a typedef HH_LAST_ERROR not found in my version of the HtmlHelp API, but it can easily copied from the MSDN library:

typedef struct tagHH_LAST_ERROR { 
  int cbStruct; 
  HRESULT hr;
  BSTR description;
} HH_LAST_ERROR;

Talking loud, saying nothing!

All interaction in the last chapter is one-way, messages are only sent from the application to the help system. But the API supports two flavors of "talkback" from the chm-file to your own application - Training Cards and Notification Messages.

Training Cards are simply messages WM_TCARD sent on user interaction to your main message loop with a 32bit unsigned integer in the wParam and an optional LPCSTR in the lParam. They are easily inserted in your html files using the HTML Help Workshop wizard for ActiveX Control Commands.

<OBJECT type="application/x-oleobject" 
  classid="clsid:adb880a6-d8ff-11cf-9377-00aa003b7a11" 
  codebase="hhctrl.ocx#Version=4,74,8793,0" width=100 height=100
>
  <PARAM name="Command" value="TCard">
  <PARAM name="Button" value="Text:Train">
  <PARAM name="Item1" value="9999, SendText">
</OBJECT>

Watch out for the optional text parameter, which is erroneously inserted with my version of this application as <PARAM Name="Item2"...> whereas the documentation promptly states that it should be inserted as shown above. When a user click the resulting button in the helpfile, you will be able to catch this event in your message loop and use it for your own devious purposes:

BOOL CALLBACK DialogProc(HWND HwndDlg, UINT uMsg, WPARAM wParam, LPARAM lParam)
:
  if (uMsg==WM_TCARD)
  {
    int idAction = (int) wParam;
    LPCSTR pText = (LPCSTR) lParam;
    printf("TRAIN: %d - %s", idAction, pText);
  }
:

Quite simple to use and easy to integrate if you have a need for sending predefined messages from the helpfile to your own application upon user action.

Should you instead be in need of a tracking mechanism, carefully tracking the user interaction with your helpfile, then start looking at Notification Messages. They serve a complicated need of tracking each user action in the helpfile and are obviously a bit harder to setup and use. First of all you'll have to change the default window type settings in order to enable message tracking. This is not quite as easy as it sounds, given the state of the documentation - But this is how I do it:

#define ID_NOTIFICATION 4242
HH_WINTYPE wt = {0}, *pwt;
LPCSTR pszFile = "test.chm";
LPCSTR pszWin = "test.chm>main";
// get the wintype
HtmlHelp(HwndDlg, pszWin, HH_GET_WIN_TYPE, (DWORD) &pwt);
// copy contents of returned wintype
wt = *pwt;
// set notification id
wt.idNotify = ID_NOTIFICATION;
// rememeber to force the correct size
wt.cbStruct = sizeof(HH_WINTYPE);
// now send the new window type
htmp = HtmlHelp(HwndDlg, szFile, HH_SET_WIN_TYPE, (DWORD) &wt);

Obviously the notifications can be turned back off by repeating the process and leaving the idNotify as null.

Having enabled notifications from your chm-file, the messages will start popping up in your message loop, correctly identifying themselves with the given idNotify constant in the wParam parameter.  lParam will point at a HHN_NOTIFY structure and the notification type can be read in the code attribute. Should the code be of type HHN_TRACK the pointer can be cast to a pointer to a HHNTRACK struct and the specific action can be extracted in the idAction attribute. To put that into readable verbose code, would look something like this:

// Make sure it's the correct notification id
if (uMsg == WM_NOTIFY && wParam == ID_NOTIFICATION)
{
  // get notify info
  HHN_NOTIFY* pno = (HHN_NOTIFY*) lParam; 
  // check notification type
  if (pno->hdr.code == HHN_WINDOW_CREATE)
    printf("HHN_WINDOW_CREATE");
  else if (pno->hdr.code == HHN_NAVCOMPLETE)
    printf("HHN_NAVCOMPLETE");
  else if (pno->hdr.code == HHN_TRACK)
  {
    printf("HHN_TRACT");
    // stringized versions of the HHACT_ enum
    LPCSTR pActions[] = { "HHACT_TAB_CONTENTS", "HHACT_TAB_INDEX", ---, "HHACT_NOTES" };
    // cast pointer to a HHNTRACK struct
    HHNTRACK* ptr = (HHNTRACK*) lParam;
    // get correct action as string is possible
    if (ptr->idAction >= HHACT_LAST_ENUM)
      printf(", Custom=%08X", ptr->idAction);
    else
      printf(", Action=%s", pActions[ptr->idAction]);
  }
  else
    printf("Unknown code");
  printf(", url=%s\n", pno->pszUrl)
}

And you will now be able to track each user-click within your chm-file. The possibilities might be endless, however I can only think of a single possible use of that information - Create your own chm-file statistics, cunningly collecting the number of clicks on each topic and maybe the most used path leading up to each topic...

I aint seen no bug

Enough interaction and demonstration - Let us see the actual bug that triggered this article using a few lines of code, where  you try to open the chm-file in append mode immediately after issuing the HH_CLOSE_ALL command:

LPCSTR pszFile = "test\\sample.chm";
// show specific existing context
res = HtmlHelp(NULL, pszFile, HH_HELP_CONTEXT, (DWORD) 1001);
// close the htmlhelp window
res = HtmlHelp(NULL, NULL, HH_CLOSE_ALL, NULL);
// open file in append mode
FILE* fp = fopen(pszFile, "a");
if (fp!=NULL)
  fclose(fp);
else
  printf("Failure, err=%d\n", GetLastError());

The failed fopen will make GetLastError() return ERROR_SHARING_VIOLATION which in plain text tells us "The process cannot access the file because it is being used by another process". Who on earth is touching the chm-file and why? Now try to repeat the test, but replace the command HH_HELP_CONTEXT with HH_DISPLAY_TOPIC and the corresponding topic:

LPCSTR pszFile = "test\\sample.chm";
// show specific existing topic
res = HtmlHelp(NULL, pszFile, HH_DISPLAY_TOPIC, (DWORD) "first.htm");
// close the htmlhelp window
res = HtmlHelp(NULL, NULL, HH_CLOSE_ALL, NULL);
// test the file state
FILE* fp = fopen(pszFile, "a");
if (fp!=NULL)
  fclose(fp);
else
  printf("Failure, err=%d\n", GetLastError());

And the lock on the chm-file has mysteriously disappeared just after issuing HH_CLOSE_ALL.

Using a process explorer you can actually see that Internet Explorer dlls will be kept in memory, even after HH_CLOSE_ALL closes the help window - But only if you use the HH_HELP_CONTEXT command! Stick with HH_LOOKUP_TOPIC and the dlls will be unloaded as expected and the chm-file lock disappear. My best guess is that after reading the context mapping information "somebody" forgets to close an internal stream in the chm-file...

Anyway - A pretty annoying bug - But what could we possibly do about it?

Needles & pins - Storage & streams

To do dirty deeds on chm-files you need access to the "Microsoft® InfoTech Storage System Library" hosted interface ITStorage implemented in itss.dll placed somewhere beneath your windows directory. The interface definition is not really public, but I extracted the following, with a little help from Keyworks Software:

DECLARE_INTERFACE_(IITStorage, IUnknown)
{
  STDMETHOD(StgCreateDocfile)
    (const WCHAR* pwcsName, DWORD grfMode, DWORD reserved, IStorage** ppstgOpen) PURE;
  STDMETHOD(StgCreateDocfileOnILockBytes) 
    (ILockBytes * plkbyt, DWORD grfMode, DWORD reserved, IStorage ** ppstgOpen) PURE;
  STDMETHOD(StgIsStorageFile) 
    (const WCHAR * pwcsName) PURE;
  STDMETHOD(StgIsStorageILockBytes) 
    (ILockBytes * plkbyt) PURE;
  STDMETHOD(StgOpenStorage) 
    (const WCHAR * pwcsName, IStorage * pstgPriority, DWORD grfMode, SNB snbExclude, 
       DWORD reserved, IStorage ** ppstgOpen) PURE;
  STDMETHOD(StgOpenStorageOnILockBytes)
    (ILockBytes * plkbyt, IStorage * pStgPriority, DWORD grfMode, SNB snbExclude, 
       DWORD reserved, IStorage ** ppstgOpen ) PURE;
  STDMETHOD(StgSetTimes)
    (WCHAR const * lpszName, FILETIME const * pctime, FILETIME const * patime, 
       FILETIME const * pmtime) PURE;
  STDMETHOD(SetControlData)
    (PITS_Control_Data pControlData) PURE;
  STDMETHOD(DefaultControlData)
    (PITS_Control_Data *ppControlData) PURE;
  STDMETHOD(Compact)
    (const WCHAR* pwcsName, ECompactionLev iLev) PURE;
};

This declaration along with a few structs/enums and matching static GUID spells is enough to get you going - To actually open an ITStorage file you'll need to do something like this:

IITStorage* pITStorage;
IStorage* pStorage;
PWCHAR pwzFile = L"sample.chm";
// don't forget to init COM
CoInitialize(NULL);
// get ITStorage interface
hr = CoCreateInstance(CLSID_ITStorage, NULL, CLSCTX_INPROC_SERVER, IID_ITStorage, 
                      (void **) &pITStorage);
// open the chm-file
hr = pITStorage->StgOpenStorage(pwzFile, NULL, STGM_READ | STGM_SHARE_DENY_WRITE, 
                                NULL, 0, &pStorage);

Use enumeration to discover the contents of your newly opened ITStorage - This will dump the root level contents of the storage along with the type and size of each element:

IEnumSTATSTG* pEnum = NULL;
STATSTG entry = {0};
LPCSTR typnam[] = { "STGTY_STORAGE", "STGTY_STREAM", "STGTY_LOCKBYTES", 
                    "STGTY_PROPERTY" };
hr = pStorage->EnumElements(0, NULL, 0, &pEnum);
while (pEnum->Next(1, &entry, NULL)==S_OK)
  printf("%S, type=%s, size=%I64u\n", entry.pwcsName, typnam[entry.type-1], 
         entry.cbSize.QuadPart);
pEnum->Release();

The typnam array is just added to improve readability by indicating the type of the element. Remember that stream sizes are 64bit unsigned integers

Each of the elements of type STGTY_STORAGE can be recursively opened and enumerated (like directories), and the elements of type STGTY_STREAM can read into memory (like plain files). Reading a specific stream like the internal #STRINGS into memory could be done like this:

IStream* pStream = NULL;
PWCHAR pwzStream = L"#STRINGS";
// open stream
hr = pStorage->OpenStream(pwzStream, NULL, STGM_READ, 0, &pStream);
// get size
hr = pStream->Stat(&entry, STATFLAG_NONAME);
ULONG cbRead, cbSize = (ULONG) entry.cbSize.QuadPart;
// allocate buffer
LPVOID pBuffer = (LPVOID) LocalAlloc(LPTR, cbSize);
// read stream
hr = pStream->Read(pBuffer, cbSize, &cbRead);
// write buffer to file
FILE* fp = _wfopen(pwzStream, L"w");
size_t cbWrote = fwrite(pBuffer, sizeof(char), cbRead, fp);
// cleanup
fclose(fp);
LocalFree(pBuffer);
pStream->Release();

Nothing much here - Just open the stream, obtain its size with a Stat call, allocate a matching buffer, read entire stream into buffer and dump it to a file with the same name as the stream. Which is about as much as you need to get started with cracking the .chm fileformat.

Cracking the format

Using the above techniques you can build a small tool capable of decompiling all contents of a htmlhelp file. For a standard htmlhelp file this will reveal several files prefixed with # - namely the interesting internal streams:

#IDXHDR, type=STGTY_STREAM, size=4096
#ITBITS, type=STGTY_STREAM, size=0
#STRINGS, type=STGTY_STREAM, size=827
#SYSTEM, type=STGTY_STREAM, size=4264
#TOPICS, type=STGTY_STREAM, size=544
#URLSTR, type=STGTY_STREAM, size=1399
#URLTBL, type=STGTY_STREAM, size=408
#WINDOWS, type=STGTY_STREAM, size=400

The hh.exe tool is actually capable of decompiling your chm-file, but will leave out all of the interesting internal files. Reverse engineering a help project using the output from hh.exe is not possible since the [alias] section of the hpp file cannot be reproduced without the #IVB information - Enter cracking!

Use HTML Help Workshop to create a small chm-file with a few topics and add a context for each. Now run your hacker decompile tool on the chm-file and use your favorite hex editor (probably not VI) to browse the resulting internal files prefixed with #. The format of these streams are pretty tricky, but this is a list of my discoveries so far:

  • #STRINGS - All strings are represented in ansi and separated by null's. Each string can be referenced by its offset in the file from other streams .
  • #WINDOWS - definitions of all windows
  • #SYSTEM - the hh compiler version as pure text  and the name of the default topic.
  • #URLTBL - all URLs in the file, separated by 9 null's.
  • #IVB - links between context id's and offsets in the #STRINGS.

Notice that the  #IVB stream is not present if the [Alias] section is left out of the help project file - This reveals quite a bit of information about where the context mapping is placed.

Converting Contexts to Topics   

The two streams #STRINGS and #IVB are the only interesting files for my purpose - Converting a context id to a topic string. After a bit of puzzling back and forth I discovered that the format of the #IVB stream is like this (all 32 bit values):

<size> <context#1> <offset #1> <context#2> <offset #2> ... <context#n> <offset#n>

  • <size> is the size of the file
  • <context> denotes each context that you have defined in the alias section (32 bit unsigned integers)
  • <offset> if the byte offset of the corresponding #STRING stream entry.

This makes reverse engineering quite easy - Open the two streams, read the contents into a buffer and close the streams again:

DWORD cbRead, cbSTRINGS, cbIVB, idxContext, idxString;
LPCSTR pSTRINGS, pTopic;
LPDWORD pIVB;
IStream* pStream = NULL;
// read #STRINGS
hr = pStorage->OpenStream(L"#STRINGS", NULL, STGM_READ, 0, &pStream);
hr = pStream->Stat(&entry, STATFLAG_NONAME);
cbSTRINGS = (DWORD) entry.cbSize.QuadPart;
pSTRINGS = (LPCSTR) LocalAlloc(LPTR, cbSTRINGS);
hr = pStream->Read((LPVOID)pSTRINGS, cbSTRINGS, &cbRead);
pStream->Release();
// read #IVB
hr = pStorage->OpenStream(L"#IVB", NULL, STGM_READ, 0, &pStream);
hr = pStream->Stat(&entry, STATFLAG_NONAME);
cbIVB = (DWORD) entry.cbSize.QuadPart;
pIVB = (LPDWORD) LocalAlloc(LPTR, cbIVB);
hr = pStream->Read((LPVOID)pIVB, cbIVB, &cbRead);
pStream->Release();

Now use the IVB buffer combined with the STRINGS buffer to flash the mapping from context id to topic string - Just remember to free your buffers after use.

// show mapping between context and topic - first DWORD unused (contains size)
int nItems = cbIVB/sizeof(DWORD);
for (int i = 1; i<nItems;i+=2)
{
  idxContext= pIVB[i];
  idxString = pIVB[i+1];
  pTopic = (char*)(pSTRINGS) + idxString;
  printf("%d=%s\n", idxContext, pTopic);
}
// cleanup
LocalFree((LPVOID)pIVB);
LocalFree((LPVOID)pSTRINGS);

Putting it all together

To use the context/topic mapping on large scale chm-file, it would probably be wise to store the actual map in a container capable of something more intelligent than binary search. I have used the STL map template to create an effective map between context DWORDS and char pointers to the #STRINGS memory block. Simply replace the context mapping dumping routine above with the following (stuff left out to improve readability)

#pragma warning(disable: 4786)
#include <list>
#include <map>
using namespace std;
:
typedef map <DWORD, LPCSTR, less<DWORD>, allocator<LPCSTR> > tMapIntString;
:
// create mapping between context and topic
tMapIntString map;
int nItems = cbIVB/sizeof(DWORD);
for (int i = 1; i<nItems;i+=2)
{
  idxContext= pIVB[i];
  idxString = pIVB[i+1];
  pTopic = (char*)(pSTRINGS) + idxString;
  map[idxContext] = pTopic;
}
// cleanup, leave pSTRINGS in memory (used by map)
LocalFree((LPVOID)pIVB);

The STL based map can easily be traversed like this if needed (using the iterator class, "first" returns your key and "second" returns your data):

// traverse map
tMapIntString::iterator pos;
for (pos = map.begin(); pos!=map.end(); pos++)
  printf("%d=%s\n", (*pos).first, (*pos).second);

Using the map is now as simple as this (using the "find" function to return the iterator for the matching element and extracting the data with the parameter "second"):

// use the map
pos = map.find(1000);
if (pos != map.end())
  HtmlHelp(NULL, pszFile, HH_DISPLAY_TOPIC, (DWORD) (*pos).second);
pos = map.find(1001);
if (pos != map.end())
  HtmlHelp(NULL, pszFile, HH_DISPLAY_TOPIC, (DWORD) (*pos).second);
// invalid context
pos = map.find(1002);
if (pos == map.end())
  printf("invalid context\n");
// close the chmfile
HtmlHelp(NULL, NULL, HH_CLOSE_ALL, NULL);
// free the map & string buffer
map.clear();
LocalFree((LPVOID)pSTRINGS);

Testing will reveal that creation of the context map is not an expensive operation, and translation from context to topic is rather efficient, with very few lines of codes. A clever part is to leave the STRINGS buffer in consecutive memory and abuse the fact that strings are already separated by null chars.

Sample App - CHM Explorer

The source file is a single zip archive, containing three subdirectories.

  • Chm - Sample chm project with two topics and contexts plus all the needed stuff for training cards and tracking.
  • Chmexp - A VC++ project using the techniques shown in this article for obtaining the topic/context mapping and using it (including trainingcard/tracking code).
  • Sample - All code samples in this article plastered in one single VC++ project.

Chmexp has been stuffed with a lot of functionality in order demonstrate all the issues described in this article.

Enter the full path of a chm-file in chm textbox and press the "Open" button. Status of the chm file will be refreshed every second and shown just beside the chm filename, revealing if the file is locked or not. "Close" will issue a HH_CLOSE_ALL to the htmlhelp api, "Dump" will extract all stream to physical files in the entered directory and "Notif" will toggle tracking information. Navigation in the helpviewer window will be shown in the status box if notification is turned on.

The comboboxes at the bottom of the dialog will contain all valid contextid's and topic after a success full open.  Select or enter a context and press "ConTop" to use the context/topic mapping or "Context" to send the context directly (invoking the lock error). Select or enter a topic and press the "Topic" button to have that topic displayed in the helpwindow.

So long and thanks for all the fish

But where does that leave us?

  • A  tolerable workaround for the mysterious chm-file lock has been found - Open the chm-file in the beginning of your program, read the internal IVB mapping and replace all HH_HELP_CONTEXT with the matching HH_LOOKUP_TOPIC call.
  • Gaining access to the ITStorage gives you sufficient information to build a cunning chm-file explorer, revealing the secrets of previously hidden information.
  • Part of the previously internal format of the helpfile have been reverse engineered to the extent that you could actually create an improved decompile application. You'd still need to workout the format of the rest of the internal streams, but the at least the [Alias] section is cracked.
  • Helpfiles are no longer black box material and the runtime access to the context/topic information opens up for interesting integration possibilities with your standard application. You could actually build a "helpfile consistency tool" capable of testing which contexts are valid and which are unused.

Anyway, play around with the format and make up more uses - The following links might help you find additional information regarding the htmlhelp format:

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here