(untagged)

CHM Snatcher

TSzatkowski

0.00/5 (No votes)

12 Feb 2008

A way of mass-converting online content to CHM files for effectively delivering user help&manual, sharing company knowledge base and distributing any sort of documentation.

Introduction

This article presents one of the possibly fastest and most reliable ways of mass-converting almost any online content to the CHM format. The conversion is designed to be as automatic and cost-efficient as possible. Although the process described in this article is focused on the following products:

online content authoring (MediaWiki),
the retriever (Mozilla Firefox with the iMacros add-on),
CHM generator (Microsoft HTML Help Workshop),

it, possibly, can be adjusted to any products applicable (f.e. DokuWiki as the online content authoring).

To orchestrate cooperation between the following three layers (and save our precious time), the CHM Snatcher, a custom built command-line tool has been created.

Why Convert to CHM Format?

Despite its enormous advantages, every online system (like company/hobby Wiki) has its limitations and complexity:

it requires a HTTP server,
a database server,
a server-side language interpreter/compiler (f.e. PHP, C#/ASP.NET), and
a connection to the Internet or to the intranet.

If someone wants to access to the content just to find some information, but, for some reason, he/she doesn't have the possibility to connect to the system (f.e. due to access restrictions, high server load), a problem emerges.

Before we Start

Several things are required to start working with CHM Snatcher:

.NET Framework 2.0,
Mozilla Firefox with iMacros for Firefox add-on (as far as I know iMacros are also available for Internet Explorer),
Microsoft HTML Help Workshop.

The Solution Outline

The following diagram represents the general idea:

First, users edit the online content, so that the content is made available to be saved offline,
Then, the CHMSnatcher XML configuration file is being configured (please read the next section for details),
Then, CHMSnatcher command line utility is run in the pre-process mode:
- iMacros macro file is generated,
- HTML Project files are generated.
Then, user starts the Mozilla Firefox and, in iMacros add-on, selects the macro generated in the previous step, then runs it. The macro physically retrieves (processes) the URL targets (which can take a while).
Then, CHMSnatcher command line utility is run in the post-process mode:
- appropriate replacements in the downloaded content is being made (f.e. a proper *.css file is replaced with another one),
- CHM file is generated (by using the Microsoft HTML Help Workshop).

See that CHM Snatcher doesn't:

retrieve the web pages — it just generates a macro that is responsible for the retrieval (more precisely, the macro that is run by Mozilla Firefox, iMacros add-on),
generate the CHM files — again, it just generates necessary HTML Help Workshop project files that are compiled to CHM file by HTML Help Workshop.

Putting it all together: CHM Snatcher is just a "man in the middle," reducing user's effort, coordinating various tools.

CHM Snatcher Overview

As stated above, CHM Snatcher is a command line utility. Currently, it supports only two commands:

CHMSnatcher.exe preProcess [path to XML configuration file]
CHMSnatcher.exe postProcess [path to XML configuration file]

Every of the two commands require a XML configuration file. Here comes an example with a more detailed description:

<snatchProject>
 <snatchExpression value="http://en.wikipedia.org/wiki/{0}?printable=yes" />
 <outputFolder value="C:\Temp\VintageGameSystems" />
 <helpProjectName value="Vintage Game Systems" />
 <iMacrosScriptFileName
value="C:\Documents and Settings\Profile\My Documents\iMacros\Macros\VintageGamesSystems.iim" />
 <pages>
  <page urlName="Video_game_system" title="Video game system - overview" />
   <page urlName="" title="Nintendo">
   <page urlName="Nintendo" title="Nintendo - Company Info" />
   <page urlName="NES" title="NES" />
   <page urlName="SNES" title="SNES" />
   <page urlName="Nintendo_64" title="Nintendo 64" />
  </page>
 </pages>
 <postProcessing>
  <replaceAll searchedFile="commonPrint.css" replacementFile="Replacements\commonPrint.css" />
  <compile pathToHTMLHelpWorkshop="C:\Program Files\HTML Help Workshop" />
 </postProcessing>
</snatchProject>

The snatchExpression tag is used to configure the way the CHM Snatcher will create an URL for every retrieved page (see the pages tag) — f.e. here we want to retrieve pages in printable version. Note that {0} parameter is mandatory — it will be replaced with page.@urlName attribute for every page.
The outputFolder tag decides:
- where the Microsoft HTML Help Workshop project files will be generated,
- where the HTML page files, retrieved by Mozilla Firefox iMacros, will be saved to (note that it will be saved in the Pages subfolder of the outputFolder.@value)
The helpProjectName tag describes the name of the help project, it will influence:
- the *.hhp, *.hhc and *.chm file names,
- CHM document name (displayed in the generated document title bar).
The iMacrosScriptFileName denotes the place, where the *.iim macro file will be generated. It is recommended to set it to the same directory where the Mozilla Firefox iMacros add-on looks for macros — please note that the path will certainly differ on your system.
The pages tag encapsulates the pages in a tree hierarchy, exactly the way these will be represented in the document tree of the CHM file.
The page.@urlName attribute represents a fragment of the URL for a page, that will be glued together with snatchExpression.@value attribute. page.@title attribute represents a page title as it will be displayed in the left-side CHM document tree of the generated CHM file. Please note that page.@urlName can be left empty.
The postProcessing tag encapulates actions that will be done AFTER user retrieved that pages with Mozilla Firefox iMacros add-on.
- The replaceAll tag denotes file replacement action: all replaceAll.@searchedFile files that are found in the outputFolder.@value folder will be replaced by replaceAll.@replacementFile file. Why do we need replacements? In the example presented the Wikipedia printable pages contain no margins, which is good for printing, but looks clumsy in the CHM file. Thus, every *.css file that was downloaded will be replaced by a custom *.css file.
- The compile tag represents the compile action. To make it run successfully, the compile.@pathToHTMLHelpWorkshop attribute must be set properly. Please note that CHM Snatcher assumes the *.hhp file will be [helpProjectName.@value].hpp.

The following table describes what tags/attributes in tags are mandatory.

XML entity name	Mandatory
snatchExpression.@value	yes
outputFolder.@value	yes
helpProjectName.@value	yes
iMacrosScriptFileName.@value	yes
page.@urlName	yes
page.@title	yes
replaceAll	no
compile	no

Using the Code

The source code submitted to the CodeProject is quite straightforward: it contains some manipulation on XML documents and a lot of file content generation (File and StringBuilder core Framework classes are used the most extensively). The interesting places at a glance:

CHMSnatcher.PreProcessor.FillTreeOfPages() and CHMSnatcher.PreProcessor.FillListOfPages() methods use recursion to build/analyze the tree of pages,
CHMSnatcher.PreProcessor.GenerateHppProject() and CHMSnatcher.PreProcessor.GenerateHppTableOfContents() methods generate HTML Help Workshop files — no third-party library was required to do the job, since HTML Help Workshop files are just pain text files with unsophisticated structure (see it yourself).

Points of Interest

I think some things can be done even better:

There can be a convenient GUI added on top of the engine.
Saving the web pages offline could be embedded into CHM Snatcher (by a third party library) or, which should be even simpler, made by a command-line tool. Unfortunately, some tools like HTTrack didn't work well for me (the save results were far from perfect — there were missing images, css files etc.). But, maybe you know an offline grabber that would the job ?
Post processing plug-ins/scripts can be useful (f.e. imagine replacing some tags in the retrieved *.html files — read a note below).

Note that CHM Snatcher can be used to archive probably most of the CodeProject pages. However, a little problem emerges with the downloaded offline page versions. Namely, some JavaScript scripts that are embedded inside the downloaded HTML files make the CHM document display nasty JavaScript error dialogs, which is, of course, unacceptable. One of the ways to overcome it is to replace <script ... </script> tags in the downloaded HTML documents with the following . Nevertheless, such post-processing task has not been not implemented so far, so, feel free extending !

To the Readers

I hope you liked the article and found it useful. However, if you think otherwise and want to express it by rating, please leave a comment why you think this article deserves a particular, low grade. I think uncommented low grades are definitely not the way to go — please help to improve!

Anyway, thanks for reading !

History

2th of February 2008 - First version created.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here