Introduction
I have been working on a project with many "encyclopedia" articles. Some of the articles are lengthy, and I thought it would be nice to add a table of contents. But manually writing and maintaining a TOC while the document was still under revision... I know from experience that such a path leads to madness.
I did some digging and found a very good method using JavaScript. Two problems, though: I didn't like needing to rely on my users having JavaScript enabled, and the TOC was not displaying at all with IE8 in compatibility mode. (It was there according to the DOM, but not displaying, which was very frustrating.) I was also annoyed at the noticeable time lag between text first appearing in the browser and the TOC being displayed.
The logical way around these issues was to do the processing server side. I posted a Quick Question about how to intercept the text of a page, and CP member sTOPs[^] provided a link that helped me to figure out my solution.
This article demonstrates how to dynamically create a table of contents and insert it into the web page before the page gets delivered to the user. Part 2, which still needs some work, will demonstrate how to use a pseudo-tag to generate inline references, similar to what the Wikipedia does with its <ref>
pseudo-tag.
Dealing with the HTML
This technique requires minimal changes in your HTML. Put the token {{toc}}
where you want the table of contents to go, and the filter will do the rest. Using it "naked" will cause the Visual Studio validator to complain about text not being allowed in the <body>
tag; if this annoys you, you can put the token inside a <div>
. The filter is case insensitive, so you can also use {{TOC}}
, {{Toc}}
, and so on.
You will want to look at your <hx>
headers that will be used to generate the TOC. On my site, <h1>
, <h2>
, and <h6>
all have special uses, so my filter ignores them. This leaves <h3>
, <h4>
, and <h5>
for use as content headings. Again, the filter is case insensitive, and should work just fine with, say, <H4>
. The filter will also handle any existing id
attributes on the header tags; if one is not provided, then the filter will auto-generate one for you so that your TOC link will have a place to land. Other attributes, if any, will be handled gracefully if they are in proper HTML format. And, please note: the tags are translated into XElement
objects, which means that their content must be XML valid. If your headers include entities like “
that XML does not understand, you will get an error. Using numeric codes in your headers instead of entities will work with some codes; other codes cause an offset problem so remember to check your work. It might be easier to use ASCII replacements (straight quotes instead of the fancy curved ones) or not use entities at all.
Lastly, look at your CSS. If you use my filter out of the box, make sure you include the style sheet so that your TOC looks nice and is properly formatted. A tiny bit of JavaScript is useful too if you want to let users minimize the TOC. Because the table of contents starts out open, it will not damage any functionality or layout if the user has scripting disabled.
Response.Filter: How it works
In the bad old pre .NET days, writing a web filter was frustrating. They had to be written in C++, and the development cycle was basically compile, install on IIS, test, uninstall, try to find your bug, recompile, install, ad nauseam.
Nowadays, filters have become much easier. Internally, the Framework keeps track of the page's assembly and rendering with Stream
s. Intercept the stream, and you can tinker with the page before sending it on its merry way. Doing this is almost trivial:
If Response.ContentType = "text/html" Then
Response.Filter = New MyNewFilter(Response.Filter)
End If
If the page has a content type of text/html, then set Response.Filter
to be your new filter. This will prevent your filter from being invoked when the server is delivering images, PDFs, and other types of content. The constructor for MyNewFilter
takes the previous filter; when yours has done its work, processing will move on to the next one in the chain. Yup, it really is that simple.
The next issue to consider is where to set the filter. From what I've been able to tell, you can do this pretty much any time before the page is delivered, from either the page itself, its master page, or even globally in Global.asax. Because most of my pages will have a table of contents, I have implemented the filter globally, in the PostReleaseRequestState
event of the Application
object. This event is one of the last to be fired before delivering the page, which makes it a logical choice. In the Global.asax file, I added this code:
Protected Sub Application_PostReleaseRequestState _
(ByVal sender As Object, ByVal e As System.EventArgs)
If Response.ContentType = "text/html" Then
Response.Filter = New TOCFilter(Response.Filter)
End If
End Sub
The exact placement of this code may not matter: I have seen examples using the Page_Load
event when only individual pages are being filtered. Very likely, all you need to do is use it in some event that every page will call.
Auxiliary class HeaderClass
To encapsulate some of the processing, I have an auxiliary class named HeaderClass
.
Private Class HeaderClass
Private pRank As String
Private pTag As XElement
Public ReadOnly Property Id() As String
Get
If pTag Is Nothing Then
Throw New Exception("Member 'Tag' was not instantiated.")
End If
If pTag.Attribute("id") IsNot Nothing Then
Return pTag.Attribute("id").Value
Else
Return pTag.Value
End If
End Get
End Property
Public ReadOnly Property Length() As Integer
Get
Return pTag.ToString.Length
End Get
End Property
Public ReadOnly Property Rank() As String
Get
Return pRank
End Get
End Property
Public ReadOnly Property Tag() As XElement
Get
Return pTag
End Get
End Property
Public ReadOnly Property TagReplacement() As String
Get
If pTag Is Nothing Then
Throw New Exception("Member 'Tag' was not instantiated.")
End If
Dim NewTag As New XElement(pTag)
NewTag.SetAttributeValue("id", _
"{0}_" + Me.Id.Replace(" ", "_"))
Return NewTag.ToString
End Get
End Property
Public ReadOnly Property Text() As String
Get
If pTag Is Nothing Then
Throw New Exception("Member 'Tag' was not instantiated.")
End If
Return pTag.Value
End Get
End Property
Public Sub New(ByVal Rank As String, ByVal Tag As String)
pRank = Rank
Try
pTag = XElement.Parse(Tag)
Catch ex As Exception
Throw New ArgumentException("Not a valid element.", "Tag", ex)
End Try
End Sub
End Class
The main purpose of this class is to hold the tag for later reference, with some extra functionality added to make things a bit cleaner.
Class TOCFilter
Now we are ready to look at the filter itself. Response.Filter
is a Stream
object, so our filter needs to be based on System.IO.Stream
. We cannot inherit directly from that, but MemoryStream
works fine.
Public Class TOCFilter
Inherits MemoryStream
Private Output As Stream
Private HTML As StringBuilder
Private EOP As Regex
Public Sub New(ByVal LinkedStream As Stream)
Output = LinkedStream
HTML = New StringBuilder
EOP = New Regex("</html>", RegexOptions.IgnoreCase)
End Sub
Public Overrides Sub Write(ByVal buffer() As Byte, _
ByVal offset As Integer, ByVal count As Integer)
Dim BufferStr As String = UTF8Encoding.UTF8.GetString(buffer)
HTML.Append(BufferStr)
If EOP.IsMatch(BufferStr) Then
Dim PageContent As String = HTML.ToString
Output.Write(UTF8Encoding.UTF8.GetBytes(PageContent), offset, _
UTF8Encoding.UTF8.GetByteCount(PageContent))
End If
End Sub
End Class
The constructor takes the filter stream that it is replacing and puts it aside. When the new filter's Write
method is invoked, the value of buffer()
is accumulated until we have the whole page, which is then processed and chained to the next filter. The encoding work makes sure that the text gets stored in memory correctly; if you are not using UTF-8 (which is pretty standard nowadays), you will need to reference whatever encoding system your pages are using.
The table of contents itself
Before we can look at how the table of contents is assembled, let's first look at how it will be put together. Here is a sample layout:
<table id="TOC">
<tr>
<th id="TOC_Header">Contents [<span id="TOC_Toggle"
onclick="ShowHideToc();">Hide</span>]</th>
</tr>
<tr id="TOC_Content" style="display:block">
<td>
<table>
<tr>
<td class="TOC_level_H3">
<a href="#1_H3_Header">1& nbsp;& nbsp;H3 Header</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
(Ignore the spaces in the & nbsp;, that is just so they will render as text and not as white space.)
What we have here is a table with an id
of TOC
and has two rows. The first row is the TOC's header and the cell has an id
of, oddly enough, TOC_Header
. The span
TOC_Toggle
is linked to a very small bit of JavaScript which will toggle the visibility of the second row, TOC_Content
. The cell in that row holds another table, where each TOC entry has its own row. Those cells have one of three classes, which pad the left side to give the cells' indent. The link inside the cell points to the matching hx
element further down the page.
Making this case-insensitive
There are two utility methods in TOCFilter
, which allow the filter to work without regards to case.
Private Function StringContains(ByVal ThisString As String, _
ByVal SearchText As String) As Boolean
Dim i As Integer = ThisString.IndexOf(SearchText, _
StringComparison.CurrentCultureIgnoreCase)
Return (i <> 0)
End Function
Private Function StringReplace(ByVal ThisString As String, _
ByVal Pattern As String, _
ByVal ReplacementText As String) As String
Return Regex.Replace(ThisString, Pattern, ReplacementText, RegexOptions.IgnoreCase)
End Function
StringContains
does a case-insensitive IndexOf
operation, and returns True
if a match is found. StringReplace
uses Regular Expressions to do a case-insensitive replace. Please note that while StringReplace
is sufficient for this filter, it is not robust enough for most real-world situations. If you want to use it as-is, you do so at your own risk.
Overrides sub Write
Now that the theory and infrastructure are out of the way, let's look at the heart of the filter. First, we check to see if the TOC token is present; if not, there is no point generating a table that will not be inserted.
If StringContains(PageContent, "{{toc}}") Then
Dim Headers As New SortedList(Of Integer, HeaderClass)
Dim Tag As String = ""
Dim i As Integer = 0
Dim j As Integer = 0
i = PageContent.IndexOf("<h3", StringComparison.CurrentCultureIgnoreCase)
Do While i > 0
j = PageContent.IndexOf("</h3>", i + 1, _
StringComparison.CurrentCultureIgnoreCase)
Tag = PageContent.Substring(i, j - i + 6)
Headers.Add(i, New HeaderClass("H3", Tag))
i = PageContent.IndexOf("<h3", j, _
StringComparison.CurrentCultureIgnoreCase)
Loop
...
End If
This code searches for <h3>
tags. If one is found, the text is copied from BufferStr
into Tag
and added to Headers
. The "+ 6" piece handles the five characters of the closing tag, plus the usual 1 character offset. The code then looks for the next tag starting from the end of the previous, until there are no more tags left. After this loop, two more retrieve the <h4>
and <h5>
tags.
Notice that Headers
is a sorted list whose key is the starting position of the tag. This means that, no matter the order in which the tags are retrieved, they will come out of the list in the order they appear in the page text.
Once we have a list of the headers being indexed, we need to generate the table.
If Headers.Count > 0 Then
Dim TocStr As New StringBuilder
Dim H3 As Integer = 0
Dim H4 As Integer = 0
Dim H5 As Integer = 0
Dim Index As String = ""
Dim NewBufferStr As StringBuilder = Nothing
Dim shift As Integer = 0
Dim fudge As Integer = 0
TocStr.AppendLine("<table id=""TOC"">")
TocStr.Append(" <tr><th id=""TOC_Header"">")
TocStr.Append("Contents [<span id=""TOC_Toggle"" onclick=""ShowHideToc();"">Hide</span>]")
TocStr.AppendLine("</th></tr>")
TocStr.AppendLine(" <tr style=""display:block;"" id=""TOC_Content"">")
TocStr.AppendLine(" <td><table>")
For Each kvp As KeyValuePair(Of Integer, HeaderClass) In Headers
Select Case kvp.Value.Rank
Case "H3"
H3 += 1
H4 = 0
H5 = 0
Index = String.Format("{0}", H3)
fudge = 3 - Index.Length
Case "H4"
H4 += 1
H5 = 0
Index = String.Format("{0}.{1}", H3, H4)
fudge = 3 - Index.Length
Case "H5"
H5 += 1
Index = String.Format("{0}.{1}.{2}", H3, H4, H5)
fudge = 3 - Index.Length
End Select
NewBufferStr = New StringBuilder
NewBufferStr.Append(PageContent.Substring(0, shift + kvp.Key))
NewBufferStr.AppendFormat(kvp.Value.TagReplacement, Index.Replace(".", "_"))
NewBufferStr.Append(PageContent.Substring(shift + kvp.Key + kvp.Value.Length))
shift += (kvp.Value.TagReplacement.Length - fudge - kvp.Value.Tag.ToString.Length)
TocStr.AppendFormat("<tr><td class=""TOC_level_{0}"">", kvp.Value.Rank)
TocStr.AppendFormat("<a href=""#{0}_{1}"">{2} {3}</a>", _
Index.Replace(".", "_"), kvp.Value.Id.Replace(" ", "_"), Index, kvp.Value.Text)
TocStr.AppendLine("</td></tr>")
PageContent= NewBufferStr.ToString
Next
TocStr.AppendLine(" </table></td>")
TocStr.AppendLine(" </tr>")
TocStr.AppendLine("</table>")
PageContent= StringReplace(PageContent, "{{toc}}", TocStr.ToString)
End If
This code is run only if there are headers found. After initializing the variables, it constructs the start of the table of contents in TocStr
. Then it goes through every item in Headers
. Depending on the type of the current header tag, the index values are reset and the index string is generated. fudge
holds an offset determined by the size of Index
.
Once we have these values, we splice out the old header tag and insert the new and improved one. The first part of PageContent
is copied over to NewBufferStr
, up to the location of the old tag. Then we append the new tag out of HeaderClass.TagReplacement
. Because the property is already set up with a {0}
placeholder for the index, we can use it to format the append. Then we move to the end of the old tag and copy the rest. shift
is updated so we know how far out of synch the main buffer has gotten, then we add the link to the new header tag into TocStr
. We reset PageContent
to include the new tag, and move on to the next header in the list.
Please note that because TagReplacement
is used as a format string, it should not contain any format characters other than the one that HeaderClass
puts in. If you absolutely must have curly braces in your header text, you will need to give the header tag a safe id
.
It is really important to keep track of shift
. Remember, the tag locations were based on the original scan. When we rewrite the tags, they will be longer: even if the tag already has an id
, we are still adding the sequence index. shift
allows us to keep track of where we are supposed to perform our cut.
After stepping through Headers
, we close the table of contents and use the case-insensitive StringReplace
to swap out the {{toc}}
token with the generated mark-up. The last thing that must be done is to write the modified text out to the next filter in the chain.
Those are features, not bugs
While making some corrections, I noticed a possible complaint that I want to head off.
The index for a level is initialized when that level is passed. That means that when you skip a level, like I did with the last TOC entry in the image at the article's start, you end up with an index value of zero. I'm calling this a feature, as good design means you do not skip levels. The zero will let you quickly find when you've done this.
The current version of the filter will remove the {{toc}}
token if no headers are found.
Moving on
The ability to easily intercept and alter a web page before it is delivered to the user opens some interesting possibilities, one of which I will cover in my next article. If you find other uses, I would enjoy hearing about them. And as always, if this article was useful to you, please vote it up.
History
- Version 1 2011-03-09 - Initial release.
- Version 2 2011-03-03 - Corrections and some minor additions.
- Version 3 2011-03-15 - Fixed a bug in the source code:
StringContains
now returns (i <> -1).
- Version 4 2011-03-29 - Rewrote the filter to handle the situation where the page content does not come in all at once.