I needed to convert HTML to RTF as I wanted be able to use the RTF output in another application that did not support HTML, but did support RTF.
My client wanted to print out mail merged documents that were generated on the fly from a web application, but the report writing software component only supports RTF, but the web interface editors only produce HTML.
I looked at some possible solutions:
- Use Word - Not a viable proposition as that means installing Office on a web server, use OLE automation and hope that your web server does not end up running out of memory. Costs around $250.
- Find a DLL that can be added to the Bin folder - Only found a couple of solutions, but they were each about $299 for developer licence and 1 server licence.
- Change to Cute-Editor as this has the ability to save RTF format as well as HTML format, costs $250.
Telling clients that their software requirements are going to cost them more money is not really a great option as you risk losing business and recommendations, so I had to rethink this from the ground up.
After a bit more research, I started to see patterns that were fixed and some that changed, I also referenced Wikipedia, and they gave a simple example of the RTF format.
{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
This is some {\b bold} text.\par
The above RTF text is simplified and is broken down as follows:
= Tell you the document is rtf
= Uses the ANSI character set
{\fonttbl\f0\fswiss Helvetica;}
= Defines font 0 as swiss (sans-serif) Helvetica if available
= Use font 0
= Start paragraph marker
{\b bold}
= Only bold this word
= End paragraph marker
- The first
and last }
are the delimiters for the string of text.
After this, I decided that I could see that I could play around with RTF and see the results of each of the changes.
I started with a simple "Hello World
" using Wordpad, which I then opened in Notepad (RTF is like HTML in that it is saved in a Human readable format - This ends up looking something like this:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Riched20 6.3.9600}\viewkind4\uc1
\pard\sa200\sl276\slmult1\qj\f0\fs22\lang9 Hello World\par
I then changed the text to bold, italic and underline and viewed the changes.
I noted that there were extra items like \viewkind4\uc1, \sa200\s1276\slmult1\qj & \fs22\lang9
Back to researching these extras found that some of them are important, some not so.
- This is a comment entry
- If opened in shows as a Normal View
- Use primary unicode character set
- Space after
- Something to do with style, but I have not figured it all out
- Single line spacing
- Justify the text (\ql
, \qr
, \qc
and \qd
are the other available options)
- Windows Font size 11 (Size * 2)
-Something to do with language, but I have not figured this out either.
So at this point, I had the basics for what the client required, and I started to code this. I only need to support Bold, Italic and Underlined text for this project so that is what I concentrated on at the beginning. I also wanted something that would not fall over if there was some other HTML element that was not supported.
The Code
Const RTF_START = "{\rtf1\ansi\ansicpg1252\deff0\deflang1033"
Const RTF_END = "}"
Const FONT_TABLE_START = "{\fonttbl"
Const FONT_TABLE_END = "}"
Const LINE_BREAK = "\par" & vbCrLf
Const NEW_PARAGRAPH = "\pard"
Const BOLD_START = "\b "
Const BOLD_END = "\b0 "
Const UNDERLINE_START = "\ul "
Const UNDERLINE_END = "\ul0 "
Const ITALIC_START = "\i "
Const ITALIC_END = "\i0 "
Const DOCUMENT_START = "\viewkind4\uc1"
Private Function RenderHTML(ByVal strInput As String) As String
Dim intPos As Integer
Dim strText As String
strText = RTF_START + FONT_TABLE_START + "{\f0\fnil\fcharset0 Arial;}" + FONT_TABLE_END + vbCrLf
strText += "\sa200\sl276\slmult1\qj\f0\fs22\lang9 "
If Mid(strInput, 1, 1) = "<" Then
If Mid(strInput, 1, 3) = "<p>" Or Mid(strInput, 1, 3) = "<P>" Then
strInput = Mid(strInput, 4)
ElseIf Mid(strInput, 1, 4) = "</p>" Or Mid(strInput, 1, 3) = "</P>" Then
strText += LINE_BREAK
strInput = Mid(strInput, 5)
ElseIf Mid(strInput, 1, 3) = "<b>" Or Mid(strInput, 1, 3) = "<B>" Then
strText += BOLD_START
strInput = Mid(strInput, 4)
ElseIf Mid(strInput, 1, 4) = "</b>" Or Mid(strInput, 1, 3) = "</B>" Then
strText += BOLD_END
strInput = Mid(strInput, 5)
ElseIf Mid(strInput, 1, 3) = "<i>" Or Mid(strInput, 1, 3) = "<I>" Then
strInput = Mid(strInput, 4)
ElseIf Mid(strInput, 1, 4) = "</i>" Or Mid(strInput, 1, 3) = "</I>" Then
strText += ITALIC_END
strInput = Mid(strInput, 5)
ElseIf Mid(strInput, 1, 8) = "<strong>" Or Mid(strInput, 1, 8) = "<STRONG>" Then
strText += BOLD_START
strInput = Mid(strInput, 9)
ElseIf Mid(strInput, 1, 9) = "</strong>" Or Mid(strInput, 1, 9) = "</STRONG>" Then
strText += BOLD_END
strInput = Mid(strInput, 10)
ElseIf Mid(strInput, 1, 4) = "<em>" Or Mid(strInput, 1, 4) = "<EM>" Then
strInput = Mid(strInput, 5)
ElseIf Mid(strInput, 1, 5) = "</em>" Or Mid(strInput, 1, 5) = "</EM>" Then
strText += ITALIC_END
strInput = Mid(strInput, 6)
ElseIf Mid(strInput, 1, 3) = "<u>" Or Mid(strInput, 1, 3) = "<U>" Then
strInput = Mid(strInput, 4)
ElseIf Mid(strInput, 1, 4) = "</u>" Or Mid(strInput, 1, 3) = "</U>" Then
strInput = Mid(strInput, 5)
intPos = InStr(strInput, ">")
HttpContext.Current.Response.Write("UNSUPPORTED : " + Mid(strInput, 1, intPos) + "<br/>")
strInput = Mid(strInput, intPos + 1)
End If
If Mid(strInput, 1, 1) = "&" Then
If Mid(strInput, 1, 6) = " " Then
strText += " "
strInput = Mid(strInput, 7)
ElseIf Mid(strInput, 1, 5) = "&" Then
strText += "&"
strInput = Mid(strInput, 6)
ElseIf Mid(strInput, 1, 4) = "<" Then
strText += "<"
strInput = Mid(strInput, 5)
ElseIf Mid(strInput, 1, 4) = ">" Then
strText += ">"
strInput = Mid(strInput, 5)
ElseIf Mid(strInput, 1, 6) = "©" Then
strText += "\'a9"
strInput = Mid(strInput, 7)
ElseIf Mid(strInput, 1, 5) = "®" Then
strText += "\'ae"
strInput = Mid(strInput, 6)
ElseIf Mid(strInput, 1, 7) = "™" Then
strText += "\'99"
strInput = Mid(strInput, 8)
ElseIf Mid(strInput, 1, 7) = "£" Then
strText += "£"
strInput = Mid(strInput, 8)
ElseIf Mid(strInput, 1, 6) = "€" Then
strText += "\'80"
strInput = Mid(strInput, 7)
ElseIf Mid(strInput, 1, 2) = "&#" Then
If CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer) <= 127 Then
strText += Chr(CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer))
ElseIf CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer) <= 255 Then
strText += "\'" + Hex(CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer))
strText += "\u" + Hex(CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer))
End If
strInput = Mid(strInput, 3, InStr(strInput, ";") + 1)
intPos = InStr(strInput, ";")
HttpContext.Current.Response.Write("UNSUPPORTED : " + Mid(strInput, 1, intPos) + "<br/>")
strInput = Mid(strInput, intPos + 1)
End If
strText += Mid(strInput, 1, 1)
strInput = Mid(strInput, 2)
End If
End If
Loop Until strInput = ""
strText += RTF_END
Return strText
End Function
Current Status
As you can tell, there are some shortcomings:
- No additional font support
- No colour support
- Table support is lacking
I will start to look at these items when I get some time and enough interest from other people, but as most of these are embedded in style sheets or style tags, using the current code model, I would have to parse the HTML file multiple times to obtain just the font and colour tables and then again to get the actual text, so I am working on a more elegant solution to these issues, also fonts can be multiple options like "Arial, Verdana, Helvetica, sans-serif" could be the style sttribute and colours can be #RRGGBB or named so when I looked at the time scale for just these changes, it became obvious that there were additional steps that would be needed.
I hope that this code helps someone else, and if you want to contribute to this tip in a constructive way, please feel free to get in touch with any suggestions.