Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

RTF to HTML - VB.NET

0.00/5 (No votes)
22 Jul 2015 1  
Work in progress, but a simple HTML to RTF converter

Introduction

I needed to convert HTML to RTF as I wanted be able to use the RTF output in another application that did not support HTML, but did support RTF.

Background

My client wanted to print out mail merged documents that were generated on the fly from a web application, but the report writing software component only supports RTF, but the web interface editors only produce HTML.

I looked at some possible solutions:

  1. Use Word - Not a viable proposition as that means installing Office on a web server, use OLE automation and hope that your web server does not end up running out of memory. Costs around $250.
  2. Find a DLL that can be added to the Bin folder - Only found a couple of solutions, but they were each about $299 for developer licence and 1 server licence.
  3. Change to Cute-Editor as this has the ability to save RTF format as well as HTML format, costs $250.

Telling clients that their software requirements are going to cost them more money is not really a great option as you risk losing business and recommendations, so I had to rethink this from the ground up.

After a bit more research, I started to see patterns that were fixed and some that changed, I also referenced Wikipedia, and they gave a simple example of the RTF format.

{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
This is some {\b bold} text.\par
}

The above RTF text is simplified and is broken down as follows:

  • \rtf1 = Tell you the document is rtf
  • \ansi = Uses the ANSI character set
  • {\fonttbl\f0\fswiss Helvetica;} = Defines font 0 as swiss (sans-serif) Helvetica if available
  • \f0 = Use font 0
  • \pard = Start paragraph marker
  • {\b bold} = Only bold this word
  • \par = End paragraph marker
  • The first { and last } are the delimiters for the string of text.

After this, I decided that I could see that I could play around with RTF and see the results of each of the changes.

I started with a simple "Hello World" using Wordpad, which I then opened in Notepad (RTF is like HTML in that it is saved in a Human readable format - This ends up looking something like this:

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Riched20 6.3.9600}\viewkind4\uc1
\pard\sa200\sl276\slmult1\qj\f0\fs22\lang9 Hello World\par
}

I then changed the text to bold, italic and underline and viewed the changes.

I noted that there were extra items like \viewkind4\uc1, \sa200\s1276\slmult1\qj & \fs22\lang9.

Back to researching these extras found that some of them are important, some not so.

  • \*\ - This is a comment entry
  • \viewkind4 - If opened in shows as a Normal View
  • \uc1 - Use primary unicode character set
  • \sa200 - Space after
  • \s1276 - Something to do with style, but I have not figured it all out
  • \slmult1 - Single line spacing
  • \qj - Justify the text (\ql, \qr, \qc and \qd are the other available options)
  • \fs22 - Windows Font size 11 (Size * 2)
  • \lang9 -Something to do with language, but I have not figured this out either.

Progression

So at this point, I had the basics for what the client required, and I started to code this. I only need to support Bold, Italic and Underlined text for this project so that is what I concentrated on at the beginning. I also wanted something that would not fall over if there was some other HTML element that was not supported.

The Code

' =======================================================================
' = Define constants that are need to build the outline RTF file format =
' =======================================================================

' Define Start and End strings for RTF Formatting
Const RTF_START = "{\rtf1\ansi\ansicpg1252\deff0\deflang1033" 
Const RTF_END = "}"
' Define Start and End strings for FONT tables
Const FONT_TABLE_START = "{\fonttbl"
Const FONT_TABLE_END = "}"
' Define Start and End strings for COLOR tables (Next version maybe)
'Const COLOR_TABLE_START = "{\colortbl "
'Const COLOR_TABLE_END = ";}"
' Define End Paragraph constant
Const LINE_BREAK = "\par" & vbCrLf
' Define Indent constant - No HTML equivalent ;-)
'Const RTF_TAB = "\tab "
' Define New Paragraph constant
Const NEW_PARAGRAPH = "\pard"
' Define Bold constants, as unlikely that you are only enboldening one character
Const BOLD_START = "\b "
Const BOLD_END = "\b0 "
' Define Underline constants
Const UNDERLINE_START = "\ul "
Const UNDERLINE_END = "\ul0 "
' Define Italic constants
Const ITALIC_START = "\i "
Const ITALIC_END = "\i0 "
' Define View and charset
Const DOCUMENT_START = "\viewkind4\uc1" 
' ======================================================
' = RenderHTML                                         =
' =                                                    =
' = Input : HTML encoded string (Content of body only) =
' =                                                    =
' = Output : RTF encoded string                        =
' =                                                    = 
' ======================================================
Private Function RenderHTML(ByVal strInput As String) As String
  Dim intPos As Integer
  Dim strText As String

  ' Create initial header strings for the RTF format - Set font to Arial
  strText = RTF_START + FONT_TABLE_START + "{\f0\fnil\fcharset0 Arial;}" + FONT_TABLE_END + vbCrLf
  ' Add view and charset
  strText += DOCUMENT_START
  ' Start the document
  strText += NEW_PARAGRAPH
  ' Set the font to Arial 11 and Justify the text
  strText += "\sa200\sl276\slmult1\qj\f0\fs22\lang9 "

  ' Start Processing the HTML string
  Do
    ' Check for < in HTML string
    If Mid(strInput, 1, 1) = "<" Then
      ' Look for different tags and move input to next element in HTML
      If Mid(strInput, 1, 3) = "<p>" Or Mid(strInput, 1, 3) = "<P>" Then
        strInput = Mid(strInput, 4)
      ElseIf Mid(strInput, 1, 4) = "</p>" Or Mid(strInput, 1, 3) = "</P>" Then
        strText += LINE_BREAK
        strInput = Mid(strInput, 5)
      ElseIf Mid(strInput, 1, 3) = "<b>" Or Mid(strInput, 1, 3) = "<B>" Then
        strText += BOLD_START
        strInput = Mid(strInput, 4)
      ElseIf Mid(strInput, 1, 4) = "</b>" Or Mid(strInput, 1, 3) = "</B>" Then
        strText += BOLD_END
        strInput = Mid(strInput, 5)
      ElseIf Mid(strInput, 1, 3) = "<i>" Or Mid(strInput, 1, 3) = "<I>" Then
        strText += ITALIC_START
        strInput = Mid(strInput, 4)
      ElseIf Mid(strInput, 1, 4) = "</i>" Or Mid(strInput, 1, 3) = "</I>" Then
        strText += ITALIC_END
        strInput = Mid(strInput, 5)
      ElseIf Mid(strInput, 1, 8) = "<strong>" Or Mid(strInput, 1, 8) = "<STRONG>" Then
        strText += BOLD_START
        strInput = Mid(strInput, 9)
      ElseIf Mid(strInput, 1, 9) = "</strong>" Or Mid(strInput, 1, 9) = "</STRONG>" Then
        strText += BOLD_END
        strInput = Mid(strInput, 10)
      ElseIf Mid(strInput, 1, 4) = "<em>" Or Mid(strInput, 1, 4) = "<EM>" Then
        strText += ITALIC_START
        strInput = Mid(strInput, 5)
      ElseIf Mid(strInput, 1, 5) = "</em>" Or Mid(strInput, 1, 5) = "</EM>" Then
        strText += ITALIC_END
        strInput = Mid(strInput, 6)
      ElseIf Mid(strInput, 1, 3) = "<u>" Or Mid(strInput, 1, 3) = "<U>" Then
        strText += UNDERLINE_START
        strInput = Mid(strInput, 4)
      ElseIf Mid(strInput, 1, 4) = "</u>" Or Mid(strInput, 1, 3) = "</U>" Then
        strText += UNDERLINE_END
        strInput = Mid(strInput, 5)
      Else
        ' ============================================================================
        ' = Catch all remaining HTML and show on the browser the unsupported element = 
        ' ============================================================================
        intPos = InStr(strInput, ">")
        HttpContext.Current.Response.Write("UNSUPPORTED : " + Mid(strInput, 1, intPos) + "<br/>")
        strInput = Mid(strInput, intPos + 1)
      End If
    Else
      ' Check for & in the HTML input and replace
      If Mid(strInput, 1, 1) = "&" Then
        If Mid(strInput, 1, 6) = "&nbsp;" Then
          strText += " "
          strInput = Mid(strInput, 7)
        ElseIf Mid(strInput, 1, 5) = "&amp;" Then
          strText += "&"
          strInput = Mid(strInput, 6)
        ElseIf Mid(strInput, 1, 4) = "&lt;" Then
          strText += "<"
          strInput = Mid(strInput, 5)
        ElseIf Mid(strInput, 1, 4) = "&gt;" Then
          strText += ">"
          strInput = Mid(strInput, 5)
        ElseIf Mid(strInput, 1, 6) = "&copy;" Then
          strText += "\'a9"
          strInput = Mid(strInput, 7)
        ElseIf Mid(strInput, 1, 5) = "&reg;" Then
          strText += "\'ae"
          strInput = Mid(strInput, 6)
        ElseIf Mid(strInput, 1, 7) = "&trade;" Then
          strText += "\'99"
          strInput = Mid(strInput, 8)
        ElseIf Mid(strInput, 1, 7) = "&pound;" Then
          strText += "&pound;"
          strInput = Mid(strInput, 8)
        ElseIf Mid(strInput, 1, 6) = "&euro;" Then
          strText += "\'80"
          strInput = Mid(strInput, 7)
        ElseIf Mid(strInput, 1, 2) = "&#" Then
          ' Handle &# 
          If CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer) <= 127 Then
            strText += Chr(CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer))
          ElseIf CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer) <= 255 Then
            strText += "\'" + Hex(CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer))
          Else
            strText += "\u" + Hex(CType(Mid(strInput, 3, InStr(strInput, ";") - 1), Integer))
          End If
          strInput = Mid(strInput, 3, InStr(strInput, ";") + 1)
        Else
        ' ============================================================================
        ' = Catch all remaining HTML and show on the browser the unsupported element = 
        ' ============================================================================
          intPos = InStr(strInput, ";")
          HttpContext.Current.Response.Write("UNSUPPORTED : " + Mid(strInput, 1, intPos) + "<br/>")
          strInput = Mid(strInput, intPos + 1)
        End If
      Else
        strText += Mid(strInput, 1, 1)
        strInput = Mid(strInput, 2)
      End If
    End If
  Loop Until strInput = ""
  strText += RTF_END
  Return strText
End Function

Current Status

As you can tell, there are some shortcomings:

  • No additional font support
  • No colour support
  • Table support is lacking

I will start to look at these items when I get some time and enough interest from other people, but as most of these are embedded in style sheets or style tags, using the current code model, I would have to parse the HTML file multiple times to obtain just the font and colour tables and then again to get the actual text, so I am working on a more elegant solution to these issues, also fonts can be multiple options like "Arial, Verdana, Helvetica, sans-serif" could be the style sttribute and colours can be #RRGGBB or named so when I looked at the time scale for just these changes, it became obvious that there were additional steps that would be needed.

I hope that this code helps someone else, and if you want to contribute to this tip in a constructive way, please feel free to get in touch with any suggestions.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here