Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Removing HTML from the text in ASP

0.00/5 (No votes)
27 Sep 2000 1  
Exploring the options of removing HTML tags from the text in ASP.

Why remove HTML tags?

There could be a number of reasons why you as a developer want to remove HTML tags from the text. The most common situation is when you are going to display some text on the web page and the text was submitted by an unknown user or it came from some other source that you have no control over. You don't have any idea what the content of the text is: it could contain some damaging script or some HTML formatting that will completely mess up the look of your site. It could be that you just don't want any HTML tags in the text because of some application restrictions. You might want to limit the use of HTML to some simple text formatting tags, but restrict the users from using links and inserting images. Whether you have a good reason for that or you just want HTML out of your text because you are a member of "HTML Hatred Club" - you have to find the way to get those tags out of the text. This article will look into the options you have when it comes to removing HTML tags from the text in ASP.

First Option - Disable HTML

First and probably the easiest solution is to just disable HTML tags in the text without removing them. You can do it with Replace() function. For example, if you want to disable all the SCRIPT tags you could do this:

strText = Replace(strText, "<script", "&lt;script", 1, -1, 1)
or to make sure that all HTML tags are disabled:
strText = Replace(strText, "<", "&lt;")

No opening brackets - no valid HTML tags - no problem. Right?

It is a good (quick) security measure to prevent people from embedding damaging client-side scripts within the text they submit, but it is hardly a user-friendly feature.

The problem with this approach is that all the HTML tags are now shown as well as the rest of the text and it is very hard to read. It's kind of like displaying the HTML source to the user - not a very nice thing to do.

Second Option - Use the brackets

How to make HTML tags disappear from the text? Well, we can just remove them. We can just take everything between opening bracket "<" and closing bracket ">" of HTML tags and remove it. It sounds easy ...

Well, it is easier said than done, especially in VBScript. :-)

People who code in Perl or Java Scripts can actually tell you that it is a piece of cake. They are absolutely right. For example, JavaScript function that removes everything between the brackets could look like this:

function RemoveHTML( strText )
{
	var regEx = /<[^>]*>/g;
	return strText.replace(regEx, "");
}

For those of you who doesn't know what all of these "/<[^>]*>/g" mean - it's called Regular Expression. "Regular expressions are patterns used to match character combinations in strings." You can learn more about them by following this link: http://developer.netscape.com/docs/manuals/js/client/jsguide/regexp.htm.

Back in VBScript world, for those of us who runs Scripting Engine 5.0 or later (you can check you version by calling ScriptEngineMajorVersion and ScriptEngineMinorVersion functions) we can use RegExp object as well. RemoveHTML function could look like this:

Function RemoveHTML( strText )
	Dim RegEx

	Set RegEx = New RegExp

	RegEx.Pattern = "<[^>]*>"
	RegEx.Global = True

	RemoveHTML = RegEx.Replace(strText, "")
End Function

It doesn't look too complicated, does it? Providing that you know how to build those patterns ... ;-)

For the rest of VBScript people (who has an older Scripting Engine or doesn't want to mess with the Regular Expressions) writing of your own little parser is the way to go. Below is an example of such a function. My friend Chris Coursey and I used this function in one of our projects a couple of years ago:

Function RemoveHTML( strText ) 
    Dim nPos1
    Dim nPos2
    
    nPos1 = InStr(strText, "<") 
    Do While nPos1 > 0 
        nPos2 = InStr(nPos1 + 1, strText, ">") 
        If nPos2 > 0 Then 
            strText = Left(strText, nPos1 - 1) & Mid(strText, nPos2 + 1) 
        Else 
            Exit Do 
        End If 
        nPos1 = InStr(strText, "<") 
    Loop 
    
    RemoveHTML = strText 
End Function 

While all of the above solutions work and do exactly what they were meant to do (remove everything between the brackets), there are at least a couple of problems with this approach:

First of all, because these functions are only take into an account the bracket characters - any brackets within the body of the text that were never meant to be HTML tags will be removed. They will be removed together with any text that happens to be within those brackets. In other words, any attempt by a user to include "<" or ">" characters in the text might cause these functions to produce unpredictable and at the time very ugly results.

On the other hand, these functions remove all the HTML tags unconditionally. You cannot control which tags are removed and which are kept untouched. That is the problem when you want to let your users to enter some harmless HTML tags like "<b>" and "<i>", but remove the other tags.

Third Option - Use IE and other tools

The only way to overcome both of the previously discussed problems is to make your code aware of specific HTML tags that you want to be removed. I am currently unaware of any third-party ASP components that would do the job for you, but they might very well be out there. I did however attempted to write one myself based on MSHTML Library and I've seen that somebody has used Internet Explorer's Application object to produce the desired results of striping HTML tags. Both of these solutions seemed to work, but with IE solution you will most likely get a huge performance hit and both of them don't seem to be very safe things to do according to MSKB:

"It may be desirable to parse HTML files inside a Web server process in response to a browser page request. However, the WebBrowser control, DHTML Editing Control, MSHTML, and other Internet Explorer components may not function properly in an Active Server Pages (ASP) page or other application run in a Web server application." (http://support.microsoft.com/support/kb/articles/Q244/0/85.ASP?LN=EN-US&SD=gn&FR=0)

In other words - think twice before using any IE components on the server side.

Fourth Option - Another VBScript attempt

Having explored all of the above options I have taking a challenge of writing an ASP function in VBScript that would both be intelligent enough to remove only known HTML tags and at the same time would provide the developer with ability to control which tags to remove. Following is the result of this attempt.

A few words about the function:

  • List of the HTML tags to be removed controlled by adding or removing tags from the TAGLIST constant. For example, to leave all <B> tags in the text you must remove B from the TAGLIST. Current list contains every tag listed in index of HTML tags of MSDN Library with the addition of the LAYER tag. Please note that every tag must be surrounded by semi-colons (";") in order for this function to work properly.
  • Both the start and the end tags will be removed. For example, both <A ...> and </A> tags will be removed.
  • If tag is present in both TAGLIST and BLOCKTAGLIST constants this function will remove everything between the start and the end tag. For example, if SCRIPT tag is included in both TAGLIST and BLOCKTAGLIST everything between <SCRIPT ...> and </SCRIPT> tags will be removed.
  • Tags without a closing bracket will not be considered a valid HTML tags and therefore will not be removed. That in compliance with the HTML standard as far as I know.
  • Block tags that does not have an end tag will cause the entire portion of the text from the start tag to the end of the text to be removed. For example, if </SCRIPT> is missing - everything from <SCRIPT ...> to the end of the text will be removed.
  • If the first portion of the comment tag ("<!--") is followed by any character other than space - comment tag will not be removed.
  • I've done some performance testing on this function just to get an idea about its speed. It removed a 1000 tags from 24K text string in one second. 2300 tags were removed from 60K text string in about 4.5 seconds. Relatively short string with a few tags - very fast. :-)

Usage of the function is simple:

strPlainText = RemoveHTML(strTextWithHTML)

And here is the function:

Function RemoveHTML( strText )
    Dim TAGLIST
    TAGLIST = ";!--;!DOCTYPE;A;ACRONYM;ADDRESS;APPLET;AREA;B;BASE;BASEFONT;" &_
              "BGSOUND;BIG;BLOCKQUOTE;BODY;BR;BUTTON;CAPTION;CENTER;CITE;CODE;" &_
              "COL;COLGROUP;COMMENT;DD;DEL;DFN;DIR;DIV;DL;DT;EM;EMBED;FIELDSET;" &_
              "FONT;FORM;FRAME;FRAMESET;HEAD;H1;H2;H3;H4;H5;H6;HR;HTML;I;IFRAME;IMG;" &_
              "INPUT;INS;ISINDEX;KBD;LABEL;LAYER;LAGEND;LI;LINK;LISTING;MAP;MARQUEE;" &_
              "MENU;META;NOBR;NOFRAMES;NOSCRIPT;OBJECT;OL;OPTION;P;PARAM;PLAINTEXT;" &_
              "PRE;Q;S;SAMP;SCRIPT;SELECT;SMALL;SPAN;STRIKE;STRONG;STYLE;SUB;SUP;" &_
              "TABLE;TBODY;TD;TEXTAREA;TFOOT;TH;THEAD;TITLE;TR;TT;U;UL;VAR;WBR;XMP;"

    Const BLOCKTAGLIST = ";APPLET;EMBED;FRAMESET;HEAD;NOFRAMES;NOSCRIPT;OBJECT;SCRIPT;STYLE;"
    
    Dim nPos1
    Dim nPos2
    Dim nPos3
    Dim strResult
    Dim strTagName
    Dim bRemove
    Dim bSearchForBlock
    
    nPos1 = InStr(strText, "<")
    Do While nPos1 > 0
        nPos2 = InStr(nPos1 + 1, strText, ">")
        If nPos2 > 0 Then
            strTagName = Mid(strText, nPos1 + 1, nPos2 - nPos1 - 1)
	    strTagName = Replace(Replace(strTagName, vbCr, " "), vbLf, " ")

            nPos3 = InStr(strTagName, " ")
            If nPos3 > 0 Then
                strTagName = Left(strTagName, nPos3 - 1)
            End If
            
            If Left(strTagName, 1) = "/" Then
                strTagName = Mid(strTagName, 2)
                bSearchForBlock = False
            Else
                bSearchForBlock = True
            End If
            
            If InStr(1, TAGLIST, ";" & strTagName & ";", vbTextCompare) > 0 Then
                bRemove = True
                If bSearchForBlock Then
                    If InStr(1, BLOCKTAGLIST, ";" & strTagName & ";", vbTextCompare) > 0 Then
                        nPos2 = Len(strText)
                        nPos3 = InStr(nPos1 + 1, strText, "</" & strTagName, vbTextCompare)
                        If nPos3 > 0 Then
                            nPos3 = InStr(nPos3 + 1, strText, ">")
                        End If
                        
                        If nPos3 > 0 Then
                            nPos2 = nPos3
                        End If
                    End If
                End If
            Else
                bRemove = False
            End If
            
            If bRemove Then
                strResult = strResult & Left(strText, nPos1 - 1)
                strText = Mid(strText, nPos2 + 1)
            Else
                strResult = strResult & Left(strText, nPos1)
                strText = Mid(strText, nPos1 + 1)
            End If
        Else
            strResult = strResult & strText
            strText = ""
        End If
        
        nPos1 = InStr(strText, "<")
    Loop
    strResult = strResult & strText
    
    RemoveHTML = strResult
End Function

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here