Why remove HTML tags?
There could be a number of reasons why you as a developer want to remove HTML tags from the text.
The most common situation is when you are going to display some text on the web page and the text was submitted by an unknown user or it came from some other source that you have no control over.
You don't have any idea what the content of the text is: it could contain some damaging script or some HTML formatting that will completely mess up the look of your site.
It could be that you just don't want any HTML tags in the text because of some application restrictions.
You might want to limit the use of HTML to some simple text formatting tags, but restrict the users from using links and inserting images.
Whether you have a good reason for that or you just want HTML out of your text because you are a member of "HTML Hatred Club" - you have to find the way to get those tags out of the text.
This article will look into the options you have when it comes to removing HTML tags from the text in ASP.
First Option - Disable HTML
First and probably the easiest solution is to just disable HTML tags in the text without removing them. You can do it with Replace()
function. For example, if you want to disable all the SCRIPT tags you could do this:
strText = Replace(strText, "<script", "<script", 1, -1, 1)
or to make sure that all HTML tags are disabled:
strText = Replace(strText, "<", "<")
No opening brackets - no valid HTML tags - no problem. Right?
It is a good (quick) security measure to prevent people from embedding damaging client-side scripts within the text they submit, but it is hardly a user-friendly feature.
The problem with this approach is that all the HTML tags are now shown as well as the rest of the text and it is very hard to read. It's kind of like displaying the HTML source to the user - not a very nice thing to do.
Second Option - Use the brackets
How to make HTML tags disappear from the text? Well, we can just remove them. We can just take everything between opening bracket "<" and closing bracket ">" of HTML tags and remove it. It sounds easy ...
Well, it is easier said than done, especially in VBScript. :-)
People who code in Perl or Java Scripts can actually tell you that it is a piece of cake. They are absolutely right. For example, JavaScript function that removes everything between the brackets could look like this:
function RemoveHTML( strText )
{
var regEx = /<[^>]*>/g;
return strText.replace(regEx, "");
}
For those of you who doesn't know what all of these "/<[^>]*>/g" mean - it's called Regular Expression. "Regular expressions are patterns used to match character combinations in strings." You can learn more about them by following this link: http://developer.netscape.com/docs/manuals/js/client/jsguide/regexp.htm.
Back in VBScript world, for those of us who runs Scripting Engine 5.0 or later (you can check you version by calling ScriptEngineMajorVersion
and ScriptEngineMinorVersion
functions) we can use RegExp object as well. RemoveHTML function could look like this:
Function RemoveHTML( strText )
Dim RegEx
Set RegEx = New RegExp
RegEx.Pattern = "<[^>]*>"
RegEx.Global = True
RemoveHTML = RegEx.Replace(strText, "")
End Function
It doesn't look too complicated, does it? Providing that you know how to build those patterns ... ;-)
For the rest of VBScript people (who has an older Scripting Engine or doesn't want to mess with the Regular Expressions) writing of your own little parser is the way to go. Below is an example of such a function. My friend Chris Coursey and I used this function in one of our projects a couple of years ago:
Function RemoveHTML( strText )
Dim nPos1
Dim nPos2
nPos1 = InStr(strText, "<")
Do While nPos1 > 0
nPos2 = InStr(nPos1 + 1, strText, ">")
If nPos2 > 0 Then
strText = Left(strText, nPos1 - 1) & Mid(strText, nPos2 + 1)
Else
Exit Do
End If
nPos1 = InStr(strText, "<")
Loop
RemoveHTML = strText
End Function
While all of the above solutions work and do exactly what they were meant to do (remove everything between the brackets), there are at least a couple of problems with this approach:
First of all, because these functions are only take into an account the bracket characters - any brackets within the body of the text that were never meant to be HTML tags will be removed.
They will be removed together with any text that happens to be within those brackets. In other words, any attempt by a user to include "<" or ">" characters in the text might cause these functions to produce unpredictable and at the time very ugly results.
On the other hand, these functions remove all the HTML tags unconditionally. You cannot control which tags are removed and which are kept untouched. That is the problem when you want to let your users to enter some harmless HTML tags like "<b>" and "<i>", but remove the other tags.
Third Option - Use IE and other tools
The only way to overcome both of the previously discussed problems is to make your code aware of specific HTML tags that you want to be removed.
I am currently unaware of any third-party ASP components that would do the job for you, but they might very well be out there.
I did however attempted to write one myself based on MSHTML Library and I've seen that somebody has used Internet Explorer's Application object to produce the desired results of striping HTML tags.
Both of these solutions seemed to work, but with IE solution you will most likely get a huge performance hit and both of them don't seem to be very safe things to do according to MSKB:
"It may be desirable to parse HTML files inside a Web server process in response to a browser page request. However, the WebBrowser control, DHTML Editing Control, MSHTML, and other Internet Explorer components may not function properly in an Active Server Pages (ASP) page or other application run in a Web server application." (http://support.microsoft.com/support/kb/articles/Q244/0/85.ASP?LN=EN-US&SD=gn&FR=0)
In other words - think twice before using any IE components on the server side.
Fourth Option - Another VBScript attempt
Having explored all of the above options I have taking a challenge of writing an ASP function in VBScript that would both be intelligent enough to remove only known HTML tags and at the same time would provide the developer with ability to control which tags to remove. Following is the result of this attempt.
A few words about the function:
- List of the HTML tags to be removed controlled by adding or removing tags from the TAGLIST constant. For example, to leave all <B> tags in the text you must remove B from the TAGLIST. Current list contains every tag listed in index of HTML tags of MSDN Library with the addition of the LAYER tag. Please note that every tag must be surrounded by semi-colons (";") in order for this function to work properly.
- Both the start and the end tags will be removed. For example, both <A ...> and </A> tags will be removed.
- If tag is present in both TAGLIST and BLOCKTAGLIST constants this function will remove everything between the start and the end tag.
For example, if SCRIPT tag is included in both TAGLIST and BLOCKTAGLIST everything between <SCRIPT ...> and </SCRIPT> tags will be removed.
- Tags without a closing bracket will not be considered a valid HTML tags and therefore will not be removed. That in compliance with the HTML standard as far as I know.
- Block tags that does not have an end tag will cause the entire portion of the text from the start tag to the end of the text to be removed. For example, if </SCRIPT> is missing - everything from <SCRIPT ...> to the end of the text will be removed.
- If the first portion of the comment tag ("<!--") is followed by any character other than space - comment tag will not be removed.
- I've done some performance testing on this function just to get an idea about its speed. It removed a 1000 tags from 24K text string in one second. 2300 tags were removed from 60K text string in about 4.5 seconds. Relatively short string with a few tags - very fast. :-)
Usage of the function is simple:
strPlainText = RemoveHTML(strTextWithHTML)
And here is the function:
Function RemoveHTML( strText )
Dim TAGLIST
TAGLIST = ";!--;!DOCTYPE;A;ACRONYM;ADDRESS;APPLET;AREA;B;BASE;BASEFONT;" &_
"BGSOUND;BIG;BLOCKQUOTE;BODY;BR;BUTTON;CAPTION;CENTER;CITE;CODE;" &_
"COL;COLGROUP;COMMENT;DD;DEL;DFN;DIR;DIV;DL;DT;EM;EMBED;FIELDSET;" &_
"FONT;FORM;FRAME;FRAMESET;HEAD;H1;H2;H3;H4;H5;H6;HR;HTML;I;IFRAME;IMG;" &_
"INPUT;INS;ISINDEX;KBD;LABEL;LAYER;LAGEND;LI;LINK;LISTING;MAP;MARQUEE;" &_
"MENU;META;NOBR;NOFRAMES;NOSCRIPT;OBJECT;OL;OPTION;P;PARAM;PLAINTEXT;" &_
"PRE;Q;S;SAMP;SCRIPT;SELECT;SMALL;SPAN;STRIKE;STRONG;STYLE;SUB;SUP;" &_
"TABLE;TBODY;TD;TEXTAREA;TFOOT;TH;THEAD;TITLE;TR;TT;U;UL;VAR;WBR;XMP;"
Const BLOCKTAGLIST = ";APPLET;EMBED;FRAMESET;HEAD;NOFRAMES;NOSCRIPT;OBJECT;SCRIPT;STYLE;"
Dim nPos1
Dim nPos2
Dim nPos3
Dim strResult
Dim strTagName
Dim bRemove
Dim bSearchForBlock
nPos1 = InStr(strText, "<")
Do While nPos1 > 0
nPos2 = InStr(nPos1 + 1, strText, ">")
If nPos2 > 0 Then
strTagName = Mid(strText, nPos1 + 1, nPos2 - nPos1 - 1)
strTagName = Replace(Replace(strTagName, vbCr, " "), vbLf, " ")
nPos3 = InStr(strTagName, " ")
If nPos3 > 0 Then
strTagName = Left(strTagName, nPos3 - 1)
End If
If Left(strTagName, 1) = "/" Then
strTagName = Mid(strTagName, 2)
bSearchForBlock = False
Else
bSearchForBlock = True
End If
If InStr(1, TAGLIST, ";" & strTagName & ";", vbTextCompare) > 0 Then
bRemove = True
If bSearchForBlock Then
If InStr(1, BLOCKTAGLIST, ";" & strTagName & ";", vbTextCompare) > 0 Then
nPos2 = Len(strText)
nPos3 = InStr(nPos1 + 1, strText, "</" & strTagName, vbTextCompare)
If nPos3 > 0 Then
nPos3 = InStr(nPos3 + 1, strText, ">")
End If
If nPos3 > 0 Then
nPos2 = nPos3
End If
End If
End If
Else
bRemove = False
End If
If bRemove Then
strResult = strResult & Left(strText, nPos1 - 1)
strText = Mid(strText, nPos2 + 1)
Else
strResult = strResult & Left(strText, nPos1)
strText = Mid(strText, nPos1 + 1)
End If
Else
strResult = strResult & strText
strText = ""
End If
nPos1 = InStr(strText, "<")
Loop
strResult = strResult & strText
RemoveHTML = strResult
End Function