Remove all the HTML tags and display a plain text only inside (in case XML is not well formed)

KevinAG

5.00/5 (1 vote)

18 Jan 2011CPOL2 min read

15.7K

Sorry, but I have to vote this way down. Your regular expression (or @Chris's) is not robust enough for what I would consider real world data. Especially if this is used on any kind of public web site, I would be afraid of JavaScript injection attacks and other things (depending on its usage)....

Sorry, but I have to vote this way down. Your regular expression (or @Chris's) is not robust enough for what I would consider "real world" data. Especially if this is used on any kind of public web site, I would be afraid of JavaScript injection attacks and other things (depending on its usage). Here is a quick example of where your regular expression fails for some completely valid HTML code:

<b title="test > fail"><i>The tag is about to be removed</i></b>

Applying your regular expression results in:

fail">The tag is about to be removed

While you may argue that it did in fact remove the tags, again, I would just have to say that I don't think it is safe to use in most cases. Here is a comment and the two regular expressions that we use.

Taken from http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx and slightly modified. I changed the first "\w" to "\S" so that tags like <namespace:tagname xmlns:namespace="#unknown"> will be found. The colon character is not part of "\w". While "\S" may be a bit overboard, I'm fine with that. Then I do the same thing with the second "\w", for an attribute name, but put it inside a subtraction set so that the "=", which is the delimiter between the attribute name and value, is not eaten up. Each of these matches is looked for one or more times, as few as possible, the "+?" after them.

Also, when pasting markup from Microsoft Word, it will include funny comment sections of the form:



Because there is a '>' soon after the opening comment tag, the AllHtmlTags pattern will pick it up. It will obviously be caught as an unrecognized tag and removed, but the content in the middle will be left alone. This means that after submitting, the user will then have a bunch of meaningless text scattered throughout their description. To prevent this, we could change to parsing the entire Html without regular expressions, or we can try to use another regular expression to first match and remove all comments.

You can use "Expresso" regular expression tool to help analyze and explain this expression. Remember, some of the quotation marks are escaped out with another quote for the C# string, but not the regular expression.

HTML

AllHtmlTagsPattern = @"</?\S+?((\s+[\S-[=]]+?(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>";
CommentTagsPattern = @"<![\s\S]*?--[\s]*>";
</namespace:tagname>

Hopefully that the above code section is readable enough.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)