While participating in a forum discussion, the need to clean up HTML from "dangerous" constructs came up.
In the present case, it was needed to remove SCRIPT, OBJECT, APPLET, EMBBED, FRAMESET, IFRAME, FORM, INPUT, BUTTON and TEXTAREA elements (as far as I can think of) from the HTML source. Every event attribute (ONEVENT) should also be removed keep all other attributes, though.
HTML is very loose and extremely hard to parse. Elements can be defined as a start tag (<element-name>) and an end tag (</element-name>) although some elements don't require the end tag. If XHTML is being parsed, elements without an end tag require the tag to be terminated with /> instead of just >.
Attributes are not easier to parse. By definition, attribute values are required to be delimited by quotes (') or double quotes ("), but some browsers accept attribute values without any delimiter.
We could build a parser, but then it will become costly to add or remove elements or attributes. Using a regular expression to remove unwanted elements and attributes seems like the best option.
First, let's capture all unwanted elements with start and end tags. To capture these elements, we must:
- Capture the begin tag character followed by the element name (for which we will store its name - t): <(?<t>element-name)
- Capture optional white spaces followed by any character: (\s+.*?)?
- Capture the end tag character: >
- Capture optional any characters: .*?
- Capture the begin tag character followed by closing tag character, the element name (referenced by the name - t) and the end tag character: </\k<t>>
<(?<t>tag-name(\s+.*?)?>.*?</\k<t>>
To capture all unwanted element types, we end up with the following regular expression:
<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\s+.*?)?>.*?</\k<t>>
Next, let's capture all unwanted elements without an end tag. To capture these elements, we must:
- Capture the begin tag character followed by the element name: <element-name
- Capture optional white spaces followed by any character: (\s+.*?)?
- Capture an optional closing tag character: /?
- Capture the end tag character: >
<tag-name(\s+.*?)?/?>
To capture all unwanted element types, we end up with the following regular expression:
<(script|object|applet|embbed|frameset|
iframe|form|textarea|input|button)(\s+.*?)?/?>
To remove those unwanted elements from the source HTML, we can combine these two previous regular expressions into one and replace any match with an empty string:
Regex.Replace(
sourceHtml,
"|(<(?<t>script|object|applet|embbed|frameset|
iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
+ "|(<(script|object|applet|embbed|frameset|iframe|
form|input|button|textarea)(<a>\\s+.*?)?/?>)"</a> ,
string.Empty);
And finally, the unwanted attributes. This one is trickier because we want to capture unwanted attributes inside an element's start tag. To achieve that, we need to match an element's opening tag and capture all attribute definitions. To capture these attributes, we must:
- Match but ignore the begin tag character followed by any element name: (?<=<\w+)
- Match all:
- Don’t capture mandatory with spaces: (?:\s+)
- Capture attribute definition:
- Capture mandatory attribute name: \w+
- Capture mandatory equals sign: =
- Capture value specification in one of the forms:
- Capture double quoted value: "[^"]*"
- Capture single quoted value: '[^']*'
- Capture unquoted value: .*?
- Match but ignore end tag: (?=/?>)
(?<=<\w+)((?:\s+)(\w+=(("
[^"]*")|('[^']*')|(.*?)))*(?=/?>)
The problem with the previous regular expression is that it matches the start tag and captures the whole list of attributes and not each unwanted attribute by itself. This prevents us from from replacing each match with a fixed value (empty string).
To solve this, we have to name what we want to capture and use the Replace overload that uses a MatchEvaluator.
We could capture unwanted attributes as we did for the unwanted elements, but then we would need to remove them from the list of all the element’s attributes. Instead, we’ll capture the wanted attributes and build the list of attributes. To identify the wanted attributes, we’ll need to name them (a). The resulting code will be something like this:
Regex.Replace(
sourceHtml,
"((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|
(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
match =>
{
if (!match.Groups["a"].Success)
{
return string.Empty;
}
var attributesBuilder = new StringBuilder();
foreach(Capture capture in match.Groups["a"].Captures)
{
attributesBuilder.Append(' ');
attributesBuilder.Append(capture.Value);
}
return attributesBuilder.ToString();
}
);
To avoid parsing the source HTML more than once, we can combine all the regular expressions into a single one.
Because we are still outputting only the wanted attributes, there’s no change to the match evaluator.
A few options (RegexOptions) will also be added to increase functionality and performance:
IgnoreCase
: For case-insensitive matching CultureInvariant
: For ignoring cultural differences in language Multiline
: For multiline modeExplicitCapture
: For capturing only named capturesCompiled
: For compiling the regular expression into an assembly. Only if the regular expression is to be used many times.
The resulting code will be this:
Regex.Replace(
sourceHtml,
"(<(?<t>script|object|applet|embbed|frameset|
iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
+ "|(<(script|object|applet|embbed|frameset|iframe|form|
input|button|textarea)(\\s+.*?)?/?>)"
+ "|((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|
('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"
[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
match =>
{
if (!match.Groups["a"].Success)
{
return string.Empty;
}
var attributesBuilder = new StringBuilder();
foreach(Capture capture in match.Groups["a"].Captures)
{
attributesBuilder.Append(' ');
attributesBuilder.Append(capture.Value);
}
return attributesBuilder.ToString();
},
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
This was not extensively tested and there might be some wanted HTML removed and some unwanted HTML kept, but it’s probably very close to a good solution.