Introduction
Sometimes we want to get the summary of a full HTML article/post to show some lines of that in the main page. Therefore, if we cut the HTML string from the middle like a regular string, we have so many un-closed open HTML tags. So what happens is that the browser cannot find the correct closing tags for the open tags. For example, if we have an un-closed tag like <div>
, we should close it. If not <div>
will be closed by the next </div>
outside the post area and posts will be arranged together.
Background
In this simple script, I use two regular expressions to export and compare tags, one for the start
tag and one for the end
tag. Then I make a reverse order for the start
tag list. See the below to imagine this:
Order
|
Start Tag List
|
End Tag (false)
|
End Tag (true)
|
Normal
|
Reverse
|
Normal
|
Normal
|
1
|
<html>
|
<p>
|
</p>
|
</p>
|
2
|
<div>
|
<input>
|
</input>
|
</input>
|
3
|
<span style=”color:red;”>
|
<form>
|
</form>
|
</form>
|
4
|
<form>
|
<span style=”color:red;”>
|
NO END TAG
|
</span>
|
5
|
<input>
|
<div>
|
NO END TAG
|
</div>
|
6
|
<p>
|
<html>
|
NO END TAG
|
</html>
|
The code is as follows:
public static string AutoCloseHtmlTags(string inputHtml)
{
var regexStartTag = new Regex(@"<(!--\u002E\u002E\u002E--|!DOCTYPE|a|abbr|" +
@"acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|big" +
@"|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|" +
@"command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|" +
@"figcaption|figure|font|footer|form|frame|frameset|h1> to <h6|head|" +
@"header|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|" +
@"map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|" +
@"output|p|param|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|" +
@"source|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|" +
@"tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr)(\s\w+.*(\u0022|'))?>");
var startTagCollection = regexStartTag.Matches(inputHtml);
var regexCloseTag = new Regex(@"</(!--\u002E\u002E\u002E--|!DOCTYPE|a|abbr|" +
@"acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|" +
@"big|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|" +
@"command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|" +
@"figcaption|figure|font|footer|form|frame|frameset|h1> to <h6|head|header" +
@"|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|map|mark|menu|" +
@"meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|" +
@"progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|span|strike|" +
@"strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|" +
@"time|title|tr|track|tt|u|ul|var|video|wbr)>");
var closeTagCollection = regexCloseTag.Matches(inputHtml);
var startTagList = new List<string>();
var closeTagList = new List<string>();
var resultClose = "";
foreach (Match startTag in startTagCollection)
{
startTagList.Add(startTag.Value);
}
foreach (Match closeTag in closeTagCollection)
{
closeTagList.Add(closeTag.Value);
}
startTagList.Reverse();
for (int i = 0; i < closeTagList.Count; i++)
{
if (startTagList[i] != closeTagList[i])
{
int indexOfSpace = startTagList[i].IndexOf(
" ", System.StringComparison.Ordinal);
if (startTagList[i].Contains(" "))
{
startTagList[i].Remove(indexOfSpace);
}
startTagList[i] = startTagList[i].Replace("<", "</");
resultClose += startTagList[i] + ">";
resultClose = resultClose.Replace(">>", ">");
}
}
return inputHtml + resultClose;
}
Please let me know about your ideas...