Introduction
Use this code snippet to extract the inner text from Html, its very lightweight, simple and efficient, work well even with malformed Html, no extra dll is needed such as htmlagilitypack.
Note:
This method is intended to be used with simple HTML that is free of scripts, styles or comments
Background
Some tasks require you to extract text from HTML, especially in web scraping. one popular solution is to use the
HtmlAgilityPack-DocumentNode.InnerText-, however this requiring you add an extra library to your project, and have drawbacks in some edge cases.
one drawback I noticed is that it might concatenate two words as a single word for example consider the Html string: "<p>this<b>is<b/> a test</p>" using the HtmlAgilityPack to extract the text will result in "thisis a test".
Using the code
To use this code you need to import System.Text.RegularExpressions
namespace. Add the following function
to your Utilities
class or as an extension method:
public static string ExtractHtmlInnerText(string htmlText)
{
Regex regex = new Regex("(<.*?>\\s*)+", RegexOptions.Singleline);
string resultText = regex.Replace(htmlText, " ").Trim();
return resultText;
}