Extract inner text from HTML using Regex

jahmani

3.67/5 (2 votes)

17 Oct 2012CPOL

59.3K

How to extract the inner text from HTML using a Regular Expression.

Introduction

Use this code snippet to extract the inner text from Html, its very lightweight, simple and efficient, work well even with malformed Html, no extra dll is needed such as htmlagilitypack.

Note:

This method is intended to be used with simple HTML that is free of scripts, styles or comments

Background

Some tasks require you to extract text from HTML, especially in web scraping. one popular solution is to use the HtmlAgilityPack-DocumentNode.InnerText-, however this requiring you add an extra library to your project, and have drawbacks in some edge cases.

one drawback I noticed is that it might concatenate two words as a single word for example consider the Html string: "<p>this<b>is<b/> a test</p>" using the HtmlAgilityPack to extract the text will result in "thisis a test".

Using the code

To use this code you need to import System.Text.RegularExpressions namespace. Add the following function to your Utilities class or as an extension method:

public static string ExtractHtmlInnerText(string htmlText)
{
    //Match any Html tag (opening or closing tags) 
    // followed by any successive whitespaces
    //consider the Html text as a single line

    Regex regex = new Regex("(<.*?>\\s*)+", RegexOptions.Singleline);
    
    // replace all html tags (and consequtive whitespaces) by spaces
    // trim the first and last space

    string resultText = regex.Replace(htmlText, " ").Trim();

    return resultText;
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)