(untagged)

Getting Only The Text Displayed On A Webpage Using C#

Jake Drew

0.00/5 (No votes)

3 May 2013

How to get only the text displayed on a webpage using C#

Introduction

After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code. I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions. However, for applications which only require getting tag free / HTML free text from a web page, these solutions seem to be overkill, at least in my case.

Here are three very simplistic ways to get only the displayed text on a web page:

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page:

//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

Use the following event code to process the downloaded web page text:

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process HTML files with only 3 simple lines of code:

XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;

There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.

Packages

I have recently used the WatiN web application testing package to get website text using C#. WatiN was not the easiest package to get set up for website text retrieval from C# as it required references to the WatiN core DLL, Microsoft.mshtml, windows.forms, and then several additional classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:

var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;

I have included a simple Visual Studio ASP.NET project for download here.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here