Introduction
After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code. I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions. However, for applications which only require getting tag free / HTML free text from a web page, these solutions seem to be overkill, at least in my case.
Here are three very simplistic ways to get only the displayed text on a web page:
Method 1 – In Memory Cut and Paste
Use WebBrowser
control object to process the web page, and then copy the text from the control…
Use the following code to download the web page:
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
wb.Url = urlPath;
Use the following event code to process the downloaded web page text:
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}
Method 2 – In Memory Selection Object
This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}
Method 3 – The Elegant, Simple, Slower XmlDocument Approach
A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.
The XmlDocument
object will load / process HTML files with only 3 simple lines of code:
XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;
There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.
Packages
I have recently used the WatiN web application testing package to get website text using C#. WatiN was not the easiest package to get set up for website text retrieval from C# as it required references to the WatiN core DLL, Microsoft.mshtml
, windows.forms
, and then several additional classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:
var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;
I have included a simple Visual Studio ASP.NET project for download here.
Links