(untagged)

Text Scraping using Regex in C#

Mohammed Ibrahim.L

0.00/5 (No votes)

13 Jul 2015

How to extract a specific text from a webiste

Introduction

In this tip, I am going to show how to extract a specific text from a webiste.

The text from a website can be scraped by using many ways. Now, I am going to show two simplest ways of scraping text. They are:

WebClient class
WebRequest/ WebResponse class

I have used both the classes in the sample project.

I have scraped the below highlighted text from a website.

Using the WebClient

I start the code explanation with the webclient class.

The WebClient class provides common methods for sending data or receiving data from any local, intranet, or Internet resource identified by a URI.

The Webclient methods have different type of methods to download data from the URL. But, I am going to use a method call DownloadString.

The method DownloadString allows to download string from any local, intranet, or internet resource identified by a URI and return the String.

Check the below code:

WebClient wb= new WebClient();
String searchquery=textBox1.Text;
String scrapdata;
scrapdata=wb.DownloadString(searchquery);

The above code shows:

Wb object has been created for the Webclient class and wb object uses the Downloadstring methods to download string from the URI, then return the string has been assigned to scrap data. The code downloads only the source code of the website. We have split the specific text from the source code by using the Regex.

What is Regex?

A regular expression is a pattern that could be matched against an input text.

Before using the Regex, include the below namespace in the C# code.

using System.Text.RegularExpressions;

I am going to use Regex.Matches.

Regex.Matches returns multiple Match objects. It matches multiple instances of a pattern and returns a MatchCollection.It is advantageous for extracting values, based on a pattern, if many values are expected.

Regex.Matches allow us to extract the text between the specific tag. See the code below:

MatchCollection data=Regex.Matches(scrapdata,@"<title>\s*(.+?)\s*</title>",RegexOptions.Singleline);

I have extracted the data between the title tag:

The RegexOptions.Singleline option, or the s inline option, causes the regular expression engine to treat the input string as if it consists of a single line. It does this by changing the behavior of the period (.) language element so that it matches every character, instead of matching every character except for the newline character \n or \u000A.

Then, I have used a foreach loop to find the exact value.

See the code below:

foreach (Match m in data) 
{
String downtitle = m.Groups[1].Value;
MessageBox.Show(downtitle.ToString());
}

There are two matching values found. See the image below:

So, I have extracted 2^nd matching value using the foreach loop.

Using the WebRequest/ WebResponse Class

The WebRequest is an abstract base class. So we actually don't use it directly. We can use it through its derived classes.

We have to use Create method of WebRequest to create an instance of WebRequest. GetResponseStream returns data stream. The following code:

WebRequest request = WebRequest.Create (searchquery); 	// Create a request for the URL. 
request.Credentials = CredentialCache.DefaultCredentials;	// If required by the server, 
								// set the credentials.
WebResponse response = request.GetResponse ();		// Get the response.
Stream dataStream = response.GetResponseStream (); 	// Get the stream containing content 
							// returned by the server.
StreamReader reader = new StreamReader (dataStream);	// Open the stream using a StreamReader  
							// for easy access.
string responseFromServer = reader.ReadToEnd ();	// Read the content.

The source code has been download in the responseFromServer string. Now, we can use the String in the Regex.Match to extract the specific like before..

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here