Introduction
In this tip, I am going to show how to extract a specific text from a webiste.
The text from a website can be scraped by using many ways. Now, I am going to show two simplest ways of scraping text. They are:
WebClient
class
WebRequest
/ WebResponse
class
I have used both the classes in the sample project.
I have scraped the below highlighted text from a website.
Using the WebClient
I start the code explanation with the webclient
class.
The WebClient
class provides common methods for sending data or receiving data from any local, intranet, or Internet resource identified by a URI.
The Webclient
methods have different type of methods to download data from the URL. But, I am going to use a method call DownloadString
.
The method DownloadString
allows to download string
from any local, intranet, or internet resource identified by a URI and return the String
.
Check the below code:
WebClient wb= new WebClient();
String searchquery=textBox1.Text;
String scrapdata;
scrapdata=wb.DownloadString(searchquery);
The above code shows:
Wb
object has been created for the Webclient
class and wb
object uses the Downloadstring
methods to download string
from the URI, then return the string
has been assigned to scrap data. The code downloads only the source code of the website. We have split the specific text from the source code by using the Regex
.
What is Regex?
A regular expression is a pattern that could be matched against an input text.
Before using the Regex, include the below namespace in the C# code.
using System.Text.RegularExpressions;
I am going to use Regex.Matches
.
Regex.Matches
returns multiple Match
objects. It matches multiple instances of a pattern and returns a MatchCollection.
It is advantageous for extracting values, based on a pattern, if many values are expected.
Regex.Matches
allow us to extract the text between the specific tag. See the code below:
MatchCollection data=Regex.Matches(scrapdata,@"<title>\s*(.+?)\s*</title>",RegexOptions.Singleline);
I have extracted the data between the title tag
:
The RegexOptions.Singleline
option, or the s
inline option, causes the regular expression engine to treat the input string
as if it consists of a single line. It does this by changing the behavior of the period (.) language element so that it matches every character, instead of matching every character except for the newline character \n
or \u000A
.
Then, I have used a foreach
loop to find the exact value.
See the code below:
foreach (Match m in data)
{
String downtitle = m.Groups[1].Value;
MessageBox.Show(downtitle.ToString());
}
There are two matching values found. See the image below:
So, I have extracted 2nd matching value using the foreach
loop.
Using the WebRequest/ WebResponse Class
The WebRequest
is an abstract
base class. So we actually don't use it directly. We can use it through its derived classes.
We have to use Create
method of WebRequest
to create an instance of WebRequest
. GetResponseStream
returns data stream. The following code:
WebRequest request = WebRequest.Create (searchquery); request.Credentials = CredentialCache.DefaultCredentials; WebResponse response = request.GetResponse (); Stream dataStream = response.GetResponseStream (); StreamReader reader = new StreamReader (dataStream); string responseFromServer = reader.ReadToEnd ();
The source code has been download in the responseFromServer string
. Now, we can use the String
in the Regex.Match
to extract the specific like before..
References