Grabbing Information from Web Pages using Regular Expressions

qazro

3.17/5 (9 votes)

19 Jun 2007CPOL3 min read

The article describes how to automate navigation through Web pages, i.e. send parameters and grab the required information automatically

Introduction

First of all, the author is not responsible for any misuse of information provided below such as generating automated requests to Web servers or violating someone's copyright.

Problem Discussed in this Article

This article describes how to accomplish the following task. There are some situations when you have to visit some pages on the Internet and grab the required information automatically. For example, suppose you have a Web site and you have to gather different statistics about it automatically every day to store in your database. Most likely, your hosting provider has very powerful statistics, but you must login first to access it. Second, you may need information from a lot of other sources, such as Google, for example about external links to your site (the request link:www.yoursite.com). To accomplish this task, you may want to create a bot, which will do this laborious work for you.

To grab the required information from any Web page, do the following:

Get the HTML code of the page to grab information.
Create Regex class instances with correct regular expressions.
Use Regex.Match(string) or Regex.Matches(string) method to analyze your text.

To do the first step, use code like this:

WebClient wc = new WebClient();
string webtext=wc.DownloadString("url of your page here");

However, usually this is not so simple. First of all, the majority of Web sites have some protection against automated requests. Second, most likely you will have to pass your parameters before getting to the right page. Third, these parameters are most likely to be passed using the POST method instead of GET, i.e. you won't be able to write those parameters in the URL string of WebRequest or WebClient classes explicitly. Therefore, you need to imitate the user.

Imitating the User

Use a System.Windows.Forms.WebBrowser class and create a Windows Forms application to imitate the user. Suppose, for example, you need to grab the statistics for your website from your hosting provider. Most likely, there is a statistics page where you are required to login first. And, of course, password and username must be passed using POST method and HTTPS must be used. Do the following:

//enter the data you want to pass to web page in the following way
string post_data = "login=yourlogin&password=yourpassword"; 
byte[]post_data = Encoding.UTF8.GetBytes(PostDataStr); 
string additional_headers = 
	"Content-Type: application/x-www-form-urlencoded" + Environment.NewLine; 

//Navigate to your page posting your data 
webBrowser1.Navigate("https://yourloginpage", "", post_data, additional_headers); 
//Handle Navigated event to implement further logic 
webBrowser1.Navigated += new WebBrowserNavigatedEventHandler(webBrowser1_Navigated); 

void webBrowser1_Navigated(object sender, WebBrowserNavigatedEventArgs e) 
{ 
//From here, if everything was correct, 
//your webBrowser control has a required credentials for navigating through your pages. 
//Remove this handler for Navigate event since it is not required anymore 
webBrowser1.Navigated -=webBrowser1_Navigated; 
webBrowser1.Navigate("Final URL to navigate"); 
//Use DocumentCompleted Event to wait for the page to be completely downloaded 
//And add the handler for this event 
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler
				(webBrowser1_DocumentCompleted); 
} 

void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) 
{ 
/*Process your page using regular expressions */ 
// Remove this handler for event since it is not required now 
webBrowser1.DocumentCompleted -= webBrowser1_DocumentCompleted; 
//Navigate further if required and add another event handler for DocumentCompleted event 
webBrowser1.DocumentCompleted+=new WebBrowserDocumentCompletedEventHandler
				(webBrowser1_DocumentCompleted2); 
webBrowser1.Navigate("Next URL to Process"); 
}

There are two important things to consider while using this logic. First, the webBrowser control should be placed on a visible Form and be visible. Second, some pages may cause DocumentCompleted event more then once, for example, if they are using AJAX. For example, Google causes this event twice while displaying its search results. Check the webpage in debug mode and implement a simple counter to overcome this issue.

Using Regular Expressions

Formally, regular expression is an expression which can be recognized by the finite automata. For simplicity, we will consider regular expressions as templates written according to some rules. In .NET, there is a special namespace System.Text.RegularExpressions containing all the required classes for dealing with regular expressions. The regular expressions language elements and rules for creating regular expressions can be found on MSDN here.

For example, the following code gets all e-mails from the Web page.

Create the console application and use the following simple code:

using System; 
using System.Collections.Generic; 
using System.Text; 
using System.Text.RegularExpressions; 
using System.Net; 
using System.IO; 
namespace SimpleRegexExample 
{ 
class Program 
 { 
  static void Main(string[] args) 
  { 
   //Getting the HTML content of our webpage 
   WebClient wc = new WebClient(); 
   string webtext=wc.DownloadString("url of your page here"); 
   //now if everything is OK we have content 
   //Creating regular expressions for our needs 
   //Regexp class should be created with a correct regular expression for e-mail 
   Regex regexp = new Regex("(?<Email>
	\\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*)"); 
   //Search content for required elements 
   MatchCollection mc = regexp.Matches(webtext); 
   //Display what was found 
   Console.WriteLine("Emails:"); 
   for (int i = 0; i < mc.Count; i++) 
   Console.WriteLine(mc[i].Groups["Email"].Value); 
   Console.ReadKey(); 
  } 
 } 
}

Conclusion

The main shortcoming of this approach is that since webBrowser must be visible, you cannot implement it as a Windows Service. But you can do the following:

Implement all the logic in the handler of Load event of the form.
Create a service which will run within the context of a special user (say some "Service User") which will execute your Windows.Forms application using System.Diagnostics.Process.Start("yourWinFormApp.exe") class when necessary.

Since this service will run within a context of another user, the Windows Forms application will also run within such context, therefore "real" users won't see any additional forms.

About the Author

I am a Software Engineer at INTSPEI.

Visit my personal Web page to view information about me and my projects.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)