Introduction
First of all, the author is not responsible for any misuse of information provided below such as generating automated requests to Web servers or violating someone's copyright.
Problem Discussed in this Article
This article describes how to accomplish the following task. There are some situations when you have to visit some pages on the Internet and grab the required information automatically. For example, suppose you have a Web site and you have to gather different statistics about it automatically every day to store in your database. Most likely, your hosting provider has very powerful statistics, but you must login first to access it. Second, you may need information from a lot of other sources, such as Google, for example about external links to your site (the request link:www.yoursite.com). To accomplish this task, you may want to create a bot, which will do this laborious work for you.
To grab the required information from any Web page, do the following:
- Get the HTML code of the page to grab information.
- Create
Regex
class instances with correct regular expressions. - Use
Regex.Match(string)
or Regex.Matches(string)
method to analyze your text.
To do the first step, use code like this:
WebClient wc = new WebClient();
string webtext=wc.DownloadString("url of your page here");
However, usually this is not so simple. First of all, the majority of Web sites have some protection against automated requests. Second, most likely you will have to pass your parameters before getting to the right page. Third, these parameters are most likely to be passed using the POST
method instead of GET
, i.e. you won't be able to write those parameters in the URL string of WebRequest
or WebClient
classes explicitly. Therefore, you need to imitate the user.
Imitating the User
Use a System.Windows.Forms.WebBrowser
class and create a Windows Forms application to imitate the user. Suppose, for example, you need to grab the statistics for your website from your hosting provider. Most likely, there is a statistics page where you are required to login first. And, of course, password and username must be passed using POST
method and HTTPS must be used. Do the following:
string post_data = "login=yourlogin&password=yourpassword";
byte[]post_data = Encoding.UTF8.GetBytes(PostDataStr);
string additional_headers =
"Content-Type: application/x-www-form-urlencoded" + Environment.NewLine;
webBrowser1.Navigate("https://yourloginpage", "", post_data, additional_headers);
webBrowser1.Navigated += new WebBrowserNavigatedEventHandler(webBrowser1_Navigated);
void webBrowser1_Navigated(object sender, WebBrowserNavigatedEventArgs e)
{
webBrowser1.Navigated -=webBrowser1_Navigated;
webBrowser1.Navigate("Final URL to navigate");
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler
(webBrowser1_DocumentCompleted);
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
webBrowser1.DocumentCompleted -= webBrowser1_DocumentCompleted;
webBrowser1.DocumentCompleted+=new WebBrowserDocumentCompletedEventHandler
(webBrowser1_DocumentCompleted2);
webBrowser1.Navigate("Next URL to Process");
}
There are two important things to consider while using this logic. First, the webBrowser
control should be placed on a visible Form and be visible. Second, some pages may cause DocumentCompleted
event more then once, for example, if they are using AJAX. For example, Google causes this event twice while displaying its search results. Check the webpage in debug mode and implement a simple counter to overcome this issue.
Using Regular Expressions
Formally, regular expression is an expression which can be recognized by the finite automata. For simplicity, we will consider regular expressions as templates written according to some rules. In .NET, there is a special namespace System.Text.RegularExpressions
containing all the required classes for dealing with regular expressions. The regular expressions language elements and rules for creating regular expressions can be found on MSDN here.
For example, the following code gets all e-mails from the Web page.
Create the console application and use the following simple code:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;
namespace SimpleRegexExample
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
string webtext=wc.DownloadString("url of your page here");
Regex regexp = new Regex("(?<Email>
\\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*)");
MatchCollection mc = regexp.Matches(webtext);
Console.WriteLine("Emails:");
for (int i = 0; i < mc.Count; i++)
Console.WriteLine(mc[i].Groups["Email"].Value);
Console.ReadKey();
}
}
}
Conclusion
The main shortcoming of this approach is that since webBrowser must be visible, you cannot implement it as a Windows Service. But you can do the following:
- Implement all the logic in the handler of
Load
event of the form. - Create a service which will run within the context of a special user (say some "Service User") which will execute your
Windows.Forms
application using System.Diagnostics.Process.Start("yourWinFormApp.exe")
class when necessary.
Since this service will run within a context of another user, the Windows Forms application will also run within such context, therefore "real" users won't see any additional forms.
About the Author
I am a Software Engineer at INTSPEI.
Visit my personal Web page to view information about me and my projects.