Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Semi generated crawler

0.00/5 (No votes)
22 Jun 2012 1  
Leverage Visual studio Web Test Framework for your crawling needs...

Semi generated crawler

Leverage VS Web Test Framework to create crawlers for you.



Table of content

Introduction

Well, I created lots of crawler in my life... more than I want to. So to keep my
passion alife I create new ways for doing the same stuff with less time !

In this article you will learn how to leverage the Web Test Framework and recorder
of Visual Studio Ultimate (professional or prenium version does not have it) to create website crawlers for you.

Maybe have you read my article about
browser automation
.


If you did, I will use the same use case example.

If you have not, here is a reminder: at the end of this article you will have an
application that will vote for my article !

The process is quick and simple :


The coffee step is the hardest one.

Use case : Vote 5 for this article automatically

Given the number of vote I had for my
browser automation
article, I decided to use the same trick to boost my
vote here.


This project will vote for me ! I am an egocentric person.

So with a classic crawler you might say :

  • Get request to http://www.codeproject.com/Articles/338036/BrowserAutomationCrawler
  • POST request with login parameter and password parameter to action "submitLogin"
  • Save the cookie
  • POST vote=5, articleId=MyArticleId to action=Vote, with the cookie

It might work, except that I completely make up the parameters and action name and
so you would need fiddler to fine tunes the requests correctly. (And depending on
the website it can be very very hard, especially with AJAX stuff)

Another way to do the same thing is to say :

  • Go to http://www.codeproject.com/Articles/338036/BrowserAutomationCrawler
  • If "logout" is present then you are already logged,

    else


    Wait that I click on sign in (so I can manually fill email and password)
  • Then click the option vote 5
  • Fill the comment textbox with "5 for me, great Nicolas, thanks for you work ! Smile | <img src= " /> "
  • Click on vote button

That's the way I did for my article on browser automation.

Now I have a new way :

  • Open IE with Web test record
  • Click click click
  • Generate code
  • Modify the login/password part of the code to use a custom property of the crawler
    class.

So first, create you need to create a new console project that will hold the code
GeneratedCrawler.Sample.

Then create a new test project, I called it GeneratedCrawler.Tests


In this test project, right click on the project/Add/New test, then select a new
Web Performance Test, I called it CodeProjectVoter.


Then a browser session in IE open...

Browse to codeproject, and to this article link, login, and vote 5 for me with a
nice comment...

If your browser is already logged in CodeProject when it is launched, you need to
logout first, stop the record, and start from the beginning.


If you are already logged, our crawler will not generate the code we need to log
in successfully to CodeProject.

And don't forget to click on Vote !!


You can see I used another article... That's because of the chicken and eggs
problem : creating my article on the project of my article... When I noticed my
mind stackoverflowed.

Then you can stop the record and then generate the code.


Now here is the trick... The generated code has heavy dependencies on tons of dll
and with MSTest.


This code is not easy to run in your own project without using MSTest.exe directly...


So I decompiled the MSTest dlls, checked what was going on, and created a project with the same classes, cutting all the dependencies.

LightWebTestFramework is born.

So first, copy the generated code from the test project to the console project created earlier : GeneratedCrawler.Sample.

Obviously, the code does not compile since GeneratedCrawler.Sample does not reference any MSTest assembly... reference the LightWebTestFramework instead.

Then use the LightWebTestFramework namespaces instead of microsoft's ones.


Now call the crawler in your code.

class Program
{
	static void Main(string[] args)
	{
		new CodeProjectVoterCoded().Execute();
	}
}

But wait... we forgot to specify the login and password as properties of the crawler.

public class CodeProjectVoterCoded : WebTest
{
	public CodeProjectVoterCoded()
	{
		this.PreAuthenticate = true;
	}
 
	public string Login
	{
		get;
		set;
	}
	public string Password
	{
		get;
		set;
	}

Update the program.cs...

class Program
{
	static void Main(string[] args)
	{
		new CodeProjectVoterCoded()
			{
				Login = "slashene@gmail.com",
				Password = "blabla"
			}.Execute();
	}
}

Then just find in CodeProjectVoterCoded where the login/password you used during the recording step appear,
and use your properties instead.


You can then generalize for other properties at will... 

Cleanup some dependant requests you don't need (like the ScoreCard Research requests)... and then you have your full crawler.

static void Main(string[] args)
{
	new CodeProjectVoterCoded()
		{
			Login = "slashene@gmail.com",
			Password = "blabla",
			ArticleId = 409009,
			Vote = 5,
			Comment = "5 for me, great work Nicolas !!!"
		}.Execute();
} 

Here is the full code of the final entirely generated then customized crawler : 

namespace GeneratedCrawler.Tests
{
	using System;
	using System.Collections.Generic;
	using System.Text;
	//using Microsoft.VisualStudio.TestTools.WebTesting;
	//using Microsoft.VisualStudio.TestTools.WebTesting.Rules;
	using LightWebTestFramework;
	using LightWebTestFramework.Rules;
	using System.Web;


	public class CodeProjectVoterCoded : WebTest
	{
		public CodeProjectVoterCoded()
		{
			this.PreAuthenticate = true;
		}


		public string Login
		{
			get;
			set;
		}
		public string Password
		{
			get;
			set;
		}
		public int ArticleId
		{
			get;
			set;
		}
		public int Vote
		{
			get;
			set;
		}
		public string Comment
		{
			get;
			set;
		}
		public override IEnumerator<webtestrequest> GetRequestEnumerator()
		{
			// Initialize validation rules that apply to all requests in the WebTest
			if((this.Context.ValidationLevel >= ValidationLevel.Low))
			{
				ValidateResponseUrl validationRule1 = new ValidateResponseUrl();
				this.ValidateResponse += new EventHandler<validationeventargs>(validationRule1.Validate);
			}
			if((this.Context.ValidationLevel >= ValidationLevel.Low))
			{
				ValidationRuleResponseTimeGoal validationRule2 = new ValidationRuleResponseTimeGoal();
				validationRule2.Tolerance = 0D;
				this.ValidateResponseOnPageComplete += new EventHandler<validationeventargs>(validationRule2.Validate);
			}

			WebTestRequest request1 = new WebTestRequest("http://www.codeproject.com/");
			WebTestRequest request1Dependent1 = new WebTestRequest("http://b.scorecardresearch.com/b");
			request1Dependent1.ThinkTime = 24;
			request1Dependent1.QueryStringParameters.Add("c1", "2", false, false);
			request1Dependent1.QueryStringParameters.Add("c2", "13507173", false, false);
			request1Dependent1.QueryStringParameters.Add("ns__t", "1340383504029", false, false);
			request1Dependent1.QueryStringParameters.Add("ns_c", "utf-8", false, false);
			request1Dependent1.QueryStringParameters.Add("c8", "CodeProject%20-%20Your%20Development%20Resource", false, false);
			request1Dependent1.QueryStringParameters.Add("c7", "http%3A%2F%2Fwww.codeproject.com%2F", false, false);
			request1Dependent1.QueryStringParameters.Add("c9", "", false, false);
			request1.DependentRequests.Add(request1Dependent1);
			ExtractHiddenFields extractionRule1 = new ExtractHiddenFields();
			extractionRule1.Required = true;
			extractionRule1.HtmlDecode = true;
			extractionRule1.ContextParameterName = "1";
			request1.ExtractValues += new EventHandler<extractioneventargs>(extractionRule1.Extract);
			yield return request1;
			request1 = null;

			WebTestRequest request2 = new WebTestRequest("http://www.codeproject.com/script/Membership/LogOn.aspx");
			request2.Method = "POST";
			request2.ExpectedResponseUrl = "http://www.codeproject.com/";
			request2.QueryStringParameters.Add("rp", "%2f", false, false);
			FormPostHttpBody request2Body = new FormPostHttpBody();
			request2Body.FormPostParameters.Add("FormName", this.Context["$HIDDEN1.FormName"].ToString());
			request2Body.FormPostParameters.Add("Email", Login);
			request2Body.FormPostParameters.Add("Password", Password);
			request2.Body = request2Body;
			yield return request2;
			request2 = null;

			WebTestRequest request4 = new WebTestRequest("http://www.codeproject.com/Script/Ratings/Ajax/RateItem.aspx");
			request4.QueryStringParameters.Add("obid", ArticleId.ToString(), false, false);
			request4.QueryStringParameters.Add("obtid", "2", false, false);
			request4.QueryStringParameters.Add("obstid", "1", false, false);
			request4.QueryStringParameters.Add("rvv", Vote.ToString(), false, false);
			request4.QueryStringParameters.Add("rvc", HttpUtility.UrlEncode(Comment), false, false);
			yield return request4;
			request4 = null;
		}
	}
} 

If the login/password is wrong you will have no exception or crash, it will just not work, I let it to you. Smile | :)  

If you need any information for your own needs, don't forget that the generated crawler is a child class of WebTest, and it comes with some cool properties.

Conclusion 

With great power comes great responsability, I hope you will use crawling for good ! 

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here