Introduction
"MyCrawler" is a class that is written in C# and functions as a "Crawler", a program that crawls a website given a root URL (or several roots) and downloads them and any links within those pages, using multiple threads, filters, and depth for precise and fast downloading.
Background
This class was written as part of a larger project for data mining.
Classes
Besides the main class "CrawlerGeneric
", there are a few more helper classes:
Network
- inherits the "WebClient
" class and overloads the "downloadstring
" method for more specific tasks.IO
- responsible for saving pages locally.MyRegex
- helper class that includes various methods to manipulate strings.Compresser
- is used by the network class for handling compressed data.
Using the code
To use the crawler, we need to first create an instance of "CrawlerGeneric
"; however, it is recommended to use it as a base class and create your own class for implementing the "ExtractLinks
" and "Filter
" methods (described below):
public class MyCrawler : CrawlerGeneric
{
protected override List<string> ExtractLinks(string pageData,string url)
{
List<string> links = BuiltInExtractLinks(pageData,url);
return links;
}
protected override bool Filter(string url)
{
url = url.ToLower();
if (url.Contains("search?") && url.Contains("&start="))
return false;
return true;
}
}
The "ExtractLinks
" method allows you to get the downloaded page data and its URL and add or remove links from the links list. The links list holds the list of URLs that were extracted from the page. BuiltInExtractLinks
is a private function that extracts all the URLs inside the "href
" attribute in the page.
The "Filter
" method examines each URL in the link list; only URLs that return false
will be downloaded.
Create an instance of MyCrawler
After writing our own class, we instantiate it and set its properties.
MyCrawler crawler = new MyCrawler();
crawler.CrawlerDepth = 2;
crawler.NumOfConcurrentThreads = 10;
crawler.OutputDirectory = "c:\\test";
crawler.Roots = new string[] { @"http://www.google.co.il/search?" & _
"hl=en&lr=&rlz=1G1GGLQ_IWIL297&q=codeproject&start=0&sa=N" };
crawler.AllowCompression = false;
crawler.ThreadIdleTime = 500;
crawler.ThreadSleepTime = 500;
The "AllowComperssion
" property allows us to choose if the data that is downloaded could be downloaded using the "gzip/deflate" compressors. In this case, I set it to false
because I am downloading pages from Google, and I noticed that compressed data that is sent from Google is compressed using the zip method and not gzip, thus receiving an exception "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream ".
Hooking up events
The class can be consumed by any UI app. So, it uses an event to alert changes. Also, it has an event to alert when the download process has ended.
.
.
.
crawler.DataCanged += new
EventHandler<Analayza.Crawler.Events.MyEventArgs>(crawler_DataCanged);
crawler.Finished += new EventHandler<Analayza.Crawler.
Events.FinishedDownloadingEventArgs>(crawler_Finished);
if (!crawler.IsConnectedToInternet())
{
Console.WriteLine("Not connected to the internet");
return;
}
crawler.Go();
while (fin == false)
{
System.Threading.Thread.Sleep(1000);
}
static void crawler_Finished(object sender,
Analayza.Crawler.Events.FinishedDownloadingEventArgs e)
{
fin = true;
}
static void crawler_DataCanged(object sender,
Analayza.Crawler.Events.MyEventArgs e)
{
Console.WriteLine(e.Message);
}
Points of interest
Because the crawler downloads the site tree and uses the URL path as the file name, there might be cases where the entire path on the local disk exceeds 260 characters, which is the Windows file system limit for a path length. So the crawler uses a method called MyRegex.ToValidWindowsPath
to transform the URL to a valid Windows path (excluding illegal characters) and if needed, cutting the name (limiting to 250, just in case..).
public static string ToValidWindowsPath(string directory, string file)
{
string newfile = ToValidWindowsFileName(file);
string fullPath = directory + newfile;
if (fullPath.Length > 250)
{
if (directory.Length > 230)
return "";
int fileNewLength = 248 - directory.Length;
if (newfile.Length > 10)
newfile = newfile.Substring(0, 10);
Random rand = new Random();
fullPath = directory + newfile + rand.Next(0, 400000).ToString();
return fullPath;
}
return fullPath;
}
Thank to..
Thanks to gizmo ... :)