A simple crawler class

Haim Nachum

3.45/5 (5 votes)

27 Dec 2009CPOL2 min read

31.7K

2.3K

A multi-threaded crawler that downloads pages and saves them localy while preserving the site tree structure.

Introduction

"MyCrawler" is a class that is written in C# and functions as a "Crawler", a program that crawls a website given a root URL (or several roots) and downloads them and any links within those pages, using multiple threads, filters, and depth for precise and fast downloading.

Background

This class was written as part of a larger project for data mining.

Classes

Besides the main class "CrawlerGeneric", there are a few more helper classes:

Network - inherits the "WebClient" class and overloads the "downloadstring" method for more specific tasks.
IO - responsible for saving pages locally.
MyRegex - helper class that includes various methods to manipulate strings.
Compresser - is used by the network class for handling compressed data.

Using the code

To use the crawler, we need to first create an instance of "CrawlerGeneric"; however, it is recommended to use it as a base class and create your own class for implementing the "ExtractLinks" and "Filter" methods (described below):

public class MyCrawler : CrawlerGeneric
{
    protected override List<string> ExtractLinks(string pageData,string url)
    {
        List<string> links = BuiltInExtractLinks(pageData,url);
        //Do some sort of logic here
        return links;
    }

    protected override bool Filter(string url)
    {
        url = url.ToLower();
        if (url.Contains("search?") && url.Contains("&start="))
            return false;
        return true;
    }
}

The "ExtractLinks" method allows you to get the downloaded page data and its URL and add or remove links from the links list. The links list holds the list of URLs that were extracted from the page. BuiltInExtractLinks is a private function that extracts all the URLs inside the "href" attribute in the page.

The "Filter" method examines each URL in the link list; only URLs that return false will be downloaded.

Create an instance of MyCrawler

After writing our own class, we instantiate it and set its properties.

MyCrawler crawler = new MyCrawler();
crawler.CrawlerDepth = 2;
crawler.NumOfConcurrentThreads = 10;
crawler.OutputDirectory = "c:\\test";
crawler.Roots = new string[] { @"http://www.google.co.il/search?" & _ 
   "hl=en&lr=&rlz=1G1GGLQ_IWIL297&q=codeproject&start=0&sa=N" };
crawler.AllowCompression = false;

//Time for thread to sleep when there is no url's to handle
crawler.ThreadIdleTime = 500;
//Time for thread to sleep after downloading a page
crawler.ThreadSleepTime = 500;

The "AllowComperssion" property allows us to choose if the data that is downloaded could be downloaded using the "gzip/deflate" compressors. In this case, I set it to false because I am downloading pages from Google, and I noticed that compressed data that is sent from Google is compressed using the zip method and not gzip, thus receiving an exception "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream ".

Hooking up events

The class can be consumed by any UI app. So, it uses an event to alert changes. Also, it has an event to alert when the download process has ended.

.
.
.

crawler.DataCanged += new 
  EventHandler<Analayza.Crawler.Events.MyEventArgs>(crawler_DataCanged); 

crawler.Finished += new EventHandler<Analayza.Crawler.
   Events.FinishedDownloadingEventArgs>(crawler_Finished);

//Check connection to internet
if (!crawler.IsConnectedToInternet())
{
    Console.WriteLine("Not connected to the internet");
    return;
}

//Start crawling
crawler.Go();

//dont exit until crawler has finished
while (fin == false)
{
    System.Threading.Thread.Sleep(1000);
}

static void crawler_Finished(object sender, 
  Analayza.Crawler.Events.FinishedDownloadingEventArgs e)
{
    fin = true;
}

static void crawler_DataCanged(object sender, 
  Analayza.Crawler.Events.MyEventArgs e)
{
    Console.WriteLine(e.Message);
}

Points of interest

Because the crawler downloads the site tree and uses the URL path as the file name, there might be cases where the entire path on the local disk exceeds 260 characters, which is the Windows file system limit for a path length. So the crawler uses a method called MyRegex.ToValidWindowsPath to transform the URL to a valid Windows path (excluding illegal characters) and if needed, cutting the name (limiting to 250, just in case..).

public static string ToValidWindowsPath(string directory, string file)
{
    string newfile = ToValidWindowsFileName(file);
    string fullPath = directory + newfile;

    if (fullPath.Length > 250)
    {
        if (directory.Length > 230)
            return "";

        int fileNewLength = 248 - directory.Length; //minimum 18 characters

        if (newfile.Length > 10)
            newfile = newfile.Substring(0, 10);

        Random rand = new Random();
        fullPath = directory + newfile + rand.Next(0, 400000).ToString();
        return fullPath;
    }
    return fullPath;
}

Thank to..

Thanks to gizmo ... :)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)