Mining Webpages in Parallel

Jake Drew

5.00/5 (4 votes)

19 Mar 2015GPL312 min read

17.3K

How to mine webpages in parallel

A Visual Studio 2013 demo project including the WebpageDownloader and LinkCrawler can be downloaded here.

Introduction

The US digital universe currently doubles in size approximately every three years [1]. In fact, Hewlett Packard estimates that by the end of this decade, the digital universe will be measured in ‘Brontobytes’, which represent one billion Exabytes or around two quadrillion years of music [2]. Each minute, internet users send over 204 million emails, Google receives over 4 million search requests, YouTube users upload 72 hours of video, and 277,000 tweets are posted on twitter [3]. It is estimated that in 2012, only 1/2 a percent of the data in the US digital universe was analyzed [1].

Data mining web content efficiently is becoming an all to common task. When crawling web pages or downloading website content, a single threaded approach can often leave one waiting unreasonable amounts of time for critical information. In previous articles, I have written on many topics about extracting valuable information from the data contained in webpages (references are included below). In the machine learning world, we typically refer to this as extracting “features” from webpages.

Webpage Features

One of the most basic features which can be extracted from web content is a webpage’s HTML which can simply be downloaded using C#. Webpage HTML content can also be processed further to produce more elaborate and complex features. For example, the HTML downloaded from a webpage contains much different information than the text and images which are displayed in the browser using the webpage’s URL. I have provided just a few examples of valuable features which can be extracted when mining webpages in the list below including some references for further reading on some topics:

Displayed text can be extracted from the HTML content downloaded from a webpage [4].
Displayed text could be processed to divide text into sentences using Natural Language Processing [6].
Screenshots of webpages can be taken for the purposes of clustering similar images [5].
HTML Document Object Model (DOM) tags could be counted to compare webpages for similarity [6].
Displayed text from webpages could be processed to create n-grams for calculating webpage similarity [7].

However, almost all successful webpage feature extractions begin with simply downloading webpages. The download process can also be the largest bottleneck in feature extraction systems, especially when you are downloading content that you did not create and may not have any idea where the actual content is located. Sometimes, HTML can be full of errors, missing scripts, malicious content, or located on very old servers, down servers, or a large number of other reasons which could potentially cause much grief during the download process. In fact, when you are mining thousands, tens-of-thousands, or millions of webpages, you are guaranteed to run into webpages that cause the download process to either fail or hang.

The remainder of this article demonstrates how to successfully mine large volumes of webpages by downloading them in parallel, keeping a detailed log of your progress, and even retrying downloads that fail or time out.

The C# Webpage Downloader

I have created a tool in C# called the “Webpage Downloader”. This class can be used within any C# program to download large volumes of webpage content in parallel. Creating the class in a C# program is very simple as shown in Figure 1.

//1. Create a webpage downloader with 100 concurrent 
//   downloads, 3 retrys, and a 60 second timeout...
//2. OutputDirectory/downloadLog.txt will have a 
//   list of all the URLs provided and the outcome 
//   of each download...

downloader = new WebpageDownloader
  (100, 3, 60, outputDirectory + "downloadLog.txt");

Figure 1 – Creating the Webpage Downloader.

Each webpage downloaded by the Webpage Downloader is returned in a WebPageItem class. The webpage item class includes all of the relevant details for a particular download and could easily be further extended to meet business requirements for a particular project. Figure 2 shows each of the current properties supported by the WebPageItem class which are visible in the class constructor.

public WebPageItem
 (string Url, string Html, string ResponseUrl)
{
 _Url = Url;
 _Html = Html;
 _ResponseUrl = ResponseUrl;
 _Error = false;
 _ErrorMessage = "";
 _ServerSuccessReponse = true;
 _FileData = null;
}

Figure 2 – The Webpage Downloader’s Class Constructor.

The WebPageItem in Figure 2 contains each downloaded webpage’s URL, HTML, the server’s response URL (important to detect a redirect), an error flag, error message, a flag indicating if the server provided a successful response, and the URL’s byte array data, if the requested URL is downloaded as binary file. Any number of new properties or methods could be added to the WebPageItem class to extent the webpage related features collected during the download process.

The Download URL Method

Figure 3 shows the asynchronous DownloadURL() method which is used to download a single webpage when provided a valid URL. It also writes any download errors to a log file (when provided). The URL’s content can be downloaded as either binary or text data. This allows the user to download HTML as text or download other file types such as images in a binary format.

public async Task DownloadURL
(string URL, bool log = false, bool binary=false)
{
  HttpClient client = new HttpClient();
  bool successResponse = false;
  if (_TimeoutMs > 0) 
    client.Timeout = TimeSpan.FromMilliseconds(_TimeoutMs);

  try
  {
    using (client)
    using (HttpResponseMessage response = await client
      .GetAsync(URL, CancelToken)
      .ConfigureAwait(false))
    {
      successResponse = response.IsSuccessStatusCode;
      //Throw exception for bad reponse code.
      response.EnsureSuccessStatusCode();
                    
     //Download content for good response code.
     using (HttpContent content = response.Content)
     {   
       //Get the WebPageItem data .
       string responseUri = response.RequestMessage
                            .RequestUri.ToString();
       WebPageItem webpageItem = null;

       if (binary)
       {
         byte[] fileData = await content
           .ReadAsByteArrayAsync()
           .ConfigureAwait(false);
         webpageItem = new 
           WebPageItem(URL, fileData, responseUri);
       }
       else
       {
         string html = await content
           .ReadAsStringAsync()
           .ConfigureAwait(false);
         webpageItem = new
           WebPageItem(URL, html, responseUri);
       }

       //Clean up our attempt tracking
       int attepmts;
       _RequeueCounts.TryRemove(URL, out attepmts);
       if (log) writeToLog(URL + "|Success" + "\r\n");
       return webpageItem;
     }
   }
 }
 catch (Exception e)
 {
   if(log) 
     writeToLog(URL + "|FAILED|" + e.Message + "\r\n");
   return new 
     WebPageItem(URL, true, e.Message, successResponse);
 }
}

Figure 3 – The DownloadURL() Method Asynchronously Downloads a Single Webpage in Either a Binary or String Format.

Portions of this method are asynchronous and the method itself is marked with the “async” keyword in its signature. Using the C# HttpClient, a timeout is also set to ensure that when a download “hangs” for any reason, it will eventually be timed out using a duration specified by the user. The method also includes a true/false or boolean parameter named “binary”. This parameter allows the URL to be downloaded asynchronously as either a byte array or as a string. The C# “await” keyword allows us to utilize the HttpContent’s asynchronous read / download methods ReadAsByteArrayAsync() and ReadAsStringAsync() to download multiple webpage’s content simultaneously in the background. The DownloadURL() method could also be extended to make the download type determination dynamically based upon a URL’s file extension instead of using the “binary” parameter. For instance, image file extensions could be downloaded as binary data while HTML file extensions were downloaded as text.

Regardless of the download’s format or success, the DownloadURL() method returns a WebPageItem which contains any downloaded data, the server’s response, any redirect URL which may have occurred, and any relevant error messages. This is a very important feature which allows each download to be tracked. When downloading large volumes of URLs in parallel, this becomes very important since the user will want detailed control over things such as how long to wait before timing out, the download format to use, and how many times to retry downloads which fail.

Downloading Multiple Webpages in Parallel

Since the DownloadURL() method executes asynchronously, we can now download multiple webpages at the same time. The Webpage Downloader accomplishes this task using two thread safe collections and two lists as shown in Figure 4.

public BlockingCollection Downloads;
private ConcurrentDictionary<string, int> _RequeueCounts;
private List<Task> _DownloadTasks;
private List _ReTry = new List();

Figure 4 – Data Structures used by the Webpage Downloader.

The thread safe BlockingCollection “downloads” contains each successful webpage download as it occurs. Since the collection is thread safe, multiple worker threads can safely place downloaded WebPageItem content into the collection at the same time. This also allows the user to successfully create a parallel producer / consumer style processing pipeline where downloaded webpages can immediately be further processed in a manner specified by the user. This is described in greater detail later on.

The _RequeueCounts data structure keeps track of how many times a webpage download is attempted for a given URL. When the download attempts is less than the maximum attempts specified by the user, the URL is added back into the _ReTry list for another download attempt. Otherwise, the URL is added to the log with a message that the download failed after exceeding the maximum attempts, along with the last error message received.

The _DownloadTasks list is used within the DownloadUrls() method to act as a “concurrency” throttle. When the DownloadUrls() method first executes, one DownloadUrl() task is created for each URL to be downloaded up to the maximum number of downloads specified by the user. Figure 5 shows the creation of these download tasks using the DownloadUrl() method.

//Create async tasks to download the correct number 
// of URLs concurrently based on the 
// _MaxConcurrentDownloads throttle
int dlCount = URLs.Length < _MaxConcurrentDownloads 
  ? URLs.Length : _MaxConcurrentDownloads;
for (int i = 0; i < dlCount; i++)
{
  _DownloadTasks.Add(DownloadURL(URLs[i], false, binary));
  downloadIndex++;
}

Figure 5 – Starting a Maximum Number of Concurrent Webpage Downloads.

Once the _DownloadTasks list is full, a second while loop executes which “awaits” any executing download task to complete. When the first task completes, it is removed from the _DownloadTasks list and a new download is started. The completed download is also added to the “downloads” BlockingCollection for further downstream processing or possibly re-queued in the event of an error. This process is illustrated in Figure 6.

//Process each download as it completes handling errors
while (_DownloadTasks.Count > 0)
{
  Task nextDownload = 
    await Task.WhenAny(_DownloadTasks);
  _DownloadTasks.Remove(nextDownload);
  // are there any URL's left?
  if (downloadIndex < URLs.Length)
  { //If yes, start a new download 
    _DownloadTasks
      .Add(DownloadURL(URLs[downloadIndex], 
        false, binary));
    downloadIndex++;
  }

  WebPageItem download = nextDownload.Result;
  if (download.Error)
  {
    handleDownloadError(download);
  }
  else
  {
    writeToLog(download.Url + "|Success" + "\r\n");
    Downloads.Add(download);
  }
}

Figure 6 – Throttle Parallel Downloads Based upon a User Specified Concurrency Level.

Using the WebPage Downloader in a Parallel Pipeline

In Figure 7, the WebpageDownloader created in Figure 1 is used to download a list of unique links. First, the DownloadUrls() method is called to begin downloading webpages in parallel. According to Figure 1, the WebpageDownloader will download up to 100 webpages at a time retrying each download up to 3 times. Downloads will also timeout after 60 seconds, and each download will be documented in the downloadLog.txt file.

//Async call to download all of the unique URLs provided  
downloader.DownloadUrls(uniqueLinks.Keys.ToArray());

//Process each webpage as it is downloaded (in parallel).
Parallel.ForEach(downloader.Downloads
  .GetConsumingEnumerable(), webpageItem =>
{ //Make sure there was no download error...
  if (webpageItem.Error == false)
  { 
    string fileName = Path.GetFileName(webpageItem.Url);
    string filePath = outputDirectory + fileName;
    File.WriteAllText(filePath, webpageItem.Html);
  }
});

Figure 7 – Using Downloads from the Webpage Downloader in a Parallel Pipeline.

In the Figure 7 example, as soon as the first download is completed, a second Parallel.ForEach loop immediately processes the downloaded content. Keep in mind that other URLs could also be in the process of downloading at the exact same time since the DownloadUrls() method is asynchronous and running in a separate thread. In this simple example, the second Parallel.ForEach loop saves each of the downloaded items to an output directory specified by the user.

Using Recursion to Retry Failed Downloads

The DownloadUrls() method recursively calls itself in the event that not all URLs were successfully downloaded and additional retries were specified by the user. During processing any URLs placed into the _ReTry collection are simply provided as input to another recursive call to the same DownloadUrls() method as shown in Figure 8. In this manner, failed downloads will continue be re-attempted until they are either successfully downloaded or the maximum attempts are exceeded.

//Recursively call DownloadUrls until there are no more 
// retries left.
if (_ReTry.Count > 0)
{
  DownloadUrls(_ReTry.ToArray());
  _ReTry.Clear();
}
else
{
  Downloads.CompleteAdding();
  _RequeueCounts.Clear();
}

Figure 8 – Using Recursion to Retry Failed Downloads.

Crawling Webpage Links in Parallel

Have you ever wondered how a search engine might crawl the web looking for links which are connected to a single webpage? The following section demonstrates using the WebpageDownloader to start with a single webpage URL and crawl the links within every connected webpage until the collection reaches a certain maximum size. The purpose of this demo application is merely to illustrate using the Webpage Downloader class for processing downloaded webpages. There are multiple improvements which could be made to this demo project for crawling links more efficiently. However, they would also add many more lines of code which may make understanding the illustration more complex.

//Async call to download the URL provided
downloader.DownloadUrls(startingURL);

//Process each webpage as it is downloaded 
//(optional processing in parallel).
//Parallel.ForEach(downloader.Downloads
//    .GetConsumingEnumerable(), webpageItem =>

foreach(WebPageItem webpageItem in 
  downloader.Downloads.GetConsumingEnumerable())
{
  if (webpageItem.Error == false)
  {
    //Now get all the links from the html and crawl!
    if (downloadCount < maxDownloadCount)
    {
      newLinks.AddRange(FindAllLinks(webpageItem.Html));

      //Save the downloaded html
      string fileName =   
        webpageItem.Url.GetHashCode().ToString();
      string filePath = outputDirectory + fileName;
      if (!File.Exists(filePath))
      {
        File.WriteAllText(filePath, "Html from: " 
          + webpageItem.Url + "\r\n\r\n" 
          + webpageItem.Html);
        Interlocked.Increment(ref downloadCount);
      }
    }
    else
    {
      //we have all the downloads we need...
      downloader.Downloads.CompleteAdding();
      //Exit out of the for loop since there still 
      //may be unused items in the BlockingCollection
      //which we are ignoring in this example...
       break;
    }
  }
} //);

Figure 9 – Crawling Webpage Links using the Webpage Downloader.

In Figure 9, part of the CrawlLinks() method is shown. We begin by downloading a single starting URL. Next, a foreach loop is used to process any downloaded WebPageItems which are provided back to the foreach loop by the WebpageDownloader. For each of the downloaded WebPageItems, we call the FindAllLinks() method to extract the URLs from any <a> link tags in the downloaded HTML. This method also filters out only common HTML file types as well for the demo.

However, the FindAllLinks() method is only called when the program has not yet exceeded the maximum number of downloads specified by the user. Each new link found is added to a list of new URLs to download. In the event that all provided URLs have been processed in the current run of the CrawlLinks() method, it next checks the “newLinks” list to determine if there are any additional links left to crawl. When “newLinks” are available and additional downloads are required, the CrawlLinks() method recursively starts over using the “newLinks” as the starting URLs and continues crawling. Once the method has collected enough link files, the foreach loop is simply exited ignoring any remaining files.

Benchmarking the Webpage Downloader

Using my own website as the starting URL, I performed 100 link crawl downloads using the Webpage Downloader. For example, Figure 10 below shows the CrawlLinks() method being used in combination with a WebpageDownloader to download 100 crawled links in parallel using 100 concurrent download workers for the WebpageDownloader.

//lets time our results
Stopwatch sw = new Stopwatch();

//Create a webpage downloader with 100 download threads
downloader = new WebpageDownloader
  (100, 3, 60, outputDirectory + "downloadLog.txt");

sw.Restart();
CrawlLinks(startingURL, outputDirectory, 100);
sw.Stop();
Console.WriteLine("Crawl and Download 100 links using 100 workers: " + sw.Elapsed);

Figure 10 – Crawling 100 Links using 100 Concurrent WebpageDownloader Workers.

I performed benchmarks to identify the first 100 links associated with www.jakemdrew.com using 2, 50, and 100 concurrent download workers. The elapsed times are shown below:

100 downloads using 2 concurrent workers – 00:00:44.77
100 downloads using 50 concurrent workers – 00:00:16.42
100 downloads using 100 concurrent workers – 00:00:17.54

Looking at the Webpage Downloader’s log in Figure 11, we can see that the LinkCrawler() located 39 new links to download from the first URL provided. Since the crawler had not yet located 100 links, it continued to download HTML and search for new links within the 39 new URLs. The crawler then found an additional 2541, links contained in the html of the 39 links during its third recursive execution. However, only a fraction of those links were actually downloaded since we set a download threshold of 100 total webpages.

Downloads Requested: 1

3/18/2015 7:54:10 PM
###################################################################
http://www.jakemdrew.com/|Success

Downloads Requested: 39
3/18/2015 7:54:10 PM
###################################################################
http://browsehappy.com/|Success
http://blog.jakemdrew.com/About|Success
http://blog.jakemdrew.com/|Success
http://www.facebook.com/jakemdrew|Success
…

Downloads Requested: 2541
3/18/2015 7:54:21 PM
###################################################################
http://www.google.com/chrome|Success
http://www.apple.com/safari/|Success
http://www.opera.com/|Success
http://www.firefox.com/|Success
…

Figure 11 – The Webpage Downloader Log.

Conclusion

It is clear that sizable performance gains can be achieved by using a parallel webpage download approach when mining data from the web. However, there is a diminishing return on adding additional concurrency workers to the download process. The optimal threshold is based upon numerous factors including both processing power and connection speed. I think it is also worth mentioning that in certain situations utilizing an anonymity network such as TOR [8] can be very beneficial to avoid facing download performance penalties or blocks which may be placed at the IP level by certain webpage providers. While moving from 2 concurrent download workers reduced execution times from 44 down to 16 seconds, increasing concurrent workers from 50 to 100 workers actually increased our total download time to 17 seconds. Carefully choosing the appropriate parameters can dramatically impact execution times when dealing with larger volumes of data.

References

International Data Corporation (IDC), The Digital Universe in 2020, http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf, accessed on 03/18/2015.
SRIDHAR PAPPU – Hewlett Packard, To Handle the Big Data Deluge, HP Plots a Giant Leap Forward, https://ssl.www8.hp.com/hpmatter/issue-no-1-june-2014/handle-big-data-deluge-hp-plots-giant-leap-forward?jumpid=sc_pur3bc5n8v/dm:_N5823.186294OUTBRAININC_109138141_282642904_0_2879120, accessed on 03/18/2015.
DOMO, Data Never Sleeps 2.0, http://www.domo.com/learn/data-never-sleeps-2, accessed on 03/18/2015.
Jake Drew, Getting Only The Text Displayed On A Webpage Using C#,http://www.codeproject.com/Articles/587458/GettingplusOnlyplusTheplusTextplusDisplayedplusOnp, accessed on 03/18/2015.
Jake Drew,Clustering Similar Images Using MapReduce Style Feature Extraction with C# and R, http://blog.jakemdrew.com/2014/06/26/clustering-similar-images-using-mapreduce-style-feature-extraction-with-c-and-r/ , accessed on 03/18/2015.
Jake Drew and Tyler Moore, Optimized combined-clustering methods for finding replicated criminal websites, http://www.jakemdrew.com/pubs/Optimized_Combined_Clustering_EURASIP_2014_Final.pdf , accessed on 03/18/2015.
Jake Drew, Creating N-grams Using C#, http://www.codeproject.com/Articles/582095/CreatingplusN-gramsplusUsingplusC , accessed on 03/18/2015.
TOR, Homepage, https://www.torproject.org/projects/torbrowser.html.en, accessed on 03/18/2015.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)