Simple Multi-Request Web Crawler using U++

Miroslav Fidler

5.00/5 (18 votes)

4 Jul 2021CPOL5 min read

43.3K

1.5K

Using asynchronous nature of U++ library HttpRequest class to implement parallel web crawler with GUI

This article looks at a simple parallel web crawler with GUI, in about 150 lines of code. We look at designing GUI, and then we concentrate on the webcrawler code.

Introduction

This article is using U++ framework. Please refer to Getting Started with Ultimate++ for an introduction to the environment.

U++ framework provides a HttpRequest class capable of asynchronous operation. In this example, we will exploit this capability to construct a simple single-threaded web crawler using up to 60 parallel HTTP connections.

Designing GUI

We shall provide a simple GUI to display the crawling progress:

First of all, we shall design a simple GUI layout for our application. Here, the GUI is fairly simple, but it is still borderline worth to use the layout designer:

Layout consists of 3 ArrayCtrl widgets, which are basically tables. We shall use work to display progress of individual HTTP requests, finished to display result of HTTP requests that ended and, to have some fun and path that will show for any line of finished a 'path' of urls from the seed url to finished url.

Now let us use this layout and setup some things in the code:

C++

#define LAYOUTFILE <GuiWebCrawler/GuiWebCrawler.lay>
#include <CtrlCore/lay.h>

struct WebCrawler : public WithCrawlerLayout<TopWindow> {
    WebCrawler();
};

WebCrawler will be the main class of our application. The weird #include before it 'imports' designed layout into the code, namely it defines WithCrawlerLayout template class that represents our layout. By deriving from it, we add work, finished and path ArrayCtrl widgets as member variables of WebCrawler. We shall finish setting up things in WebCrawler constructor:

C++

WebCrawler::WebCrawler()
{
    CtrlLayout(*this, "WebCrawler");
    work.AddColumn("URL");
    work.AddColumn("Status");
    finished.AddColumn("Finished");
    finished.AddColumn("Response");
    finished.WhenCursor = [=] { ShowPath(); };    // when cursor is changed in finished, 
                                                  // show the path
    finished.WhenLeftDouble = [=] { OpenURL(finished); };
    path.AddColumn("Path");
    path.WhenLeftDouble = [=] { OpenURL(path); }; // double-click opens url in browser
    total = 0;
    Zoomable().Sizeable();
}

CtrlLayout is WithCrawlerLayout method that places widgets into designed positions. Rest of the code sets up lists with columns and connects some user actions on widgets with corresponding methods in WebCrawler (we shall add these methods later).

Data Model

Now with boring GUI stuff out of the way, we shall concentrate on the funny parts - webcrawler code. First, we will need some structures to keep track of things:

C++

struct WebCrawler : public WithCrawlerLayout<TopWindow> {
    VectorMap<String, int> url;        // maps url to the index of source url
    BiVector<int>          todo;       // queue of url indices to process
    
    struct Work {                      // processing record
        HttpRequest http;              // request
        int         urli;              // url index
    };
    Array<Work>      http;             // work records
    int64            total;            // total bytes downloaded

VectorMap is an unique U++ container that can be thought of as a mix of array and map. It provides index based access to keys and value and a quick way to find the index of key. We will use url as a way to avoid duplicate url requests (putting url to key) and we shall put index of 'parent' url as value, so that we can later display path from the seed url.

Next we have a queue of urls to process. When extracting urls from html, we will put them to url VectorMap. That means each url has unique index in url, so we just need to have queue of indices, todo.

Finally, we shall need some buffer to keep our concurrent requests. Processing record Work simply combines HttpRequest with url index (just to know what url we are trying to process). Array is U++ container that is capable of storing objects that do not have any form of copy.

The Main Loop

We have data model, let us start writing the code. Simple things first, let us ask user about seed url:

C++

void WebCrawler::Run()
{   // query the seed url, then do the show
    String seed = "www.codeproject.com";            // predefined seed url
    if(!EditText(seed, "GuiWebSpider", "Seed URL")) // query the seed url
        return;
    todo.AddTail(0);                                // first url to process index is 0
    url.Add(seed, 0);                               // add to database

Seed is the first url, so we know it will have index 0. We shall simply add it to url and todo. Now the real work begins:

C++

Open();              // open the main window
while(IsOpen()) {    // run until user closes the window
    ProcessEvents(); // process GUI events

We shall be running the loop until user closes the window. We need to process GUI events in this loop. The rest of the loop will be handling the real stuff:

C++

while(todo.GetCount() && http.GetCount() < 60)
{ // we have something to do and have less than 60 active requests
    int i = todo.Head();                     // pop url index from the queue
    todo.DropHead();
    Work& w = http.Add();                    // create a new http request
    w.urli = i;                              // need to know source url index
    w.http.Url(url.GetKey(i))                // setup request url
          .UserAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0)
          Gecko/20100101 Firefox/11.0")      // lie a little :)
          .Timeout(0);                       // asynchronous mode
    work.Add(url.GetKey(i));                 // show processed URL in GUI
    work.HeaderTab(0).SetText
      (Format("URL (%d)", work.GetCount())); // update list header
}

If we have something todo and have less than 60 concurrent requests, we add a new concurrent request.
Next thing to do is to progress all active HTTP requests. HttpRequest class does this with method Do. This method, in nonblocking mode, tries to progress connection request. All we need to do is to call this method for all active requests and then read the status.

However, even if it would be possible to do this without waiting for actual sockets events in 'active' mode, well-behaved program should first wait until either it is possible to write or read from socket, to save system resources. U++ provides class SocketWaitEvent exactly for this:

C++

SocketWaitEvent we; // we shall wait for something to happen to our request sockets
for(int i = 0; i < http.GetCount(); i++)
    we.Add(http[i].http);
we.Wait(10);        // wait at most 10ms (to keep GUI running)

The only problem here is that SocketWaitEvent waits only on sockets and we have the GUI to run. We workaround this by specifying the maximum wait limit to 10 ms (we know that at this time, at least periodic timer event would happen that should be processed by ProcessEvents).
With this issue cleared, we can go on to actually process requests:

C++

int i = 0;
while(i < http.GetCount()) {                       // scan through active requests
    Work& w = http[i];
    w.http.Do();                                   // run request
    String u = url.GetKey(w.urli);                 // get the url from index
    int q = work.Find(u);                          // find line of url in GUI work list
    if(w.http.InProgress()) {                      // request still in progress
        if(q >= 0)
            work.Set(q, 1, w.http.GetPhaseName()); // set GUI to inform user
                                                   // about request phase
        i++;
    }
    else { // request finished
        String html = w.http.GetContent();         // read request content
        total += html.GetCount();      // just keep track about total content length
        finished.Add(u, w.http.IsError() ? String().Cat() << w.http.GetErrorDesc()
                                         : String().Cat() << w.http.GetStatusCode()
                                           << ' ' << w.http.GetReasonPhrase()
                                           << " (" << html.GetCount() << " bytes)",
                     w.urli);          // GUI info about finished url status,
                                       // with url index as last parameter
        finished.HeaderTab(0).SetText(Format("Finished (%d)", finished.GetCount()));
        finished.HeaderTab(1).SetText(Format("Response (%` KB)", total >> 10));
        if(w.http.IsSuccess()) {       // request ended OK
            ExtractUrls(html, w.urli); // extact new urls
            Title(AsString(url.GetCount()) + " URLs found"); // update window title
        }
        http.Remove(i);                // remove from active requests
        work.Remove(q);                // remove from GUI list of active requests
    }
}

This loop seems complex but most of the code deals with updating the GUI. HttpRequest class has handy GetPhaseName method to describe what is going on in the request. InProgress is true until request is finished (either as success or some kind of failure). If the request succeeds, we use ExtractUrls to get new urls to test from html code.

Getting New URLs

For simplicity, ExtractUrls is quite a naive implementation, all we do is scan for "http://" or "https://" strings and then read next characters that look like url:

C++

bool IsUrlChar(int c)
{// characters allowed
    return c == ':' || c == '.' || IsAlNum(c) || c == '_' || c == '%' || c == '/';
}

void WebCrawler::ExtractUrls(const String& html, int srci)
{// extract urls from html text and add new urls to database, srci is source url
    int q = 0;
    while(q < html.GetCount()) {
        int http = html.Find("http://", q); // .Find returns next position of pattern
        int https = html.Find("https://", q); // or -1 if not found
        q = min(http < 0 ? https : http, https < 0 ? http : https);
        if(q < 0) // not found
            return;
        int b = q;
        while(q < html.GetCount() && IsUrlChar(html[q]))
            q++;
        String u = html.Mid(b, q - b);
        if(url.Find(u) < 0) {             // do we know about this url?
            todo.AddTail(url.GetCount()); // add its (future) index to todo
            url.Add(u, srci);             // add it to main url database
        }
    }
}

We put all candidate urls to url and todo to be processed by the main loop.

Final Touches

At this point, all the hard work is done. The rest of the code is just two convenience functions, one that opens url on double clicking finished or path list:

C++

void WebCrawler::OpenURL(ArrayCtrl& a)
{
    String u = a.GetKey(); // read url from GUI list
    WriteClipboardText(u); // put it to clipboard
    LaunchWebBrowser(u);   // launch web browser
}

(We put the url on clipboard too as bonus.)
Another function fills path list to show path from the seed url to url in finished list:

C++

void WebCrawler::ShowPath()
{   // shows the path from seed url to finished url
    path.Clear();
    if(!finished.IsCursor())
        return;
    int i = finished.Get(2);  // get the index of finished
    Vector<String> p;
    for(;;) {
        p.Add(url.GetKey(i)); // add url index to list
        if(i == 0)            // seed url added
            break;
        i = url[i];           // get parent url index
    }
    for(int i = p.GetCount() - 1; i >= 0; i--) // display in reverted order, with seed first
        path.Add(p[i]);
}

Here, we are using 'double nature' of VectorMap to traverse from child url back to seed using indices.
The only little piece of code missing now is MAIN:

C++

GUI_APP_MAIN
{
    WebCrawler().Run();
}

and here we go, simple parallel web crawler with GUI, in about 150 lines.

Useful Links

History

5^th May, 2020: Initial version
7^th May, 2020: ExtractUrls now scans for "https" too
20^th September, 2020: Crosslink with "Getting started" article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)