Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Simple WebPageCheck (Spider)

0.00/5 (No votes)
10 Jan 2007 1  
Small application that checks a list of websites for specified text
Sample Image - WebPageCheck.gif

Introduction

To do: An application that checks a list of sites if they exist and (if yes) contain a specified text!

I designed a GUI as it appears in the picture above. The interface has three textboxes multiline:

  1. txtImitialList (the list of URLs for checking will be pasted here, one URL in a row)
  2. txtGood (if the result of the check is positive - page exists and contains the text we search for - the URL will appear here)
  3. txtBad (bad pages - check result is negative)

We also need a textbox to put the text that we are looking for: txtMustContain and a check box (case sensitive check or not).

And the last.... btnStart, a button that starts the process! The main job is done by this class that has only one static function:

public static string WebFetch(string url)

This function receives a string argument, the URL of the page, and returns the source of that page, as a string.

using System;
using System.Text;
using System.Net;
using System.IO;

namespace WindowsApplication1
{
    class WebFetchClass
    {
        public static string WebFetch(string url)
        {
            // used to build entire input
            StringBuilder sb = new StringBuilder();

            // used on each read operation
            byte[] buf = new byte[8192];

            // prepare the web page we will be asking for
            HttpWebRequest request = (HttpWebRequest)
                WebRequest.Create(url);

            // execute the request
            HttpWebResponse response = (HttpWebResponse)
                request.GetResponse();

            // we will read data via the response stream
            Stream resStream = response.GetResponseStream();

            string tempString = null;
            int count = 0;

            do
            {
                // fill the buffer with data
                count = resStream.Read(buf, 0, buf.Length);

                // make sure we read some data
                if (count != 0)
                {
                    // translate from bytes to ASCII text
                    tempString = Encoding.ASCII.GetString(buf, 0, count);

                    // continue building the string
                    sb.Append(tempString);
                }
            }
            while (count > 0); // any more data to read?

            // return page source
            return sb.ToString();
        }
    }
}

Don't forget to include the namespaces System.Net (for HttpWebResponse and HttpWebRequest) and System.IO (for stream functions):

using System.Net;
using System.IO;

In form1.cs file, I wrote three functions (to make it much easier to understand the program). Each function does almost the same job with some small differences.

The three functions are:

  • CheckPageLoad()          //check only if specified page exists on server 
  • DoCheckCaseSensitive()   //DoCheck() - case Sensitive
  • DoCheck()                // Case Insensitive check for specified text 

The code for this function is here:

private void CheckPageLoad()
        {
            int totalLinks;
            int count = 0;
            url_arr = txtImitialList.Text.Split('\n');
            totalLinks = url_arr.Length;

            for (int i = 0; i < totalLinks; i++)
            {
                count++;
                try
                {
                    if (WebFetchClass.WebFetch(url_arr[i]).Trim().Length > 10)
                        txtGood.Text += url_arr[i] + "\n";
                    else
                        txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                    txtGood.Update();
                }
                catch
                {
                    txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                }
                lblStatusCurrentPos.Text = count.ToString() + "/" + 
                        totalLinks.ToString();
                this.Update();
            }
        }
        private void DoCheckCaseSensitive()
        {
            int totalLinks;
            int count = 0;
            string to_check = txtMustContain.Text.Trim();
            url_arr = txtImitialList.Text.Split('\n');
            totalLinks = url_arr.Length;

            for (int i = 0; i < totalLinks; i++)
            {
                count++;
                try
                {
                    if (WebFetchClass.WebFetch(url_arr[i]).Trim().IndexOf(to_check) > 0)
                        txtGood.Text += url_arr[i] + "\n";
                    else
                        txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                    txtGood.Update();
                }
                catch
                {
                    txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                }
                lblStatusCurrentPos.Text = count.ToString() + "/" + 
                    totalLinks.ToString();
                this.Update();
            }
        }
        private void DoCheck()
        {
            int totalLinks;
            int count = 0;
            string to_check = txtMustContain.Text.Trim().ToLower();
            url_arr = txtImitialList.Text.Split('\n');
            totalLinks = url_arr.Length;

            for (int i = 0; i < totalLinks; i++)
            {
                count++;
                try
                {
                    if (WebFetchClass.WebFetch(url_arr[i]).Trim().ToLower().IndexOf
                                (to_check) > 0)
                        txtGood.Text += url_arr[i] + "\n";
                    else
                        txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                    txtGood.Update();
                }
                catch
                {
                    txtBad.Text += url_arr[i] + "\n";
                    txtBad.Update();
                }
                lblStatusCurrentPos.Text = count.ToString() + "/" + 
                    totalLinks.ToString();
                this.Update();
            }
        }

Ok... now let's start the application: Start Button!

private void btnStart_Click(object sender, EventArgs e)

This is the event that is raised when the Start button is pressed. Let's write some code for this event:

private void btnStart_Click(object sender, EventArgs e)
        {
            //clear (if exist) previews data(s)
            txtBad.Clear();
            txtGood.Clear();
            lblStatusCurrentPos.Text = "Starting...";
            

            if (txtMustContain.Text.Trim() == "")
            {
                //TODO: CheckPageLoad()
                Thread t = new Thread(new ThreadStart(CheckPageLoad));
                t.IsBackground = true;
                t.Start();
                return;
            }

            if (chkCaseSensitive.Checked)
            {
                //TODO: DoCheckCaseSensitive()
                Thread t = new Thread(new ThreadStart(DoCheckCaseSensitive));
                t.IsBackground = true;
                t.Start();
            }
            else
            {
                //TODO: DoCheck()
                Thread t = new Thread(new ThreadStart(DoCheck));
                t.IsBackground = true;
                t.Start();
            }
        }

As you can see, I run the function that accesses the Web in a separate thread, because I don't want the main window to be frozen as long as the process runs.

This is a very simple application, with no error checking. It can be improved by adding more threads or error checking.

History

  • 10th January, 2007: Initial post

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here