Web Extraction and Submission toolkit (Library and tool)

inuka_g

4.10/5 (10 votes)

27 Jul 20056 min read

1.8K

A tool and an extensible library that allows the user to interact with web based forms.

Introduction

This article covers:

Extracting data, and performing GET/POST submissions using HTTPWebRequest and HTTPWebResponse.
Parsing regular expressions through RegEx. This toolkit has two parts:
- An extensible library for extracting and submitting forms.
- A tool that implements the library, so that a user can interact with tags in a form directly.

This is a toolkit that provides a simple HTML parser in order to allow the user to interact with web based forms. There are two parts to this toolkit.

An extensible library that can perform simple HTML parsing as well as form extraction and submission. There is also an extensible object that forms an adapter with a DataSet and web based information. An implementation is given for inserting information in a simple table into a DataSet.
A form extraction tool that implements the library. This tool allows the user to extract information from web based forms. This tool displays all the form tags as well as all the tags inside the form tags. Furthermore the submission method and the URL of each form tag is displayed as well. The user can change the values and attributes of any of these tags and submit it to see the results.

The whole thing started as an experiment, a proof of concept of sorts before I started doing some data extraction. Since I implemented quite a bit of functionality, and experimented quite a bit, I figured somebody might benefit from my folly :). This is not the fastest or the most elegant solution, but it probably is one of the simplest complete C# solutions. If you want an even simpler solution you can use the WebControl, but that's a topic of another write up.

Background

Extracting data from a web page

The main functionality of this program depends on extracting and submitting data to a web server. To do this the HttpWebRequest and the HttpWebResponse objects are used. This is how it's done, quite simple really:

 //String Page 
public void getUrlData(String url) 
{
    conn = url;  
    // the URL
    HttpWebRequest myHttpWebRequest = 
         (HttpWebRequest)WebRequest.Create(conn);  
    // create a request                       
    HttpWebResponse myHttpWebResponse = 
         (HttpWebResponse)myHttpWebRequest.GetResponse(); 
    //get the respnse
    Stream receiveStream = myHttpWebResponse.GetResponseStream(); 
    //read in the response stream
    Encoding encode = System.Text.Encoding.GetEncoding("utf-8");  
    StreamReader readStream = 
         new StreamReader( receiveStream, encode );
     //convert the whole thing to a string
    page = readStream.ReadToEnd();
    //..... some good stuff     using RegeEx
}

Submitting data to a form (POST and GET)

The other part to this tool is submitting the data. There are two ways of doing this; GET and POST. The link I have provided above gives you a detailed explanation, but as far as programming is concerned GET is encoded in the URL while POST is in the HTTP body.

The values are passed using key value pairs. They are passed in the format:

name1=value1&name2=value2

For a GET request the values are passed in the format:

URL?name1=value1&name2=value2

Keep in mind that the names and values must be encoded according to the standard. Certain characters (space etc.) must be passed as ASCII value between the two % signs. Read the standard for more info. Now for the code:

The simplest of the two to implement is the GET. So, here it is:

//FormTag.cs

public string postURL()
{
    // ... do stuff well does the post and value name paring here
    // valpairs have the name value pairs set according to the 
    // standards, this.formurl holds the url.

    else
    {
        webreq = 
          (HttpWebRequest)WebRequest.Create(this.formurl+"?"+valpairs);
            
    }

    res= (HttpWebResponse)webreq.GetResponse();
    Stream receiveStream = res.GetResponseStream();
    Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
    // Pipes the stream to a higher level stream reader with
    // the required encoding format. 
    StreamReader readStream = new StreamReader( receiveStream, encode );
    // Reads 256 characters at a time.
    return readStream.ReadToEnd(); // returns the stream
}

And now for the POST. Since POST requires the data in the HTTP header a little bit of ASYNC talk has to happen. I got it off the MSDN article I posted in the background section. This is the top bit of the method posted above and a bit more.

public static ManualResetEvent allDone = 
                  new ManualResetEvent(false);
private static byte[] b;

/// <summary>
/// This method is used to submit the 
/// form data to a given webpage.
/// The relevant methods in this object 
/// must be called inorder to 
/// set the relevant properites of this object
/// beforehand.
/// </summary>
/// <returns></returns>
public string postURL()
{
    //UpdateTagSet();
    if(formurl==null) return null;
        
    HttpWebRequest webreq ;
    HttpWebResponse res;
    string valpairs="";
    IEnumerator i= valuelist.Keys.GetEnumerator();
    while(i.MoveNext())
    {
        ValueNameTag v=
               (ValueNameTag)valuelist[i.Current];
        v.SetNVPairs(ref valpairs);
    }
        
        
    // Set the content length of the 
    // string being posted.
    if(!Regex.IsMatch(formmethod,"get",
                        RegexOptions.IgnoreCase))
    {
        ASCIIEncoding encoding=new ASCIIEncoding();
        b = encoding.GetBytes(valpairs);
        webreq = 
          (HttpWebRequest)WebRequest.Create(this.formurl);
        webreq.ContentLength=b.Length;
        //webreq.AllowWriteStreamBuffering = true;
        webreq.Method = this.formmethod.ToUpper();
        webreq.ContentType=
                "application/x-www-form-urlencoded";
        webreq.BeginGetRequestStream(
                new AsyncCallback(ReadCallback), webreq);
        allDone.WaitOne();        
            
    }
    else
    {
        webreq = (HttpWebRequest)WebRequest.Create(
                             this.formurl+"?"+valpairs);
            
    }

    res= (HttpWebResponse)webreq.GetResponse();
    Stream receiveStream = res.GetResponseStream();
    Encoding encode = 
         System.Text.Encoding.GetEncoding("utf-8");
    // Pipes the stream to a higher 
    // level stream reader with
    // the required encoding format. 
    StreamReader readStream = 
            new StreamReader( receiveStream, encode );
    // Reads 256 characters at a time.
    return readStream.ReadToEnd();

}

private static void ReadCallback(
                   IAsyncResult asynchronousResult)
{    
    HttpWebRequest request = 
       (HttpWebRequest)asynchronousResult.AsyncState;
    // End the operation.
    Stream newStream= 
       request.EndGetRequestStream(asynchronousResult);
    newStream.Write(b,0,b.Length);                
    newStream.Close(); 
    allDone.Set();  
}

Parsing the tags

The data that is read from a web page is read as a text stream. Since each tag needs to be manipulated, the tags must be extracted from the stream. The solution is regular expressions and the RegEx object/toolkit.

Here is a simple example:

public class TextAreaTag : Tag
{
    string parsestring=@"<\s*textarea[\s|\w\|=]*>";
    // doing stuff
    
    //extracting data here
    public override ArrayList GetValueTags(String page) 
    {
        
        ArrayList vals= new ArrayList();
        Match m = Regex.Match(page,parsestring,
                      RegexOptions.IgnoreCase); //match
        while(m.Success) //go through each match
        {
            
            //extracting the name whats inside 
            //the brackets is captured into a group
            string name =@"name\s*=\s*"+'"'+@"(\w+)"+'"'; 
                                                          
            string curr = m.ToString();
            Match name_match = Regex.Match(curr,name,
                                RegexOptions.IgnoreCase);
            ValueNameTag tg;
            if(name_match.Success )
            {
              //Groups are 1 based indecies, so this is the
              //the group captured above(first group)
              tg = 
                 new ValueNameTag(name_match.Groups[1].ToString(), 
                                               TagTypes.TEXTAREA);
            }    
            else
            {
                throw new SystemException("Incorrect " + 
                             "Area tag no name found", null);
            }
            vals.Add(tg);
            m.NextMatch();            
        }
        return vals;

    }
}

There are probably better ways to write the expressions. I just found it easier to break up the whole thing into segments.

The design

We have figured out how to extract data, submit data, parse data and now what remains to be done is to put it all together.

The premise behind the design is that an HTML page can contain many FORM tags. The FORM tag in turn contains fields that describe the form. The HTMLSet is the facade to this whole setup. It also holds a list of FORM tags. The user is able to interact with these FORM tags by getting a reference (an instance) of one of the FORM tags in the HTMLSet. Each tag inside a form is stored as a ValueNameTag no matter what type it is. The ValueNameTag does hold an enumerated type that describes what type it is.

The builder pattern is used to generate the ValueNameTags. Each tag that inherits the Tag class contains information on how the specified HTML tag is parsed and how the ValueNameTags should be generated. Note that the Input tag can be used to describe many tag types (see W3 standards above).

The tool

The Web Extraction tool implements the library that was discussed earlier. Below is a screenshot of it:

In order to interact with a web site, enter the URL into textbox (1) and click on submit. Once the web page is parsed the tag information will be displayed. Select a FORM tag from ListBox (6) and the tag information for that page will be displayed in ListBox (7). If a tag is selected in ListBox (7), the value information will be displayed in boxes 9 and 11.

To set a value for tag, select the form, and then the tag. Then, either select the value from the value list in textbox (9) and click the select button, or add a value to the value list by typing in the value to textbox (13) and then click on the add button. The values that are selected are displayed in box (11). Please note that multiple values are only allowed if the tag is specified as multiselect (see the standards).

Once the values are set, select the desired form from box (12), and click the submit button (15).

Note that the JavaScript in the form (except listeners) is displayed in box 12. It helps to figure out why certain pages won't work.

Points of interest

I just want to point out that I don't format the values according to the standard (replacing spaces with the ASCII values and such), but it still seems to work. I am not sure if it really works for all web pages but because it worked when I tested I have been too lazy to implement the formatting.

If you experiment with the program, you will begin to realize that certain pages don't work (dice for example). Most of the time this is because there are JavaScripts involved that manipulate the values submitted. The solution for this is to use the WebControl (IE DLLs) to submit the form.

Conclusion

Oops for some reason, links to the source and executable didn't work. Apologies.

Home page

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here