Simple Web Scraper

Rumman92

5.00/5 (2 votes)

14 Mar 2013CPOL1 min read

28.6K

1.4K

A simple web scraper that loads only the readable contents of a website.

Introduction

Web Scraper can be used as tool that loads website contents. Since it downloads all the data from a website I prefer to format it, making it readable.

Using the code

You can either use this data with console applications or with Windows/web applications. I used a console since it is introductory.

In the console application, add the following namespaces:

using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data

Loading the content

Create the WebRequest and WebResponse objects.

WebRequest request=System.Net.HttpWebRequest.Create(url); // url="http://www.google.com";
WebResponse response=request.GetResponse();

Create the StreamReader object to store the response of the website and save it in any string type variable and close the stream.

StreamReader sr=new StreamReader(response.GetResponseStream()); 
string result=sr.ReadToEnd();
sr.Close();

To view the unformatted result, simply write it on the console.

Console.WriteLine(result);

Formatting the result

To format the result we will use Regular Expression class functions.

result = Regex.Replace(result, "<script.*?</script>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", 
  RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets          
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags    
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on screen.

Console.WriteLine(result);

Update

In this update I have tried to match the loaded content over the specific pattern. I have used it to match and load the URLs in the loaded content. However, you can choose your own pattern to match.

Using the code

What I am focusing here is pattern matching. To match a pattern specify it in a regular expression. I am using it to extract the associated list of URLs so the pattern for a URL is:

string pattern=@"\b[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\b";

Now create a Regular Expression class object with the the given pattern as parameter.

Regex r = new Regex(pat);

To match the given pattern we will use the Matches() function. Iterate through each of the found patterns and print it on the screen.

foreach (Match m in r.Matches(result))
// result=loaded content from website using scrapper without formatting it
{
    Console.WriteLine(m.Value);
}

A list of associated URLs will be printed on screen. You can use it for matching different patterns.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)