Introduction
Web Scraper can be used as tool that loads website contents. Since it downloads all the data from
a website I prefer to format it, making it readable.
Using the code
You can either use this data with console applications or with Windows/web applications. I used
a console since it is introductory.
In the console application, add the following namespaces:
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
Loading the content
Create the WebRequest
and WebResponse
objects.
WebRequest request=System.Net.HttpWebRequest.Create(url);
WebResponse response=request.GetResponse();
Create the StreamReader
object to store the response of the website and save it in any string type
variable and close the stream.
StreamReader sr=new StreamReader(response.GetResponseStream());
string result=sr.ReadToEnd();
sr.Close();
To view the unformatted result, simply write it on the console.
Console.WriteLine(result);
Formatting the result
To format the result we will use Regular Expression class functions.
result = Regex.Replace(result, "<script.*?</script>", "",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, "<style.*?</style>", "",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", "");
result = Regex.Replace(result, "<!--(.|\\s)*?-->", "");
result = Regex.Replace(result, "<!(.|\\s)*?>", "");
result = Regex.Replace(result, "[\t\r\n]", " ");
Now print the results on screen.
Console.WriteLine(result);
Update
In this update I have tried to match the loaded content over the
specific pattern. I have used it to match and load the URLs in the
loaded content. However, you can choose your own pattern to match.
Using the code
What I am focusing here is pattern matching. To match a pattern
specify it in a regular expression. I am using it to extract the
associated list of URLs so the pattern for a URL is:
string pattern=@"\b[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)\b";
Now create a Regular Expression class object with the the given pattern as parameter.
Regex r = new Regex(pat);
To match the given pattern we will use the Matches()
function. Iterate through each of the found patterns and print it on the screen.
foreach (Match m in r.Matches(result))
{
Console.WriteLine(m.Value);
}
A list of associated URLs will be printed on screen. You can use it for matching different patterns.