Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Get page HTML from URL using WebClient, Strip HTML using Regex , export a list of Anchors into Excel or XML.

0.00/5 (No votes)
6 Nov 2012 1  
Get page HTML using System.Net.WebClient class of .NET as well as striping HTML using Regex and export a list into Excel or XML.

Introduction 

In this article I have tried to solve a very common requirement of developer that finding links other website page or get HTML of any webpage  (Internal project/ External website). This topic also covers how to get page HTML using System.Net.WebClient class of .NET as well as strip a particular HTML tag using Regex and export a list into excel or XML.

Background 

From past few days I had a discussion in forums and I found several developers discussing with me about few topics like

 (I) How to get a page HTML/Anchor Tag's/ Div Content from URL or from those web pages on which they don't have access on code?

(II) How to export a list or collection in excel or XML and download it? 

(III) How to strip a particular tag or Stripping HTML?

On the basis of above requirement I have tried to combine those solution and tried to discuss abut those topics as per my findings.

Using the code 

I have created two projects, one is class library one is a web project to implement this library.

First create on class to store values get from a URL and to export like this:

[Serializable]
public class AnchorValues
{
    public string Name { get; set; }
    public string Url { get; set; }
}  

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI.

The WebClient class uses the WebRequest class to provide access to resources. WebClient instances can access data with any WebRequest. Learn more about from MSDN: http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=vs.80%29.aspx

Then created another class to get HTML from any URL using System.Net.WebClient class like this:

protected string GetString(string url)
{
    WebClient wc = new WebClient();
    Stream resStream = wc.OpenRead(url);
    StreamReader sr = new StreamReader(resStream, System.Text.Encoding.Default);
    string ContentHtml = sr.ReadToEnd();

    return ContentHtml;
}

Get anchor tag's from HTML and store them into a collection using Regex.

The System.Text.RegularExpressions namespace contains the Regex class used to form and evaluate regular expressions. The Regex class contains static methods used to compare regular expressions against strings. The Regex class uses the IsMatch() static method to compare a string with a regular expression or get collection of matches with Mathch().  

Learn more about Regex from MSDN http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx like this:

List<AnchorValues> _list = new List<AnchorValues>();

string initialURL = @"<a.*?href=([""'])?(?<url>.*?)[""?|'?].*?>(?<name>.*?)</a>";
Regex regex = new Regex(initialURL, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(html);

foreach (Match mt in matches)
{
    AnchorValues obj = new AnchorValues();
    obj.Name = mt.Result("${name}");
    obj.Url = mt.Result("${url}");
    _list.Add(obj);
}

Finally export your list as Excel/XML depending upon user choice as or it can return IDictionary object without Exporting to any file format.

HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/XML";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xml");

HttpContext.Current.Response.Write(SerializeToXML(source));
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xls");

GridView1.DataSource = source;
GridView1.DataBind();
GridView1.RenderControl(oHtmlTextWriter);
HttpContext.Current.Response.Write(oStringWriter.ToString());
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();

Points of Interest

Here we can notice one thing which is additional to this when we try to export a list using the HttpContext.Current.Response object from a class we got e exception "Thread is being aborted." because of Response.End to solve this we can use this:

HttpContext.Current.ApplicationInstance.CompleteRequest(); 

History

I have just tried to write quick solve of few requirements and assemble them into one article and will update this with more description soon.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here