Introduction
In this article I have tried to solve a very common requirement of developer that finding links other website page or get HTML of any webpage (Internal project/ External website). This topic also covers how to get page HTML using
System.Net.WebClient
class of .NET as well as strip a
particular HTML tag using Regex and export a list into excel or XML.
Background
From past few days I had a discussion in forums and I found several developers discussing with me about few topics like
(I) How to get a page HTML/Anchor Tag's/ Div Content from URL or from those web pages on which they don't have access on code?
(II) How to export a list or collection in excel or XML and download it?
(III) How to strip a particular tag or Stripping HTML?
On the basis of above requirement I have tried to combine those solution and tried to discuss abut those topics as per my findings.
Using the code
I have created two projects, one is class library one is a web project to implement this library.
First create on class to store values get from a URL and to export like this:
[Serializable]
public class AnchorValues
{
public string Name { get; set; }
public string Url { get; set; }
}
The WebClient
class provides common methods for sending data to or receiving data from any
local, intranet, or Internet resource identified by a URI.
The WebClient
class uses the
WebRequest
class to provide access to resources. WebClient
instances can access data with any
WebRequest
. Learn more about from MSDN:
http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=vs.80%29.aspx
Then created another class to get HTML from any URL using System.Net.WebClient class like this:
protected string GetString(string url)
{
WebClient wc = new WebClient();
Stream resStream = wc.OpenRead(url);
StreamReader sr = new StreamReader(resStream, System.Text.Encoding.Default);
string ContentHtml = sr.ReadToEnd();
return ContentHtml;
}
Get anchor tag's from HTML and store them into a collection using Regex.
The System.Text.RegularExpressions
namespace contains the Regex class used to form and evaluate regular expressions. The
Regex
class contains static methods used to compare regular expressions against strings. The
Regex
class uses the IsMatch()
static method to compare a string with a regular expression or get collection of matches
with Mathch()
.
Learn more about Regex from MSDN http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx like this:
List<AnchorValues> _list = new List<AnchorValues>();
string initialURL = @"<a.*?href=([""'])?(?<url>.*?)[""?|'?].*?>(?<name>.*?)</a>";
Regex regex = new Regex(initialURL, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(html);
foreach (Match mt in matches)
{
AnchorValues obj = new AnchorValues();
obj.Name = mt.Result("${name}");
obj.Url = mt.Result("${url}");
_list.Add(obj);
}
Finally export your list as Excel/XML depending upon user choice as or it can return
IDictionary
object without Exporting to any file format.
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/XML";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xml");
HttpContext.Current.Response.Write(SerializeToXML(source));
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.Charset = "";
HttpContext.Current.Response.AppendHeader("Content-Disposition", "attachment; filename=AnchorFile.xls");
GridView1.DataSource = source;
GridView1.DataBind();
GridView1.RenderControl(oHtmlTextWriter);
HttpContext.Current.Response.Write(oStringWriter.ToString());
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
HttpContext.Current.ApplicationInstance.CompleteRequest();
Points of Interest
Here we can notice one thing which is additional to this when we try to export a list using the
HttpContext.Current.Response
object from a class we got e exception
"Thread is being aborted." because of Response.End
to solve this we can use this:
HttpContext.Current.ApplicationInstance.CompleteRequest();
History
I have just tried to write quick solve of few requirements and assemble them into one article and will update this with more description soon.