Introduction
Today the software development landscape has evolved significantly with the proliferation of Web technologies. Thus a majority of applications developed have some form of connectivity or integration with another application, web service, web application, remote database, etc.
This article will therefore try to touch one specific area, which is HTML content and DOM. And in doing so, it will investigate two approaches available in .NET which can be used to fuse these two for some practical purpose.
Examples provided are based on .NET code and libraries. However, the concepts remain the same for HTML and DOM are independent from any programming language. This article is not exhaustive in any manner, however references are provided for those seeking a more in depth coverage.
Background
According to W3C [1], HTML is the publishing language for the World Wide Web. This basically means that HTML is the language that is used to display content in your web browser when you visit any website.
HTML (Hyper Text Markup Language) is a markup language where predefined tags are used to instruct the browser how content should appear. For example, <h1>This is a heading</h1>
, is the heading tag that tells the browser, the text “This is a heading
” should be displayed bolded, and slightly bigger than the rest of the text on the web page. Different tags are used for different purposes. These tags are defined by the W3C in their language specifications. Currently the latest specification for HTML is HTML 4.01 [2]. The purpose of a specification is to specify how a certain language should be used, i.e. the recommendations by the creators of the language. HTML 4.01 specification by W3C recommends how HTML should be used in your websites and what the language is suppose to do. There is also XHTML 1.0 which is the latest specification for XTHML [3]. This is an extension of the HTML 4, which was designed with the intent to harness and integrate the power of XML in web pages.
DOM (Document Object Model) is an interface that allows applications to dynamically access content, structure and style of documents. It is not restricted to a specific platform or language [4]. W3C has defined several levels of DOM (e.g. DOM Level 1 – 3) and also several modules for each level (e.g. Core, XML, HTML, etc. 14 modules altogether). An implementation (application, agent, library, API, SDK, etc.) is said to conform to a certain DOM level or a module, if it that implementation supports all the interfaces for that module and the associated semantics [5] [6].
Approach
The steps that will be taken to demonstrate how HTML DOM can be used in .NET are:
- Step 1. Retrieve the HTML Content
- Step 2. Process the HTML Content using DOM
- Step 3. Make use of the processed HTML Content
The following are the details of these steps.
Step 1. Retrieve the HTML Content
To retrieve the HTML Content, the System.Net.WebClient
Class will be used. This class provides common methods for sending data to and receiving data from a resource identified by a URI [7]. According to RFC3986 [8], URI (Uniform Resource Identifier) is a compact sequence of characters that identifies an abstract or physical resource. It provides a simple and extensible means for identifying a resource. The commonly used term “URL” is basically a subset of URI. More on this topic can be explored in the RFC3986 document (link in the reference section).
The following code (Code Listing 1) can be used to retrieve the HTML content from the www.cnn.com and display in the TextBox1
control. However before the code can be executed, remember to reference the System.Net
namespace.
Code Listing 1:
WebClient client = new WebClient();
Stream data = client.OpenRead(new Uri("http://www.cnn.com"));
StreamReader reader = new StreamReader(data);
string htmlContent = reader.ReadToEnd();
textBox1.Text = htmlContent;
data.Close();
reader.Close();
An alternative to using the System.Net.WebClient
class would be to use the WebBrowser
Control. Simply place the WebBrowser
control on the Form and use the following code to go to a preferred URI.
webBrowser1.Navigate(new Uri(txtURL.Text));
Following this, the DocumentCompleted
event can be used to capture the HTML content after the page has been loaded using the DocumentText
property (Code Listing 2).
The WebBrowser
control may be suitable if you wish to display the page to the user, however if you only want to capture the HTML content, then the WebClient
class is more suitable and efficient.
Code Listing 2:
private void webBrowser1_DocumentCompleted
(object sender, WebBrowserDocumentCompletedEventArgs e)
{
textBox1.Text = webBrowser1.DocumentText;
}
Step 2. Process the HTML Content using DOM
Once the HTML content has been retrieved, DOM can be used for processing based on your needs. You could construct a full DOM tree, or you could simple extract specific tags, ids, or even content from the web page. For this part, I will use the Microsoft HTML Object Library, which is a COM library that has to be referenced in your Visual Studio Project. Once the reference has been added, reference the mshtml
namespace in the code.
The mshtml
namespace consists of different interfaces that can be used to access the Dynamic HTML (DHTML) Object Model [9][10]. The IHTMLDocument2
interface will be used in this article. This interface can be used to get information about the document, and also to examine and modify HTML elements and text in the document [11].
Firstly obtain the document interface using the IHTMLDocument2
interface. All the elements in the document can be accessed using this interface. The following code shows how this can be done purely using the mshtml
interfaces. Most of the examples available are using the WebBrowser
control, however this is an alternative approach. After the document object is created, the document is constructed using the HTML content (htmlContent
) retrieved in the Code Listing 1. Following this, other interfaces can be used to access the documents elements. The following code (Code Listing 3) shows how all the tags are traversed and displayed in a ListBox
by tag name.
Code Listing 3:
IHTMLDocument2 htmlDocument = (IHTMLDocument2)new mshtml.HTMLDocument();
htmlDocument.write(htmlContent);
listBox1.Items.Clear();
IHTMLElementCollection allElements = htmlDocument.all;
foreach (IHTMLElement element in allElements)
{
listBox1.Items.Add(element.tagName);
}
More specific queries can be done on the elements such as extract all the links or even all the images in a particular web page. The following code (Code Listing 4) shows how all the image elements can be extracted and its sources displayed in a ListBox
.
Code Listing 4:
IHTMLElementCollection imgElements = htmlDocument.images;
foreach (IHTMLImgElement img in imgElements)
{
listBox1.Items.Add(img.src);
}
Step 3. Make Use of the Processed HTML Content
There are numerous applications for this approach of extracting HTML content. Once the content is extracted, using DOM, selective elements can be processed. For example, using the above examples a simple image gallery can be built using the images used in a website. The following code iterates through all the items in the ListBox
and adds PictureBoxes
dynamically to a FlowLayoutPanel
using the image sources retrieved in the previous step.
Code Listing 5:
for (int i = 0; i < listBox1.Items.Count - 1; i++)
{
PictureBox pic = new PictureBox();
pic.Width = 100;
pic.Height = 100;
pic.SizeMode = PictureBoxSizeMode.StretchImage;
pic.ImageLocation = listBox1.Items[i].ToString();
flowLayoutPanel1.Controls.Add(pic);
}
Conclusion
This article has covered just a little bit of HTML DOM and how it can be used within .NET. While the applications are numerous, I hope that the readers will have some direction and know where to start when solving problems in this domain. If you wish to build photo galleries by ripping pictures off on other websites, monitor changes brought to web pages, or even develop spiders that crawl several sites, this approach is a simple and efficient way of going about it.
There are several other techniques that can be used and also very helpful third-party tools or libraries built just for this purpose are available, some for free, others commercially. The following is a screenshot of the demo application built using the code presented in this article. It is available for download from my blog. If you have any feedback, ideas or queries, please drop me a mail.
Figure 1: Demo Application Screenshot
References
[1] http://www.w3.org/html/[^]
[2] http://www.w3.org/TR/html401/[^]
[3] http://www.w3.org/TR/xhtml1/[^]
[4] http://www.w3.org/DOM/[^]
[5] http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/introduction.html[^]
[6] http://www.w3.org/2003/02/06-dom-support.html[^]
[7] http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx[^]
[8] ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt[^]
[9] http://msdn.microsoft.com/en-us/library/bb508515%28v=VS.85%29.aspx[^]
[10] http://msdn.microsoft.com/en-us/library/ms533044%28v=VS.85%29.aspx[^]
[11] http://msdn.microsoft.com/en-us/library/aa752574%28v=VS.85%29.aspx[^]
History
- 5th June, 2010: Initial post