(untagged)

Extract RSS feeds from Web pages

Alex Furman

0.00/5 (No votes)

25 Sep 2004

Shows how to extract RSS feeds from Web pages.

Introduction

I love RSS readers. They save a lot of my time. Would it be nice if we can convert any Web data into RSS format? Then we can view Bank records, Credit card records, online shop promotions, e-mail subscriptions, etc. in one standard way.

Unfortunately, not too many Web sites provide RSS/ATOM feeds. In this article, I will show that RSS extraction is a very simple task, especially if a proper technology is used.

How to extract

We will only consider Web pages which are developed by using HTML or DHTML. From the first glance, the task looks very simple: download HTML pages locally and parse them. But it can take hours to write the code even for a simple web site, and it is hard to keep the code working; web site changes can break it.

The following approaches can be used to extract data from Web pages: "Raw" HTTP, IE Automation, and SWExplorerAutomation.

"Raw" HTTP

HTTP is a "raw" approach. We use WebRequest (.NET) to download a page source locally. The RSS data then can be extracted by XPath or regular expressions. To use XPath, the page source should be converted to XML (XHTML) using HTML Tidy.

Pros

Performance is very fast.

Cons

Requires knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
Due to HTML is not well formed, HTML to XML conversion will not always work.
Very unstable. Even simple changes to a web page layout will break an extraction.
Will not work with web pages created by JavaScript.
Time consuming.

IE automation

The solution is based on accessing HTML DOM. We can use Internet Explorer automation or host Web Browser control to get access to the HTML DOM data model.

Pros

Can work with any web page shown in IE.
Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.

Cons

Changes to web site layout will break an extraction.
Requires a good knowledge of Web Browser events, HTML DOM, COM.
Not as fast as HTTP way.
Time consuming.

SWExplorerAutomation

Picture 1. SWExplorerAutomation class diagram.

SWExplorerAutomation is a framework which converts a web application into programmable objects: scenes (pages) and controls. Those objects are visually defined using visual designer, and accessible from any .NET language.

Pros

Can work with any web page shown in IE.
Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
Separates data extraction from program logic.
Effectively handles error conditions.
Takes minutes to write code.

Cons

Not as fast as HTTP way.

SWExplorerAutomation Example

To illustrate how SWExplorerAutomation can be used to extract RSS feeds from web pages, I wrote a sample application which extracts RSS feed from CNN web site. I have created the following definitions (scenes) for CNN pages: [CnnNews], [Sport], [Money], [Main]. Each of the scenes contains HtmlContent control which extracts data from a defined page place.

First, we create and initialize ExplorerManager instance. ExplorerManager is initialized by [cnn_rss.htp] project file which was visually created by SWExplorerAutomation designer. ExplorerManager Connect () function runs Internet Explorer instance and connects to it. Then ExplorerManager navigates browser to the main CNN page.

ExplorerManager explorerManager = new ExplorerManager();
explorerManager.Connect();
explorerManager.LoadProject(@"..\..\cnn_rss.htp");
explorerManager.Navigate("http://www.cnn.com/");
rssw.WriteChannel("CNN", "CNN News", scene.Descriptor.Url);

The code waits until a scene defined for the main CNN page will be activated. It uses XPathDataExtractor to extract list of article links from the web page.

scene = explorerManager["CnnNews"];
if (!scene.WaitForActive(60000)) 
return ""; 
XmlNodeList nodeList = (HtmlContent)(scene["HtmlContent_0"])). 
XPathDataExtractor.Expressions["ItemList"].SelectNodes(); 
for ( int i = 0; i < nodeList.Count; i++) { 
     //��.. 

}

The same set of actions Navigate, Wait, Extract is repeated for every article link.

for ( int i = 0; i < nodeList.Count; i++) { 
  string link = nodeList[i].Attributes["href"].Value as String; 
  explorerManager.Navigate(link); 
  Scene[] scenes = explorerManager.WaitForActive( new 
        string[] {" Main ", "Money", "Sport"}, 20000); 
  if (scenes == null) 
    continue; 
  scene = scenes[0]; 
  XPathDataExtractor xe = 
    ((HtmlContent)(scene["HtmlContent_0"])).XPathDataExtractor; 
  string title = xe.Expressions["Title"].SelectNodes()[0].InnerText; 
  string pubDateStr = xe.Expressions["PubDate"].SelectNodes()[0].InnerText; 
  WriteRssItem(title, link, 
    xe.Expressions["PubDate"].SelectNodes()[0].InnerText, 
    xe.Expressions["Content"].SelectNodes()); 
  scene.Deactivate(); 
}

The code is completely metadata driven and doesn't require changes in case CNN site design will change.

Using Visual Designer to create cnn_rss.htp

Screenshot 1. SWExplorerAutomation Visual Designer

To create cnn_rss.htp using SWDesigner

On the Explorer menu, click Run.
Navigate IE to http://www.cnn.com/.
On the Scene Editor menu, click Start.
Use right mouse button to show IE context menu. Click SceneEditor\Text Selection Mode.
Mark text on CNN page. Click SceneEditor\Select control from the context menu. The HtmlContent control will be added to the project.
Rename the control to CnnNews.
Run XPathDataExtractor custom property editor.
Define named XPath expression: select HTML link using mouse cursor, and click left mouse button to calculate XPath expression. Change the expression to select list of links (for example, DIV[1]/DIV[position() != 7] /A[1]).
Click Add button. Rename the named expression to "ItemList".
Click Exec button to test the expression and close XPathDataExtractor dialog.
Navigate to one of the news articles. Mark text on the page and create control (step 5).
Create the following named XPath expressions: PubDate, Content and Title.
Change Scene descriptor URL pattern to regular expression �http://www\.cnn\.com/2004(.*)� and change title pattern to �CNN\.com\ -(.*)�
Repeat 11-13 for Money and Sport.

To view cnn_rss.htp using SWDesigner

On Project menu, click Open.
Select �cnn_rss.htp�.
On the Scene Editor menu, click Start.
Select CnnNews scene. On the context menu, click Navigate.
Run XPathDataExtractor custom property editor.
Repeat 4-5 for all scenes.

Using the code

Just don't forget to register SWExplorerAutomation.dll. It is a Browser Helper Object and has to be registered.

Summary

The article explains how to extract RSS feeds from web pages using SWExplorerAutomation. It took me less then 10 minutes to write and test the article example code. Future articles will explain SWExplorerAutomation in more details and in more complex situations.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here