Introduction
The purpose of this article is to describe how you can implement a generic “HTTP Data Client” (I apologize if it sounds fussy) using C# which would allow you to query in an elegant manner any web based resource you would like. I would like to mention from the beginning that this is not “the perfect solution” and that for sure it can be improved in many ways, so please feel free to do so. The entire concept is based on the HTTPWebRequest
object offered by .NET under the System.Net
namespace.
Prerequisites
Before I start to dwell into the architecture and code, there are some extra libraries which are used and required by the “HTTP Data Client” project. Here is the list of libraries:
- Db4Object (this is an object oriented database; I am using it mainly for embedded applications; there are two assembly files which are referenced: Db4objects.Db4o.dll and Db4objects.Db4o.NativeQueries.dll; you can get DB4Object from the following location: http://www.db4o.com/DownloadNow.aspx).
- HTML Agility Pack (this is a library which allows you to process HTML content using various techniques, it is very handy when you would like to convert HTML DOM to XML; there is one assembly file referenced: HtmlAgilityPack.dll; you can get the library from the following location: http://htmlagilitypack.codeplex.com).
- Microsoft MSHTML (its purpose is to render and parse HTML and JavaScript content).
If you are wondering why I have decided to use two different libraries to parse HTML content, the answer is straightforward. The HTML Agility Pack performs really well most of the time; the output you get is usually what you are expecting, but not always. So if one library fails to provide the expected results, I can switch to the other one. The major drawback of the MSHTML library in my opinion is the slow processing speed when it is integrated in a non-desktop application (e.g.: web sites, web services, etc.). The role of DB4Object in this project is to store configuration settings and cache content. One important thing that has to be mentioned about DB4Object is that the non-server version doesn’t support multi-threading (you can easily replace it with any other storage which is suitable for you).
Architecture
My solution contains four projects:
- HtmlAgilityPack (the actual HTML Agility Pack project with source code)
- HttpData.Client (the main project which implements the rules of HTML processing)
- HttpData.Client.MsHtmlToXml (the wrapper project over MSHTML and some extensions of it)
- HttpData.Client.Pdf (the project which implements some PDF processing using IFilter; not important for this article)
There is no point in discussing about the HTML Agility Pack since you can find all the details and documentation about it on http://htmlagilitypack.codeplex.com. I will focus mainly on HttpData.Client
and try to offer you as many details and explanations as possible. The HTTP data client is designed to work in a similar way to the .NET SQL client (System.Data.SqlClient
), you will notice that the classes included in the project and their logic resembles a lot (I hope it is not just my imagination). I will enumerate the interfaces and classes and provide details about their logic and purpose.
IHDPAdapter and HDPAdapter
The purpose of the HDPAdapter
class is to allow integration of XML data with other data objects as DataTable
and DataSet
. The IHDPAdapter
interface exposes two methods which convert XML data into either a DataTable
or a DataSet
. Currently, only the DataTable
conversion method is implemented. Here is the code snippet for the interface and class:
IHDPAdapter code:
using System.Data;
using System.Xml;
namespace HttpData.Client
{
public interface IHDPAdapter
{
#region
IHDPCommand SelectCommand{ get; set; }
#endregion
#region METHODS
int Fill(DataTable table, XmlDocument source, bool useNodes);
int Fill(DataSet dataset, XmlDocument source, bool useNodes);
#endregion
}
}
HDPAdapter code:
using System;
using System.Xml;
using System.Xml.XPath;
using System.Data;
using System.Text;
namespace HttpData.Client
{
public class HDPAdapter : IHDPAdapter
{
#region PRIVATE VARIABLES
private IHDPCommand _selectCommand;
#endregion
#region Properties
IHDPCommand IHDPAdapter.SelectCommand
{
get{ return _selectCommand; }
set{ _selectCommand = value; }
}
public HDPCommand SelectCommand
{
get{ return (HDPCommand)_selectCommand; }
set{ _selectCommand = value; }
}
public string ConnectionString { get; set; }
#endregion
#region .ctor
public HDPAdapter()
{
}
public HDPAdapter(string connectionString)
{
this.ConnectionString = connectionString;
}
#endregion
#region Public Methods
public int Fill(DataTable table, XmlDocument source, bool useNodes)
{
bool columnsCreated = false;
bool resetRow = false;
if(table == null || source == null)
return 0;
if (table.TableName.Length == 0)
return 0;
StringBuilder sbExpression = new StringBuilder("//");
sbExpression.Append(table.TableName);
XPathNavigator xpNav = source.CreateNavigator();
if (xpNav != null)
{
XPathNodeIterator xniNode = xpNav.Select(sbExpression.ToString());
while(xniNode.MoveNext())
{
XPathNodeIterator xniRowNode =
xniNode.Current.SelectChildren(XPathNodeType.Element);
while (xniRowNode.MoveNext())
{
if(resetRow)
{
xniRowNode.Current.MoveToFirst();
resetRow = false;
}
DataRow row = null;
if (columnsCreated)
row = table.NewRow();
if(useNodes)
{
XPathNodeIterator xniColumnNode =
xniRowNode.Current.SelectChildren(XPathNodeType.Element);
while (xniColumnNode.MoveNext())
{
if (!columnsCreated)
{
DataColumn column =
new DataColumn(xniColumnNode.Current.Name);
table.Columns.Add(column);
}
else
row[xniColumnNode.Current.Name] =
xniColumnNode.Current.Value;
}
}
else
{
XPathNodeIterator xniColumnNode = xniRowNode.Clone();
bool onAttribute = xniColumnNode.Current.MoveToFirstAttribute();
while (onAttribute)
{
if (!columnsCreated)
{
DataColumn column =
new DataColumn(xniColumnNode.Current.Name);
table.Columns.Add(column);
}
else
row[xniColumnNode.Current.Name] =
xniColumnNode.Current.Value;
onAttribute = xniColumnNode.Current.MoveToNextAttribute();
}
}
if (!columnsCreated)
{
columnsCreated = true;
resetRow = true;
}
if (row != null)
table.Rows.Add(row);
}
}
}
return table.Rows.Count;
}
public int Fill(DataSet dataset, XmlDocument source, bool useNodes)
{
throw new NotImplementedException();
}
#endregion
#region Private Methods
#endregion
}
}
IHDPConnection and HDPConnection
As the name says, this represents the connection class which will manage in an abstract way how a connection behaves. The interface exposes a set of methods and properties relevant to it. There are only three methods exposed and implemented:
Open
method (changes the connection state to open; this method has an override which accepts as parameter the URL of the web resource which will be opened)Close
method (changes the connection state to close, and if there is a cache storage in use, it closes it)CreateCommand
method (it creates a new HDPCommand
object and assigns the current connection to it)
Now let us take a look at the properties exposed by IHDPConnection
and implemented by HDPConnection
:
ConnectionURL
(represents the web resource URL which will be opened using the current connection)KeepAlive
(defines if the connection should be kept opened or not once the querying is done)AutoRedirect
(defines if the connection allows any auto-redirects to be performed)MaxAutoRedirects
(defines how many auto-redirects can be performed)UserAgent
(defines what user agent will be associated with the connection, e.g.: Internet Explorer, Chrome, Opera, etc.)ConnectionState
(read only property which provides information about the connection state; is connection opened or closed)Proxy
(defines what proxy will be used when querying is performed)Cookies
(cookies associated with the connection currently or when the querying takes place)ContentType
(defines what content type is expected when querying takes place, e.g.: application/x-www-form-urlencoded, application/json, etc.)Headers
(contains the headers associated with the connection currently or when the querying takes place)Referer
(contains the referrer which is going to be used when querying the connection URL)
IHDPConnection code:
using System.Collections.Generic;
using System.Net;
namespace HttpData.Client
{
public interface IHDPConnection
{
#region MEMBERS
#region METHODS
void Open();
void Close();
IHDPCommand CreateCommand();
#endregion
#region PROPERTIES
string ConnectionURL { get; set; }
bool KeepAlive { get; set; }
bool AutoRedirect { get; set; }
int MaxAutoRedirects { get; set; }
string UserAgent { get; set; }
HDPConnectionState ConnectionState { get; }
HDPProxy Proxy { get; set; }
CookieCollection Cookies { get; set; }
string ContentType { get; set; }
List<HDPConnectionHeader> Headers { get; set; }
string Referer { get; set; }
#endregion
#endregion
}
}
HDPConnection code:
using System.Collections.Generic;
using System.Net;
namespace HttpData.Client
{
public class HDPConnection : IHDPConnection
{
#region Private Variables
private HDPConnectionState _connectionState;
private string _connectionURL;
private HDPCache cache;
private bool useCache;
#endregion
#region Properties
public bool UseCahe
{
get { return useCache; }
}
public HDPCache Cache
{
get { return cache; }
}
#endregion
#region .ctor
public HDPConnection()
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = "";
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
}
public HDPConnection(string connectionURL)
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = connectionURL;
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
}
public HDPConnection(HDPCacheDefinition cacheDefinitions)
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = "";
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
useCache = true;
}
public HDPConnection(string connectionURL, HDPCacheDefinition cacheDefinitions)
{
_connectionState = HDPConnectionState.Closed;
_connectionURL = connectionURL;
Cookies = new CookieCollection();
MaxAutoRedirects = 1;
cache = cacheDefinitions != null ? new HDPCache(cacheDefinitions) : null;
useCache = true;
}
#endregion
#region Public Methods
#endregion
#region IHDPConnection Members
#region Methods
public void Open()
{
_connectionState = HDPConnectionState.Open;
}
public void Open(string connectionURL)
{
_connectionURL = connectionURL;
_connectionState = HDPConnectionState.Open;
}
public void Close()
{
_connectionState = HDPConnectionState.Closed;
if (cache != null)
cache.CloseStorageConnection();
}
IHDPCommand IHDPConnection.CreateCommand()
{
HDPCommand command = new HDPCommand { Connection = this };
return command;
}
public HDPCommand CreateCommand()
{
HDPCommand command = new HDPCommand { Connection = this };
return command;
}
#endregion
#region Properties
public string ConnectionURL
{
get { return _connectionURL; }
set { _connectionURL = value; }
}
public bool AutoRedirect { get; set; }
public int MaxAutoRedirects { get; set; }
public bool KeepAlive { get; set; }
public string UserAgent { get; set; }
public string ContentType { get; set; }
public CookieCollection Cookies { get; set; }
public HDPConnectionState ConnectionState
{
get { return _connectionState; }
}
public HDPProxy Proxy { get; set; }
public List<HDPConnectionHeader> Headers { get; set; }
public string Referer { get; set; }
#endregion
#endregion
#region IDisposable Members
public void Dispose()
{
this.dispose();
System.GC.SuppressFinalize(this);
}
private void dispose()
{
if (_connectionState == HDPConnectionState.Open)
this.Close();
}
#endregion
}
}
IHDPCommand and HDPCommand
This represents our engine which provides the functionality for querying web resources and processing the result (response). It offers a variety of ways that can be used to process the response content of the query as: XPath, RegEx, XSLT, Reflection, etc. I will discuss in detail only the main methods, the rest of them are leveraged on those, and I assume the comments which accompany the methods will suffice to provide guidance in the right direction. But before I’ll reach the methods, let me present to you the properties. I will not post here the content of the HDPCommand
class due to its large number of lines of code. You will be able to analyze it in detail using the source code provided.
Connection
(defines the connection object associated with this command)Parameters
(defines the parameters used in the querying process)CommandType
(defines the command type used in the querying process; it is either GET or POST)CommandText
(defines the content of the command which is going to executed; if this is a GET command, then the URL with query parameters are stored, if it is a POST command, then the body content of the POST action is stored)CommandTimeout
(defines the time period in which a response is expected from the web resource)Response
(contains the response string received from the web resource based on a query action)Uri
(contains the URI of the queried web resource)Path
(contains the path of the web resource queried)LastError
(contains the last error message encountered in the process)ContentLength
(contains the length of the content received from the web resource based on a query action)
We can now move to the exposed/implemented methods.
GetParametersCount
(gets the number of parameters used in the query process)CreateParameter
(creates a new parameter to be used in the query process)ExecuteNonQuery
(executes a query on a web resource using either GET or POST method, and it returns the number of results received; it has a parameter which specifies if the collection of parameters used in the query process should be cleaned at the end)Execute
(executes a query on a web resource using either the GET or POST method, and it returns a boolean value: true if the query was executed with success, false if it failed)ExecuteStream
(executes a query on a web resource using either GET or POST method, and it returns the underlying HTTP response stream)CloseResponse
(it closes the HTTP response stream opened by the ExecuteStream
method)ExecuteNavigator
(executes a query on a web resource using either GET or POST method, and it returns an XPathNavigator
object used to navigate through the response converted to XML; it has a parameter which specifies if the collection of parameters used in the query process should be cleaned at the end)ExecuteDocument
(it has an override, and executes a query on a web resource using either GET or POST method, and it returns an IXPathNavigable
object used to navigate through the response converted to XML, the “expression
” parameter represents an XPath expression which will be used in the processing of the result)ExecuteBinary
(executes a query on a web resource using either GET or POST method, and it returns a result in byte array format; this is mostly used when querying binary content from web resources, e.g.: PDF files, images; one of the overridden method parameters imposes a limit on the output buffer)ExecuteBinaryConversion
(executes a query on a web resource using either GET or POST method, and it returns result as string; this is used when querying binary content from web resources, e.g.: PDF files, and the content of a PDF file is converted from binary to string)ExecuteString
(executes a query on a web resource using either GET or POST method, and it returns result as plain string)ExecuteValue
(executes a query on a web resource using either GET or POST method, and it returns result as string which is a representation of an XPath expression applied; instead of an XPath expression, it can be a RegEx)ExecuteCollection
(executes a query on a web resource using either GET or POST method, and it returns result as a generic string collection; the result is a representation of either an XPath or RegEx expression applied on the result)ExecuteArray
(executes a query on a web resource using either GET or POST method, and it returns result as a string array; the result is a representation of either an XPath or RegEx expression applied on the result)
IHDPCommand code:
using System.Collections.Generic;
using System.IO;
using System.Xml.XPath;
namespace HttpData.Client
{
public interface IHDPCommand
{
#region Members
#region Properties
IHDPConnection Connection { get; set; }
IHDPParameterCollection Parameters { get; }
HDPCommandType CommandType { get; set; }
string CommandText { get; set; }
int CommandTimeout { get; set; }
string Response { get; }
string Uri { get; }
string Path { get; }
string LastError { get; }
long ContentLength { get; }
#endregion
#region Methods
int GetParametersCount();
IHDPParameter CreateParameter();
int ExecuteNonQuery(bool clearParams);
bool Execute();
bool Execute(bool clearParams);
Stream ExecuteStream(bool clearParams);
void CloseResponse();
XPathNavigator ExecuteNavigator(bool clearParams);
IXPathNavigable ExecuteDocument(bool clearParams);
IXPathNavigable ExecuteDocument(string expression, bool clearParams);
byte[] ExecuteBinary(bool clearParams);
byte[] ExecuteBinary(int boundaryLimit, bool clearParams);
string ExecuteBinaryConversion(bool clearParams);
string ExecuteString(bool clearParams);
string ExecuteValue(string expression, bool clearParams);
string ExecuteValue(string expression, bool clearParams, bool isRegEx);
List<string> ExecuteCollection(string expression, bool clearParams, bool isRegEx);
List<string> ExecuteCollection(string expression, bool clearParams);
string[] ExecuteArray(string expression, bool clearParams, bool isRegEx);
string[] ExecuteArray(string expression, bool clearParams);
#endregion
#endregion
}
}
HDPCache, HDPCacheDefinition, HDPCacheObject, and HDPCacheStorage
HDPCache
, HDPCacheDefinition
, HDPCacheObject
, and HDPCacheStorage
are the classes which handle the cache. I will not insist on this subject since is not so important in this case. If you like, you can study those classes in more detail by yourself. I think the code comments will help you to grasp their purpose and functionality quite fast. The class HDPCacheObject
is straightforward; it contains a set of properties which define the cache behavior. Here are its properties:
StorageActiveUntil
(defines the date until the cache is considered to be valid)MemorySizeLimit
(imposes a memory size limit of the cache)ObjectsNumberLimit
(imposes an objects number limit on the cache)UseStorage
(defines if the cache should be persisted on disk)RetrieveFromStorage
(defines if a specific value should be searched on the persisted cache on disk)RealtimePersistance
(defines if the cache will be persisted on disk in real time once a new value has been added to it)StorageName
(defines the file name of the posted cache on disk)
The class HDPCacheObject
is just a value pair set of properties and a time stamp field used to identify the cached object age. The caching system works in a very clear and simple manner. When a web resource is queried, the URL of it represents the cache object key and the result of the query represents the cache object value. If the cache is activated when using an HDPCommand
object to query web resources, each query URL and response content is stored in the memory cache. If the same web resource is queried again using the same URL, the HTTP request is not performed and the response content is retrieved from the memory cache. There are extra options defined in HDPCacheDefinition
which allow you to control how the cache behaves. For example, if you impose a cache memory limit of 1024 KB, then every time a new value is added to the cache, its memory footprint is calculated. In case the imposed limit is exceeded, based on other behavior definitions, the cache content is either stored on disk or deleted. I would like to mention that MemorySizeLimit
and ObjectsNumberLimit
are mutually exclusive. So if you define a value for the MemorySizeLimit
greater than 0, then there is no point in defining a value for ObjectsNumberLimit
because it will not be taken into consideration, and vice versa.
HDPCacheDefinition code:
using System;
namespace HttpData.Client
{
public class HDPCacheDefinition
{
#region Public Variables
public DateTime StorageActiveUntil = DateTime.Now.AddDays(1);
public long MemorySizeLimit;
public int ObjectsNumberLimit = 10000;
public bool UseStorage = true;
public bool RetrieveFromStorage;
public bool RealtimePersistance;
public string StorageName = "HttpDataProcessorCahe.che";
#endregion
}
}
HDPCacheObject code:
using System;
namespace HttpData.Client
{
[Serializable]
public class HDPCacheObject
{
#region Private Variables
private string key;
private object value;
private DateTime cacheDate;
#endregion
#region Properties
public string Key
{
get { return key; }
set { key = value; }
}
public object Value
{
get { return value; }
set { this.value = value; }
}
public DateTime CacheDate
{
get { return cacheDate; }
}
#endregion
#region .ctor
public HDPCacheObject()
{
cacheDate = DateTime.Now;
}
public HDPCacheObject(string key, object value)
{
this.key = key;
this.value = value;
cacheDate = DateTime.Now;
}
#endregion
#region Public Methods
#endregion
#region Private Methods
#endregion
}
}
Using the Code
I will provide a couple of examples so you can figure out how things work. I consider this to be the best way to understand how the earth spins. Let us say, for example, that we would like to retrieve all Florida cities from the following page: http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34. Here is the code to achieve the above mentioned task.
using System;
using System.Collections.Generic;
using HttpData.Client;
namespace CityStates
{
class Program
{
static void Main(string[] args)
{
private const string connectionUrl =
"http://www.stateofflorida.com/Portal/DesktopDefault.aspx?tabid=34";
HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
{
UseStorage = false,
StorageActiveUntil = DateTime.Now,
ObjectsNumberLimit = 10000,
RealtimePersistance = false,
RetrieveFromStorage = false,
StorageName = null
};
HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
{
ContentType = HDPContentType.TEXT,
AutoRedirect = true,
MaxAutoRedirects = 10,
UserAgent = HDPAgents.FIREFOX_3,
Proxy = null
};
connection.Open();
HDPCommand command = new HDPCommand(connection)
{
ActivatePool = true,
CommandType = HDPCommandType.Get,
CommandTimeout = 60000,
UseMsHtml = true
};
List<string> cities =
command.ExecuteCollection("//ul/li/b//text()[normalize-space()]", true);
foreach (string city in cities)
Console.WriteLine(city);
connection.Close();
}
}
}
Here is a different example now. Let us say that we would like to login on to the LinkedIn network using a user name and password. Here is the code to achieve that:
using System;
using System.Collections.Generic;
using HttpData.Client;
namespace CityStates
{
class Program
{
static void Main(string[] args)
{
private const string connectionUrl =
"https://www.linkedin.com/secure/login?trk=hb_signin";
HDPCacheDefinition cacheDefinition = new HDPCacheDefinition
{
UseStorage = false,
StorageActiveUntil = DateTime.Now,
ObjectsNumberLimit = 10000,
RealtimePersistance = false,
RetrieveFromStorage = false,
StorageName = null
};
HDPConnection connection = new HDPConnection(connectionUrl, cacheDefinition)
{
ContentType = HDPContentType.TEXT,
AutoRedirect = true,
MaxAutoRedirects = 10,
UserAgent = HDPAgents.FIREFOX_3,
Proxy = null
};
connection.Open();
HDPCommand command = new HDPCommand(connection)
{
ActivatePool = true,
CommandType = HDPCommandType.Get,
CommandTimeout = 60000,
UseMsHtml = false
};
HDPParameterCollection parameters = new HDPParameterCollection();
HDPParameter pToken =
new HDPParameter("@csrfToken", "ajax:-3801133150663455891");
HDPParameter pSessionKey =
new HDPParameter("@session_key", "YOUR_EMAIL@gmail.com");
HDPParameter pSessionPass =
new HDPParameter("@session_password", "YOUR_PASSWORD");
HDPParameter pSessionLogin =
new HDPParameter("@session_login", "Sign+In");
HDPParameter pSessionLogin_ = new HDPParameter("@session_login", "");
HDPParameter pSessionRiKey = new HDPParameter("@session_rikey", "");
parameters.Add(pToken);
parameters.Add(pSessionKey);
parameters.Add(pSessionPass);
parameters.Add(pSessionLogin);
parameters.Add(pSessionLogin_);
parameters.Add(pSessionRiKey);
string value = command.ExecuteValue(
"//a[@id='manual_redirect_link']/@href", true);
if (value != null && String.Compare(value,
"http://www.linkedin.com/home") == 0)
{
command.Connection.ConnectionURL = value;
command.CommandType = HDPCommandType.Get;
string content =
command.ExecuteString("//title[contains(.,'Welcome,')]", true);
if (content.Length > 0)
Console.WriteLine(content);
else
Console.WriteLine("Login failed!");
}
connection.Close();
}
}
}
On your sample project, please add the following app.config content if you are going to use MSHTML.
="1.0"="utf-8"
<configuration>
<appSettings>
<add key="LogFilePath" value="..\Log\My-Log.txt"/>
<add key="HtmlTagsPath" value="HtmlTags.txt"/>
<add key="AttributesTagsPath" value="HtmlAttributes.txt"/>
</appSettings>
</configuration>
Notes
- HttpData.Client.Pdf - not all content belongs to me. I do not recall from where I got parts of it.
- HDPUtils.cs - I am not proud of its content, I find it to be quite dirty so please ignore that for now.
Issues
HtmlAgilityPack - when used, sometimes the content converted by it doesn't match the actual HTML DOM structure, specially when it comes to the form
element.
MSHTML - when used, it strips all content between the html
tag and the body
tag (including the html
tag). It also validates the input HTML content against a list of valid elements and attributes, so everything that doesn't match will be removed. One important thing to note is that by default the JavaScript content is removed. You can change this behavior from the HtmlLoader.cs class found on the HttpData.Client.MsHtmlToXm project.
Points of Interest
It is quite obvious on what sort of applications you could make use of the above library.
History
No updates yet, but I am sure there will be some in the future.