Contents
- Introduction
- Background
- Using the code
- AngleSharp API
- Give Me the Code
- Points of Interest
- History
Introduction
Once in a while we face the following problem: We need to access some data that is only exposed via some webpage. Unfortunately the page is only accessible after submitting a login form. What can we do? Most people would instantly go for a solution like PhantomJS, which is quite heavy and restricted to powerful platforms. For instance we can't deploy an app that is using PhantomJS on common smartphones.
However, we are lucky. There are many C# projects that try to solve this problem. We could also just go for a standard HttpClient
and combine it with a state of the art HTML5 parser. But doing so correctly is tedious and the W3C specifications are vast. For a simple problem we might be able to come up with our own, home-baked solution, which just works. But when the page - and therefore the problem - changes, we might need to reconsider.
Finally we may be interested in a state of the art solution that solves all our problems. This article discusses the AngleSharp library, which forms the basis of a (headless) browser completely written in managed code.
Background
Its already more than two years ago that I started the AngleSharp project. Initially planned as just a little HTML5 parser component the project transformed pretty quickly to form the basis for a browser in C#. The core project contains an HTML5 parser, a CSS3 parser, a simple (but mostly sufficient) HTTP requester and many more utilities. There are other libraries (available or to be released), which care about scripting (e.g., connecting a JavaScript engine) or even rendering.
The initial release of AngleSharp is also documented on CodeProject. The first article goes into details of parts of the implementation. Some things changed internally and the API matured. Right now we are close to observe the release of AngleSharp v0.9. This is quite short before the real deal, AngleSharp v1. AngleSharp uses semver versioning (see http://semver.org), which will trigger quite drastic jumps in the version number if any breaking changes occur. It is therefore required to make the API as stable and extensible as possible prior to v1.
Connecting a JavaScript engine has also been discussed. The second article on AngleSharp outlines some of the advances of the library internally and the future roadmap. It is a rather technical document describing what JavaScript engines are out there (especially for .NET) and why we choose Jint over the alternatives. Also there are some things to learn about JavaScript engines.
This article will be rather user focused. We will discuss a (more or less) standard demo of AngleSharp and learn its API. Most interestingly we will see how AngleSharp deals with the concept of reading / manipulating websites.
Using the code
The supplied code contains two projects:
- A very simple ASP.NET MVC webpage.
- A basic WPF desktop application.
The former is used to represent the website, which we want to access. The particular data we've been interested in is only accessible for authenticated users. The latter is a desktop application that contains an button, which triggers an action that will login, get the data and logout from the website.
The website looks as follows. The following screenshot shows the landing page.
The page with the secret information only contains a small panel. Rendered it looks as follows.
The WPF client really consists only of a single button. A screenshot of the application (after the data has been received):
Nevertheless, the WPF client uses the MVVM pattern to deliver a nice looking code. Let's have a look at the VM:
sealed class MainViewModel : BaseViewModel
{
State _state;
String _content;
RelayCommand _submit;
public MainViewModel()
{
_state = State.Idle;
_content = String.Empty;
_submit = new RelayCommand(async () =>
{
ChangeState(State.Loading);
ChangeState(State.Finished);
});
}
public Boolean IsIdle
{
get { return _state == State.Idle; }
}
public Boolean IsLoading
{
get { return _state == State.Loading; }
}
public Boolean IsFinished
{
get { return _state == State.Finished; }
}
void ChangeState(State state)
{
_state = state;
TriggerChanged("IsIdle");
TriggerChanged("IsLoading");
TriggerChanged("IsFinished");
}
public RelayCommand Submit
{
get { return _submit; }
}
public String Content
{
get { return _content; }
set
{
_content = value;
TriggerChanged();
}
}
enum State
{
Idle,
Loading,
Finished
}
}
As there is not much going on in the UI it is not really required to talk about the XAML. Basically we only have a button and a textbox. Once the button's action is triggered the state is changed from idle to loading. Finally once we receive the desired data we change the state to finished.
The state machinery is as follows: In the idle state only the enabled button is shown. The loading state shows a textbox with "Loading ..." as content and disables the button. Finally in the finished state we only see the textbox. The content of the textbox is the content of the secret page that we've been interested in.
AngleSharp API
AngleSharp exposes a fully functional DOM to the user. This requires the interplay of a lot of components. The library itself does not contain all of these components. Instead, AngleSharp has extension points, which allow users to provide the desired functionality. The set of offered (and specific) functionality is aggregated into an instance of the IConfiguration
interface.
The core library comes with a standard implementation of IConfiguration
, called Configuration
. The standard implementation offers the static Default
property, which yields a default (usually empty) configuration. The default configuration can be set. It will be used internally if no configuration is provided and is therefore really useful.
There are many ways to parse HTML in AngleSharp. But the best way is by opening a dedicated IBrowsingContext
. A browsing context can be seen as a tab in common browsers. It is an independent unit with its own security settings, configuration and history. It also follows best practices for loading documents and therefore it knows how to talk to any given HTTP requester. It is also useful for navigating to pages or submitting forms. The latter is especially interesting for us.
We should use the standard implementation accessible via the BrowsingContext
class. Creating a new instance can be either done classicaly by using the new
operator, or by using the static New
method. The latter looks better in chained scenarios. So let's create a new context using the default configuration explicitely.
var context = BrowsingContext.New(Configuration.Default);
Adding functionality to the configuration works by using extension methods. Any plugin for AngleSharp would follow the same pattern. The most important concept here is, that the IConfiguration
interface only defines getters. It is therefore regarded as immutable. Since no plugin can expect a specific implementation (such as Configuration
) to be used, it is impossible to alter the current state. Therefore we will always receive a new IConfiguration
instance, which will be the aggregate of the former configuration with the new abilities.
We use the With...()
extension methods to create a new, extended, instance of an IConfiguration
object. In our special case we care about having an HTTP requester. Even though AngleSharp comes with one by default, it is not included in the default setting for Configuration.Default
. We need to include it.
var configuration = Configuration.Default.WithDefaultRequester();
Alternatively we may start with a completely fresh configuration. The one obtained from Configuration.Default
may already contain unwanted abilities, depending on other code. Here we write:
var configuration = new Configuration().WithDefaultRequester();
There is yet another advantage for instantiating the Configuration
besides being sure it does not contain any services already. We can set the locale information. This has no influence on the treatment of numbers etc. (they are all invariant), but may have influences on some culture dependent parts of the specification. For instance the default encoding service uses the culture to determine the default encoding. We can also integrate our own encoding service, which will, e.g., always use UTF-8. But keep in mind that many of AngleSharp's default services are created to follow the standard exactly. If we replace these services by our own components we may get non-standard results.
Now that we have successfully created a new IBrowsingContext
with an IConfiguration
that contains all services required for the upcoming task, we may load a page to inspect. Methods for a IBrowsingContext
are supplied again in form of extension methods. This makes them independent of a concrete implementation. They only require the implementation of the basic set of properties defines by the IBrowsingContext
.
All methods are async
. AngleSharp tries to use the TPL
for everything. Anything that uses (maybe external) streams, or should be queued somehow, is transformed to a Task
. The loading mechanisms also apply to this.
If we want to open an "empty" page we can use OpenNewAsync
. Optionally we can specify an address for this empty resource. This will then be the baseURI
of the new document. The base URL is only interesting for navigation, form submission and other things, but might be handy if we plan to manipulate the empty document.
If we want to open a "local" page, either with an existing Stream
(maybe from disk), or with an existing string
instance, we can use the virtual response interface exposed in form of an Action
, which is an overload of the OpenAsync
method. The virtual response let's us dictate what response we would want to see from an hypothetical server.
As an example if we have the (fixed) source for Google's homepage in a string
variable called sourceCode
, we could use the following instruction:
var document = await context.OpenAsync(res => res.
Address("http://www.google.com").
Status(200).
Header("Content-Language", "en").
Content(sourceCode));
Chaining makes it quite easy to transport a lot of semantic into a single line of code. For readability the statement has been split into multiple lines. Note we are using await
to unpack the Task<IDocument>
to an IDocument
after the document loading finished.
The opening methods will all do the same. They will send a request (if necessary), obtain the response and use the response to construct a new document. The document is then filled by an HTML parser, which constructs the DOM from the body of the response asynchronously.
The current document can be also retrieved from the context itself. The context has a property called Active
. The property references the currently active IDocument
. It is the answer to the question: "How browsing tab, what document are you currently displaying?"
Now that we have the document the next question is - what shall we do with it? We could get elements by using QuerySelectorAll
or just the first one with QuerySelector
.
var anchors = document.QuerySelectorAll("a");
var firstAnchor = anchors.FirstOrDefault();
var firstAnchorDirect = document.QuerySelector("a");
There is subtle, but maybe important difference between using QuerySelector
and a combination of QuerySelectorAll
and the FirstOrDefault
LINQ extension method: The former will stop at the first match, while the second one will definitely iterate over all elements. The reason is simple: QuerySelectorAll
will already return a fully evaluated set. It does not use lazy evaluation. Nevertheless, the big message is that the returned type implements IEnumerable<IElement>
and therefore allows using LINQ statements on the result.
A single element is represented by the base interface IElement
. But IElement
may not expose the properties or methods we want from an anchor element (IHtmlAnchorElement
). We could use LINQ to perform the cast. Or we include AngleSharp.Extensions
for some convenience methods:
var anchorsWithLinq = document.QuerySelectorAll("a").OfType<IHtmlAnchorElement>();
var anchorsConvinient = document.QuerySelectorAll<IHtmlAnchorElement>("a");
Casting is one of the annoying things that makes working with the DOM less pleasant than with JavaScript. But it makes working with the DOM also much more reliable and stable.
An anchor element also implements IUrlUtilities
. This associates an URL with the element. Of course we might want to navigate to this URL. But we do not have to get the URL, contact the browsing context and start the navigation. Instead we just call the Navigate
method from the set of extensions.
var anchor = document.QuerySelector<IHtmlAnchorElement>("a");
if (anchor != null)
document = await anchor.Navigate();
The check for null
might be redundant, but it is better to use it. The QuerySelector
method returns null
if no such element could be found. The navigation method is, as expected, asynchronous. It returns the document, which has been the navigation target. Theoretically we could use context.Active
, but for convinience we just reassign the document
variable.
Manipulating the document works also easily. We can either use the official DOM methods or convenient wrappers. These wrappers are sometimes familiar to jQuery. Usually they work on a set of elements, given as an IEnumerable<T>
, where T
has to implement IElement
.
var anchors = document.QuerySelectorAll<IHtmlAnchorElement>("a");
anchors.AddClass("my-anchor-class").Attr(new { foo = "bar" });
document.QuerySelector("body").ClassList.Add("cs-body-element");
There are also useful extensions for ordinary DOM operations. Most importantly the CreateElement
method of the IDocument
got a nice addition for C#. Usually this factory method just returns an IElement
instance tailored to the requested element name, e.g.,
var div = document.CreateElement("div");
but using this approach may require an additional cast if we want to access some of the more specialized properties or methods. Also there me only be a single class implementing the DOM interface we are after. So for instance we could write the following to create a new HTML anchor element:
var newAnchor = document.CreateElement<IHtmlAnchorElement>();
There is no string
required here. Overall this approach should be favored in C#, but only if there is only a single implementing class (usually the case for the more specialized interfaces) and therefore if the name is mapped 1:1 to the interface. In this case the tag name a is mapped 1:1 to IHtmlAnchorElement
.
Creating a new element is only half of the story. As long as an element is not attached to the tree, it won't be integrated to queries and any kind of rendering. The AppendNode
method can be chained, but only returns an INode
instance. Lucky for us there is a AppendElement<T>
method, returning the appended element along with its corresponding type.
This allows code such as the following to work.
document.Body.AppendElement(document.CreateElement<IHtmlAnchorElement>()).Href = "http://www.google.com";
Form elements can be constructed, manipulated and used in AngleSharp. Like in the browser the most important element is the IHtmlFormElement
itself. Then we have a mixture of elements, with IHtmlInputElement
, IHtmlButtonElement
and IHtmLTextareaElement
, just to name a few. Of course the IHtmlInputElement
may be the most used one. It is itself a host to many states, which are set by changing the Type
property.
var input = document.CreateElement<IHtmlInputElement>();
input.Type = "hidden";
By default the Type
is set to text. The type influences the behavior of same (especially validation) methods. AngleSharp implements the full suite of HTML5 input types, including the constraint validation model. This allows form validation as in the browser.
With these basics let's see how the code for our example must look like.
Give Me the Code
We start by installing AngleSharp via NuGet. We right click on the project and select "Manage NuGet Packages ...". Then we search online for AngleSharp and click install. The package can also be found on the NuGet website.
Now we need to add some important namespaces. Most importantly we need the AngleSharp
namespace, as this one contains the BrowsingContext
, the Configuration
and extensions for the IConfiguration
. We also need the AngleSharp.Extensions
, since it contains useful helpers for working with the AngleSharp API. Finally we also need to add AngleSharp.Dom
, or even more specialized, AngleSharp.Dom.Html
. For this simple example only the latter is required.
using AngleSharp;
using AngleSharp.Dom.Html;
using AngleSharp.Extensions;
Now let's decide for the webserver. This could be entered by the user of the application, or given in some configuration file. We hardcode it in form of a global static readonly
field.
static readonly String WebsiteUrl = "http://localhost:54361";
As the final step we need to fill out the blank spot in our ViewModel definition. We wanted to specify the action of the Submit RelayCommand
.
Before we dissect the code we should have a glance at it. The code is not long, but it does a lot.
_submit = new RelayCommand(async () =>
{
ChangeState(State.Loading);
var configuration = Configuration.Default.WithDefaultLoader().WithCookies();
var context = BrowsingContext.New(configuration);
await context.OpenAsync(WebsiteUrl);
await context.Active.QuerySelector<IHtmlAnchorElement>("a.log-in").Navigate();
await context.Active.QuerySelector<IHtmlFormElement>("form").Submit(new
{
User = "User",
Password = "secret"
});
await context.Active.QuerySelector<IHtmlAnchorElement>("a.secret-link").Navigate();
Content = context.Active.QuerySelector("p").Text();
ChangeState(State.Finished);
});
Let's recap what the code above does. There are actually quite a few steps and the fact that we are indeed able to do them all in a very short time (much below a second for local connections, e.g., 200ms on my machine with the debug build) is remarkable. Even more amazing is the speed of development. Implementing this solution may take less than 5 minutes.
- Load the landing page
- Navigate to the URL (
href
) of an anchor element with the class log-in
- Wait for the login page to be loaded
- Submit the form (the first / only form of the page) with they key-value-pairs provided in an anonymous object
- Wait for the form to be submitted and the response to be received
- Navigate to the URL (
href
) of an anchor element with the class secret-link
- Wait for the content page to be loaded
- Read the content (text) of the first / only paragraph on the page
What are the most important parts in the code? The right configuration definitely matters. Without the loader we do not have any HTTP requester. We would be screwed. Without cookies we cannot transport the authentication from one page to the next. Also we would not be able to verify the verification token (more on that later). Hence cookies are a must for login forms or forms, which are validated.
var configuration = Configuration.Default.WithDefaultLoader().WithCookies();
While the navigation process has been pretty much explained previously, we did not go into many details of form submission. There are many ways to do form submission in AngleSharp. The two, probably most popular, ways are:
- Iterate over the contained input elements, such as
IHtmlInputElement
or IHtmlTextareaElement
, and fill out Value
if the Name
is matched.
- Use a helper method to deliver an
IEnumerable
of key-value-pairs, which carry corresponding name-value-pairs. Or use the helper to provide an anonymous object, which will be transformed to such a dictionary.
In our example we use the latter. Our input names are well suited for using the anonymous object approach. If they would be exotic we might not be able to use valid C# identifiers. We end up with a pretty short line of code that does everything from selecting the (hopefully right!) form, to filling it out and submitting it.
context.Active.QuerySelector<IHtmlFormElement>("form").Submit(new
{
User = "User",
Password = "secret"
});
These steps do everything we need to gather the data from the homepage without knowing the exact URL of the login. We only need to know some selectors to choose the right elements and navigate to their URL. We also need to know the input fields (names and in this case demanded values). All that can be investigated step by step.
As far as the website is concerned there is nothing special. We use the standard forms authentication model. The following code snippet illustrates the most important actions of the HomeController
class. Since this is a simple demo we do not use a database or any advanced authentication mechanism. We just check if the provided credentials match the expected one (there is only one).
Most importantly the information behind the Secret
action is protected. We use a standard AuthorizeAttribute
to hand over the responsibility of authentication checking to the framework.
[HttpGet]
public ViewResult LogIn()
{
return View();
}
[HttpPost]
[ValidateAntiForgeryToken]
public ActionResult LogIn(LogInModel model)
{
if (model.User == "User" && model.Password == "secret")
{
FormsAuthentication.SetAuthCookie(model.User, false);
return RedirectToAction("Index");
}
return View(model);
}
[HttpGet]
public RedirectToRouteResult LogOut()
{
FormsAuthentication.SignOut();
return RedirectToAction("LogIn");
}
[HttpGet]
[Authorize]
public ViewResult Secret()
{
return View();
}
Another thing that is important to realize is the use of an anti-forgery-token for evaluating the login action. Obviously we need to load the login page explicitly. If we would send the form data directly to the server we would miss the generated anti-forgery-token. Hence the ValidateAntiForgeryTokenAttribute
would yield a negative result upon evaluation. As a result we would not be able to login. This is, of course, unwanted. Another reason to use AngleSharp from the beginning with a valid BrowsingContext
.
Points of Interest
I've presented a more complete variant of this demo at some conferences and user group meetings. The first talk was given at the Developer Week 2015. You can find the original samples on GitHub. If you are interested in the presentation then have a look at the slides on my page.
The reaction of the audience is always quite enthusiastic, even though I realize that demos, which show the connection to JavaScript engines, are more popular. I believe that correct form submission and HTTP handling is essential for any HTML tool. In the end most of the HTML code we are interested in will come from servers. Communicating with these servers should be possible without installing other libraries or providing custom implementations.
Why is the HTTP requester from the AngleSharp core library so limited? Personally I would love to boost this functionality, but since AngleSharp is deployed as a PCL (profile 259), we cannot access platform specific functionality. Luckily the used PCL profile comes with an HTTP requester (WebRequest
and derived, HttpWebRequest
). This feature makes basic HTTP requests from the core library possible. Nevertheless, the provided requester has some platform-dependent hiccups and some platform-independent flaws. For instance we cannot accept the certificate for some HTTPS connections.
In the future there will be a library (called AngleSharp.Io), which will deliver a better solution. This one, however, will naturally have stronger dependencies than the PCL. These libraries will all be part of the AngleSharp GitHub organization.
History
- v1.0.0 | Initial Release | 08.08.2015
- v1.0.1 | Added some links | 11.08.2015
- v1.0.2 | Fixed some typos | 12.08.2015