Introduction
Usually website has an RSS file. Then we can parse it to have the latest news, however, there are some that didn't make this RSS file so we should parse directly HTML of this website.
You can download this sample here.
Using the Code
First of all, we should add to the reference the <a href="http://htmlagilitypack.codeplex.com/">Htmlagilitypack</a>
. You can download it from nuget on your Visual Studio.
P.S.: If you are working on Windows Phone, it will have some problems with that DLL, so you must add these two DLL files, System.net.http
and System.Xml.Xpath
. You can also find it on nuget.
We create a new function that takes as parameter the website that you want to parse:
Parsing("http://www.mytek.tn/");
Then, we send a request to the website to get all HTML pages:
HttpClient http = new HttpClient();
var response = await http.GetByteArrayAsync(website);
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument resultat = new HtmlDocument();
resultat.LoadHtml(source);
P.S.: You should pay attention to the Encoding, each website has an Encoding. In this example, it uses utf-8
, you can find it on the attribute charset
on the website HTML.
After that, we inspect the element that we want to parse and get its id or class, then we can retrieve it easily.
As you can see in the picture, we want to parse information of these devices that are all wrapped in ul
, but before that we must find the ancestor div
that has an id or a class, in this example the div
has a class named block_content.
So now, we will filter the HTML with only the content of this div
, then we get all tag of li
that contains information that we want to get.
List<HtmlNode> toftitle = resultat.DocumentNode.Descendants().Where
(x => (x.Name == "div" && x.Attributes["class"] != null &&
x.Attributes["class"].Value.Contains("block_content"))).ToList();
After each filter you do, it is preferred to breakpoint the project to verify our work.
As a result, we get 11 div
s that have class named block_content
, so you should verify which item contains information that we want to get. In this example, it's item N°6.
var li = toftitle[6].Descendants("li").ToList();
foreach (var item in li)
{
var link = item.Descendants("a").ToList()[0].GetAttributeValue("href", null);
var img = item.Descendants("img").ToList()[0].GetAttributeValue("src", null);
var title = item.Descendants("h5").ToList()[0].InnerText;
}
Inside each item of li
, we will get the link, image and Title.
Descendants
allow you to get all tag with specified name inside the item.GetAttributeValue
allows you to get the attribute of the tag.InnerText
allows you to get Text between tags.InnerHtml
allows you to get HTML.
History
Difficulty of parsing HTML depends on the structure of the website.