Introduction
How can we get some content from some website?
We can use one of three ways:
1. Open website in a browser engine, i.e. standard WebBrowser or some third-party engine (here is article about WebBrowser and third-party engines) and get content of some DOM element of page.
2. Download HTML content via System.Net.WebClient and next parse it by String.IndexOf()/Substring, regular expressions or HtmlAgilityPack
library.
3. Use website's API (if exists): send query to API and get response, also using System.Net.WebClient or other System.Net classes.
Way 1 - Via browser engine
For example, we have website about weather, with such HTML content:
<html>
<head><title>Weather</title></head>
<body>
City: <div id="city">Monte-Carlo</div>
Precipitation:
<div id="precip">
<img src="/rain.jpg" />
</div>
Day temperature: <div class="t">20 C</div>
Night temperature: <div class="t">18 C</div>
</body>
</html>
Tip: if you haven't internet access or can't locate my site (or create your own), you can navigate local *.html file with such HTML content.
Let's get city name (i.e. Monte-Carlo).
You creating a WebBrowser (programmatically or in an form designer), navigating to website, and when website loaded (in DocumentCompleted event, make sure the website is indeed fully loaded), we getting DOM element (first div
) by it id "city"
and getting it inner text ("Monte-Carlo"
):
var divCity = webBrowser1.Document.GetElementById("city");
var city = divCity.InnerText;
label1.Text = "City: " + city;
Next, let's get precipitation image link to show it in PictureBox:
var divPrecip = webBrowser1.Document.GetElementById("precip");
var img = divPrecip.Children[0];
var imgSrc = img.GetAttribute("src");
pictureBox1.ImageLocation = imgSrc;
Lastly, let's get day and night temperature:
private HtmlElement[] GetElementsByClassName(WebBrowser wb, string tagName, string className)
{
var l = new List<HtmlElement>();
var els = webBrowser1.Document.GetElementsByTagName(tagName);
foreach (HtmlElement el in els)
{
if (el.GetAttribute("className") == className)
{
l.Add(el);
}
}
return l.ToArray();
}
var divsTemp = GetElementsByClassName(webBrowser1, "div", "t");
var divDayTemp = divsTemp[0];
var dayTemp = divDayTemp.InnerText;
label2.Text = "Day temperature: " + dayTemp;
var divNightTemp = divsTemp[1];
var nightTemp = divNightTemp.InnerText;
label3.Text = "Night temperature: " + nightTemp;
Way 2 - Via WebClient And HtmlAligityPack
You can download full HTML website content via System.Net.WebClient:
using System.Net;
string HTML;
using (var wc = new WebClient())
{
HTML = wc.DownloadString("http://csharp-novichku.ucoz.org/pagetoparse.html")
}
And then you can parse it via HtmlAgilityPack third-party library identically to web engine:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(HTML);
label1.Text = "City: " + doc.GetElementbyId("city").InnerText;
Note that HtmlAgilityPack supports NOT all WebBrower engine methods! For example, there are no GetElementByTagName() method. You should define them yourself.
Way 3 - Via Website API
To be continued...
Which way is better?
Website API is most convenient and light way usually. But it usually contains limitations, mainly for reasons of security.
WebBrowser way is lighest way if there is no API in website. Also, it very naturally simulates user actions and sometimes allow you to bypass website anti-bot security. It is one way if site content can load only by JS, because JS can't work in WebClient.
WebClient way is more fast-running and usually very durable than WebBrowser way.