Introduction
This article deals with two major issues in automatic web data extraction.
- How to use WINHTTP 5 library to do crawling (reading Web data).
- How to take the data extracted from the WINHTTP and extract/instantiate the DOM out of it.
At the end we discuss how to make a recursive crawler.
Background
The WINHTTP library complies with HTTP 1.0/1.1 model that is based on a persistent (keep alive) protocol model which means that we first connect to a webserver and then make requests for the documents from it. The subsequent requests from a same webserver(hostname in our case) does not involve making and breaking the connection. We discuss here how to extract the HTML data given a string URL to you. The main problem I experienced was that for the crawling you might have a big url given to you. Now this has to be broken up into hostname ( for connection ) and the rest of the url path ( for request ). Of course you would say the WINHTTPCrackUrl
method would do this job but it doesn't. It gives you the correct URL Path but not the correct hostname to connect to the server.
For the operating on DOM the most widely used interface series is the IHTMLDocument
but an object of this type is usually instantiated and populated by the Browser
object ( using the get_document
method ). The issue here is how to populate this object with the plain text HTML we get from the WINHTTP.
These two steps go a long way to lay the foundations of a tool which can crawl the web and operate on DOM models of the web pages rather than doing plain string post-processing which most tools do. The similar feat can be obtained by invoking Navigate
method on the IE and analyzing the DOM but its trivial to estimate how inefficient it would be to load the whole document (including images) and render it in the browser before getting the DOM.
Why WINHTP when there is WinInet
A traditional developer would say why use WinHTTP when we have the WinInet and it is touted by Microsoft for HTTP( as well as ftp and Gopher ) client applications. But WinInet poses a great stumbling block to total automation. When we do any authentication and some other operations through WinInet it displays a user interface. WinHTTP, however handles these operations programmatically.
How the Program looks
For example if the url to be traversed was http://news.yahoo.com/fc?tmpl=fc&cid=34&in=world&cat=iraq then WINHTTP expects to connect to news.yahoo.com which is the hostname of the webserver and then put in a request for /fc?tmpl=fc&cid=34&in=world&cat=iraq
Given below is a complete description of how to take the URL and spilt it (using WinHttpCrackUrl
), doing changes after that because the cracking does not give us the results as we want and then feed this data to the WINHTTP calls After we are done with this; the data extraction from the URL comes into picture and to do this first we connect to the URL and we get the size of the data available on that url using WinHttpQueryDataAvailable
. The catch is that we don't get all the data of a web page in one shot so We initialize a buffer to which we'll keep appending the data got from the WinHttpReadData
and get the webpage when all the data has been read ( indicated by the available data size being equal to zero). This is exactly how an equivalent URLReader
class in java works. Given below is the complete code to do such a feat with explicit comments at each step
USES_CONVERSION;
URL_COMPONENTS urlComp;
LPCWSTR varURL;
DWORD dwUrlLen = 0;
LPCWSTR hostname, optional;
ZeroMemory(&urlComp, sizeof(urlComp));
urlComp.dwStructSize = sizeof(urlComp);
urlComp.dwSchemeLength = -1;
urlComp.dwHostNameLength = -1;
urlComp.dwUrlPathLength = -1;
urlComp.dwExtraInfoLength = -1;
if (!WinHttpCrackUrl( varURL, wcslen(pwszUrl1), 0, &urlComp))
{
printf("Error %u in WinHttpCrackUrl.\n", GetLastError());
}
String myhostname(W2T(urlComp.lpszHostName));
String myurlpath(W2T(urlComp.lpszUrlPath));
int strindex = myhostname.IndexOf(myurlpath);
String newhostname(myhostname.SubString(0,strindex));
strindex = 0;
DWORD dwSize = 0;
DWORD dwDownloaded = 0;
LPSTR pszOutBuffer;
BOOL bResults = FALSE;
HINTERNET hSession = NULL,
hConnect = NULL,
hRequest = NULL;
hSession = WinHttpOpen( L"WinHTTP Example/1.0",
WINHTTP_ACCESS_TYPE_DEFAULT_PROXY,
WINHTTP_NO_PROXY_NAME,
WINHTTP_NO_PROXY_BYPASS, 0);
if (hSession)
hConnect = WinHttpConnect( hSession, T2W(newhostname),
INTERNET_DEFAULT_HTTP_PORT, 0);
if (hConnect)
hRequest = WinHttpOpenRequest( hConnect, L"GET",
urlComp.lpszUrlPath,
NULL, WINHTTP_NO_REFERER,
WINHTTP_DEFAULT_ACCEPT_TYPES,
WINHTTP_FLAG_REFRESH);
if (hRequest)
bResults = WinHttpSendRequest( hRequest,
WINHTTP_NO_ADDITIONAL_
HEADERS, 0,
WINHTTP_NO_REQUEST_DATA, 0,
0, 0);
if (bResults)
bResults = WinHttpReceiveResponse( hRequest, NULL);
String respage="";
if (bResults)
do
{
dwSize = 0;
if (!WinHttpQueryDataAvailable( hRequest, &dwSize))
printf("Error %u in WinHttpQueryDataAvailable.\n",
GetLastError());
pszOutBuffer = new char[dwSize+1];
if (!pszOutBuffer)
{
printf("Out of memory\n");
dwSize=0;
}
else
{
ZeroMemory(pszOutBuffer, dwSize+1);
if (!WinHttpReadData( hRequest,
(LPVOID)pszOutBuffer,
dwSize, &dwDownloaded))
printf("Error %u in WinHttpReadData.\n",
GetLastError());
else
respage.Append(pszOutBuffer);
delete [] pszOutBuffer;
}
} while (dwSize>0);
When we are done with this, we have the HTML page as a string in the respage
buffer. So now the aim is to get a DOM model of this, so that we can operate on the data programmatically like query the nodes, access particular elements and so on. The best way to do DOM manipulation is through the Microsoft provided interfaces IHTMLDocument
, IHTMLDocument2
, IHTMLDocument3
and IHTMLDocument4
. The following code takes data from that buffer and makes an IHTMLDocument2
out of it. We can then use its various methods ( getBody
, getInnerHTML
, etc. ) to access the DOM or type cast it into a related interface like IHTMLDocument3
and query the nodes in the DOM tree.
IHTMLDocument2Ptr myDocument;
HRESULT hr = CoCreateInstance(CLSID_HTMLDocument,NULL,
CLSCTX_INPROC_SERVER,IID_IHTMLDocument2, (void **)&myDocument);
HRESULT hresult = S_OK;
VARIANT *param;
SAFEARRAY *tmpArray;
tmpArray = SafeArrayCreateVector(VT_VARIANT, 0, 1);
bstr_t bsData = (LPCTSTR) respage;
hresult = SafeArrayAccessData(sfArray,(LPVOID*) & param);
param->vt = VT_BSTR;
param->bstrVal = bsData;
hresult = myDocument->write(tmpArray);
hresult = SafeArrayUnaccessData(tmpArray);
SysFreeString(bsData);
if (tmpArray != NULL) {
SafeArrayDestroy(tmpArray);
}
Further Enhancements
I have highlighted the basics of crawling here and the complete crawler design is a user discretion. For a complete crawler we need to extract the links from a particular web page and extract data from those links. Traditional tools do string processing to look for anchor or the href tags and extract the hyperlink strings and it obviously seems an inefficient way because we have to parse all the page data. Querying the DOM for that purpose is much more efficient; we can just look for all the nodes with an anchor tag and extract the href attribute. Making a web-site grabber is very easy with the code i have given above. You can use the get_anchors
method of IHTMLDocument2
to get the the hyperlinks from a page and then recursively call the code above after implementing proper checks for link loops. Such a program can crawl all the hyperlink accessible pages from a given base URL up to any number of levels.