This article presents a utility that lets you retrieve raw
information from web servers using HTTP's GET
and POST
commands.
Description
This utility is just a wrapper around reusable functions that allow programmatic access to the web through a sort of 'mini-browser' embedded inside your program.
There are many uses for such code. Programs that look at a series of web pages, much like a user surfing from one page to the next, are often called spiders, bots, or crawlers. Such programs are often used to catalog websites, import external data from the Web, or simply to send commands to a web server. You could extend the functionality of the classes presented here to retrieve information from the Internet in a variety of ways.
There are many third-party DLLs and solutions which retrieve data from websites. The functions presented in this article are totally self-contained. There is no reliance on WinInet
, Internet Explorer, Netscape, or any requirement that similar software be installed, apart from WinSock. WinSock is
an integral part of the Windows TCP/IP stack and is available on any computer capable of running a browser.
Every Internet protocol is documented in an RFC (Request For Comments) document. HTTP is documented in RFC1945. Additionally, RFC1630, RFC1708 and RFC1808 document the format of a URL.
A complete set of RFCs can be found at http://www.rfc-editor.org.
Implementation
The engine of the utility is in the Request
class. The key function is SendHTTP()
. This function accepts 5 parameters and returns one integer. The first parameter is the URL to POST
to or GET
from. The second parameter specifies any additional HTTP headers to be passed during this request. The third and fourth parameters specify the data and length of data to post. The fifth parameter is a pointer to an HTTPRequest
structure that will hold the headers and messages sent and returned by the web server. SendHTTP
return 0 if the POST or GET was succesful, otherwise 1 to indicate an error.
SendHTTP()
begins by parsing the URL string. A URL is an address that specifies the exact location of a resource on the Internet. A URL has several parts, some of which are optional. An example of a URL would be:
http://www.codetools.com:80/index.html
The first part of the URL is the protocol which specifies how to receive the resource. Following the protocol is the
host name. This can be either a domain name or an IP address. Following the host is a port number. Every protocol has a default port number to be used if no port is specified. The default HTTP port is port 80. Following the port is the request being made of the specified web server. If not specified, it defaults to just '/', which requests the root document of the web server.
Next, SendHTTP()
initializes the WinSock library by calling WinSock's WSAStartup()
. After establishing a socket connection, SendHTTP()
transmits a request to the server. There are 2 forms of HTTP requests. The first, and simpler form, is the HTTP GET
.
An HTTP GET does not send any additional information to the
web server other than the request headers and the URL. An HTTP GET often uses the URL itself to send additional information:
http://localhost/projects/HTTP/TestGet.asp?name=fred&age=22
The second form, an HTTP POST, sends data along with the request, separate from the URL.
Usually, an HTTP POST include the header:
Content-Type:application/x-www-form-urlencoded
Without this header, some web servers (particularly ASP running on IIS) will not recognize your parameters. An HTTP POST has 2 parts. The first is the HTTP headers, just as in the GET. The headers contain the actual request and additional pieces of information. Unlike a GET, a POST contains data after the headers (separated from them by a blank line).
After the web server receives the GET or POST request, it sends back a response. The response has 2 parts: headers followed by data (with a blank line separating the two).
The first line of the HTTP headers specifies the status of
the request. It starts with a numeric error code.
100-199 is an informational message and is not generally used.
200-299 means a successful request.
300-399 indicates that the requested resource has been moved; web servers use this for redirection.
400-499 indicates client errors.
500-599 indicates server errors.
After the headers comes the data returned by the GET or POST
request. This is usually seen on the browser screen.
Dialog box wrapper
The MFC dialog project is used like a wrapper to the
Request
class. In the dialog container is inserted a instance of the
Microsoft Web Browser control. This makes it very easy to navigate the data, make commands like GET or POST. The control is used in 2 ways:
1. When the user makes a request from the browser, the control fires the OnBeforeNavigate2
event which is captured by the dialog program. In that way, in OnBeforeNavigate2Explorer1
function is used to discover if is a GET or POST, the header sent to the web server and the posted data.
2. If the user wants to use the SendHTTP
engine, enter the required URL, complete the 'SendHTTPrequest' and 'PostData' (if is a POST) fields, chack the radio button GET or POST and click on the 'Go' button. The IE control will load the HTML formatted data received from SendHTTP()
function in the m_HTTPbody
string variable. The HTML loading is done in OnButtonViewHttp()
.
IHTMLDocument2* pHTMLDocument2;
LPDISPATCH lpDispatch;
lpDispatch = m_Browser.GetDocument();
if (lpDispatch)
{
HRESULT hr;
hr = lpDispatch->QueryInterface(IID_IHTMLDocument2,
(LPVOID*)&pHTMLDocument2);
lpDispatch->Release();
IHTMLElement* pBody;
hr = pHTMLDocument2->get_body(&pBody);
BSTR bstr;
bstr = m_HTTPbody.AllocSysString();
pBody->put_innerHTML(bstr);
SysFreeString(bstr);
pBody->Release();
}
Usage
Input the URL address and click on the Go button. On the right there is a mini-browser with your page. Navigating on links and buttons on this page and in the 'PostData', 'SendHTTPrequest' and 'ReceiveHTTPrequest' will receive the corresponding data. The radio buttons Get/Post are modified automatically - the IE instance knows if you make an GET (you push on a link) or POST (you push a button).
You are able to input your header in the 'SendHTTPrequest'
edit box and your POST data in the 'PostData' edit box, and then push the 'Go' button. The browser will navigate to your address using the headers and data submitted from 'SendHTTPrequest' and 'PostData' fields.
Use the TestGet.asp and TestPost.asp files from Web directory to test your GET/POST utility :