Introduction
I first encountered the data: protocol when I saw the JavaScript Draw site, an AJAX implementation of a scribble application. The problem with the site was that it did not work with Internet Explorer. Going through the source code, I found that among the various reasons it did not work was that it used data: URLs as the source for dynamically created images, which is not supported for Internet Explorer.
The data: protocol is described in RFC 2397. Currently, the only browsers that support the data: protocol are Opera and Mozilla Firefox. This article describes an asynchronous pluggable protocol implementation to support the data: protocol in Internet Explorer. One possible use of the data: protocol is to embed small images in the HTML itself to avoid server hits. It can also be useful in AJAX applications like JavaScript Draw to return images, encoded in base64, as the response text.
The data: URL format
The protocol itself, as described in the RFC, is quite simple. The format is:
dataurl := "data:" [ mediatype ] [ ";base64" ] "," data
mediatype := [ type "/" subtype ] *( ";" parameter )
data := *urlchar
parameter := attribute "=" value
The media type indicates the type of the data and its encoding. The default media type is text\plain;charset=US-ASCII
. For an image, the media type can be image/gif
, image/png
etc. The optional base64 part of the URL indicates that the actual data represented in the URL is encoded in base64 format. Although the primary use of base64 encoding in URLs is for binary data, it can also be used to encode text. Finally, the data portion of the URL is the actual encoded data represented by the URL.
The next sections describe how the protocol was implemented.
Parsing the URL
ATL's regular expression classes come in pretty handy to parse the data: URLs. The following is a regular expression to parse the URL:
data:{(.*?/.*?)}?(;{.*?}={.*?})?{;(base64)?}?,{.*}
The regular expression captures the various portions of the URL into five different groups:
Group |
Capture |
0 |
The type/subtype portion of the media type or the MIME type |
1 |
The attribute name of any additional parameter specified with the MIME type |
2 |
The value of the attribute captured in group 1 |
3 |
The base64 string |
4 |
The actual data string |
After capturing the different portions of the URL, the Base64 encoded data is converted into bytes. The ATL function Base64Decode
is used for this.
int nReqLen = Base64DecodeGetRequiredLength(strData.GetLength());
m_pvData = new BYTE[nReqLen];
int nDestLen = nReqLen;
bRet = Base64Decode(strData, strData.GetLength(), m_pvData, &nDestLen) != 0;
m_dwDataLength = nDestLen;
Converting the Text Data to Unicode
If the data format is text, the text is converted into Unicode so that Internet Explorer can handle it correctly. The encoding of the source data comes from the charset attribute specified in the parameter portion of the media type. An example of such a URL is data:text/plain;charset=iso-8859-8-i,%f9%ec%e5%ed - which is some Hebrew text encoded in ISO-8859-8-i.
To convert the multi byte to Unicode, we have to use the famous MultiByteToWideChar
function. The MultiByteToWideChar
function requires a DWORD
codepage identifier. A little bit of research revealed that the IMultiLanguage
interface in the MLang API can be used to obtain the codepage identifier from the named charsets.
CComPtr<IMultiLanguage2> spMLang;
if (SUCCEEDED(hr = spMLang.CoCreateInstance(CLSID_CMultiLanguage)))
{ MIMECSETINFO mi;
if (SUCCEEDED(hr = spMLang->GetCharsetInfo(CComBSTR(GetCharset()), &mi)))
...
}
The MIMECSETINFO
structure is declared as:
typedef struct tagMIMECSETINFO {
UINT uiCodePage;
UINT uiInternetEncoding;
WCHAR wszCharset[MAX_MIMECSET_NAME];
} MIMECSETINFO, *PMIMECSETINFO;
From the first glance, it seems that the uiCodePage
member will give the required code page identifier; however, in my experience, under certain circumstances, the uiCodePage
member is the required codepage, and in some other circumstances, the uiInternetEncoding
is the required value. Unfortunately, I could not locate any document describing when to use what. As a result, the code to convert charsets becomes a little ugly.
int nSrcLen = strData.GetLength();
UINT uCodePage = mi.uiInternetEncoding;
int nWideChar = MultiByteToWideChar(uCodePage, 0,
(LPCSTR)strData, nSrcLen, NULL, 0);
if (nWideChar == 0)
{
uCodePage = mi.uiCodePage;
nWideChar = MultiByteToWideChar(uCodePage, 0,
(LPCSTR)strData, nSrcLen, NULL, 0);
}
if (nWideChar != 0)
{
WCHAR* sz = new WCHAR[nWideChar + 1];
MultiByteToWideChar(uCodePage, 0,
(LPCSTR)strData, nSrcLen, sz + 1, nWideChar);
m_pvData = (BYTE*)sz;
m_dwDataLength = (nWideChar + 1) * 2;
m_pvData[0] = 0xFF;
m_pvData[1] = 0xFE;
}
Once the characters are converted to a Unicode stream of bytes, the byte stream needs to be prefixed with the Unicode lead bytes to indicate to Internet Explorer. The lead bytes are 0xFFFE.
The URL parsing gave us the data and the MIME type of the data. The actual implementation of the pluggable protocol is pretty simple.
Implementing the Asynchronous Pluggable Protocol Handler
An asynchronous pluggable protocol handler is a COM object that implements the IInternetProtocol
and the IInternetProtocolInfo
interfaces. For Internet Explorer to use the URL scheme handled by the protocol, the registration entries need to be added at HKEY_CLASSES_ROOT\PROTOCOLS\Handler. The following is an extract from the .rgs file for the protocol COM object.
HKCR
{ ...
NoRemove PROTOCOLS
{
NoRemove Handler
{
ForceRemove data = s 'data: pluggable protocol'
{
val CLSID = s '{C79BF22F-25C4-4D3D-8183-14149EAB9C0C}'
}
}
}
}
The only interesting methods in the implementation of the pluggable protocol handler are IInternetProtocol::Start
and IInternetProtocol::Read
. The IInternetProtocol::Start
is called by Internet Explorer (actually, urlmon.dll) indicating the handler that data needs to be downloaded from a given URL. The pluggable protocol handler parses the URL and downloads the data. It notifies the caller of the progress using the IInternetProtocolSink
-caller supplied callback interface. The caller calls IInternetProtocol::Read
to read chunks of data depending on the status information received from the protocol handler. The start method of the data protocol handler is implemented as:
STDMETHODIMP CDataPluggableProtocol::Start(
LPCWSTR szUrl,
IInternetProtocolSink *pIProtSink,
IInternetBindInfo *pIBindInfo,
DWORD grfSTI,
DWORD dwReserved)
{
HRESULT hr = S_OK;
if (m_url.Parse(szUrl))
{
m_dwPos = 0;
CAtlString strData = m_url.GetDataString();
pIProtSink->ReportProgress(BINDSTATUS_FINDINGRESOURCE, strData);
pIProtSink->ReportProgress(BINDSTATUS_CONNECTING, strData);
pIProtSink->ReportProgress(BINDSTATUS_SENDINGREQUEST, strData);
pIProtSink->ReportProgress(BINDSTATUS_VERIFIEDMIMETYPEAVAILABLE,
m_url.GetMimeType());
pIProtSink->ReportData(BSCF_FIRSTDATANOTIFICATION, 0,
m_url.GetDataLength());
pIProtSink->ReportData(BSCF_LASTDATANOTIFICATION |
BSCF_DATAFULLYAVAILABLE,
m_url.GetDataLength(),
m_url.GetDataLength());
}
else
{
if (grfSTI & PI_PARSE_URL)
hr = S_FALSE;
}
return hr;
}
The function parses the URL which automatically extracts the data. The code then sends a series of notifications to the caller. The important call is ReportProgress(BINDSTATUS_VERIFIEDMIMETYPEAVAILABLE, m_url.GetMimeType());
which indicates the MIME type of the data to the caller so that the caller can handle the data accordingly. The caller then calls IInternetProtocol::Read
to read the data.
Testing the Protocol Handler
The protocol handler is automatically registered when the project is built. Once the handler is registered, data: URLs will start working in Internet Explorer. The protocol handler has been tested with the data: URL Tests at the mozilla.com testing website. The handler passes all the tests, except one. The test fails because of the limitation of Internet Explorer URL length. So far, no security issues have been identified. I welcome readers to indicate any possible security issues with the protocol handler.
History
- January 28, 2006 - Initial release.