Introduction
First let me explain why I called the article "3rd Way". I've already seen
such articles on CodeGuru, explaining how to load and parse HTML file from
memory. You may ask, so why I'm writing another guide? Well, below I'll show
advantages and disadvantages that I found in those ways.
First one, which is also shown in MSDN , is to load HTML code using
IStream
interface. You can
read the article about it here. If all you want is to put a new code into your document,
you should definately use this one. But if you'll try to get tags from your
document after you load HTML, you will get nothing. Just because they are still
in parsing and you have to create an OnDocumentComplete
handler and only than
start to look inside your document.
When I realized this I went to look for another way that will give me
document immediately after submitting a code. And yes, I found it! You can look
at the great article by Asher
Kobin at CodeGuru. It uses a new interface called IMarkupServices
, introduced with MS Internet
Explorer 5.0. I picked this code and made my own from it and started using
it.... but suddenly I saw that when I'm saving my document to disk, the BODY tag
has no attributes! I worked on this problem a whole day, trying to get it
working, but... nothing. When you load your HTML code from memory into document,
all attributes of BODY tag are gone. Still have no idea why it is happening and
will be
glad if someone will tell me.
Thus I came to MSDN again and found another, third way to load and parse
HTML. I was so happy, so I decided to write my first article to CodeProject
about it, which you are reading now :)
Code
For those, advanced programmers, that don't want to read a whole article, I
will give a hint: loading HTML code is made by write()
method of IHTMLDocument2
interface.
Now I'll explain how to do this from beginning.
Headers and imports
I'll assume here, that you have a standard MFC application (such as Dialog ,
SDI or MDI applications). First of all you have to initialize COM, since we
gonna use MSHTML COM interfaces. This can be done in InitInstance()
function of
your application. Remember also to uninitialize COM in your ExitInstance()
:
BOOL CYourApp::InitInstance()
{
CoInitialize(NULL);
...
}
int CYourApp::ExitInstance()
{
...
CoUninitialize();
return CWinApp::ExitInstance();
}
Now in the file you are going to use MSHTML interfaces,
include mshtml.h
, comdef.h
(for smart pointers) and import
mshtml.tlb
:
#include <comdef.h>
#include <mshtml.h>
#pragma warning(disable : 4146)
#import <mshtml.tlb> no_auto_exclude
Where do I get a document?
Now let's get a pointer to IHTMLDocument
interface. How you will get it?
Depends on what you already have :) If you are hosting a WebBrowser
control or
using CHtmlView
in your application, u can call GetDocument()
function in store
the return value in your pointer, but I will explain how to get a 'free'
document, which is not attached to any control or view. This can be done by
simple call to CoCreateInstance()
function:
MSHTML::IHTMLDocument2Ptr pDoc;
HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER,
IID_IHTMLDocument2, (void**)&pDoc);
Validate that you have a valid pointer (not
NULL
) and move on.
Converting your HTML code
I'll assume that you have all HTML code you want to load
in some variable called lpszHTMLCode
. This can be CString
or any other buffer,
loaded for example from file on disk. We need to prepare it before passing to
MSHTML. The problem is that MSHTML function we are going to use takes only
SAFEARRAY
as parameter. So let's convert our string to SAFEARRAY
:
SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
VARIANT *param;
bstr_t bsData = (LPCTSTR)lpszHTMLCode;
hr = SafeArrayAccessData(psa, (LPVOID*)¶m);
param->vt = VT_BSTR;
param->bstrVal = (BSTR)bsData;
Last jump
Now we are ready to pass our SAFEARRAY
to write()
function. These 2 lines of code will do all dirty parsing work for you
hr = pDoc->write(psa);
hr = pDoc->close();
SafeArrayDestroy(psa);
Of course, remember to check every your step, so your
program never crush, I skipped it to keep the code simple.
Now, after all this work you have a pointer to IHTMLDocument2
interface, which gives you a lot of features, like getting particular tag,
searching, inserting, replacing, deleting tag, just like you do it in
JavaScript.
And remember, if you are using smart pointers (like I do
here) you don't need to call Release() function, the object will be freed
automatically.
"about:blank" bug workaround
Well, since we have no site "attached" to our document interface, all links (href, src) that are relative to document, will start with "about:blank" if you'll try to use IHTMLAnchorElement::href
property. The way to get the exact link, as it is in HTML source, is to use IHTMLElement
interface with nice function called getAttribute
. Just remember that the second parameter of this function should be 2, it will tell to parser to return you text as is.
Of course same way you should work with IMG, LINK and other tags. The example project updated with this fix also. You can download it and see how I did it.
References
Ahser Kobin's article about parsing with IMarkupServices
(CodeGuru)
Load HTML from Stream (MSDN)
MSHTML Reference (MSDN)
IHTMLDocument2 Reference (MSDN)