Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / MFC

HTML Parser C++ (Demo Project)

5.00/5 (12 votes)
4 Oct 2013CPOL1 min read 71.5K   4K  
This is a sample project for HTML Reader C++ Class Library

Introduction

This is a sample project developed using this tiny HTML Parser library. Its main purpose is to show the use of that library. However I have added some additional features to the lib. The project has support for UNICODE builds. The code warps the HTML tags to a tree model, exposing a function to retrieve a specific HTML element.  

An HTML element is an individual component of an HTML document or "web page", once this has been parsed into the Document Object Model.

In the HTML syntax, most elements are written with a start tag and an end tag, with the content in between. An HTML tag is composed of the name of the element, surrounded by angle brackets. An end tag also has a slash after the opening angle bracket, to distinguish it from the start tag.

<p>In the HTML syntax, most elements are written ...</p>

Between the starting/ending tags, any number of other tags may exist. This project offers a way to search for a specific tag, and also specify an attribute with a value for that tag. Then extract the content of that element. It's a cheap alternative to Microsoft's MSHTML parser (full of leaks).

Image 1

Using the Code

Add to your project the files form AClass directory.

Include some headers you may need like:

C++
#include "AClass/LiteHTMLReader.h"  
#include "AClass/HtmlElementCollection.h"


Instantiate the reader which will parse the HTML string.

C++
CLiteHTMLReader theReader;
CHtmlElementCollection theElementCollectionHandler;
theReader.setEventHandler(&theElementCollectionHandler);  

If you want to get a specific set of tags with a specific attrib use:

C++
theElementCollectionHandler.InitWantedTag(_T("style"), _T("id"),_T("sss"));

Call the parser function. At the end, the theElementCollectionHandler will be filled with the parsed structure.

C++
theReader.Read(m_szHtmlPage)

Now start retrieving the elements' text to a CString var.

C++
for (int i=0;i<theElementCollectionHandler.GetNumElementsFiltered();i++){
    theElementCollectionHandler.GetOuterHtml(i, szTxt, 1);
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)