Introduction
Before I go in detail, I want you to know what actually EfTidy is. EfTidy is a wrapper component of Tidy library, and if you don't know what Tidy is, here is a little description:
"TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools".
- By Mr. Dave Raggett
So What Am I Doing With This Library?
Recently, one of my company clients requested us to make a TidyAtl
class for the new Tidy library, as the last ATL component or ActiveX wrapper for the Tidy library was built in 2002. So, my company assigned me the task of creating an ATL library for this component. After completion of the component, Mr. Saurabh told me: "Alok, this is an open source component and other programmers deserve to use it". So here I am, presenting you this component with supporting source code and a brief overview of each function.
Component Reference
The EfTidy contains four interfaces
IEfTidyAttr
(two properties)
IEfTidyNode
(one property and four methods)
ItidyOption
(sixty six properties)
ItidyCom
(five methods and four properties)
EfTidy also contains five enumerations
CharEncodingType
typedef [public] enum tagCharEncodingType
{ ASCII, LATIN1, RAW, UTF8, ISO2022, MAC, WIN1252,
UTF16LE, UTF16BE, UTF16, BIG5, SHIFTJIS }
CharEncodingType;
OutputType
typedef [public] enum tagOutputType
{
XmlOut,
XhtmlOut,
HtmlOut
}OutputType;
IndentScheme
typedef [public] enum IndentScheme
{
NOINDENT=0,
INDENTBLOCKS,
AUTOINDENT
}IndentScheme;
DoctypeModes
typedef [public] enum { DoctypeOmit,
DoctypeAuto,
DoctypeStrict,
DoctypeLoose,
DoctypeUser
} DoctypeModes;
EfTidyMainNode
typedef [public] enum {
TIDY_ROOT, TIDY_HTML, TIDY_HEAD, TIDY_BODY }EfTidyMainNode;
Now, let's take each interface one by one.
1. ItidyCom
First, let's check out each and every method or property present in this interface, and the functions they perform:
Property/Method name |
Parameters |
Get/Put |
Description |
TidyFiletoMem
|
[in] BSTR sourceFile, [out, retval] BSTR* result |
n/a |
Write output to memory. |
TidyFileToFile
|
[in] BSTR sourceFile, [in] BSTR destFile |
n/a |
Write output in file. |
TidyMemToMem
|
[in] BSTR sourceStr, [out, retval] BSTR* result |
n/a |
Write output to memory. |
TidyMemtoFile
|
[in] BSTR buffer, [in] BSTR destFile |
n/a |
Take input as buffer and output in file. |
TotalWarnings
|
([out, retval] long *pVal); |
Get |
Return the total number of warnings after the above four operations. |
TotalErrors
|
([out, retval] long *pVal); |
Get |
Return the total number of errors after the above four operations. |
ErrorWarning
|
[out, retval] BSTR *pVal |
Get |
Return the buffer, which contains human readable errors/ warnings. |
Option
|
[out, retval] ItidyOption* *pVal |
Get |
Set the Option for the Tidy library. |
EfTidyNode
|
[in]EfTidyMainNode Type,[out,retval]IEfTidyNode **ppNewEfTidyNode |
n/a |
As the HTML page has tree structure, this method returns you the tidyNode that assists you to read each and every tag and its attributes. This is the latest addition to the Tidy library. |
2. ItidyOption
Here is a list of properties and methods for the ItidyOption
interface:
Property/Method name |
Parameter |
Get/Put |
Description |
LoadConfigFile
|
BSTR |
n/a |
Load option settings from a configuration file. |
ResetToDefaultValue
|
Void |
n/a |
Reset options to default settings. |
Doctype |
BSTR |
Both |
Doctype declaration generated by Tidy. |
TidyMark
|
VARIANT_BOOL |
Both |
For meta element indicating tidied doc. |
HideEndTag
|
VARIANT_BOOL |
Both |
Suppress optional end tags. |
EncloseText
|
VARIANT_BOOL |
Both |
If yes, text in the body is wrapped in <p> . |
EncloseBlockText
|
VARIANT_BOOL |
Both |
If yes, text in blocks is wrapped in <p> |
LogicalEmphasis
|
VARIANT_BOOL |
Both |
Replace i by em and b by strong . |
DefaultAltText
|
BSTR |
Both |
Default text for alt attribute. |
Clean
|
VARIANT_BOOL |
Both |
Replace presentational clutter by style rules. |
DropFontTags
|
VARIANT_BOOL |
Both |
Discard presentation tags. |
DropEmptyParas
|
VARIANT_BOOL |
Both |
Discard empty p elements. |
Word2000
|
VARIANT_BOOL |
Both |
Both draconian cleaning for Word2000. |
FixBadComment
|
VARIANT_BOOL |
Both |
Both fix comments with adjacent hyphens. |
FixBackslash
|
VARIANT_BOOL |
Both |
Both fix URLs by replacing \ with /. |
NewEmptyTags
|
BSTR |
Both |
Declared empty tags. |
NewInlineTags
|
BSTR |
Both |
Declared inline tags. |
NewBlockLevelTags
|
BSTR |
Both |
Declared block tags. |
NewPreTags
|
BSTR |
Both |
Declared pre tags. |
OutputType
|
OutputType *pVal |
Both |
You can set the output type from here, like you can get the output as XML, XHTML or pure HTML. |
InputAsXML
|
VARIANT_BOOL |
Both |
Treat input as XML. |
ADDXmlDecl
|
VARIANT_BOOL |
Both |
Add >?xml ?< for XML docs. |
AddXmlSpace
|
VARIANT_BOOL |
Both |
If set to yes, adds XML: space attr as needed. |
Bare
|
VARIANT_BOOL |
Both |
Make bare HTML. |
AssumeXmlProcins
|
VARIANT_BOOL |
Both |
If set to yes, PIs must end with ?>. |
CharEncoding
|
CharEncodingType |
Both |
Set/Get in/out character encoding. |
InCharEncoding
|
CharEncodingType |
Both |
Input character encoding (if different). |
OutCharEncoding
|
CharEncodingType |
Both |
Output character encoding (if different). |
NumericsEntities
|
VARIANT_BOOL |
Both |
Use numeric entities for symbols. |
QuoteMarks
|
VARIANT_BOOL |
Both |
Output " marks as " . |
QuoteNBSP
|
VARIANT_BOOL |
Both |
Both output non-breaking space as entity. |
QuoteAmpersand
|
VARIANT_BOOL |
Both |
Output naked ampersand as & . |
OutputTagInUpperCase
|
VARIANT_BOOL |
Both |
Output tags in upper not lower case. |
OutputAttrInUpperCase
|
VARIANT_BOOL |
Both |
Output attributes in upper not lower case. |
WrapScriptlets
|
VARIANT_BOOL |
Both |
Wrap within JavaScript string literals. |
WrapAttVals
|
VARIANT_BOOL |
Both |
Wrap within attribute values. |
WrapSection
|
VARIANT_BOOL |
Both |
Wrap within section tags. |
WrapAsp
|
VARIANT_BOOL |
Both |
Wrap within ASP pseudo elements. |
WrapJste
|
VARIANT_BOOL |
Both |
Wrap within JSTE pseudo elements. |
WrapPhp
|
VARIANT_BOOL |
Both |
Wrap within PHP pseudo elements. |
Indent
|
IndentScheme |
Both |
Indent the content of appropriate tags. |
IndentSpace
|
long |
Both |
Indentation of n spaces. |
WrapLen
|
long |
Both |
Set wrap margin for output. |
TabSize
|
long |
Both |
Expand tabs to n spaces. |
IndentAttributes
|
long |
Both |
New-line + indent before each attribute. |
BreakBeforeBR
|
VARIANT_BOOL |
Both |
Output new-line before or not. |
LiteralAttribs
|
VARIANT_BOOL |
Both |
If true , attributes may use new-lines. |
MarkUp
|
VARIANT_BOOL |
Both |
|
ShowWarnings
|
VARIANT_BOOL |
Both |
On/Off |
Quiet
|
VARIANT_BOOL |
Both |
No 'Parsing X', guessed DTD or summary. |
KeepTime
|
VARIANT_BOOL |
Both |
If yes, last modified time is preserved. |
ErrorFile
|
BSTR |
Both |
File name to write errors to. |
GnuEmacs
|
VARIANT_BOOL |
Both |
If true , format error output for GNU Emacs |
FixUrl
|
VARIANT_BOOL |
Both |
Applies URI encoding if necessary. |
BodyOnly
|
VARIANT_BOOL |
Both |
Output BODY content only. |
HideComments
|
VARIANT_BOOL |
Both |
Hides all (real) comments in output. |
DoctypeMode
|
DoctypeModes |
Both |
Sets the doctype mode for output. |
3. IEfTidyNode
Here is a list of properties for the IEfTidyNode
interface:
Property/Method name |
Parameter |
Get/Put |
Description |
Name
|
BSTR *pVal |
Get |
Return the name of the current tag. |
GetFirstChildNode
|
IEfTidyNode |
n/a |
Return the first child node. |
GetNextChildNode
|
IEfTidyNode |
n/a |
Using this, you can enum the rest of the tags. |
GetFirstAttribute
|
IEfTidyAttr |
n/a |
Return the first attribute of the current tag. |
GetNextAttribute
|
IEfTidyAttr |
n/a |
Return rest of the attributes one by one. |
4. IEfTidyAttr
Here is a list of properties for the IEfTidyAttr
interface:
Property/Method Name |
Parameter |
Get/Put |
Description |
Name
|
BSTR *pVal |
Get |
Name of the attribute. |
Value
|
BSTR *pVal |
Get |
Value of the attribute. |
Using the code
Almost every component was developed to be used with Visual Basic and other COM friendly languages. So, all the code described here is in Visual Basic. I am going to use some test cases to explain the working of the component.
I have used the Test.htm (included with the project) to test EfTidy
responses. Here is what Test.htm contains:
<html>
<head><title>tidy Library</title></head>
<body>
<blockquote>
<p> </p> --(1)
<p><fontsize="5"color=
"#FF00FF">TidyLibrary</font></p>
</blockquote>
<P><p><fontsize="5"color="#FF00FF"></font></p>
<table border="1" cellpadding="0" cellspacing="0"
style="border-collapse: collapse"
bordercolor="#111111" width="100%" id="AutoNumber1">
<tr>
<td width="50%" style="border-left-style: solid;
border-left-width: 1; border-right-style: none;
border-right-width: medium; border-top-style: solid;
border-top-width: 1; border-bottom-style:
none; border-bottom-width: medium"> --(2)
</td>
<td width="50%" style="border-left-style: none;
border-left-width: medium; border-right-style:solid;
border-right-width: 1; border-top-style: solid;
border-top-width: 1;border-bottom-style: none;
border-bottom-width: medium">
</td>
</tr>
</table>
<b>Tidy --- (3)
</h1> <tidy> ---(4)
</body>
</html>
In test.htm, I have added the following mistakes:
- A dummy
<Tidy>
tag at (4),
- Missing
<h1>
tag at (4),
- Empty para
<p>
tag (1),
- Un-closed
<b>
tag at (3).
Test case # 1 using ITidyCOM
First, create an object of our component. Here is a listing of how to achieve that:
Private Sub Form_Load()
Dim TidyCOMObj as EFTIDYLib.tidyCom
Set TidyCOMObj = New EFTIDYLib.tidyCom
End Sub
Now, clean the test.htm file using this object. The code listing for that is given below:
Private Sub cmdMemtoMem_Click()
Dim result As String
TidyCOMObj.TidyFileToFile("test.htm","test1.htm")
txtError = TidyCOMObj.TotalErrors
txtWarning = TidyCOMObj.TotalWarnings
End Sub
And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyCom
) contains:
<html>
<head>
<meta name="generator"
content="HTML Tidy for Windows (vers 1st September 2004),
see www.w3.org">
<title>tidy Library</title>
</head>
<body>
<blockquote>
<p> </p>
<p><font size="5" color="#FF00FF">Tidy Library</font>
</p>
</blockquote>
<p><font size="5" color= "#FF00FF"> </font></p>
<table border="1" cellpadding="0" cellspacing="0"
style= "border-collapse: collapse" bordercolor="#111111"
width="100%" id= "AutoNumber1">
<tr>
<td width="50%" style= "border-left-style: solid;
border-left-width: 1; border-right-style: none;
border-right-width: medium; border-top-style: solid;
border-top-width: 1; border-bottom-style: none;
border-bottom-width: medium">
</td>
<td width="50%"
style= "border-left-style: none;border-left-width: medium;
border-right-style: solid; border-right-width: 1;
border-top-style: solid; border-top-width: 1;
border-bottom-style: none;border-bottom-width: medium">
</td>
</tr>
</table>
<b>Tidy</b> --(1)
</body>
</html>
If you see the above cleaned HTML page - the dummy <tidy>
tag and the </h1>
have been removed near (1), and </b>
is added after Tidy at (1).
Here is a summary of the errors/warnings produced by EfTidyCom
, showing you the details of each action it has performed:
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1 - Error: <tidy> is not recognized!
line 23 column 1 - Warning: discarding unexpected <tidy>
line 15 column 1 - Warning: <table> proprietary attribute
"bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary
5 warnings, 1 error were found!
Test case # 2 using ITidyCOM
Now, apply some options to Test.htm to get the custom output. So, I am using these options:
Clean =TRUE
(to make separate class for style),
DoctypeMode = DoctypeUser
(to enable display string),
Doctype = "Ef Tidy library"
(display string),
OutputType = XhtmlOut
(output type),
NewInlineTags = "tidy"
(Make our dummy <tidy>
tag legal).
Here is the code listing to achieve the above:
Private Sub cmdMemtoMem_Click()
Dim me1 As String
TidyCOMObj.Option.Clean = True
TidyCOMObj.Option.NewInlineTags = "tidy"
TidyCOMObj.Option.OutputType = XhtmlOut
TidyCOMObj.Option.DoctypeMode = DoctypeUser
TidyCOMObj.Option.Doctype = "Ef Tidy library"
TidyCOMObj.TidyFileToFile("test.htm","test1.htm")
txtError = TidyCOMObj.TotalErrors
txtWarning = TidyCOMObj.TotalWarnings
End Sub
And here is the result produced by tidy listing showing what test1.htm (created by EfTidyCom
) contains after applying our options:
<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Windows (vers 1st September 2004),
see www.w3.org" />
<title>tidy Library</title>
<style type="text/css"> --(2)
/*<![CDATA[*/
table.c4 {border-collapse: collapse}
td.c3 {border-left-style: none;
border-left-width: medium; border-right-style: solid;
border-right-width: 1; border-top-style: solid;
border-top-width: 1;
border-bottom-style: none; border-bottom-width: medium}
td.c2 {border-left-style: solid; border-left-width: 1;
border-right-style: none;
border-right-width: medium; border-top-style: solid;
border-top-width: 1;
border-bottom-style: none; border-bottom-width: medium}
h2.c1 {color: #FF00FF}
/*]]>*/
</style>
</head>
<body>
<blockquote>
<p> </p>
<h2 class="c1">Tidy Library</h2>
</blockquote>
<h2 class="c1">
</h2>
<table border="1" cellpadding="0" cellspacing="0" class="c4"
bordercolor="#111111" width="100%" id="AutoNumber1">
<tr>
<td width="50%" class="c2"> </td> ----(3)
<td width="50%" class="c3"> </td>
</tr>
</table>
<b>Tidy <tidy></tidy></b> ----(4)
</body>
</html>
Now, let us see what Tidy cleans for us:
- In (1), our custom string "
EfTidyCom
" is visible.
- In (2) and (3), the styles are cleaned and a class is created for that.
- In (4), our
<Tidy>
tag gets legal, though it does nothing in the actual HTML page.
Here is a summary of all the errors/warnings:
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 22 column 10- Warning: discarding unexpected </h1>
line 23 column 1 - Warning: <tidy> is not approved by W3C
line 23 column 1 - Warning: missing </tidy> before </body>
line 22 column 2 - Warning: missing </b> before </body>
line 15 column 1 - Warning: <table> proprietary attribute
"bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary
7 warnings, 0 errors were found!
Test case # 3 Using IEftidyNode and IEfTidyAttr
These two interfaces will help you gather node by node and attribute by attribute information from the tree structure of HTML, cleaned by the Tidy library. Here is the code listing for finding the <Head>
tag and enumerating all the attributes in that.
Note: Always use these two interfaces on the HTML cleaned by Tidy.
Private Sub cmdGetNode_Click()
a = TIDY_HEAD
Set tidyNode = TidyDoc.EfTidyNode(a)
txtNodeName = tidyNode.Name
If tidyNode Is Nothing Then
Else
Set atr = tidyNode.GetFirstAttribute
Do Until atr Is Nothing
lstAttr.AddItem atr.Name & " " & atr.Value
Set atr = tidyNode.GetNextAttribute
Loop
End If
End Sub
Now, how to enumerate the child in the Head node and get the attribute of each. I am finding the first child for you here, the code listing for that is given below:
Private Sub cmdGetFirstChildNode_Click()
Dim localnode As EfTidyNode
Set localnode = tidyNode.GetFirstChildNode
txtNodeName = localnode.Name
If localnode Is Nothing Then
Else
Set atr = localnode.GetFirstAttribute
Do Until atr Is Nothing
lstAttr.AddItem atr.Name & " " & atr.Value
Set atr = localnode.GetNextAttribute
Loop
End If
End Sub
Wait a minute, I have shot a nice snapshot after clicking on the above code button:
Here, all I have given is a small overview of the Tidy library and EfTidyCom
. For more information on Tidy library, visit Tidy home page.
Author Comment
I know there is much scope for improvement in this component, especially in the interfaces IEfTidyNode
and IEfTidyAttr
. I promise these improvements will be there in the next version/update of the library.
Files Listed with the Project
EfTidy version 1.3.2
- Source zip contains:
- TidyLib (Updated Tidy library released on 20 oct 2009) source code
- EfTidyCom source code released on 28th December, 2011
- Project zip contains:
- Unicode Release Mini Dependency version of
EfTidy
component
EfTidy version 1.3.1.2
- Source zip contains:
- TidyLib (Updated Tidy library released on 20th November, 2005) source code
- EfTidyCom source code released on 12th January, 2005
- Project zip contains:
- Unicode Release Mini Dependency version of
EfTidy
component
EfTidy version 1.3.0
- Source zip contains:
- TidyLib (Updated Tidy library released on 20th November, 2005) source code.
- EfTidyCom source code released on 12th December, 2005.
- Project zip contains:
- Release minimum size version of
EfTidy
component.
EfTidy version 1.2
- Source zip contains:
- TidyLib (Updated Tidy library released on 16th February, 2005) source code.
- EfTidyCom source code released on 18th February, 2005.
- Project zip contains:
- Release version of
EfTidy
component.
EfTidy version 1.0
- Source zip contains:
- TidyLib (original Tidy library) source code.
- EfTidyCom source code.
- Project zip contains:
- Release version of
EfTidy
component.
- Visual Basic test project for
ItidyCom
and ItidyOption
(with source).
- Visual Basic test project for
iTidyNode
and iTidyAttr
(with source).
- Test.htm.
Special Thanks
- Mr. Saurabh Gupta [Director Efextra eSolutions Pvt. Ltd.]
- Paul E. Bible for his
CCOMString
class (in EfTidy version 1.0)
- Tidy SourceForge group for this nice library, i.e. Tidy library
Update History
- 28th November, 2004: EfTidy version 1.0 introduced
- 18th February, 2005: EfTidy version 1.1
- Some minor bug fixes
- Use of
ATL::CString
instead of CCOMString
- New
Tidy
source included
- 29th June, 2005: EfTidy version 1.2
- 12th December, 2005: EfTidy version 1.3.0
- New Visual C++ .NET 2003 compliant wrapper/COM
- Other minor bug fixes
- 12th January, 2006: EfTidy version 1.3.1.2
- New UNICODE enable wrapper
STL::String
and STL:Wstring
are used instead of ATL:CString
- 28th December, 2011: EfTidy version 1.3.2
- Reverse incomparison to last version
ATL:CString
is used instead of STL::String
and STL:Wstring
- Removed
_bstr_t
dependency from the project
- Some performance enhancement and bug fixes
- Updated source code to support Visual 2010
Other Version
Alternate .NET based version of Eftidycom is present
here, it's written in managed VC++.