Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

EfTidy: The Tidy Library Wrapper

0.00/5 (No votes)
6 Sep 2013 1  
A free component for HTML parsing and cleaning

Introduction

Before I go in detail, I want you to know what actually EfTidy is. EfTidy is a wrapper component of Tidy library, and if you don't know what Tidy is, here is a little description:

"TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools".

- By Mr. Dave Raggett

So What Am I Doing With This Library?

Recently, one of my company clients requested us to make a TidyAtl class for the new Tidy library, as the last ATL component or ActiveX wrapper for the Tidy library was built in 2002. So, my company assigned me the task of creating an ATL library for this component. After completion of the component, Mr. Saurabh told me: "Alok, this is an open source component and other programmers deserve to use it". So here I am, presenting you this component with supporting source code and a brief overview of each function.

Component Reference

The EfTidy contains four interfaces

  • IEfTidyAttr (two properties)
  • IEfTidyNode (one property and four methods)
  • ItidyOption (sixty six properties)
  • ItidyCom (five methods and four properties)

EfTidy also contains five enumerations

  • CharEncodingType
    typedef [public] enum tagCharEncodingType 
      { ASCII, LATIN1, RAW, UTF8, ISO2022, MAC, WIN1252, 
        UTF16LE, UTF16BE, UTF16, BIG5, SHIFTJIS }
    CharEncodingType;
  • OutputType
    typedef [public] enum tagOutputType
    { 
      XmlOut, /**< Create output as XML */
      XhtmlOut, /**< Output extensible HTML */
       HtmlOut /**< Output plain HTML, even for XHTML input.*/
    }OutputType;
  • IndentScheme
    typedef [public] enum IndentScheme
    {
      NOINDENT=0,
      INDENTBLOCKS,
      AUTOINDENT 
    }IndentScheme;
  • DoctypeModes
    typedef [public] enum { DoctypeOmit, /**< Omit DOCTYPE altogether */ 
      DoctypeAuto,   /**< Keep DOCTYPE in input. Set version to content */ 
      DoctypeStrict, /**< Convert document to HTML 4 strict content model */ 
      DoctypeLoose,  /**< Convert document to HTML 4 transitional content 
                     /** model */
      DoctypeUser    /**< Set DOCTYPE FPI explicitly */ 
    } DoctypeModes;
  • EfTidyMainNode
    typedef [public] enum { 
      TIDY_ROOT, //Return Tidy ROOT Node 
      TIDY_HTML, //Return Tidy HTML Node 
      TIDY_HEAD, //Return Tidy HEAD Node
      TIDY_BODY //Return Tidy BODY Node
    }EfTidyMainNode;

Now, let's take each interface one by one.

1. ItidyCom

First, let's check out each and every method or property present in this interface, and the functions they perform:

Property/Method name Parameters Get/Put Description

TidyFiletoMem

[in] BSTR sourceFile, [out, retval] BSTR* result n/a Write output to memory.

TidyFileToFile

[in] BSTR sourceFile, [in] BSTR destFile n/a Write output in file.

TidyMemToMem

[in] BSTR sourceStr, [out, retval] BSTR* result n/a Write output to memory.

TidyMemtoFile

[in] BSTR buffer, [in] BSTR destFile n/a Take input as buffer and output in file.

TotalWarnings

([out, retval] long *pVal); Get Return the total number of warnings after the above four operations.

TotalErrors

([out, retval] long *pVal); Get Return the total number of errors after the above four operations.

ErrorWarning

[out, retval] BSTR *pVal Get Return the buffer, which contains human readable errors/ warnings.

Option

[out, retval] ItidyOption* *pVal Get Set the Option for the Tidy library.

EfTidyNode

[in]EfTidyMainNode Type,[out,retval]IEfTidyNode **ppNewEfTidyNode n/a As the HTML page has tree structure, this method returns you the tidyNode that assists you to read each and every tag and its attributes. This is the latest addition to the Tidy library.

2. ItidyOption

Here is a list of properties and methods for the ItidyOption interface:

Property/Method name Parameter Get/Put Description

LoadConfigFile

BSTR n/a Load option settings from a configuration file.

ResetToDefaultValue

Void n/a Reset options to default settings.
Doctype BSTR Both Doctype declaration generated by Tidy.

TidyMark

VARIANT_BOOL Both For meta element indicating tidied doc.

HideEndTag

VARIANT_BOOL Both Suppress optional end tags.

EncloseText

VARIANT_BOOL Both If yes, text in the body is wrapped in <p>.

EncloseBlockText

VARIANT_BOOL Both If yes, text in blocks is wrapped in <p>

LogicalEmphasis

VARIANT_BOOL Both Replace i by em and b by strong.

DefaultAltText

BSTR Both Default text for alt attribute.

Clean

VARIANT_BOOL Both Replace presentational clutter by style rules.

DropFontTags

VARIANT_BOOL Both Discard presentation tags.

DropEmptyParas

VARIANT_BOOL Both Discard empty p elements.

Word2000

VARIANT_BOOL Both Both draconian cleaning for Word2000.

FixBadComment

VARIANT_BOOL Both Both fix comments with adjacent hyphens.

FixBackslash

VARIANT_BOOL Both Both fix URLs by replacing \ with /.

NewEmptyTags

BSTR Both Declared empty tags.

NewInlineTags

BSTR Both Declared inline tags.

NewBlockLevelTags

BSTR Both Declared block tags.

NewPreTags

BSTR Both Declared pre tags.

OutputType

OutputType *pVal Both You can set the output type from here, like you can get the output as XML, XHTML or pure HTML.

InputAsXML

VARIANT_BOOL Both Treat input as XML.

ADDXmlDecl

VARIANT_BOOL Both Add >?xml ?< for XML docs.

AddXmlSpace

VARIANT_BOOL Both If set to yes, adds XML: space attr as needed.

Bare

VARIANT_BOOL Both Make bare HTML.

AssumeXmlProcins

VARIANT_BOOL Both If set to yes, PIs must end with ?>.

CharEncoding

CharEncodingType Both Set/Get in/out character encoding.

InCharEncoding

CharEncodingType Both Input character encoding (if different).

OutCharEncoding

CharEncodingType Both Output character encoding (if different).

NumericsEntities

VARIANT_BOOL Both Use numeric entities for symbols.

QuoteMarks

VARIANT_BOOL Both Output " marks as ".

QuoteNBSP

VARIANT_BOOL Both Both output non-breaking space as entity.

QuoteAmpersand

VARIANT_BOOL Both Output naked ampersand as &.

OutputTagInUpperCase

VARIANT_BOOL Both Output tags in upper not lower case.

OutputAttrInUpperCase

VARIANT_BOOL Both Output attributes in upper not lower case.

WrapScriptlets

VARIANT_BOOL Both Wrap within JavaScript string literals.

WrapAttVals

VARIANT_BOOL Both Wrap within attribute values.

WrapSection

VARIANT_BOOL Both Wrap within section tags.

WrapAsp

VARIANT_BOOL Both Wrap within ASP pseudo elements.

WrapJste

VARIANT_BOOL Both Wrap within JSTE pseudo elements.

WrapPhp

VARIANT_BOOL Both Wrap within PHP pseudo elements.

Indent

IndentScheme Both Indent the content of appropriate tags.

IndentSpace

long Both Indentation of n spaces.

WrapLen

long Both Set wrap margin for output.

TabSize

long Both Expand tabs to n spaces.

IndentAttributes

long Both New-line + indent before each attribute.

BreakBeforeBR

VARIANT_BOOL Both Output new-line before or not.

LiteralAttribs

VARIANT_BOOL Both If true, attributes may use new-lines.

MarkUp

VARIANT_BOOL Both

ShowWarnings

VARIANT_BOOL Both On/Off

Quiet

VARIANT_BOOL Both No 'Parsing X', guessed DTD or summary.

KeepTime

VARIANT_BOOL Both If yes, last modified time is preserved.

ErrorFile

BSTR Both File name to write errors to.

GnuEmacs

VARIANT_BOOL Both If true, format error output for GNU Emacs

FixUrl

VARIANT_BOOL Both Applies URI encoding if necessary.

BodyOnly

VARIANT_BOOL Both Output BODY content only.

HideComments

VARIANT_BOOL Both Hides all (real) comments in output.

DoctypeMode

DoctypeModes Both Sets the doctype mode for output.

3. IEfTidyNode

Here is a list of properties for the IEfTidyNode interface:

Property/Method name Parameter Get/Put Description

Name

BSTR *pVal Get Return the name of the current tag.

GetFirstChildNode

IEfTidyNode n/a Return the first child node.

GetNextChildNode

IEfTidyNode n/a Using this, you can enum the rest of the tags.

GetFirstAttribute

IEfTidyAttr n/a Return the first attribute of the current tag.

GetNextAttribute

IEfTidyAttr n/a Return rest of the attributes one by one.

4. IEfTidyAttr

Here is a list of properties for the IEfTidyAttr interface:

Property/Method Name Parameter Get/Put Description

Name

BSTR *pVal Get Name of the attribute.

Value

BSTR *pVal Get Value of the attribute.

Using the code

Almost every component was developed to be used with Visual Basic and other COM friendly languages. So, all the code described here is in Visual Basic. I am going to use some test cases to explain the working of the component.

I have used the Test.htm (included with the project) to test EfTidy responses. Here is what Test.htm contains:

<html>
    <head><title>tidy Library</title></head>
    <body> 
      <blockquote> 
        <p> </p> --(1)
        <p><fontsize="5"color=
      "#FF00FF">TidyLibrary</font></p>
      </blockquote>
      <P><p><fontsize="5"color="#FF00FF"></font></p>

      <table border="1" cellpadding="0" cellspacing="0" 
         style="border-collapse: collapse" 
         bordercolor="#111111" width="100%" id="AutoNumber1">
       <tr> 
         <td width="50%" style="border-left-style: solid; 
           border-left-width: 1; border-right-style: none; 
           border-right-width: medium; border-top-style: solid; 
           border-top-width: 1; border-bottom-style: 
           none; border-bottom-width: medium"> --(2)
         </td>
         <td width="50%" style="border-left-style: none; 
           border-left-width: medium; border-right-style:solid; 
           border-right-width: 1; border-top-style: solid;
           border-top-width: 1;border-bottom-style: none; 
           border-bottom-width: medium">
         </td> 
       </tr>
      </table> 
      <b>Tidy  --- (3)
      </h1> <tidy> ---(4)
    </body> 
</html>

In test.htm, I have added the following mistakes:

  • A dummy <Tidy> tag at (4),
  • Missing <h1> tag at (4),
  • Empty para <p> tag (1),
  • Un-closed <b> tag at (3).
Test case # 1 using ITidyCOM

First, create an object of our component. Here is a listing of how to achieve that:

Private Sub Form_Load() 
    Dim TidyCOMObj as EFTIDYLib.tidyCom 
    Set TidyCOMObj = New EFTIDYLib.tidyCom 
End Sub

Now, clean the test.htm file using this object. The code listing for that is given below:

Private Sub cmdMemtoMem_Click()
    Dim result As String
    TidyCOMObj.TidyFileToFile("test.htm","test1.htm")

    'check No of error in the HTML
    txtError = TidyCOMObj.TotalErrors
    'check no of warning in above HTML
    txtWarning = TidyCOMObj.TotalWarnings
End Sub

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyCom) contains:

<html> 
<head> 
 <meta name="generator" 
       content="HTML Tidy for Windows (vers 1st September 2004), 
                see www.w3.org"> 
    <title>tidy Library</title> 
</head>
<body> 
    <blockquote> 
        <p> </p> 
        <p><font size="5" color="#FF00FF">Tidy Library</font>
        </p> 
    </blockquote> 
    <p><font size="5" color= "#FF00FF"> </font></p> 
    <table border="1" cellpadding="0" cellspacing="0" 
         style= "border-collapse: collapse" bordercolor="#111111" 
         width="100%" id= "AutoNumber1">
     <tr> 
        <td width="50%" style= "border-left-style: solid; 
           border-left-width: 1; border-right-style: none; 
           border-right-width: medium; border-top-style: solid; 
           border-top-width: 1; border-bottom-style: none;
           border-bottom-width: medium">
        </td> 
        <td width="50%" 
           style= "border-left-style: none;border-left-width: medium;
           border-right-style: solid; border-right-width: 1;
           border-top-style: solid; border-top-width: 1; 
           border-bottom-style: none;border-bottom-width: medium"> 
        </td> 
     </tr> 
    </table> 
    <b>Tidy</b> --(1) 
</body>
</html>

If you see the above cleaned HTML page - the dummy <tidy> tag and the </h1> have been removed near (1), and </b> is added after Tidy at (1).

Here is a summary of the errors/warnings produced by EfTidyCom, showing you the details of each action it has performed:

line 1 column 1   - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1  - Error: <tidy> is not recognized!
line 23 column 1  - Warning: discarding unexpected <tidy>
line 15 column 1  - Warning: <table> proprietary attribute 
                    "bordercolor"
line 15 column 1  - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

5 warnings, 1 error were found!
Test case # 2 using ITidyCOM

Now, apply some options to Test.htm to get the custom output. So, I am using these options:

  • Clean =TRUE (to make separate class for style),
  • DoctypeMode = DoctypeUser (to enable display string),
  • Doctype = "Ef Tidy library" (display string),
  • OutputType = XhtmlOut (output type),
  • NewInlineTags = "tidy" (Make our dummy <tidy> tag legal).

Here is the code listing to achieve the above:

Private Sub cmdMemtoMem_Click()
    Dim me1 As String 
    TidyCOMObj.Option.Clean = True 
    TidyCOMObj.Option.NewInlineTags = "tidy" 
    TidyCOMObj.Option.OutputType = XhtmlOut 

    'our string shown in the cleaned html
    'only if the doctype mode is User

    TidyCOMObj.Option.DoctypeMode = DoctypeUser 
    TidyCOMObj.Option.Doctype = "Ef Tidy library" 

    TidyCOMObj.TidyFileToFile("test.htm","test1.htm") 
    txtError = TidyCOMObj.TotalErrors 
    txtWarning = TidyCOMObj.TotalWarnings 
End Sub

And here is the result produced by tidy listing showing what test1.htm (created by EfTidyCom) contains after applying our options:

<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="generator" 
    content="HTML Tidy for Windows (vers 1st September 2004), 
            see www.w3.org" />
  <title>tidy Library</title>
  <style type="text/css">  --(2)

     /*<![CDATA[*/
       table.c4 {border-collapse: collapse}
       td.c3 {border-left-style: none; 
          border-left-width: medium; border-right-style: solid; 
          border-right-width: 1; border-top-style: solid; 
          border-top-width: 1; 
          border-bottom-style: none; border-bottom-width: medium}
       td.c2 {border-left-style: solid; border-left-width: 1; 
          border-right-style: none; 
          border-right-width: medium; border-top-style: solid; 
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       h2.c1 {color: #FF00FF}
     /*]]>*/
  </style>
  </head>
  <body>
    <blockquote>
      <p> </p>
      <h2 class="c1">Tidy Library</h2>
    </blockquote>
    <h2 class="c1">
    </h2>
    <table border="1" cellpadding="0" cellspacing="0" class="c4"
           bordercolor="#111111" width="100%" id="AutoNumber1">
        <tr>
            <td width="50%" class="c2"> </td> ----(3)
            <td width="50%" class="c3"> </td>
        </tr>
    </table>
    <b>Tidy <tidy></tidy></b> ----(4)
  </body>
</html>

Now, let us see what Tidy cleans for us:

  • In (1), our custom string "EfTidyCom" is visible.
  • In (2) and (3), the styles are cleaned and a class is created for that.
  • In (4), our <Tidy> tag gets legal, though it does nothing in the actual HTML page.

Here is a summary of all the errors/warnings:

line 1 column 1  - Warning: missing <!DOCTYPE> declaration
line 22 column 10- Warning: discarding unexpected </h1>
line 23 column 1 - Warning: <tidy> is not approved by W3C
line 23 column 1 - Warning: missing </tidy> before </body>
line 22 column 2 - Warning: missing </b> before </body>

line 15 column 1 - Warning: <table> proprietary attribute 
                   "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

7 warnings, 0 errors were found!
Test case # 3 Using IEftidyNode and IEfTidyAttr

These two interfaces will help you gather node by node and attribute by attribute information from the tree structure of HTML, cleaned by the Tidy library. Here is the code listing for finding the <Head> tag and enumerating all the attributes in that.

Note: Always use these two interfaces on the HTML cleaned by Tidy.

Private Sub cmdGetNode_Click()
  
  'assuming TidyDoc contain Cleaned HTML
  'after applying any of four ITidyCom method
  'here TidyDoc is Object of iTidyCom
  a = TIDY_HEAD
  'give the <head> Node

  Set tidyNode = TidyDoc.EfTidyNode(a)

  'display name
   txtNodeName = tidyNode.Name
   
    If tidyNode Is Nothing Then
    Else
        'Enumerate all attribute in the head if any
        Set atr = tidyNode.GetFirstAttribute
        Do Until atr Is Nothing
            lstAttr.AddItem atr.Name & "   " & atr.Value
            Set atr = tidyNode.GetNextAttribute
        Loop
    End If
End Sub

Now, how to enumerate the child in the Head node and get the attribute of each. I am finding the first child for you here, the code listing for that is given below:

Private Sub cmdGetFirstChildNode_Click()
     Dim localnode As EfTidyNode

     Set localnode = tidyNode.GetFirstChildNode
     txtNodeName = localnode.Name

     If localnode Is Nothing Then
     Else
        Set atr = localnode.GetFirstAttribute
       Do Until atr Is Nothing
        lstAttr.AddItem atr.Name & "   " & atr.Value
        Set atr = localnode.GetNextAttribute
       Loop
    End If
End Sub

Wait a minute, I have shot a nice snapshot after clicking on the above code button:

Here, all I have given is a small overview of the Tidy library and EfTidyCom. For more information on Tidy library, visit Tidy home page.

Author Comment

I know there is much scope for improvement in this component, especially in the interfaces IEfTidyNode and IEfTidyAttr. I promise these improvements will be there in the next version/update of the library.

Files Listed with the Project

EfTidy version 1.3.2

  • Source zip contains:
    • TidyLib (Updated Tidy library released on 20 oct 2009) source code
    • EfTidyCom source code released on 28th December, 2011
  • Project zip contains:
    • Unicode Release Mini Dependency version of EfTidy component

EfTidy version 1.3.1.2

  • Source zip contains:
    • TidyLib (Updated Tidy library released on 20th November, 2005) source code
    • EfTidyCom source code released on 12th January, 2005
  • Project zip contains:
    • Unicode Release Mini Dependency version of EfTidy component

EfTidy version 1.3.0

  • Source zip contains:
    • TidyLib (Updated Tidy library released on 20th November, 2005) source code.
    • EfTidyCom source code released on 12th December, 2005.
  • Project zip contains:
    • Release minimum size version of EfTidy component.

EfTidy version 1.2

  • Source zip contains:
    • TidyLib (Updated Tidy library released on 16th February, 2005) source code.
    • EfTidyCom source code released on 18th February, 2005.
  • Project zip contains:
    • Release version of EfTidy component.

EfTidy version 1.0

  • Source zip contains:
    • TidyLib (original Tidy library) source code.
    • EfTidyCom source code.
  • Project zip contains:
    • Release version of EfTidy component.
    • Visual Basic test project for ItidyCom and ItidyOption (with source).
    • Visual Basic test project for iTidyNode and iTidyAttr (with source).
    • Test.htm.

Special Thanks

  • Mr. Saurabh Gupta [Director Efextra eSolutions Pvt. Ltd.]
  • Paul E. Bible for his CCOMString class (in EfTidy version 1.0)
  • Tidy SourceForge group for this nice library, i.e. Tidy library

Update History

  • 28th November, 2004: EfTidy version 1.0 introduced
  • 18th February, 2005: EfTidy version 1.1
    • Some minor bug fixes
    • Use of ATL::CString instead of CCOMString
    • New Tidy source included
  • 29th June, 2005: EfTidy version 1.2
  • 12th December, 2005: EfTidy version 1.3.0
    • New Visual C++ .NET 2003 compliant wrapper/COM
    • Other minor bug fixes
  • 12th January, 2006: EfTidy version 1.3.1.2
    • New UNICODE enable wrapper
    • STL::String and STL:Wstring are used instead of ATL:CString
  • 28th December, 2011: EfTidy version 1.3.2
    • Reverse incomparison to last version ATL:CString is used instead of STL::String and STL:Wstring
    • Removed _bstr_t dependency from the project
    • Some performance enhancement and bug fixes
    • Updated source code to support Visual 2010

Other Version

Alternate .NET based version of Eftidycom is present here, it's written in managed VC++.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here