Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Wrap your HTML parser to exclude scripting

0.00/5 (No votes)
25 Dec 2003 1  
Processing complex HTML pages will require sectional or content exclusion

Introduction

Most parser enabled internet applications require script exclusion. This wrapper properly excludes script elements from testing, and possible script tainting. After reading the file it is entered into an array for line by line processing. If you are trying to disable anomalies caused by IE, clear line 2 of a saved document to keep it from reasserting the original document object model. It WILL do that on fresh documents. Clearing the line forces it to create a new model. Note that this is done in preparation for subsequent browser navigations, NOT this parsing session.

    Dim loc, z as long    
    Elements = Split(s, vbCrLf)

    Elements(1) = ""     
    in_script = False
    
    For i = 2 To UBound(Elements)
        z = 1
        If in_script = False Then
            loc = InStr(z, UCase(Elements(i)), "<SCRIPT ", vbBinaryCompare)
            If loc > 0 Then
                If (InStr(z, UCase(Elements(i)), 
                    "<SCRIPT ", vbBinaryCompare) > 0 And 
                    InStr(z, UCase(Elements(i)), 
                    "</SCRIPT>", vbBinaryCompare) > 0) Then
                    in_script = False
                    Elements(i) = Replace(Elements(i), 
                      Page & "_files/", myscriptsfolder)
                    z = loc + 8
                Else
                    Elements(i) = Replace(Elements(i), 
                      Page & "_files/", myscriptsfolder)
                    in_script = True
                End If
            End If
                      
'/////////////////////////////////////////////////         


            
'  ADD MORE PARSER METHODS HERE

            

'insert basetag method calls InsertBaseElement method

            
loc = InStr(z, Elements(i), "<HEAD>", vbBinaryCompare)
            
If loc > 0 
  Then
            
    If (objDocument.getElementsByTagName("BASE").length = 0) 
      Then                  
        Elements(i) = InsertBaseElement(Elements(i), loc)               
     Else            
        Elements(i) = Replace(Elements(i), s, ARCRoot)
            
  End If
End If           
            
'/////////////////////////////////////////////////

DoEvents '///////////////////////////////////////////////// 'This code can be modified to suit special 'requirements 'It is useful for chopping of a
'document with dynamic footer content
'written by script methods '(Coders may be trying to ensure some kind of difficulty getting a ' clean archive document from their service.) This code attempts ' to cleanup the non-compliant HTML footer. loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare) If loc > 0 Then in_script = False i = i + 1 Elements(i) = "</BODY></HTML>" i = i + 1 Do While i < UBound(Elements) Elements(i) = "" i = i + 1 Loop End If Else 'in_script = true so look for endtag loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare) If loc > 0 Then in_script = False End If End If Next

Using the Code

Insert your own methods to replace links, image tags, insert a table, footer etc. Leaving this wrapper intact will protect the script sections and it will also prevent the parser method from misbehaving. I added z value to let the parser process the HTML in strings having code after the found </SCRIPT> tag (as is possible with NYTimes pages.)

I need to appologize for not providing a working demonstration. Its difficult to just throw out a useful demonstration at this time without disseminating too much about the BOWSER parse method. The wrapper is used in my BOWSER demonstration.

Interesting Points

It seems that providers are using complex structures to prevent commercial quality archiving of their content. I have no problem handling the content of the average HTML website, but the NYT with its dynamic content insertions play at havoc, using techniques to cause my parser to either skip content or otherwise misbehave. Presently I'm adding code to process HTML found after the </SCRIPT> tags in what is like a cat & mouse game. The more sophisticated the parser becomes, the easier it will be to break.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here