Introduction
Most parser enabled internet applications require script exclusion. This wrapper properly excludes script elements from testing, and possible script tainting. After reading the file it is entered into an array for line by line processing. If you are trying to disable anomalies caused by IE, clear line 2 of a saved document to keep it from reasserting the original document object model. It WILL do that on fresh documents. Clearing the line forces it to create a new model. Note that this is done in preparation for subsequent browser navigations, NOT this parsing session.
Dim loc, z as long
Elements = Split(s, vbCrLf)
Elements(1) = ""
in_script = False
For i = 2 To UBound(Elements)
z = 1
If in_script = False Then
loc = InStr(z, UCase(Elements(i)), "<SCRIPT ", vbBinaryCompare)
If loc > 0 Then
If (InStr(z, UCase(Elements(i)),
"<SCRIPT ", vbBinaryCompare) > 0 And
InStr(z, UCase(Elements(i)),
"</SCRIPT>", vbBinaryCompare) > 0) Then
in_script = False
Elements(i) = Replace(Elements(i),
Page & "_files/", myscriptsfolder)
z = loc + 8
Else
Elements(i) = Replace(Elements(i),
Page & "_files/", myscriptsfolder)
in_script = True
End If
End If
loc = InStr(z, Elements(i), "<HEAD>", vbBinaryCompare)
If loc > 0
Then
If (objDocument.getElementsByTagName("BASE").length = 0)
Then
Elements(i) = InsertBaseElement(Elements(i), loc)
Else
Elements(i) = Replace(Elements(i), s, ARCRoot)
End If
End If
DoEvents
loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare)
If loc > 0 Then
in_script = False
i = i + 1
Elements(i) = "</BODY></HTML>"
i = i + 1
Do While i < UBound(Elements)
Elements(i) = ""
i = i + 1
Loop
End If
Else
loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare)
If loc > 0 Then
in_script = False
End If
End If
Next
Using the Code
Insert your own methods to replace links, image tags, insert a table, footer etc. Leaving this wrapper intact will protect the script sections and it will also prevent the parser method from misbehaving. I added z value to let the parser process the HTML in strings having code after the found </SCRIPT> tag (as is possible with NYTimes pages.)
I need to appologize for not providing a working demonstration. Its difficult to just throw out a useful demonstration at this time without disseminating too much about the BOWSER parse method. The wrapper is used in my BOWSER demonstration.
Interesting Points
It seems that providers are using complex structures to prevent commercial quality archiving of their content. I have no problem handling the content of the average HTML website, but the NYT with its dynamic content insertions play at havoc, using techniques to cause my parser to either skip content or otherwise misbehave. Presently I'm adding code to process HTML found after the </SCRIPT> tags in what is like a cat & mouse game. The more sophisticated the parser becomes, the easier it will be to break.