-
Download demo project - 56 Kb TRUE OFFLINE NYT browsing mode is attained with script filter option. NewVersion 1.04 Leaves parsed documents with images relative to the site ROOT. This is in preparation for a release version that will let users MAP exports to removable media. 12-31 Truncated the trailing garbage from the end of NYTimes pages. (Kill before save solved it.) Filters CGI queries, leaving a pretty clean version of the NY Times page that doesn't prompt for a connection. Also loads quickly. Caught an error occurring with external applets and none in the root domain. This will be helpgul for other sites as well. Added support for embedded HTML segments. Added support for maximized view. Considering adding parser support for the 'embedded html' segments. IE makes little subfolders with all the files. Designers put the name of the html segments into an image source tag. I just need to rework what is implemented, retaining the folder label in the HTML and copying the folder to the images folder.
Introduction
"A New York minute" is adjective about lifestyle. In a fast-paced world we need every time-saving appliance we can afford. By automating its content saving function, the "NY Times Browser" reduces stressful moments spent archiving online content. It also transforms use of information services offering sophisticated options for archival and retrieval. A large realm of possible use is opened to less-sophisticated users who would otherwise have more limited access.
In the demonstration application Click a link in the "NY Times" pages index, wait for it to load and click the save button. The demo application has used the request to preconfigure the folder for you. Go ahead and navigate the site saving page content as needed. Links are now transformed in the front page to access archived content in the main pages subfolders. (The archiving feature should work for 99% of websites. CNN is a known exception.)
Background
"I found myself with a laptop in front of me one morning. I was somewhere in NYC getting coffee and a donut in one of those fast, chic Internet Cafe's. With only thirty minutes to get to the airport, I wanted to download
the Times for my morning flight back to Cali. (I had met a very nice girl this time, so I wanted to check for work. Its hard to believe I couldn't just find a job talking to people in NYC, but without a degree and at my age, you have to do anything they'll still LET you do.) I made my flight with 5 whole minutes to spare. During much of the flight my offline access to the entire edition of the "NY Times" proved to be intriguing to one of the flight attendants, as well as to the suit across the aisle. I even handed the computer to him once letting him check it out.
Hey,, don't get the idea that I just wrote the BOWSER so I could get the paper. I didn't. I actually wrote it a couple weeks before making the trip. Take it easy."
I was once paid good California money to rip1 a WWW e-commerce database using Perl
. That job automated the acquisition of product information, as well as the merging of online databases. "NY Times Browser" does have similar capabilities. It is intended however, to enable portable offline WWW applications (like laptops.)
"NY Times Browser" is my initial attempt at automation with the VB6 Browser control object. After tinkering with the available events and returns, I managed to configure a reliable condition in Visual Basic for indicating when even a very complex page is loaded. Also, I discovered that an IE registry setting can be manipulated to set the user SAVE path. This allows the "NY Times Browser" to be oriented specifically for archiving WWW content, making it easier to acquire and manage.
Use of the Method
The document_complete
detection algorithm is featured in a scaled-down test project in the download.
A document
qualifies for testing to determine if it is loaded when the ratio of the two returned progress
values is unity
. Detection also uses a global enable
setting which makes the code
more efficient, enabling tests only after most of the document is loaded. Determining that it is loaded is a composite
indication acquired when all the indicators have been True
. I also found checking the returned pDisp value for a match with the topmost browser control, essential. (Documentation suggesting to expect a null indicator when loaded appears misleading.) After more refinement, I decided that the progress indicator is relative, so I backed it down to .5 scale to use as a loose qualifier for enabling tests. Works much better! Testing for unity was too stringent.
PSEUDOCODE:
Dim progress1, progress2 As Long
Dim enable As Boolean
Dim loaded As Boolean
Do While loaded = False
Do While progress1 / progress2 > .5 'Works much better.
enable = False
Loop
If enable = True And pDisp = WebBrowser1 Then
RaiseEvent Document_isloaded
loaded = True
End If
Loop
In this application after a document has been loaded, it will be saved by the Microsoft IE default save function. The ExecWB
saveas command uses the OLECMDEXECOPT_DODEFAULT
tag to save the entire content. Then processing it for archival can commence, loading the index file and alterring the embedded links so that they access other archive page documents. Maintaining the folder structure is automated according to the virtual path names used by the "NY Times." Archive folder names are derived from document paths used in page requests.
Insight
I still have not finished coding the BOWSER application. It required some investigating of the limitations Microsoft has placed on WebBrowser functionality. Though it is not possible to save a page without a dialog, I still think the application will be useful, affording users an archiving session, and a chance to skim through information while capturing it. There is still a matter of links, and when/how to convert them to query archived pages. Users may want to simultaneously maintain an online index file and an offline version (for browsing the archive.)
In my investigation, I discovered that its possible to alter a loaded HTML document and save it as modified. My initial test altered the title in the document before saving. While WWW browsing links references to page filenames, IE defaults to saving documents using the page title. As a result, the offline multipage document is broken unless you can a.) stay online b.) overide the IE mechanisms. THAT will be the subject of my next Web BOWSER article, "Taking a Bite. < ^v^v^v^ >"
Points of Interest
Microsoft's document object model assumes developers will have the same
forgiving features in tests. They made IE save documents, requiring some renaming/or conversion of images. I'e seen it with IMG/GIF and JPG/JPEG naming. Problem is that the document object model retains the same image list as that originally
downloaded. I would have expected it to change when I reload the file, but
there is a list associated with it that does not get updated with the strict
names. Microsoft uses loose naming and expects developers to do it as well.
It would be nice if it could be explained, but its the kind of thing you
need to learn firsthand.
Its appearantly a bit of bad news/good news the Microsoft's IE save function doesn't support saving EVERY image in a complex document (appears is the keyword.) Looking for missing files requires a program such as BOWSER to note discrepancies. I get errors when files are missing from the save folder so I can do analysis. Solution could be to get them from the cache using the DOM, obtaining the URL after it is determined they do not exist in the IE_files folder. Otherwise it will be necessary to download them.
I discovered that its possible to use a timer to suppress the document_complete event process when booting the application. After the timer has executed the delay, the is_loaded status changes to allow the event process to execute.
Microsoft has inserted a filter to repair HREF content, switching the forward slashes to backslashes if the reference uses the other tag initially. I substituted only the root portion and found all slashes were switched next time I accessed the file. (I still need to verify this behavior.)
It looks like it may be possible to wrap a silent browser, using it to acquire pages without user prompting. The ExecWB
saveas command with the default OLECMDEXECOPT_DODEFAULT
save mode may suppress prompting in silent mode. This would be ideal for the next extension of this application after navigating to an index.
Development is moving forward. In addition to testing the 'silent' mode, I'm looking at using a TCP socket control to query a list of files. This would make it possible to eliminate the multiple image subfolders folders that IE creates with each save.
History
(Sorry, I won't post the entire source for the demo hack application.)
Thanks for checking out my VB code.
Be good. -Q o0O0o
-PS. "What's gotten into Lou Dobbs lately?"
References
1. RIP - "extract from a source"
UPDATES
- There was a small bug in the initial demo hack application upload. Some directories would give an exception for an array variable being out of range. Fixed. Also added a line to suppress popups. 12-9 4am CST
- 12-9 Added a line to initialize the default root save path as "c:\__Internet"
- 12-9 Also automated link conversion so links work when navigating the saved archive. (Now everyone needs to have this.) Whoa - Its "stillabeta."
- Added source for demo project. 30Kb
- 12-12 Fixed link edit and made the navigation window larger. Added SAVE button. Link edit uses DOM to acquire links and replaces them in the offline HTML.
- 12-15 Made a vast improvement in version 1.0.1. It works quite well. Next version will likely see automated links processing with a spider to get the content from index pages unattended.
- 12-16 More fixes, support for multiple filenames in a folder, delay added at boot to disable save, cleared external links to eliminate most connection prompts during offline browsing.
- 12-17-2003 (Caught a few problems with IE save dialog. Abort does not overwrite an existing offline file unless the online version is fresher than 20 seconds. Abort save doesn't crash anymore. Also fixed the URL parser so that it adds the needed "/" if missing. Now you can use: http://nytimes.com/ and it will work.)
- MAJOR revision- Naming uses entire URL. Also setup links analysis and query building methods with a tree depth setting to open a restricive window to be used in qualifying new queries.
- 12-23 Addressed image renaming issue without reloading the image after IE SAVE. Thus, the DOM is inaccurate necessitating patches to allow for what Microsoft does to the document. I found that they rename IMG files to GIF, and JPG to JPEG. Long-term solution is probably to reload the SAVED version, re-asserting its DOM. Its impractical to read every file from the cache. At this time I'm aware of the image naming issues, which can also be addressed with a quick rename of all DOM images according to the Microsoft convention. Then the patches can be removed.
- Several updates have occurred. BOWSER is pretty stable, now copying Java applets properly and handling IE's saved FLASH conversions when possible. Parsing is now done in a single pass.
- Added support for RESIZE.
- Made images relative to root.
- Added optional scripts filter. (Option to remove ALL scripts.) Parsed NYT is left without prompts to connect.
Known problems: If site already uses a BASE REF: then BOWSER does not work. I need to program the support for it.