Introduction
The MSN Desktop Search application is an evolutionary version of the standard Indexing Service that has been shipping as part of Windows since Windows 2000. They both make use of the same IFilter
components to allow third parties to add indexing support for their file formats. One of the standard IFilter
components is one that parses HTML documents. The beta version of MSN Desktop Search included an option to index your web browser history but this feature was removed from the final version due to privacy concerns.
I often find myself searching for a web page that I remember reading a couple of months ago. If I then use Google or MSN to search for the page. It can take me a long time to find the same page again since you're often overwhelmed with other pages containing your search term. But rather than being the page you want, they're sales pages.
The ideal thing would be to perform an indexed search based on some terms you remember reading in the page but limiting the search to only pages you've actually visited on the web. In other words if you had a complete copy as opposed to just a small cached subset of every web page you've visited then you could use a local desktop search against this complete browser history.
Taking this to the next level in terms of keeping an electronic copy of everything you've visited or received including paper copies is the MyLifeBits project. Channel9 also has an interview with a couple of people involved in the MyLifeBits project.
Implementation Details
There were a couple of options I could've taken in order to store a copy of every web page visited. One approach is to implement a proxy server which sees every response coming back to the browser and the other approach is to enumerate the browser's cache periodically. The advantage of the proxy server approach is that it is browser independent where as the enumeration of the browser cache is browser specific.
I went with the simpler approach of periodically enumerating the browser's cache using the WinInet
API. This does mean that it will only work for browsers using the WinInet API for downloading URLs with the main browser using this API being Internet Explorer.
In order to enumerate the browser cache, the FindFirstUrlCacheEntry
and FindNextUrlCacheEntry
functions are used. Since there is no date based query filter built into the API, the first thing we do when we retrieve an INTERNET_CACHE_ENTRY_INFO
structure is to compare the date/time that the URL was downloaded with the date/time that our program was last run. This is done to avoid processing and copying the same cache entry every time our cache copying program is run.
Next we check the MIME type to determine whether we are interested in making a copy of this particular cache entry. The current set of MIME types that I check for are the following:
text/html
application/pdf
application/msword
So I basically only make a copy of the text associated with a web page and don't make copies of the associated images. In addition I also make copies of any PDF and Word documents that I may have read in the browser.
Next I determine where to store the cache entry and create a file name for the copy we're going to make. An example of a cache entry's name and location is shown below:
My Documents\WebCache\2005\4\10\1c582c0-cab8d650-18be.html
I store my WebCache history under the "My Documents" folder so that the contents will automatically be indexed by MSN Desktop Search. The sub-directory tree is based on the date the URL was visited and the filename is the FILETIME
that the URL was visited. This scheme also handles the case of a user visiting the same URL, e.g. www.cnn.com, more than once on the same day.
If the MIME type is HTML then a header similar to the headers displayed by Google and MSN is added to the top of the HTML file to allow you to easily load the current version of the URL when viewing the cached version.
The last modification made to the file being copied is to create a property set on the file and set the Keywords
property to the URL of the entry. The reason for recording the original URL as a keyword is so that you can filter your query based on the URL. So for example if you want to find a web page that you remember reading at CNN with the term "space shuttle launch", then you can issue the following query in MSN Desktop Search:
path:webcache keywords:cnn "space shuttle launch"
This will limit the query to items that have 'webcache' in their path name, i.e. only files in our webcache directory and not in any other document locations or in email messages etc. In addition, the query will be limited to files that contain 'cnn' in the keywords
property.
The AddURLKeywordProperty
function below creates the property set for the file by making use of the NTFS implementation of the IPropertySetStorage
interface.
void AddURLKeywordProperty(LPCWSTR pszFileName, LPWSTR pszURL)
{
IPropertySetStorage *pPropSetStg = NULL;
IPropertyStorage *pPropStg = NULL;
HRESULT hr = StgOpenStorageEx(
pszFileName,
STGM_SHARE_EXCLUSIVE|STGM_READWRITE,
STGFMT_FILE, 0, NULL, 0,
IID_IPropertySetStorage,
reinterpret_cast<void**>(&pPropSetStg) );
if(SUCCEEDED(hr))
{
hr = pPropSetStg->Create(
FMTID_SummaryInformation, NULL,
PROPSETFLAG_DEFAULT,
STGM_CREATE|STGM_READWRITE|STGM_SHARE_EXCLUSIVE,
&pPropStg );
if(SUCCEEDED(hr))
{
PROPSPEC propspec;
PROPVARIANT propvarWrite;
propspec.ulKind = PRSPEC_PROPID;
propspec.propid = PIDSI_KEYWORDS;
propvarWrite.vt = VT_LPWSTR;
propvarWrite.pwszVal = pszURL;
hr = pPropStg->WriteMultiple(1, &propspec,
&propvarWrite,
PID_FIRST_USABLE);
pPropStg->Release();
}
pPropSetStg->Release();
}
}
Conclusion
I've been using this application for just over 3 months now and my WebCache has a total of 8126 files with a total size of 368 MB which is compressed to 218 MB using NTFS file compression. So at this rate my WebCache will consume just under 1 GB per year of browsing. In conjunction with MSN Desktop search, it has made it a lot easier to find web pages that I've visited in the past that I need to look for at a later stage.