Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / web / HTML

Reading the Internet Explorer Cache

4.80/5 (11 votes)
26 Apr 20063 min read 1   2.8K  
An article on using two different methods to return information stored in the IE cache.

Sample Image - showing custom IE Cache View

Introduction

There are two basic ways to read the cache files that Internet Explorer produces. One method is to use the WinInet cache functions to do the job. The other is to use a custom built solution to read the cache files. There are no clear advantages to using one method over the over, except perhaps that one is Microsoft based, and the other is not. In this article, I will present both methods of reading the cache.

Cache Structure

The cache has a 28 byte header tag that identifies the cache version: Client UrlCache MMF Ver 5.2. At index 0x48 from the file beginning is a two byte value containing the number of folders. Immediately following are 8 byte folder names, followed by a 4 byte value (unknown what it is for). From the end of the folder list, up until about 0x6000, is a bunch of unknown data. Then comes entries which typically are one of four possible types: Leak, Redr, URL, and Hash. It is unknown what the Hash entries are for, so we just read them, discard them, then continue on. The Leak and URL entries appear to have the same structure.

//this is the basic structure for the url entries
typedef struct UrlEntry
{
    //=URL_ID
    TCHAR szRecordType[4];
    //="02 00 00 00" :
    // ActualSize =  dwRecordSize* (128 Bytes or 0x80)
    DWORD dwRecordSize;
    FILETIME modifieddate;

    FILETIME accessdate;
    DWORD dwUnsure1;
    DWORD dwUnsure2;

    DWORD wFileSizeLow;
    DWORD wFileSizeHigh;//???
    BYTE uBlank[8];//expire time?
#ifdef __IE40__
    DWORD dwExtra;//Extra one here if its IE4
#endif
    DWORD uSame; //= "60 00 00 00"
    DWORD dwCookieOffset;//="68 00 00 00"
    BYTE uFolderNumber;//="FE" FE=No Folder
    BYTE unknown[3];//="00 10 10"
    DWORD uFilenameOffset;

    DWORD dwCacheEntryType; //= "01 00 10 00"
    DWORD unSure;//="00 00 00 00"
    DWORD dwHeaderSize;// 00 00 00 00
    DWORD dwUnknown;//"00 00 00 00" 

    DWORD dwUnsure3;//???
    DWORD wHitCount;
    DWORD dwUseCount;//00 00 00 00
    DWORD dwData2;//??

    BYTE uMiscExtraData[8];
    //this will contain the url, filename, 
    //http response, and user with 0x00, 
    //0xF0, 0xAD, and 0x0B as separating characters.
    BYTE lpText[1];
    //lpText containing:
    //Format:
    //WebUrl
    //LocalFileName
    // The order of the following changes and is optional
    //HTTP 1.1 / OK
    //Pragma: no cache
    //Content Type
    //ETag
    //Content Length
    //~U : username
}URLENTRY, *LPURLENTRY;

The Redr has the following structure:

//a redr entry
typedef struct RedrEntry
{
    //=REDR_ID
    TCHAR szRecordType[4];
    //="01 00 00 00" : ActualSize =
    //  dwRecordSize* (128 Bytes (0x80 Bytes))
    DWORD dwRecordSize;
    FILETIME dwNotSur;
    BYTE lpWebUrl[1];//Url till end
}REDRENTRY, *LPREDRENTRY;

And the Hash structure is:

//a hash entry
typedef struct HashEntry
{
    //=HASH_ID
    TCHAR szRecordType[4];
    //="20 00 00 00" : ActualSize =
    //  dwRecordSize* (128 Bytes (0x80 Bytes))
    DWORD dwRecordSize;
    BYTE lpHashText[1];
}HASHENTRY,*LPHASHENTRY;

This continues until the end of the file. Each URL size is in terms of blocks. One data block is 0x80 bytes. Most are two or three blocks long. They do not appear to be ordered by type, perhaps by date, but I have not looked that deeply into them.

Using the code

The first way to read the cache files is using custom built functions to read the file structures. The functions to accomplish this are:

//this opens the cache file
HANDLE OpenCacheFile(TCHAR* szCacheFilePath);
//deal with the cache folders
WORD GetCacheFolderNum(HANDLE hFile);
void GetCacheFolders(HANDLE hFile, 
     WORD wFolders,LPCACHEFOLDERS& pFolders);
CString GetFolderName(LPCACHEFOLDERS pFolders, WORD wFolderNum);
//these get the cache entires
DWORD GetFirstCacheEntry(HANDLE hFile,DWORD* lpdwOffset);
//Takes a current offset, returns new offset
DWORD GetNextCacheEntry(HANDLE hFile,DWORD* lpdwOffset);
//these read the various types of entries
void ReadCacheEntry(HANDLE hFile, 
     DWORD* lpdwOffset, LPURLENTRY& lpData);
void ReadCacheLeakEntry(HANDLE hFile, 
     DWORD* lpdwOffset, LPLEAKENTRY& lpData);
void ReadCacheRedrEntry(HANDLE hFile, 
     DWORD* lpdwOffset, LPREDRENTRY& lpData);
void ReadCacheHashEntry(HANDLE hFile, 
     DWORD* lpdwOffset, LPHASHENTRY& lpData);

The code to use and process using this method is a bit complicated. It includes looping through the file, calling GetNextCacheEntry, and then a ReadCache*Entry call. The GetNext call merely moves the file pointer to the correct position to read the next entry, and returns the entry type. While, the Read calls actually read the data, fill the structure, and set the pointer to the end of the block. You could call these with arbitrary file positions but they would quickly break if you call them with incorrect file positions. An example of using these functions is:

//set cursor to busy
HCURSOR hCur = SetCursor(LoadCursor(NULL, IDC_WAIT));
CString str;
//get the path to the cache to open
m_path.GetWindowText(str);
//open the cache specified
HANDLE hFile = 
    OpenCacheFile(str.GetBuffer(str.GetLength()));
str = _T("This is a custom View ") 
      _T("reading bytes from the DAT files.\r\n");
m_test.SetWindowText(str);
//get the number of cache folders
WORD wNum = GetCacheFolderNum(hFile);
// if the cache folders=0 then we
// did not read the file correctly so exit
if (wNum == 0)
{
    return;
}
CacheFolders* pFolders,*p;
//get the list of cache folders
GetCacheFolders(hFile,wNum,pFolders);
//loop the list and write out the folder names
for (int n = 1; n <= wNum; n++)
{
    CacheFolder lpFolder = (pFolders->folders[n-1]);
    str.Format("Folder: %s\r\n",lpFolder.szFolderName);
    //do something with folder name
}
//we do not delete the list here 
//because we will reference these later
DWORD dwOffset = 0;
//retrieve only first 50 entries because 
//they will be very large text on the screen
int nEntries = 50;
DWORD dwType = GetFirstCacheEntry(hFile, 
               &dwOffset);//get first entry
do
{
    if ((dwOffset >= 0xB5C00))
        dwType = dwType;
    if (dwType == URL_ID)//if its a url do this
    {
        URLENTRY *url;
        ReadCacheEntry(hFile,&dwOffset,url);
        //process data here
        CoTaskMemFree(url);//free url info
    }
    else if (dwType == LEAK_ID)//if its a leak do this
    {
        LEAKENTRY *url;
        ReadCacheLeakEntry(hFile,&dwOffset,url);
        //process data here
        CoTaskMemFree(url);//free leak info
    }
    else if (dwType == REDR_ID)//if its a redr do this
    {
        REDRENTRY *url;
        ReadCacheRedrEntry(hFile,&dwOffset,url);
        //process data here
        CoTaskMemFree(url);//free redr info
    }
    else if (dwType == HASH_ID)//if its a hash do this
    {
        HASHENTRY *url;
        ReadCacheHashEntry(hFile,&dwOffset,url);
        //we dont know what the hash stuff 
        //is so we just read it and do not display it.
        CoTaskMemFree(url);//free hash info
    }
    dwType = GetNextCacheEntry(hFile,&dwOffset);
}
//while (dwType != 0);
//This would read the entire file, 
//takes about 20-25 minutes for a 8.5MB File 

//This is limited to 50 because otherwise 
//the edit box and strings run out of memory
while ((nEntries-- >= 0) && (dwType != 0));
//free folders info
::CoTaskMemFree(pFolders->folders);
//free cache folders holder
::CoTaskMemFree((void*)pFolders);
CloseHandle(hFile);//close cache file
SetCursor(hCur);//return cursor to normal

Using WinInet to read the cache

The other method to reading the cache is using the WinInet functions. This is a rather simple method, and it returns pretty much the same information. A few function calls are needed but they can easily be wrapped into two functions. The first is to get the first cache entry, the next to get each subsequent entry. These functions do not appear to return results that are in the same order as the information in the file. And they do not allow you to read data from a location other than the default Internet Explorer cache. The two functions are:

//these two functions are use the winInet 
//functions for dealing with caches
//the LPINTERNET_CACHE_ENTRY_INFO must be 
//allocated space before these functions are called.
//Ideally you can allocate only one structure, 
//and then re-use it for each call...
HANDLE GetFirstInetCacheEntry(LPINTERNET_CACHE_ENTRY_INFO 
              lpCacheEntry, DWORD &dwEntrySize/*=4096*/);
//It is important to note that the dwEntrySize is 
//modified from within the function calls to represent 
//the size of the data actually returned. 
//Therefore if you are using one 
//variable as the size, you need to 
//reset the variable to the actual allocated 
//entry size before the next call to the function.
BOOL GetNextInetCacheEntry(HANDLE &hCacheDir, 
     LPINTERNET_CACHE_ENTRY_INFO lpCacheEntry, 
     DWORD &dwEntrySize/*=4096*/);

And a sample of using them is:

//this does not use the cache path, 
//it automatically uses the default IE cache
//I have not figured out how 
//to open another cache file yet.
LPINTERNET_CACHE_ENTRY_INFO lpCacheEntry;
DWORD MAX_CACHE_ENTRY_INFO_SIZE=4096;
DWORD dwEntrySize=MAX_CACHE_ENTRY_INFO_SIZE;
HANDLE hCacheDir;
int nCount=0;//set count to 0
//init cache entry holder
lpCacheEntry = 
  (LPINTERNET_CACHE_ENTRY_INFO) new char[dwEntrySize];
lpCacheEntry->dwStructSize = dwEntrySize;
//get first cache entry using winInet functions
hCacheDir = GetFirstInetCacheEntry(lpCacheEntry,dwEntrySize);

//process entry here
nCount++;//increase count
do 
{
    //reset entry size because it was changed in the cache call
    dwEntrySize = MAX_CACHE_ENTRY_INFO_SIZE;
    //attempt to get the cache entry
    if (!GetNextInetCacheEntry(hCacheDir, 
         lpCacheEntry,dwEntrySize))
        break;
    //process entry here
    nCount++;
}
//while (TRUE);
//This would read the entire file, 
//takes about 20-25 minutes for a 8.5MB File 

//loop for first 100 strings only because 
//otherwise you get a cstring format error.
while (nCount < 100);
//delete the cache entry
delete [] lpCacheEntry;
//close cache if not already close.
FindCloseUrlCache(hCacheDir);

History

  • v1.0 - Tested and worked fine with IE4 using Win98 and VS 6.0. Then I got XP and .NET 2003, and it would not work anymore. So I updated the code to work with IE6, XP, and .NET 2003. I do not have any other system, so I can not test on other platforms.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here