Abstract
I've been involved in some projects which required file gathering through directories and this class allows just that: gather file information recursively by directory and, as a bonus track, it also calculates 32bit file-checksum (note this is not NT's executables checksum calculated with MapFileAndChecksum
) and 32bit file-CRC (with a borrowed code, I didn't feel like re-inventing the wheel and the other option was to review my Codification Theory notes and I'm a bit alergic to dust).
The second part of this article presents FCompare
, a sample application of CFileInfo
and CFileInfoArray
usage. This application does a:
- Recursive search of source and target files to compare, given a directory and a filemask.
- Binary comparison of source with target files by their size, partial/total content, partial/total checksum or partial/total CRC.
- Feeds a listview with matched filenames and paths.
Building environment
VC++ 6.0, with warning level 4.
Tested on Windows NT 4.0 and W'95.
Although not tested, I guess
CFileInfo
,
CFileInfoArray
and
FCompare
can be safely recompiled to unicode.
CFileInfo and CFileInfoArray
class CFileInfo {
public:
CFileInfo();
CFileInfo(const CFileInfo& finf);
~CFileInfo();
void Create(const WIN32_FIND_DATA* pwfd, const CString strPath,
LPARAM lParam=NULL);
void Create(const CString strFilePath, LPARAM lParam = NULL);
DWORD GetChecksum(const ULONGLONG uhUpto=0, const BOOL bRecalc = FALSE,
const volatile BOOL* pbAbort=NULL, volatile ULONG* pulCount = NULL);
DWORD GetCRC(const ULONGLONG dhUpto=0, const BOOL bRecalc = FALSE,
const volatile BOOL* pbAbort=NULL, volatile ULONG* pulCount = NULL);
DWORD GetLength(void) const { return (DWORD) m_uhFileSize; };
ULONGLONG GetLength64(void) const { return m_uhFileSize; };
CString GetFileDrive(void) const;
CString GetFileDir(void) const;
CString GetFileTitle(void) const;
CString GetFileExt(void) const;
CString GetFileRoot(void) const { return GetFileDrive() + GetFileDir(); };
CString GetFileName(void) const { return GetFileTitle() + GetFileExt(); };
const CString& GetFilePath(void) const { return m_strFilePath; }
const CTime& GetCreationTime(void) const { return m_timCreation; };
const CTime& GetLastAccessTime(void) const { return m_timLastAccess; };
const CTime& GetLastWriteTime(void) const { return m_timLastWrite; };
DWORD GetAttributes(void) const { return m_dwAttributes; };
BOOL IsDirectory(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_DIRECTORY; };
BOOL IsArchived(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_ARCHIVE; };
BOOL IsReadOnly(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_READONLY; };
BOOL IsCompressed(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_COMPRESSED; };
BOOL IsSystem(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_SYSTEM; };
BOOL IsHidden(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_HIDDEN; };
BOOL IsTemporary(void) const {
return m_dwAttributes & FILE_ATTRIBUTE_TEMPORARY; };
BOOL IsNormal(void) const { return m_dwAttributes == 0; };
LPARAM m_lParam;
private:
CString m_strFilePath;
DWORD m_dwAttributes;
ULONGLONG m_uhFileSize;
CTime m_timCreation;
CTime m_timLastAccess;
CTime m_timLastWrite;
DWORD m_dwChecksum;
DWORD m_dwCRC;
DWORD m_uhCRCBytes;
DWORD m_uhChecksumBytes;
};
class CFileInfoArray : public CArray<CFILEINFO, CFileInfo&> {
public:
CFileInfoArray();
enum {
AP_NOSORT=0,
AP_SORTASCENDING=0,
AP_SORTDESCENDING=1,
AP_SORTBYSIZE=2,
AP_SORTBYNAME=4
};
int AddDir(const CString strDirName, const CString strMask,
const BOOL bRecurse,
LPARAM lAddParam=AP_NOSORT, const BOOL bIncludeDirs=FALSE,
const volatile BOOL* pbAbort = NULL, volatile ULONG* pulCount = NULL);
int AddFile(const CString strFilePath, LPARAM lAddParam);
protected:
virtual int AddFileInfo(CFileInfo& finf, LPARAM lAddParam);
};
How to use it
I recommend you to read thoroughly the above class header to get an overall view of the classes and their methods. For further refference, you can inspect FCompare's source code (see second half of article).
Anyway, there it goes some sample code:
This code adds all files in root directory and its subdirectories (but not directories themselves) to the array and TRACE
s them:
CFileInfoArray fia;
fia.AddDir(
"C:\\", // Directory
"*.*", // Filemask (all files)
TRUE, // Recurse subdirs
fia::AP_SORTBYNAME | fia::AP_SORTASCENDING, // Sort by name and ascending
FALSE // Do not add array entries for
// directories (only for files)
);
TRACE("Dumping directory contents\n");
for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath()+"\n");
You can also call
AddDir
multiple times. The example shows files in root directories (but not subdirectories) of C:\\ and D:\\:
CFileInfoArray fia;
fia.AddDir("C:\\", "*.*", FALSE, fia::AP_SORTBYNAME |
fia::AP_SORTASCENDING, FALSE );
fia.AddDir("D:\\", "*.*", FALSE, fia::AP_SORTBYNAME |
fia::AP_SORTASCENDING, FALSE );
TRACE("Dumping directory contents for C:\\ and D:\\ \n");
for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath()+"\n");
Or you can add individual files:
CFileInfoArray fin;
fia.AddDir("C:\\WINDOWS\\", "*.*", FALSE,
fia::AP_SORTBYNAME | fia::AP_SORTDESCENDING,
FALSE );
fia.AddFile("C:\\AUTOEXEC.BAT", fia::AP_SORTBYNAME
| fia::SORTDESCENDING);
TRACE(
"Dumping directory contents for C:\\WINDOWS\\ and file C:\\AUTOEXEC.BAT\n");
for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath()+"\n");
And mix directories with individual files:
CFileInfoArray fin;
fia.AddDir("C:\\WINDOWS\\", "*.EXE;*.COM", FALSE,
fia::AP_SORTBYNAME | fia::AP_SORTDESCENDING, FALSE );
fia.AddFile("C:\\AUTOEXEC.BAT", fia::AP_SORTBYNAME |
fia::SORTDESCENDING);
// Note no trailing bar for next AddFile (we want to
// insert an entry for the directory
// itself, not for the files inside the directory)
fia.AddFile("C:\\PROGRAM FILES", fia::AP_SORTBYNAME
| fia::SORTDESCENDING);
TRACE(
"Dumping directory contents for C:\\WINDOWS\\, file C:\\AUTOEXEC.BAT and "
" directory \"C:\\PROGRAM FILES\" \n");
for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath+"\n");
Implementation details and rationale
- I could have made
CFileInfo
as a descendant of CFindFile
, but I don't like its FindFile
, FindNextFile
and Close
methods at all (I don't need them) and CFindFile
stores information as pointers, which I also didn't like (see To pointer or not to pointer discussion below about wether to use pointers to elements or elements themselves for CArray
's contents).
- I wanted it to be sort of win64 compliant, so I used Win32 API file access functions (when calculating checksum and CRC) which allow to address up to 64bit sized files. I studied the posibility of going memory-mapped, but I don't think it would pay the effort (volunteers welcome).
- Windows seems not to buffer API file access functions (at least not as
fread
does) so I wrote a few more lines in file reading loops in order to make a little of buffered access.
- To store filesizes, instead of API's
nFileSizeHigh
& nFileSizeLow
scheme, I used the type ULONGLONG
, a MS-propietary unsigned long long
(64bit). BTW, Visual C++ 6.0 doesn't support unsigned long long
type (although it defines this ULONGLONG
for this purpose).
- I wanted the code to be abortable, thread safe and progress-reportable. Some checksum/CRC calculation and directory retrieving can be quite time-consuming. After aborting any of abortable functions, stored values are correct, although can be incomplete:
- If
AddDir
is aborted, some CFileInfos
will be missing, but all the CFileInfos
contained in the array are OK.
- If
GetCRC
or GetChecksum
are aborted, CRC or checksum will not be entirely calculated, and will return the corect value calculated up to the abort moment.
In any case, you don't have to do anything special to use again either function and obtain correct results.
- You can see quite a bunch of
volatile
qualifiers in AddDir
definition. It's because those parameters are to be set in multithreaded applications, where they are read by AddDir
loop and are set by another thread, so they must not be cached on a register.
- I don't think it's sctrictly necessary to use any kind of safe accesing to common multithread-variables (
InterlockedIncrement
and the like): just don't rely too much in a temporary weird pulCount
value, but just for the sake of rightness, I use InterlockedExchange
and InterlockedIncrement
to increment pulCount
in CFileInfoArray::AddDir
, in CFileInfo::GetCRC
and CFileInfo::GetChecksum
.
- Due to
volatile
qualifier, main application doesn't need to modify multithreaded-vars with thread-safe functions (InterlockedIncrement
...): the only variable of this kind the application needs to modify is pbAbort
and due to its boolean nature, it is not prone to errors because a non-atomic modification of it.
- When using MFC's array template classes, I always think twice wether to store pointers to elements or elements themselves in the array. This time I've decided to store elements and not pointers because of the overhead memory allocation produces: Recursively gathering files (at least in my top-full HD) often involves allocating several thousands of
CFileInfo
structures.
Storing elements in the array reduces memory fragmentation and, with an apropiate CArray
regrowing increment, it also reduces the number of calls to memory allocation routines (at least is far from the one-allocation-call-per-element ratio that would otherwise be necessary).
It has some inconveniences, though, for example when switching elements from place to place for sorting, or when inserting elements in the middle of the array: it's almost always quicker to move a pointer than an structure. Another caveeat that appears when dealing with elements instead of pointers, is that when you externally refference elements by pointer (for example via lParam
of a listview item, as it happens in FCompare app) and you add new elements to the array, those refferences aren't up to date anymore and you have to update them somehow (in FCompare I do it by rebuilding the listview).
- Another benefit of
CArray
's element storing in front of pointer storing is the fact that elements are automagically deallocated when the array is deallocated. When storing pointers, a template function DeleteItems
ought to be written to deallocate individual elements as they are removed from the array.
Sample application: FCompare
Download source - 35 KB FCompare or Binary File Compare is an application to binary compare a group of files, selectable recursively from a given directory and filemask.
Binary comparison can be done by comparing files' size, CRC, checksum or contents. When comparing by CRC, checksum and contents you can limit the number of bytes the comparison will take into account.
Technical Features
- MFC Dialogbox-style multithreaded application.
- Threads can be aborted at any time.
- Progress report of threads through
WM_TIMER
message, isolating worker threads from UI tasks as much as possible and avoiding to overload them (graphical information is displayed and updated at a constant time rate, not at worker thread's looping rate).
- Lock of listviews to speed element inserting up and to avoid continuous redrawing while inserting elements.
- Use of
CTabCtrl
.
- Use of LPSTR_TEXTCALLBACK listview items.
How to use it
I think it's pretty straightforward to use, anyway there it goes the normal procedure of use:
- Fill Directory editbox either by typing a directory or by selecting one through the browse directory dialog that appears when pressing . If you want to recurse subdirectories, check Recurse dirs checkbox.
- Fill File masks editbox with a semicolon separated list of filemasks, for example
*.htm;*.html;*.shtml;*.asp
to find all HTML-related files.
- Press Add to Source button. The files in the selected directory will be gathered and the Source files listview filled.
- Select another (or the same) directory and filemasks.
- Press Add to Target button. The files in the selected directory will be gathered and the Target files listview filled.
- Select a comparison method:
- By Size: files with equal sizes will match.
- By Checksum: files with equal checksum will match.
- By CRC: files with equal CRC will match.
- By Contents: files with equal contents (byte per byte) will match.
For checksum, CRC and contents you can enter in UpTo editbox the number of bytes of the file that will be used to calc the value (thus speeding up the calculation). Enter 0 to use all the bytes of the file for calculation.
- If you want to supress duplicated files (files that appear in both target and source listviews) from appearing in matched listview, uncheck Compare duplicates.
- Press Compare button.
- Matched files will appear in Compare tab. You can export the three lists to a file by pressing Export... button and selecting a file.
Implementation details and rationale
- Using a property sheet (
CPropertySheet
) embedded in a bigger dialog, is quite a pain:
- You can't edit property sheets as normal dialogs in MS's Dialog Editor, so it's quite an adventure to place controls in a property sheet that has something more than the typical three or four overlapping property pages.
- Even if you could edit property sheets with the Dialog Editor, the way of editing property pages (creating independent dialogs) is confusing, at least for me, because you can't figure out how will the page fit in the universal harmony of its big-brother dialog box.
That's why I've used directly the tab control, neglecting to use CProperty
stuff.
- Dialogbox applications generated by AppWizard don't trigger
OnIdle
nor WM_ENTER_IDLE
messages. Also, PumpMessage
doesn't work properly. Due to that, the only way I've found to make a progress-report loop (even if work is being done in a worker thread) is to set a timer.
The desired approach would have been to use OnIdle
or a similar hack based on PeekMessage
/ PumpMessage
pair, but as I stated before, they don�t work for Dialogbox apps (or at least I haven't been able to make them work).
- Contents matching option is not "win64 friendly", i.e., uses ANSI C
fread
so it can't address 64bit-sized files.
- I use a cute algorithm for file comparing that translates an O(n^2) "normal operation" to O(n) (2n to be precise). At first glance, the most obvious algorithm for file comparing is comparing each source file with each target file, this is O(n^2).
As I have the arrays sorted by increasing filesize, I can convert it to O(n) :
- Let iSource and iTarget be indexes to source[] and target[] arrays.
- iSource = 0, iTarget=0
- while target[] and source[] have elements do
- if target[iTarget] = source[iSource] there is a probable match, do further comparing by checksum or whatever (if you inspect FCompares's source code, you'll see some tricky code here to ensure every needed comparison is made). If further match is positive, add to match array.
- if target[iTarget]>=source[iSource] then iSource++ else iTarget++
- end while
BTW, I didn't say the algo was the top-work of computer-science, I just stated that it was cute (and I guess it appears somewhere in Knuth's Art of Computer Programming series).
History:
- 1999-9-23 ATL (v1.4)
-
- Corrected yet another bug in
GetCRC
and GetChecksum
as suggested (again!) by R�bert Szucs:
It ought to be 4-(dwRead & 0x3)
instead of dwRead & 0x3
when calc'ing the padding mask.
- 1999-9-16 ATL (v1.3)
-
- Corrected bug in
GetCRC
and GetChecksum
as suggested by R�bert Szucs:
There was a buffer overflow and checksum and crc for last dword +1 was calc'ed instead of the ones for last dword. Instead accessing buffer[dwRead +3...]
it ought to access buffer[dwRead...]
(shame on me! :'().
- 1999-9-2 ATL (v1.2)
-
- Corrected bug in
Create(CString, LPARAM)
as suggested by Nhycoh:
There was some weird stuff at CFileInfo::Create(strFilePath, lparam)
stating strFilePath.GetLength()-nBarPos
instead of nBarPos+1
(I'm quite sure I left my head on my pillow the day I did that %-#).
- Updated
GetCRC
& GetChecksum
to avoid some bug cases
- 1999-4-30 ATL (v1.1, Internal Release)
-
- Corrected a bug when setting timers: requested timer was id 0 and MSDN help states that the timer id must be greater than 0. This bug was pointed out to me by Javier Maura (although the timer _does_ work even if its id is 0!!!). The user is also warned about progress not being reported if
SetTimer
fails.
- 1999-4-7 ATL (v1.1, Internal Release)
-
- Updated source code doc to conform Autoduck 2.0 standard
- Corrected bug in
CFileInfoArray::AddDir
as suggested by Zhuang Yuyao: bIncludeDirs
wasn't used if bRecurse
was false.
Recycling bits
For
FCompare I've borrowed:
CDirDialog
, the directory browsing not-so-common-dialog-box wrapper initially by Girish Bharadwaj and Lars Klose and later enhanced by Vladimir Kvash. BTW, I slightly modified DirDialog.h
in order MS VC++ not to complain about not using csStatusText
and lpcsSelection
in SelChanged
declaration.
- Fancy
CHyperLinkEx
by Giancarlo Iovino, where I also commented unused parameter nFlag
at CHyperLink::OnMouseMove
.