Introduction
I wrote this tool to find duplicate files in my system, and I wanted to share it with you. As the main purpose of this article is to present the tool, there is relatively little to write about. However, I'll try to describe the basic mechanisms of the tool's implementation.
How to Use
Actually, it's very easy to use the tool. You can see its interface in action here:
In addition to having a Start button, the tool also provides Stop and Pause buttons as it may be a lengthy operation to find duplicates in a large number of files. Duplicated files found during scanning are immediately inserted into the tree control (CTreeCtrl
), as you can see. It's a two level hierarchy, that holds duplicated file names in its second level, and general information in its first level. You also can log the resulting hierarchy in a textual form (the Log... button will be enabled when the search process ends up). In general, the tool works very fast; for example, it takes ~40 minutes to scan a directory of ~70 GB size. But, isn't that size too much for a single shot scan?
How it Works
Here, I will describe the underlying strategy of the implementation. The user interface is implemented via MFC (VC7.1 was used, sorry I don't have VC9). The scanning process is done in a separate worker thread so that the user can stop/pause the execution at any point. This thread starts when the user starts a new scan, and finishes when the user stops it or no more files remain to scan. And, as you guess, the main engine is concentrated in the worker thread itself. The steps it performs may be itemized into the following outline:
- Gather necessary information about the directory size
- Reserve enough amount of virtual memory to keep scanned file names
- Start scanning the directory
- Free the resources
Here is the description of the points above, in a little bit more details:
- Gather necessary information about the directory size: This process is accomplished with the
CFileFind
MFC utility class by which the procedure walks through each file to compute the overall file count in a directory. I was struggling to find a Win32 equivalent, but in vein. And, if you happen to know a Win32 API returning the number of files residing in a directory, please let me know. - Reserve enough amount of virtual memory to keep scanned file names: The implementation uses
VirtualAlloc
and VirtualFree
to allocate/free virtual memory chunks. As the directory to scan may be very large, it's proper to store the scanned file names in virtual memory, not in RAM directly. So, this virtual memory is needed to store file names that have already been scanned. The next file that is being scanned should be tested against duplicates in the already scanned file table. To optimize this process (i.e., not to traverse all the entries in the table), it keeps the table sorted (by file content, not their names), and uses binary search to find a duplicate (and if not found, it finds the position where to insert the file name in the table). - Start scanning the directory: The scanning process picks the next file and tests it against being a duplicate. To pick the next file, again, the
CFileFind
MFC class is used. So, how do we compare two files? The tool performs a bitwise comparison of two files. That is, it opens them, and compares their content in lexicographical order (as if two strings of bytes). To optimize this process, the files are opened as memory mapped filesm and their contents are compared by the C-runtime routine memcmp
. To automate this process, I wrote a class CMMFile
which handles all this procedure. It also ensures that if a file cannot be mapped (due to insufficient memory), it gets compared part by part (these parts will be small enough to be successfully mapped). - Free the resources: After the scanning process finishes, it's necessary to free the resources, e.g., the allocated virtual memory.
Acknowledgment
There are two articles that were used by this tool. Thanks to Dana Holt for his article about browsing folders. And, the second one was written by me, which provides a simple class encapsulating the VirtualAlloc
/VirtualFree
Win32 API functions. And certainly, many thanks to the readers and users of this article!