Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / MFC

Duplicate File Finder

3.89/5 (16 votes)
2 Aug 2008CPOL4 min read 1   3.1K  
A tool that finds duplicate files in your system.

Introduction

I wrote this tool to find duplicate files in my system, and I wanted to share it with you. As the main purpose of this article is to present the tool, there is relatively little to write about. However, I'll try to describe the basic mechanisms of the tool's implementation.

How to Use

Actually, it's very easy to use the tool. You can see its interface in action here:

duplicatefinder_src

In addition to having a Start button, the tool also provides Stop and Pause buttons as it may be a lengthy operation to find duplicates in a large number of files. Duplicated files found during scanning are immediately inserted into the tree control (CTreeCtrl), as you can see. It's a two level hierarchy, that holds duplicated file names in its second level, and general information in its first level. You also can log the resulting hierarchy in a textual form (the Log... button will be enabled when the search process ends up). In general, the tool works very fast; for example, it takes ~40 minutes to scan a directory of ~70 GB size. But, isn't that size too much for a single shot scan?

How it Works

Here, I will describe the underlying strategy of the implementation. The user interface is implemented via MFC (VC7.1 was used, sorry I don't have VC9). The scanning process is done in a separate worker thread so that the user can stop/pause the execution at any point. This thread starts when the user starts a new scan, and finishes when the user stops it or no more files remain to scan. And, as you guess, the main engine is concentrated in the worker thread itself. The steps it performs may be itemized into the following outline:

  • Gather necessary information about the directory size
  • Reserve enough amount of virtual memory to keep scanned file names
  • Start scanning the directory
  • Free the resources

Here is the description of the points above, in a little bit more details:

  • Gather necessary information about the directory size: This process is accomplished with the CFileFind MFC utility class by which the procedure walks through each file to compute the overall file count in a directory. I was struggling to find a Win32 equivalent, but in vein. And, if you happen to know a Win32 API returning the number of files residing in a directory, please let me know.
  • Reserve enough amount of virtual memory to keep scanned file names: The implementation uses VirtualAlloc and VirtualFree to allocate/free virtual memory chunks. As the directory to scan may be very large, it's proper to store the scanned file names in virtual memory, not in RAM directly. So, this virtual memory is needed to store file names that have already been scanned. The next file that is being scanned should be tested against duplicates in the already scanned file table. To optimize this process (i.e., not to traverse all the entries in the table), it keeps the table sorted (by file content, not their names), and uses binary search to find a duplicate (and if not found, it finds the position where to insert the file name in the table).
  • Start scanning the directory: The scanning process picks the next file and tests it against being a duplicate. To pick the next file, again, the CFileFind MFC class is used. So, how do we compare two files? The tool performs a bitwise comparison of two files. That is, it opens them, and compares their content in lexicographical order (as if two strings of bytes). To optimize this process, the files are opened as memory mapped filesm and their contents are compared by the C-runtime routine memcmp. To automate this process, I wrote a class CMMFile which handles all this procedure. It also ensures that if a file cannot be mapped (due to insufficient memory), it gets compared part by part (these parts will be small enough to be successfully mapped).
  • Free the resources: After the scanning process finishes, it's necessary to free the resources, e.g., the allocated virtual memory.

Acknowledgment

There are two articles that were used by this tool. Thanks to Dana Holt for his article about browsing folders. And, the second one was written by me, which provides a simple class encapsulating the VirtualAlloc/VirtualFree Win32 API functions. And certainly, many thanks to the readers and users of this article!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)