This is a small guide on how to use md5 in combination with find and locate on MacOS Unix to identify duplicate files on the same drive or across drives.
Introduction
This is a follow up on the usage of the locate Unix command. In my previous article, Using Locate Databases on MacOS Unix, I explained the internal workings of name databases and the locate command to access them. This tip now shows how to use them to identify duplicate files.
The examples below as well as the previous article are based on MacOS High Sierra. I have noticed newer MacOS versions have a newer locate command.
Background
The background to this article is that I recently found an old compact flash card on which I backed up files from a computer's hard drive years ago before formatting the drive and selling it along with the computer. I was searching for a quick way to check if there are files on this compact flash card that I had forgotten to copy to my new computer.
Using the Code
As explored in my previous article mentioned in the Introduction section above, the find command lies at the heart of the locate.updatedb
command that is used to populate the names
database which is then queried by the locate
command.
Out of the box, the find
command prints the file name with full path which is then stored in the names
database. Interestingly, this output can be modified and the information stored in the names database can thus be enhanced and can be searched for by locate
.
To demonstrate this, I modified the enhanced version of the locate.updatedb
command from my previous article and extended the output by the md5 fingerprint:
$ cat locate.updatedb.md5
: ${LOCATE_CONFIG="/etc/locate.rc"}
if [ -f "$LOCATE_CONFIG" -a -r "$LOCATE_CONFIG" ]; then
. $LOCATE_CONFIG
fi
: ${FCODES:=/var/db/locate.database}
if [ "$(id -u)" = "0" ]; then
rc=0
export TMP_FCODES=`sudo -u nobody mktemp -t updatedb`
chown nobody $TMP_FCODES
tmpdb=`su -fm nobody -c "$0"` || rc=1
if [ $rc = 0 ]; then
install -m 0444 -o nobody -g wheel $TMP_FCODES $FCODES
fi
rm $TMP_FCODES
exit $rc
fi
: ${LIBEXECDIR:=/usr/libexec}; export LIBEXECDIR
: ${TMPDIR:=/tmp}; export TMPDIR
if ! TMPDIR=`mktemp -d $TMPDIR/locateXXXXXXXXXX`; then
exit 1
fi
PATH=$LIBEXECDIR:/bin:/usr/bin:$PATH; export PATH
set -o noglob
: ${mklocatedb:=locate.mklocatedb}
: ${TMP_FCODES=$FCODES}
: ${SEARCHPATHS:="/"}
: ${PRUNEPATHS:="/private/tmp /private/var/folders /private/var/tmp */Backups.backupdb"}
: ${FILESYSTEMS:="hfs ufs apfs"}
: ${find:=find}
case X"$SEARCHPATHS" in
X) echo "$0: empty variable SEARCHPATHS"; exit 1;; esac
case X"$FILESYSTEMS" in
X) echo "$0: empty variable FILESYSTEMS"; exit 1;; esac
excludes="! (" or=""
for fstype in $FILESYSTEMS
do
excludes="$excludes $or -fstype $fstype"
or="-or"
done
excludes="$excludes ) -prune"
case X"$PRUNEPATHS" in
X) ;;
*) for path in $PRUNEPATHS
do
excludes="$excludes -or -path $path -prune"
done;;
esac
tmp=$TMPDIR/_updatedb$$
trap 'rm -f $tmp; rmdir $TMPDIR; exit' 0 1 2 3 5 10 15
if $find -s $SEARCHPATHS $excludes -or -exec md5 -r {} \; 2> /dev/null |
$mklocatedb -presort > $tmp
then
case X"`$find $tmp -size -257c -print`" in
X) cat $tmp > $TMP_FCODES;;
*) echo "updatedb: locate database $tmp is empty"
exit 1
esac
fi
</wosch@freebsd.org>
Using diff
, you can check for the modifications. The relevant bits are the lines 92,93c95,96 where exec md5 is used instead of print
.
$ diff /usr/libexec/locate.updatedb locate.updatedb.md5
29a30,37
>
>
>
> : ${LOCATE_CONFIG="/etc/locate.rc"}
> if [ -f "$LOCATE_CONFIG" -a -r "$LOCATE_CONFIG" ]; then
> . $LOCATE_CONFIG
> fi
> : ${FCODES:=/var/db/locate.database}
33,34c41,42
< export FCODES=`sudo -u nobody mktemp -t updatedb`
< chown nobody $FCODES
---
> export TMP_FCODES=`sudo -u nobody mktemp -t updatedb`
> chown nobody $TMP_FCODES
37c45
< install -m 0444 -o nobody -g wheel $FCODES /var/db/locate.database
---
> install -m 0444 -o nobody -g wheel $TMP_FCODES $FCODES
39c47
< rm $FCODES
---
> rm $TMP_FCODES
42,45d49
< : ${LOCATE_CONFIG="/etc/locate.rc"}
< if [ -f "$LOCATE_CONFIG" -a -r "$LOCATE_CONFIG" ]; then
< . $LOCATE_CONFIG
< fi
60c64
< : ${FCODES:=/var/db/locate.database}
---
> : ${TMP_FCODES=$FCODES}
90d93
<
92,93c95,96
<
< if $find -s $SEARCHPATHS $excludes -or -print 2>/dev/null |
---
>
> if $find -s $SEARCHPATHS $excludes -or -exec md5 -r {} \; 2> /dev/null |
97c100
< X) cat $tmp > $FCODES;;
---
> X) cat $tmp > $TMP_FCODES;;
Along with the locate.updatedb.md5 script, I was using the two configuration files below (locate.Kingston.rc is the one for the compact flash drive mounted on /Volumes/Kingston and locate.Documents.rc is the one for my document repository $HOME/Documents on my Mac):
$ cat locate.Kingston.rc
TMPDIR="/tmp"
FCODES="locate.Kingston.database"
SEARCHPATHS="/Volumes/Kingston"
PRUNEPATHS="/Volumes/Kingston/.Spotlight-V100
/Volumes/Kingston/.Trashes /Volumes/Kingston/Ignore"
FILESYSTEMS="msdos"
$ cat locate.Documents.rc
TMPDIR="/tmp"
FCODES="locate.Documents.database"
SEARCHPATHS="$HOME/Documents"
FILESYSTEMS="hfs ufs apfs"
The name databases for locate
can then be created by running the above configuration files with the enhanced locate.updatedb.md5.
$ export LOCATE_CONFIG="./locate.Documents.rc";./locate.updatedb.md5
$ export LOCATE_CONFIG="./locate.Kingston.rc";./locate.updatedb.md5
If no error message is printed, the name databases
were created successfully and you can browse their contents and structure as below:
$ locate -d locate.Documents.database "*"|less
d4c5320cf104b5629d490976c1d8059e /Users/(...)/Documents/program.jpg
cc2f48fbb82a8840a9fcb9ab0d94d0b4 /Users/(...)/Documents/program.vthought
38279e78475f437c5ebdb508b9524c72 /Users/(...)/Documents/setArray.BAK.vthought
58a2e7ffd785b72868bf28bbcacb49c3 /Users/(...)/Documents/setArray.vthought
4d42423c69db19795f8169324ecf579e /Users/(...)/Documents/structure.jpg
c63b631236fb415838cb4b57ddb1ed92 /Users/(...)/Documents/structure.vthought
$ locate -d locate.Kingston.database "*"|less
...
a3d301ed3cc4442ff2df681a9f6dd0e1 /Volumes/Kingston/Addresses/Mac/Cards2ascii.vcf
ea4177dbe0aacdd4f0b962cc3a6ffe57 /Volumes/Kingston/Addresses/Mac/vCards.vcf
bf10d18437911231b07c172729e5516c /Volumes/Kingston/Addresses/Mac/vCards2.vcf
c00ca5c3b168936ebdb4d81172e11071 /Volumes/Kingston/Addresses/Mac/vCards2utf.vcf
a3d301ed3cc4442ff2df681a9f6dd0e1 /Volumes/Kingston/Addresses/Mac/vCards2utf16.vcf
a3d301ed3cc4442ff2df681a9f6dd0e1 /Volumes/Kingston/Addresses/Mac/vCards2utf8.vcf
162b1606424bca82fa234c726e94eeab /Volumes/Kingston/Addresses/Mac/vCards2w.vcf
2aace6876876d77691fb23d69319bab0 /Volumes/Kingston/Addresses/Mac/vCards3w.vcf
...
As you can see, the names databases have the md5 fingerprint separated by a space character from the full filename.
Intermediate Result
So let's see what we have achieved so far. We now have two name databases, one containing the filenames and their md5 fingerprint for the files in the Document folder on the local hard drive (locate.Documents.database) and one for the ones on the compact flash card (locate.Kingston.database).
Any part of the entry (md5 or file name) might be queried for as shown below:
$ locate -d locate.Kingston.database bf10d18437911231b07c172729e5516c
bf10d18437911231b07c172729e5516c /Volumes/Kingston/Addresses/Mac/vCards2.vcf
$ locate -d locate.Kingston.database vCards2.vcf
effbec4ddbd7fd21dac46e11cda782e8 /Volumes/Kingston/Addresses/Mac/._vCards2.vcf
The advantage of using the md5 value instead of the filename is that no matter where the file is stored in the directory structure or what its current name is, it can be uniquely identified.
However md5 values are not guaranteed to be unique. It is unlikely but still possible that two files produce the same md5 fingerprint even though they are not identical. To be sure that 2 files are identical, you will need to run cmp on them. This will, however, not be covered in this article.
You can now use the two databases to check for files that are only on the compact flash card as shown in the next section.
Finding Files Already Existing on the Target
You may now use the standard Unix tool comm to check and extract the md5 values that are only in the compact flash names database (locate.Kingston.database):
$ comm -23 <(locate -d locate.Kingston.database "*" |cut -f 1 -d " "|sort)
<(locate -d locate.Documents.database "*" |cut -f 1 -d " "|sort) > KingstonOnly.md5.srt
Using the standard Unix tool join, you can expand the md5 value to the full filename:
join <(cat KingstonOnly.md5.srt)
<(locate -d locate.Kingston.database "*"|sort)|grep vCards2.vcf
bf10d18437911231b07c172729e5516c /Volumes/Kingston/Addresses/Mac/vCards2.vcf
In the example above, a grep
is done on a particular file. If you want to see the complete output, you may pipe to less instead of grep
or redirect the output to some file.
You need however be careful when you have filenames that have two or more consecutive blanks in it as Unix join
uses blanks as default delimiter and will collapse consecutive ones so you better use the slash (/
) as delimiter which cannot be part of a filename (at least under Unix). This can be accomplished by including awk
in your command to add a slash to the output of the KingstonOnly.md5.srt file. The output then would look like below:
$ join -t / <(cat KingstonOnly.md5.srt|awk '{print $1 " / "}')
<(locate -d locate.Kingston.database "*"|sort)|grep vCards2.vcf
bf10d18437911231b07c172729e5516c / /Volumes/Kingston/Addresses/Mac/vCards2.vcf
Going from here, it is up to you to decide what to do with the files that were identified to only be on the compact flash card. The easiest is to simply tar them to an archive to your local hard drive and leave them for later investigation:
join -t / <(cat KingstonOnly.md5.srt|awk '{print $1 " /"}')
<(locate -d locate.Kingston.database "*"|sort)|cut -c 35-|tar -cf KingstonOnly.tar -T -
The additional cut in the example above is necessary to remove the MD5 fingerprint before piping the filenames to tar.
The files on the compact flash card are now redundant and the card can be formatted and used for other purposes.
Finding Duplicate Files Within a Drive
People with some background on rational database management systems might have already become concerned at the point where the join
command was used to expand the md5
value to the full filename as in a situation where there is a duplicate file on the compact flash card, the join
will no longer produce a 1:1 match.
Starting from here, there is another nice feature of using name databases with md5
values which is that you can easily identify duplicate files within the database using the uniq commad.
$ join <(locate -d locate.Kingston.database "*"|cut -f 1 -d " "|sort|uniq -d)
<(locate -d locate.Kingston.database "*"|sort)|less
$ join <(locate -d locate.Documents.database "*"|cut -f 1 -d " "|sort|uniq -d)
<(locate -d locate.Documents.database "*"|sort)|less
(In the above example, the case that filenames might have two or more consecutive blanks in the filename are ignored. In this case, you need to use slash as delimiter in the join
statement as shown in the examples in the previous section.)
Points of Interest
The two pipes in one command <(..) <(..)
notation is taken from here: https://unix.stackexchange.com/questions/31653/two-pipes-to-one-command/31654
History
- October 2020 - November 2022: Initial version