We can use hashes to find exact duplicates of images but that's where it ends. I've written a class that uses a reduced image of a smaller size keeping the original width height ratio down to maximum of 128 pixels. The next step makes a dictionary of unique colors in the thumbnail and tallies the count of each color. A flat array of color information is extracted for speed. The array is used to in sort and equality routines. Exact duplicate checking is quite fast, if the original width, height, or the count of unique colors is different the extra step of comparing colors can be skipped. If an image is flipped left to right or top to bottom the color information is still the same. I did not discover a way to determine if two images were similar but not identical. I tried to compare color usage percentages to some degree but my results were sporadic. Sorting does work to a degree in the current build. Images are generally sorted dark to light and some are grouped by subject.
Introduction
I created a picture management program a few years ago that I could use to rename image files into groups under their directory. I was interested in being able to sort the images to speed up the process of finding images that belonged to certain groups. It occurred to me that if I extracted the color information I might be able to compare the images in a sort.
I knew that it would take way too long to attempt to extract the colors of the entire image so i shrank them down to thumbnail size. I was surprised by the sorted results but kind of disappointed that the sorts weren't perfect. I kept trying different more elaborate comparison methods but the results were nearly the same. I tried using XYZ and LAB colors but again the results turned out nearly the same, I stuck with RGB. I tried extracting color information of the image borders separately and offered their comparison as a secondary sort option on the GUI.
I tried sorting the color array based on color usage thinking that would make a difference in the sorted outcome. It made a slight difference but it didn't seem to matter that one image might contain one particular shade of magenta more than another. I could never really look at the raw color data color by color and come up with any conclusions on how some form of complex comparison might effect the sort.
In the end I eliminated as many steps as possible to speed up the color extraction process. I extracted the colors, filled the array in order of color appearance starting from the top left, detected for a monochrome image and kept an average of red, green and blue values along the image border. Eventually I stored the thumbnail on a database as PNG along with file name, width and height to skip opening the original image and shrinking it each time. I did not attempt to save the color array on the database as individual components knowing it would be very slow but I suppose the array could be stored as a blob somehow for very fast load times.
Using the code
Compare class instance to another image for use in sort routines
public int CompareRGB(in ImageParser py)
{
int k, i, c;
if (Colors.Length <= py.Colors.Length)
c = Colors.Length;
else
c = py.Colors.Length;
for (i = 0;i<c;i++)
{
k = PicColor.CompareRGB(Colors[i], py.Colors[i]);
if (k != 0) return k;
if (Colors[i].Amount > py.Colors[i].Amount) return 1;
if (Colors[i].Amount < py.Colors[i].Amount) return -1;
}
return 0;
}
The PicColor class CompareRGB routine called above
public static int CompareRGB(in PicColor p1, in PicColor p2)
{
if (p1.R > p2.R) return 1;
if (p1.R < p2.R) return -1;
if (p1.G > p2.G) return 1;
if (p1.G < p2.G) return -1;
if (p1.B > p2.B) return 1;
if (p1.B < p2.B) return -1;
return 0;
}
Check class instance against another image to test for duplicates
public bool Equal(in ImageParser py)
{
int i, c, k;
if (Width != py.Width) return false;
if (Height != py.Height) return false;
if (Colors.Length != py.Colors.Length) return false;
c = Colors.Length;
for (i=0;i<c;i++)
{
k = PicColor.CompareRGB(Colors[i], py.Colors[i]);
if (k != 0) return false;
if (Colors[i].Amount != py.Colors[i].Amount) return false;
}
return true;
}
Points of Interest
I kept trying to find the best way of comparing the color information arrays to obtain the perfect sort where sets of pictures would be magically grouped together. The results were always a bit different but never perfect no matter how complex the comparison routines were. I tried things like sorting the array by the number of each color present after the extraction. I tried limiting the array comparison depth to an adjustable set number of colors.
I think the main reason the sort results aren't perfect is because the shades of colors in images vary quite a bit. A group of images might all contain cyans, but the individual cyan shades can be a bit darker or lighter across the entire sub group of images. A cloud passing overhead during filming would alter the color shades.
Are there words to describe how incomprehensible raw image color data is given my ability to decipher data? I set breakpoints and tried to look at trends in color data in the comparison routines to no avail. The best I could do was let the sort run and look at the resulting order of the thumbnails in a ListView.
Comparing for duplicate images based on the color extraction has always worked perfectly. I didn't try experimenting with the size of the thumbnail to see how it effected the chance of catching a duplicate, I suppose there is a chance that two similar images could be shrunk to 128 pixels and be considered duplicate by the program but I never encountered that condition.
The included ImageParser class contains a reference to a PictureItem class which stores the image data to SQL. The reference and any code that uses can be safely removed. If I included the PictureItem class I might as well upload the whole project because the references would pile up quickly.
The Sorts.cs file also references various classes of mine that won't be included at this time. I will note the LVItem class is a simple ListViewItem wrapper.
Come to think of it there is nothing revolutionary here, extracting color from a bitmap and sorting it. But when I put it all together and the images came up sorted it was a surprise. I think the image duplication test could be commercially viable if the color array could be saved on the database as a blob.