I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It's the kind of difference you'd get from re-compressing a jpeg.
I already have a foolproof way to detect if two images are identical: I sum the delta-brightness over all the pixels and compare to a threshold. This method has proven 100% accurate but doing 1 photo against 2 million is incredibly slow (hours per photo).
I would like to fingerprint the images in a way that I could just compare the fingerprints in a hash table. Even if I can reliably whittle down the number of images that I need to compare to just 100, I would be in great shape to compare 1 to 100. What would be a good algorithm for this?
Have a look at O. Chum, J. Philbin, and A. Zisserman, Near duplicate image detection: min-hash and tf-idf weighting, in Proceedings of the British Machine Vision Conference, 2008. They solve the problem you have and demonstrate the results for 146k images. However, I have no first-hand experience with their approach.
Naive idea: create a small thumbnail (50x50 pixels) to find "probably identical" images, then increase thumbnail size to discard more images.
Building on the idea of minHash...
My idea is to make 100 look-up tables using all the images currently in the database. The look-up tables are mapping from the brightness of a particular pixel to a list of images that have that same brightness in that same pixel. To search for an image just input it into the hash tables, get 100 lists, and score a point for each image when it shows up in a list. Each image will have a score from 0 to 100. The image with the most points wins.
There are many issues with how to do this within reasonable memory constraints and how to do it quickly. Proper data structures are needed for storage on disk. Tweaking of the hashing value, number of tables, etc, is possible, too. If more information is needed, I can expand on this.
My results have been very good. I'm able to index one million images in about 24 hours on one computer and I can lookup 20 images per second. Accuracy is astounding as far as I can tell.
I don't think this problem can be solved by hashing. Here's the difficulty: suppose you have a red pixel, and you want 3 and 5 to hash to the same value. Well, then you also want 5 and 7 to hash to the same value, and 7 and 9, and so on... you can construct a chain that says you want all pixels to hash to the same value.
Here's what I would try instead:
Build a huge B-tree, with 32-way fanout at each node, containing all of the images.
All images in the tree are the same size, or they're not duplicates.
Give each colored pixel a unique number starting at zero. Upper left might be numbered 0, 1, 2 for the R, G, B components, or you might be better off with a random permutation, because you're going to compare images in order of that numbering.
An internal node at depth n discriminates 32 ways on the value of the pixel n divided by 8 (this gets out some of the noise in nearby pixels.
A leaf node contains some small number of images, let's say 10 to 100. Or maybe the number of images is an increasing function of depth, so that if you have 500 duplicates of one image, after a certain depth you stop trying to distinguish them.
One all two million nodes are inserted in the tree, two images are duplicate only if they're at the same node. Right? Wrong! If the pixel value in two images are 127 and 128, one goes into outedge 15 and the other goes into outedge 16. So actually when you discriminate on a pixel, you may insert that image into one or two children:
For brightness B, insert at B/8, (B-3)/8, and (B+3)/8. Sometimes all 3 will be equal, and always 2 of 3 will be equal. But with probability 3/8, you double the number of outedges on which the image appears. Depending on how deep things go you could have lots of extra nodes.
Someone else will have to do the math and see if you have to divide by something larger than 8 to keep images from being duplicated too much. The good news is that even if the true fanout is only around 4 instead of 32, you only need a tree of depth 10. Four duplications in 10 takes you up to 32 million images at the leaves. I hope you have plenty of RAM at your disposal! If not, you can put the tree in the filesystem.
Let me know how this goes!
Also good about hash from thumbnails: scaled duplicates are recognized (with little modification)
Related
I'm currently making a custom dataset with 1 class. The images i am labeling contains several of these objects in each image (between 30-70). I therefore wonder if I should count each of the objects in each image as "1 datapoint" when evaluating the size of the dataset?
I.e: Does more objects per image require less images?
Being this a detection problem, the size of the dataset is given both by the number of images and the number of objects. There is no reason to choose one of the two because they are both equally important numbers.
If you really want to define "size" you probably have to start from the error metric. Usually for object detection mIoU (Mean Intersection over Union) is used. This metric is at the object level so it doesn't care if you have 10 or 1 million images.
Finally, it could be that having many objects per image will allow you to use a smaller number of total images, but this can only be confirmed experimentally.
I use example code to compare HSV histograms using EMD.
I want to find similar images in people's (mobile) picture library. It's quite common that people take several images of the same subject (in a row) with just slight changes: zooming in/out a bit, different angle, different exposure as a result of changing position, other pose, ....
I selected 4 sets of 4 similar images to test this algorithm. When comparing the images inside the sets, I get 22 EMD-L1 values between roughly 0.25 and 2.25 (average 1.47) and 2 outliers around 7.2.
When I cross-comparing between sets I get values between 2 and 15 with an average around 8.
Yes, there is a significant range difference between the two result sets. But I was disappointed that there was no (gap) between these ranges, and instead a small overlap [2.0, 2.25]. I'm hoping to improve the algorithm.
How can I optimise my comparison for my particular use-case? There are various histogram forms, various histogram comparison algorithms, and then each has various parameters.
Does OpenCV implement the fastest known EMD algorithm? I was surprised that the comparison of some histograms took up to a second; especially with the relatively small bin numbers.
Then, some cross-comparisons give good EMD results, but have totally different RGB histograms. Here are two images:
My current EMD-L1 says 1.95, but the RGB histograms are totally different.
Probably you've already refined your comparison method. But this might not be obvious, you could divide the image into overlapping subregions, and then compute the EMD for all 4 parts.
I'm having trouble working with the underlying memory of an MPSImage. I've been using the methods getBytes and replace on MPSImage's texture member variable to read and write the underlying data. The problem is I can't find documentation of how the memory is interpreted as an image (i.e. how the rows, columns, and channels are layed out). Part of what complicates the issue is that regardless of the number of feature channels, the data is stored as a stack of RGBA texture slices, with some channels possibly left unused. For instance, with 3 feature channels, there will be one RGBA texture slice, and one channel's worth of space will be left unused.
The problem is, how is the MPSImage data actually arranged within the texture? It seems more complicated than I originally would have guessed.
After much experimentation, it seems like the data is arranged differently depending on whether the number of feature channels is < 4 or > 4. But I'm still having trouble figuring it out.
Can anyone explain the MPSImage data layout to me?
The first four feature channels are encoded as they would be for a standard RGBA texture. Feature channel 0 is in the "R" position, feature channel 1 is in the "G" position and so forth.
The next four feature channels are present as the next slice in a texture2d_array. If you have a 100x100 image with 20 feature channels, this will be encoded as a 100x100 texture array with (20/4=) 5 slices in the array.
To make matters more complicated, you can have MPSImage arrays with have multiple images in them, each with more than 4 feature channels. This is frequently referred to as batching. The second image is found in the texture array immediately after the first image. If we have multiple 100x100x20 images in the MPSImage, then the second one starts at slice 5, the third one at slice 10 and so forth.
Im using GPUImage in my project and I need an efficient way of taking the column sums. Naive way would obviously be retrieving the raw data and adding values of every column. Can anybody suggest a faster way for that?
One way to do this would be to use the approach I take with the GPUImageAverageColor class (as described in this answer), only instead of reducing the total size of each frame at each step, only do this for one dimension of the image.
The average color filter determines the average color of the overall image by stepping down in a factor of four in both X and Y, averaging 16 pixels into one at each step. If operating in a single direction, you should be able to use hardware interpolation to get an 18X reduction in a single direction per step with good performance. Your final step might either require a quick CPU-based iteration on the much smaller image or a tweaked version of this shader that pulls the last few pixels in a column together into the final result pixel for that column.
You notice that I've been talking about averaging here, because the output values for any OpenGL ES operation will need to be in terms of colors, which only have a 0-255 range per channel. A sum will easily overflow this, but you could use an average as an approximation of your sum, with a more limited dynamic range.
If you only care about one color channel, you could possibly encode a larger value into the RGBA channels and maintain a 32-bit sum that way.
Beyond what I describe above, you could look at performing this sum with the help of the Accelerate framework. While probably not quite as fast as doing a shader-based reduction, it might be good enough for your needs.
Given a dataset with 23 points spread out over 6 dimensions, in the first part of this exercise we should do the following, and I am stuck on the second half of this:
Compute the first step of the CLIQUE algorithm (detection of all dense cells). Use
three equal intervals per dimension in the domain 0..100,and consider a cell as dense if it contains at least five objects.
Now this is trivial and simply a matter of counting. The next part asks the following though:
Identify a way to compute the above CLIQUE result by only using the functions of
Weka provided in the tabs of Preprocess, Classify , Cluster , or Associate .
Hint : Just two tabs are needed.
I've been trying this for over an hour now, but I can't seem to get anywhere near a solution here. If anyone has a hint, or maybe a useful tutorial which gives me a little more insight into weka it would be very much appreciated!
I am assuming you have 23 instances (rows) and 6 attributes (dimensions)
Use three equal intervals per dimension
Use pre-process tab to discretize your data to 3 equal bins. See image or command line. You use 3 bins for intervals. You may choose to change useEqualFrequency to false and true and try again. I think true may give better results.
weka.filters.unsupervised.attribute.Discretize -B 3 -M -1.0 -R first-last
After that cluster your data. This will give show you near instances. Since you would like to find dense cells. I think SOM may be appropriate.
a cell as dense if it contains at least five objects.
You have 23 instances. Therefore try for 2x2=4 cluster centers, then go for 2x3=6,2x4=8 and 3x3=9. If your data points are near. Some of the cluster centers should always hold 5 instances no matter how many cluster centers your choose.