I'm currently making a custom dataset with 1 class. The images i am labeling contains several of these objects in each image (between 30-70). I therefore wonder if I should count each of the objects in each image as "1 datapoint" when evaluating the size of the dataset?
I.e: Does more objects per image require less images?
Being this a detection problem, the size of the dataset is given both by the number of images and the number of objects. There is no reason to choose one of the two because they are both equally important numbers.
If you really want to define "size" you probably have to start from the error metric. Usually for object detection mIoU (Mean Intersection over Union) is used. This metric is at the object level so it doesn't care if you have 10 or 1 million images.
Finally, it could be that having many objects per image will allow you to use a smaller number of total images, but this can only be confirmed experimentally.
I am classifying data using a trained model and the results vary with size. e.g. suppose I have n rows initially and classify them and get a set of results X. Now if I add m rows to the previous dataset and have n+m rows and classify it then the results are different for first n rows also. And yes the change is not negligible. Please if anyone can provide an insight into this. Please let me know if the question is not clear. I am using R and the classifier is SVM.
If I understood you correctly the reason would be because an SVM model is a representation of all your sample as points in space.
Just from Wikipedia:
That means all your data is mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as
possible.
All your examples are mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Since all the data is mapped, a new dataset could mean a new division, affecting your final result.
Davies-bouldin index validation is basically the ratio within cluster scatter and between cluster distances. We iterate that for all clusters and finally take the maximum. My question here is why maximum not minimum?
Thank you.
Consider the following scenario:
Three clusters. One is well separated from the others, two are conflated.
Let S_i be 0.5 for all of them.
For the conflated ones, M_ij is close to zero. For the well separated ones, the distance of the means is much larger. The resulting R_i is large for the conflated ones, and small for the separated clusters.
If you take the maximum, the index says "two clusters are mixed up, the result is thus bad - not all clusters are well separated". If you used the minimum, it would ignore this problem and say "well, at least it separated them from one of the other clusters".
I work with a lot of histograms. In particular, these histograms are of basecalls along segments on the human genome.
Each point along the x-axis is one of the four nitrogenous bases(A,C,T,G) that compose DNA and the y-axis represents how many times a base was able to be "called" (or recognized by a sequencer machine, so as to sequence the genome, which is simply determining the identity of each base along the genome).
Many of these histograms display roughly linear dropoffs (when the machines aren't able to get sufficient read depth) that fall to 0 or (almost-0) from plateau-like regions. When the score drops to zero, it means the sequencer isn't able to determine the identity of the base. If you've seen the double helix before, it means the sequencer can't figure out the identify of one half of a rung of the helix. Certain regions of the genome are more difficult to characterize than others. Bases (or x data points) with high numbers of basecalls, on the order of >=100, are able to be definitively identified. For example, if there were a total of 250 calls for one base, and we had 248 T's called, 1 G called, and 1 A called, we would call that a T. Regions with 0 basecalls are of concern because then we've got to infer from neighboring regions what the identity of the low-read region could be. Is there a straightforward algorithm for assigning these plots a score that reflects this tendency? See box.net/shared/nbygq2x03u for an example histo.
You could just use the count of base numbers where read depth was 0... The slope of that line could also be a useful indicator (steep negative slope = drop from plateau).
I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It's the kind of difference you'd get from re-compressing a jpeg.
I already have a foolproof way to detect if two images are identical: I sum the delta-brightness over all the pixels and compare to a threshold. This method has proven 100% accurate but doing 1 photo against 2 million is incredibly slow (hours per photo).
I would like to fingerprint the images in a way that I could just compare the fingerprints in a hash table. Even if I can reliably whittle down the number of images that I need to compare to just 100, I would be in great shape to compare 1 to 100. What would be a good algorithm for this?
Have a look at O. Chum, J. Philbin, and A. Zisserman, Near duplicate image detection: min-hash and tf-idf weighting, in Proceedings of the British Machine Vision Conference, 2008. They solve the problem you have and demonstrate the results for 146k images. However, I have no first-hand experience with their approach.
Naive idea: create a small thumbnail (50x50 pixels) to find "probably identical" images, then increase thumbnail size to discard more images.
Building on the idea of minHash...
My idea is to make 100 look-up tables using all the images currently in the database. The look-up tables are mapping from the brightness of a particular pixel to a list of images that have that same brightness in that same pixel. To search for an image just input it into the hash tables, get 100 lists, and score a point for each image when it shows up in a list. Each image will have a score from 0 to 100. The image with the most points wins.
There are many issues with how to do this within reasonable memory constraints and how to do it quickly. Proper data structures are needed for storage on disk. Tweaking of the hashing value, number of tables, etc, is possible, too. If more information is needed, I can expand on this.
My results have been very good. I'm able to index one million images in about 24 hours on one computer and I can lookup 20 images per second. Accuracy is astounding as far as I can tell.
I don't think this problem can be solved by hashing. Here's the difficulty: suppose you have a red pixel, and you want 3 and 5 to hash to the same value. Well, then you also want 5 and 7 to hash to the same value, and 7 and 9, and so on... you can construct a chain that says you want all pixels to hash to the same value.
Here's what I would try instead:
Build a huge B-tree, with 32-way fanout at each node, containing all of the images.
All images in the tree are the same size, or they're not duplicates.
Give each colored pixel a unique number starting at zero. Upper left might be numbered 0, 1, 2 for the R, G, B components, or you might be better off with a random permutation, because you're going to compare images in order of that numbering.
An internal node at depth n discriminates 32 ways on the value of the pixel n divided by 8 (this gets out some of the noise in nearby pixels.
A leaf node contains some small number of images, let's say 10 to 100. Or maybe the number of images is an increasing function of depth, so that if you have 500 duplicates of one image, after a certain depth you stop trying to distinguish them.
One all two million nodes are inserted in the tree, two images are duplicate only if they're at the same node. Right? Wrong! If the pixel value in two images are 127 and 128, one goes into outedge 15 and the other goes into outedge 16. So actually when you discriminate on a pixel, you may insert that image into one or two children:
For brightness B, insert at B/8, (B-3)/8, and (B+3)/8. Sometimes all 3 will be equal, and always 2 of 3 will be equal. But with probability 3/8, you double the number of outedges on which the image appears. Depending on how deep things go you could have lots of extra nodes.
Someone else will have to do the math and see if you have to divide by something larger than 8 to keep images from being duplicated too much. The good news is that even if the true fanout is only around 4 instead of 32, you only need a tree of depth 10. Four duplications in 10 takes you up to 32 million images at the leaves. I hope you have plenty of RAM at your disposal! If not, you can put the tree in the filesystem.
Let me know how this goes!
Also good about hash from thumbnails: scaled duplicates are recognized (with little modification)