Why does object detection result in multiple found objects?

Why does object detection result in multiple found objects? - ios

I trained an object detector with CreateML and when I test the model in CreateML, I get a high number of identified objects:
Notes:
The model was trained on a small data set of ~30 images with that particular label face-gendermale occuring ~20 times.
Each training image has 1-3 labelled objects.
There are 5 label total.
Questions:
Is that expected or is there something wrong with the model?
If this is expected, how should I evaluate these multiple results or even count the number of objects found in the model?
Cross-posted in Apple Developer Forums. Photo of man © Jason Stitt | Dreamstime.com

A typical object detection model with make about 1000 predictions for every image (although it can be much more depending on the model architecture). Most of these predictions have very low confidence, so they are filtered out. Then the ones that are left over are sent through non-maximum suppression (NMS), which removes bounding boxes that overlap too much.
In your case, it seems that the threshold for NMS is too low (or too high), because many overlapping boxes survive.
However, it also seems that the model hasn't been trained very well yet, probably because you used very few images.

Related

tackle class imbalance in single shot object detector

I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?

Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.

In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.

Training with duplicates in dataset

I have a dataset of images for classification purposes. The dataset is very large and most of the images are duplicates of each other. So essentially, the same image occurs multiple times. Moreover, the dataset is unbalanced.
I understand the motivation of cleaning the dataset of duplicates. But it is extensive and very time consuming to do so.
Is there a way to train a net on this dataset, and not overfit the model?
Could enforcing harsher regularization, dropouts, penalize the losses still produce a usable model?

As suggested by Jon.H in comments, instead of training your model on a dataset with duplicates, you could use image hashing to detect and remove them from the dataset. Although the cryptographic hashing (like MD5 and SHA1) will suffice to find exact duplicates, according to your comment you also would like to get rid of similar images, not just exact duplicates (Do you really want to do this? Having a bigger dataset is usually better for training, and keeping similar images with small variations, e.g. in color, is not necessarily a bad thing -- see "data augmentation").
Generating a hash for images is not robust to slight changes in pixel
values, say minor lighting changes which aren't visible to the eye but
the pixel value differs. - Ronica Jethwa
One solution to this is to use perceptual hashing which is quite robust to minor differences in color, rotation, aspect ratio of images etc. In particular I would suggest you to try the pHash algorithm based on Discrete Cosine Transform as described in Looks-Like-It. There is a python library that implements it, called imagehash. Here's how to use it:
from PIL import Image
import imagehash
# Compute the perception-hash values (64 bit) for two images
phash_1 = imagehash.phash(Image.open('image_1')) # e.g. d58e11ce51ee15aa
phash_2 = imagehash.phash(Image.open('image_2')) # e.g. d58e01ae519e559e
# Compare the images using the Hamming distance of their perception hashes
dist = phash_1 - phash_2
Then it's up to you to choose the similarity threshold for the Hamming distance.

Duplicates don't imply over-fitting; they give that image more weight in the training. Yes, you can train on the data set; the results will be valid. For instance, if you have the same quantity of duplicates (say, 10 of everything). then you'll get the same results as if you had just one -- or almost: the shuffling order can slightly affect the balance of training, since a single image can now appear multiple times near the start of epoch 1.
The various counter-measures you list are good tools against over-fitting, but your main danger is merely what you have anyway: the potential of a small set of unique examples.

Adding my cent to this old question.
During training the problem arises only if you have a high chance of having many duplicates in a single batch.
Let's say you choose a batch size of 64; since you will randomly sample the images to compose the batch it could be that on average you have only 2 duplicates. This really depends on how many times (on average) an image is duplicated in proportion to the total number of images.
Anyway the problem is alleviated by using (online) data augmentation which introduces some differences, even between identical images.
The biggest problem is on the test set because the accuracy estimation will be biased towards the images with more duplicates, so I would embrace the effort and deduplicate the test (and validation) sets.

If you have the same images in the validation set as in the train set, but different in the test set, the validation will give a better (accuracy) score than test. In this case, it will be like overfitting. Duplicates occur naturally everywhere, therefore it must be ok.

Train with duplicate data. Use the representation vector i.e output of last convolution. If you using pretrained CNN model use the final out of that. Apply knn or clustering on the representation vectors and identify duplicates. Remove duplicates and retain your model.

Correctly splitting the dataset

I have downloaded a dataset of 10 class objects for the object detection. The dataset is not divided into training, validation, and testing. However, the author has mentioned in his paper to divide the dataset in 20% Training, 20% Validation, and 60% Testing and images are choose randomly.
Following the criteria said by the author, I have randomly selected 20% images for Training, 20% images for Validation, and 60% images Testing.
I want to know couple of things
1) Do I need to put difficult images in training set or validation set or testing set? for example currently there is 41 difficult images in test set, 30 in Training set and 20 in validation set.
2) How can I ensure that all ten object classes are equally distributed?
Updated
3)Ideally, for balance split difficult images should be equally distributed? and how much it effect the result if testing have more difficult, or training have more difficult or validation have more?
Ten classes: Airplane, Storage tank, Baseball ground, Tennis Court, Basketball court, ground track field, Bridge, Ship, Harbor, and Vehicle.
I have total 650 images, among them 466 images have exactly one class and there are more than one objects in a image
Airplane = 88 images, Storage tank= 10 images, Baseball ground=46 images, Tennis Court =29 images, Basketball court =32 images, ground track field=55 images, Bridge 58 images, Ship=36 images, Harbor 27 images, and Vehicle=85 images.
Remaining 184 images have multiple classes.
In total 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges, and 477 vehicles

The most common technique is a random selection. For example, if you have 1000 images you can create an array that contains the names of every file and you can aleatorize the elements using a random permutation. Then you can use the first 200 elements for training, the next 200 elements for validation and the other elements for test (in the case of 20%,20%,60%)
If there is a extremely unbalanced class you can force the same proportion of classes in every set. To do that you must do the procedure that I mentioned class by class.
You shouldn't choose images by hand. If you know that there are some difficult images in your dataset you can not choose them by hand to include them in the train, validation and test set.
If you want a fair comparison of your algorithm, if a few images can highly modify the accuracy. You can repeat the random split several times. In some cases there will be many difficult images in the training set, and in other cases in the validation or test set. Then you can privide the mean and standard deviation of your accuracy (or the metric that you are using).
UPDATED:
I see, in your description you have more than 1 object in a image. Isn't it?
For example, can you have two ships and one bridge?
I use to work with datasets that contain a single object in every image. Then to detect several objects in a image I scan different parts of a image looking for single objects.
Probably the author of the paper that you have mentioned divided the dataset randomly. If you use a more complex division in a research paper you should mention it.
About your question about how is the effect of having more diffecult images in every set, the answer is very complex. It depends on the algorithm and how similar are the images of the training set when comparing with the images of the validation and test set.
With a complex model (for example a Neural Net with a lot of layers and neurons) you can obtain the accuracy you want on the traning set(for example 100%). Then if the images are very similar to the images in the validation and test set the accuracy will be similar. But if they are not very similar you have overfitted and the accuracy will be slower in the validation and test set. To solve that you need a simpler model (for example reducing the number of neurons or using a good regularization technique), in that case the accuracy will be slower in the training set but the accuracy of the validation and test set will be closer to the accuracy obtained with the training set.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?

Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.

You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)

http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.

The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.

It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.

Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

SVM Classification - minimum number of input sets for each class

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
From the help that I got on this Stackoverflow question, I thought SVM is the best approach to my aim.
So, I have coded SVM and an SMO myself. The dataset which I have got from UCI data repository has 3280 instances ( Link to Dataset ) where around 400 of them are from class representing Advertisement images and rest of them representing non-advertisement images.
Right now I'm taking the first 2800 input sets and training the SVM. But after looking at the accuracy rate I realised that most of those 2800 input sets are from non-advertisement image class. So I`m getting very good accuracy for that class.
So what can I do here? About how many input set shall I give to SVM to train and how many of them for each class?
Thanks. Cheers. ( Basically made a new question because the context was different from my previous question. Optimization of Neural Network input data )
Thanks for the reply.
I want to check whether I`m deriving the C values for ad and non-ad class correctly or not.
Please give me feedback on this.
Or you u can see the doc version here.
You can see graph of y1 eqaul to y2 here
and y1 not equal to y2 here

There are two ways of going about this. One would be to balance the training data so it includes an equal number of advertisement and non-advertisement images. This could be done by either oversampling the 400 advertisement images or undersampling the thousands of non-advertisement images. Since training time can increase dramatically with the number of data points used, you should probably first try undersampling the non-advertisement images and create a training set with the 400 ad images and 400 randomly selected non-advertisements.
The other solution would be to use a weighted SVM so that margin errors for the ad images are weighted more heavily than those for non-ads, for the package libSVM this is done with the -wi flag. From your description of the data, you could try weighing the ad images about 7 times more heavily than the non-ads.

The required size of your training set depends on the sparseness of the feature space. As far as I can see, you are not discussing what image features you have chose to use. Before you can train, you need to to convert each image into a vector of numbers (features) that describe the image, hopefully capturing the aspects that you care about.
Oh, and unless you are reimplementing SVM for sport, I'd recomment just using libsvm,

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart