Object Detection Model training from scratch vs. pretrained weights - image-processing

Hey I have a question related to training a object detection model:
Lets say I have trained a model for detecting 1 class with, say, 500 images including positive and negative samples and saved the best weights ( lets say best.weights is the weights file).
-> What are pros/cons of these training schedules according to assessment of object detection model performance?
Training model with totally different 500 images (similar illumination, and object sizes) with using those pre-trained weights (best.weights). #1
Training model with those mentioned 1000 images from scratch (randomly initialized weights). #2
Training model with those mentioned 1000 images with using pre-trained weights which are trained on COCO dataset but knowing that the COCO and those 1000 images have completely different characteristics such as difference of objectSize/ImageSize ratio , difference in illuminations since 1000 images include weather conditions, i.e. snowy days and summer days are bright, winter days are dark etc #3
P.S: assume object class occurs in COCO.

Related

Does a ML model classify between desired image classes or by datasets?

If I had a Dataset 1 with 90% cat images and 10% dog images, and I combined Dataset 2, with only dogs to equalize the class imbalance, will my model classify which are cats and dogs or which are dataset 1 images and dataset 2 images?
If it's the latter, how do I get the model to classify between cats and dogs?
Your model will only do what it is trained for, regardless of what name your dataset(s) have.
Name of the dataset is just an organizational issue which does not go into training, does not really effect the amount of loss that will be produced during a training step. What will effect your models responses is however is the properties of the data.
Sometimes data from different datasets have different properties even though the datasets serve for the same purpose; like images with different illumination, background, resolution etc. That surely have an effect on the model performance. This is why mixing datasets should be performed with caution. You might find it useful to have a look at this paper.

tackle class imbalance in single shot object detector

I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?
Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.
In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.

Finding the suitable CNN architecture for the calssification

I want to use convolutional Neural Network (CNN) to classify between two classes of images. I built several CNN architectures, but I always get the same result; the network always classify all cases as a second class sample. Therefore, I always get 50% accuracy in leave-one-out. The data is balanced in terms of the number of samples of each class (16 from 1st, and 16 from 2nd). Could you please clarify what does this mean.
With such small number of training samples, Your CNN model is very likely to overfit the data giving good training accuracy and worst test accuracy.
Else your model can be skewed predicting the same class at all times.
Below are some of the solutions you can try:
1) As you have commented, if you cannot get any more images, then try creating new images by modifying the ones already available. For ex: Let's say you have 16 images of a cat (cat is the class). You can crop the cat and paste it in different backgrounds, try varying the brightness, intensity etc, Try rotation, translation operations etc.
This will help you create a good training set.
2) Try creating a smaller model (with one or two layers) and check if it improves your accuracy.
3) You can do transfer learning by using a good pre-trained model as it can learn pretty well when compared to creating a model from base.

Miminum requirements for Google tensorflow image classifier

We are planning to build image classifiers using Google Tensorflow.
I wonder what are the minimum and what are the optimum requirements to train a custom image classifier using a convolutional deep neural network?
The questions are specifically:
how many images per class should be provided at a minimum?
do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?
what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes.
is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000.
"how many images per class should be provided at a minimum?"
Depends how you train.
If training a new model from scratch, purely supervised: For a rule of thumb on the number of images, you can look at the MNIST and CIFAR tasks. These seem to work OK with about 5,000 images per class. That's if you're training from scratch.
You can probably bootstrap your network by beginning with a model trained on ImageNet. This model will already have good features, so it should be able to learn to classify new categories without as many labeled examples. I don't think this is well-studied enough to tell you a specific number.
If training with unlabeled data, maybe only 100 labeled images per class. There is a lot of recent research work on this topic, though not scaling to as large of tasks as Imagenet.
Simple to implement:
http://arxiv.org/abs/1507.00677
Complicated to implement:
http://arxiv.org/abs/1507.02672
http://arxiv.org/abs/1511.06390
http://arxiv.org/abs/1511.06440
"do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?"
It should work with different numbers of examples per class.
"what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes."
You should use the label smoothing technique described in this paper:
http://arxiv.org/abs/1512.00567
Smooth the labels based on your estimate of the label error rate.
"is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000."
Yes
How many images per class should be provided at a minimum?
do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?
what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes.
These three questions are not really TensorFlow specific. But the short answer is, it depends on the resiliency of your model in handling unbalanced data set and noisy labels.
is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000.
Yes, definitely. This would mean a much larger classifier layer, so your training time might be longer. Other than that, there are no limitations in TensorFlow.

How to train a caffemodel on our own dataset?

I tried using pre-trained bvlc_reference_caffenet.caffemodel for object recognition from images. I got good results for images containing only a single object. For images with multiple objects, I removed the argmax() term from prediction which gives the class label with the maximum probability.
Still, the accuracy is very less for the labels which I am getting. So, I am thinking of training the same caffemodel on my own dataset (containing images with multiple objects). How should I proceed? Is there any way to retrain a pre-trained caffemodel with the different dataset?
What you are after is called "finetuning": taking a deep net trained for task A, reusing its weights and re-train it to accomplish task B.
You can start with this tutorial, but you will find much more information simply by googling "finetune caffe model".
You may also be interested in this post regarding training caffe with mutiple categories per input image.

Resources