Can inception model be used for object counting in an image? - image-processing

I have already gone through the image classification part in Inception model, but I require to count the objects in the image.
Considering the flowers data-set, one image can have multiple instances of a flower, so how can I get that count?

What you describe is known to research community as Instance-Level Segmentation.
In last year itself there have been a significant spike in papers addressing this problem.
Here are some of the papers:
https://arxiv.org/pdf/1412.7144v4.pdf
https://arxiv.org/pdf/1511.08498v3.pdf
https://arxiv.org/pdf/1607.03222v2.pdf
https://arxiv.org/pdf/1607.04889v2.pdf
https://arxiv.org/pdf/1511.08250v3.pdf
https://arxiv.org/pdf/1611.07709v1.pdf
https://arxiv.org/pdf/1603.07485v2.pdf
https://arxiv.org/pdf/1611.08303v1.pdf
https://arxiv.org/pdf/1611.08991v2.pdf
https://arxiv.org/pdf/1611.06661v2.pdf
https://arxiv.org/pdf/1612.03129v1.pdf
https://arxiv.org/pdf/1605.09410v4.pdf
As you see in these papers simple object classification network won't solve the problem.
If you search github you will find a few repositories with basic frameworks, you can build on top of them.
https://github.com/daijifeng001/MNC (caffe)
https://github.com/bernard24/RIS/blob/master/RIS_infer.ipynb (torch)
https://github.com/jr0th/segmentation (keras, tensorflow)

indraforyou answered the question in how to solve the problem you are having. I want to add something for the inception model specifically. In https://arxiv.org/pdf/1312.6229.pdf they propose a regressor network trained on the output of a model trained on the imagenet dataset like the inception model. This regressor model then is used to propose object boundaries for you to use for counting. The advantage of this approach is that you do not have to annotate any training examples and you can just use the ImageNet dataset for training.
If you do not want to train anything I would propose a heuristic in finding object boundaries. Literature in image segmentation https://en.wikipedia.org/wiki/Image_segmentation should help you find a suitable heuristic. I do think using a heuristic will decrease your accuracy though.
Last but not least this is an open problem in computer vision research. You should not expect to get 100% accuracy or even 95% accuracy on counting. Many very smart people have tried this and reported mixed results. Still some very cool things can be accomplished.

Any classification model like inception model will identify the object like flower in your case. However, when multiple items are there classifications won't work (get confused in simple language).
Thus:
You've to segment main image into child images with one object per image and use classification on each segment. This is termed as image segmentation in image processing.

Related

Evaluation of generative models like variational autoencoder

i hope everyone is doing well
I need some help with generative models.
So im working on a project where the main task is to build a binary classification model. In the dataset which contains 300000 sample and 100 feature, there is an imbalance between the 2 classes where majority class is too much bigger than the minory class.
To handle this problem, i'm using VAE (variational autoencoders) to solve this problem.
So i started training the VAE on the minority class and then use the decoder part of the VAE to generate new or fake samples that are similars to the minority class then concatenate this new data with training set in order to have a new balanced training set.
My question is : is there anyway to evalutate generative models like vae, like is there a way to know if the data generated is similar to the real one ??
I have read that there is some metrics to evaluate generated data like inception distance and Frechet inception distance but i saw that they have been only used on image data
I wanna know if i can use them too on my dataset ?
Thanks in advance
I believe your data is not image as you say there are 100 features. What I believe that you can check the similarity between the synthesised features and the original features (the ones belong to minority class), and keep only the ones with certain similarity. Cosine similarity index would be useful for this problem.
That would be also very nice to check a scatter plot of the synthesised features with the original ones to see if they are close to each other. tSNE would be useful at this point.

Can we use multiple models for object detection?

So I have been working on object detection since a lot of time, I have seen models like YOLO and Mask-RCNN use a single deep model to classify objects. Is it possible to make multiple small networks comparatively to identify each and every object separately to increase accuracy and what will be the effect on speed. I'm confused little bit.
If you look inside the "black box" of models such as Yolo and Mask-RCNN, you will realize that they already contain "multiple small networks", to a certain extent, regarding object detection.
Actually, Mask-RCNN is roughly a Faster-RCNN with an additional branch for segmentation. However, regarding detection, there is "somewhere" a classification layer that give a score for each class object (and a regression layer to estimate the box). All the object classes are estimated from a common representation (all the rest of the network) and only the last layer is specialized to each class. The point is nevertheless that there are advantages to compute the common representation jointly to all object classes, in particular because a positive sample for class i is usually also a negative sample for class j.
The idea is quite different for YOLO (v1) but "somewhere" at the end of the network, there is a stack of neurons layers. There is a layer for each object class and it computes the probability of presence of the corresponding object in a region of the image. Once again, the layers are computed from a "common representation" thus in that sense they are quite independent "classifiers". But once again, these "classifiers" benefit from the representation that is computed in jointly for all class objects.
To be honest, these explanation are quite approximate, in order to try to be clear. If you really want to understand, the best is to read the publication(s) of Yolo or that of mask R-CNN. However, it is quite technical and requires to understand quite well deep learning basics. There are also some good tutorials on the web.
This being said, you can modify the architecture of Yolo and Mask R-CNN to put more complex "small neural networks" in replacement of the existing layers. It may improve performance since you will have more neurons, but will be also more complex to train. As said in comments by #jakub, you can also train multiple specific network and add a layer to choose between all, but it would be a "new" architecture and I doubt that you will obtain a better compromise between performances and computational efficiency than Yolo or Mask R-CNN
you can ensemble multiple models. I found a this article which has a better explanation about ensembling multiple models

Using tensorflow Object-detection on only 1 classes

I am using tensorflow (object-detection) on my own dataset (drone recognition), also only 1 class named 'drone', after about 30000 steps trained, my result model can detect drone with very high accuracy, but I got a problem, I used ssd_inception_v2_coco model and its fine_tune_checkpoint on model zoo, right now sometimes in my real time detection, it detected human face as drone (very big different between 2 objects like that), I think because of the old checkpoint.
How can I prevent the detection of some object that have big different with my drone object, like human, dog, cat... Or can someone describe for me what problem here?
Sorry for my bad english
Even if you train an SSD for one class, it automatically creates another class called background. The background is trained using the regions of the training images that are not labeled as the desired classes (in your case, drone).
An easy way out is to add training samples that include images that have both drones and the things that you don't want to recognize as drones, in the same scene. Doing this and then increasing the number of epochs should improve the precision.
If you are doing an application where there are frequent occurences of some objects with drones, another possiblity is to actually train the network for those things too. This will increase your training workload, but improve the accuracy.
Some implementations of SSD have an option for hard negative mining of data, so that mistakes made during validation are specifically used with training. If you are familiar with the code, you might want to check if this is available.

Image classification, narrow domain with custom labels

Let's suppose I would like to classify motorbikes by model.
there are couple of hundreds models of motorbikes I'm interested in.
I do have tens, sometimes hundreds of pictures of each motorbike model.
Can you please point me to the practical example that demonstrates how to train model on your data and then use it to classify images? It needs to be a deep learning model, not simple logistic regression.
I'm not sure about it, but it seems like I can't use pre-trained neural net because it has been trained on wide range of objects like cat, human, cars etc. They may be not too good at distinguishing the motorbike nuances I'm interested in.
I found couple of such examples (tensorflow has one), but sadly, all of them were using pre-trained model. None of it had example how to train it on your own dataset.
In cases like yours you either use transfer learning or fine tuning. If you have more then thousand images of motorbikes I would use fine tuning and if you have less transfer learning.
Fine tuning is using a pre trained model and using a different classifier part. Then the new classifier part maybe the last 1-2 layers of the trained model are trained to your dataset.
Transfer learning means using a pre trained model and letting it output features for an input image. Now you use a new classifier based on those features. Maybe a SVM or a logistic regression.
An example for this can be seen here: https://github.com/cpra/dlvc2016/blob/master/lectures/lecture10.pdf. slide 33.
This paper Quick, Draw! Doodle Recognition from a kaggle challenge may be similar enough to what you are doing. The code is on github. You may need some data augmentation if you only have a few hundred images for each category.
What you want is pretty EZ. Follow the darknet YOLO implementation
Instruction: https://pjreddie.com/darknet/yolo/
Code https://github.com/pjreddie/darknet
Training YOLO on COCO
You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here's how to get it working on the COCO dataset.
Get The COCO Data
To train YOLO you will need all of the COCO data and labels. The script scripts/get_coco_dataset.sh will do this for you. Figure out where you want to put the COCO data and download it, for example:
cp scripts/get_coco_dataset.sh data
cd data
bash get_coco_dataset.sh
Add your data inside and make sure it is same as testing samples.
Now you should have all the data and the labels generated for Darknet.
Then call training script with the pre-trained weight.
Keep in mind that only training on your motorcycle may not result in good estimation. There would be biased result coming out, I red it somewhere b4.
The rest is all inside the link. Good luck

Image similarity detection with TensorFlow

Recently I started to play with tensorflow, while trying to learn the popular algorithms i am in a situation where i need to find similarity between images.
Image A is supplied to the system by me, and userx supplies an image B and the system should retrieve image A to the userx if image B is similar(color and class).
Now i have got few questions:
Do we consider this scenario to be supervised learning? I am asking
because i don't see it as a classification problem(confused!!)
What algorithms i should use to train etc..
Re-training should be done quite often, how should i tackle this
problem so i don't train everytime from scratch( fine-tuning??)
Do we consider this scenario to be supervised learning?
It is supervised learning when you have labels to optimize your model. So for most neural networks, it is supervised.
However, you might also look at the complete task. I guess you don't have any ground truth for image pairs and the "desired" similarity value your model should output?
One way to solve this problem which sounds inherently unsupervised is to take a CNN (convolutional neural network) trained (in a supervised way) on the 1000 classes of image net. To get the similarity of two images, you could then simply take the euclidean distance of the output probability distribution. This will not lead to excellent results, but is probably a good starter.
What algorithms i should use to train etc..
First, you should define what "similar" means for you. Are two images similar when they contain the same object (classes)? Are they similar if the general color of the image is the same?
For example, how similar are the following 3 pairs of images?
Have a look at FaceNet and search for "Content based image retrieval" (CBIR):
Wikipedia
Google Scholar
This can be a supervised learning. You can classify the images into categories, if two images are in the same categories (or close in a category), you can think of them as similar.
You can use the deep conventional neural networks for imagenet such as inception model. The inception model outputs a probability map for 1000 classes (which is a vector whose values sum to 1). You can calculate the distance of vectors of two images to get their similarity.
On the same page of the inception model, you will also find the instructions to retrain a model: https://github.com/tensorflow/models/tree/master/inception#how-to-fine-tune-a-pre-trained-model-on-a-new-task

Resources