Recognizing multiple objects in an image with convolutional neural networks - image-processing

I've seen quite a few CNN code examples for ID'ing images, but they generally relate to a 1-to-1 input to target relationship (like the MNISt handwritten numerals set), and most seem to use similar image dimensions (pixels) for the input image and training images.
So...what is the usual approach for identifying multiple objects in one image? (like several people, or any other relatively complex scene). I've seen it done often enough, but haven't seen design approaches mentioned. Does this require some type of preprocessing or can this be handled directly by a CNN?

I would say the most known family of techniques to retrieve multiple objects from an images would be the Detection family.
With Detection, the basic idea is to have one or more Proposal windows of different sizes and ratios within an image, generated with either a calculated or random array of algorithms.
For each Proposal window, the Classification algorithm is then executed to reveal what that specific area of the image represents.
The next step would usually be to run a Merge process to combine all neighbouring areas into one single classification output.
Note: A None class is often also used to represent an area with no specific class found.

Related

Neural Network for Learning Cut VS Uncut Grass

I've got a script to take pictures like the one provided, with colored loops encircling either uncut grass, cut grass, or other background details (for purposes of rejecting non-grass regions), and generate training data in the form of a bunch of small images from inside the colored loops of those types of training data. I'm struggling to find which type of neural network that would work best for learning from this training data and telling me in real time from a video feed mounted on a lawn mower which sections of the image is uncut grass or cut grass as it is mowing though a field. Is there anyone on here experienced with neural networks, and can either tell me some I could use, or just point me in the right direction?
Try segmentation network. There are many types of segmentation.
Mind that for neuron networks, training data is necessary. Your case (to detect cut and uncut grass) is considered special, which means existing models may not fit your purpose. If so, you'll need a dataset including images and annotations. There are also tools for labeling segmentation images.
Hope it helps.

Does the presence of an particular object in all the images of data set affect a CNN's performance

Context: I have partial images of size view of different types of vehicles in my data set ( Partial images because of limited Field Of View of my camera lens ). These partial images cover more than half the vehicle and can be considered as good representative images of the vehicle. The vehicle categories are car, bus, trucks. I always get a wheel of the vehicle in these images and because I am capturing these images during different parts of the day the colour intensity of the wheels vary throughout the day. However a wheel is definitely present in all the images.
Question: I wanted to know if presence of a object in all the images of a data set not logically useful for classification will affect the CNN in any way. Basically I wanted to know before training the CNN should I mask the object i.e black it out in all the images or just let it be there.
A CNN creates a hierarchical decomposition of the image into combinations of various discriminatory patterns. These patterns are learnt during training to find those that separate the classes well.
If an object is present in every image, it is likely that it is not needed to separate the classes and won't be learnt. If there is some variation on the onject that is class dependant, then maybe it will be used. It is really difficult to know what features are important beforehand. Maybe busses have shinier wheels than other cars, and this is something you have not noticed, and thus having the wheel in the image is beneficial.
If you have inadvertently introduced some class specific variation, this can cause a problem for later classification. For example, if you only took photos of busses at night, the network might learn night = bus and when you show it a photo of a bus during the day it won't classify correctly.
However, using dropout in the network forces it to learn multiple features for classification, and not just rely on one. So if there is variation, this might not have as big an impact.
I would use the images without blanking anything out. Unless it is something simple such as background removal of particles etc., finding and blacking out the object adds another layer of complexity. You can test if the wheels make a big difference by training the network on the normal images, then classifying a few training examples with the object blacked out and seeing if the class probabilities change.
Focus you energy on doing good data augmentation, that is where you will get the most gains.
You can see an example of which features are learnt on MNIST in this paper.

Can we explicitly specify what feature to be extracted from an image while using CNN

Last day I learned about the convolution neural network, And went through some implementations of CNN using Tensorflow, All the implementation only specify the size, number of filters and strides for the filter. But when I learned about the filter it says that filter on each layer extracts different feature like edges, corners etc.
My question is can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
Sure, this could be done. But the advantage of CNNs is that they learn the best features themselves (or at least very good ones; better ones than we can come up with in most cases).
One famous example is the ImageNet dataset:
In 2012 the first end-to-end learned CNN was used. End-to-end means that the network gets the raw data on one end as input and the optimization objective on the other end.
Before CNNs, the computer vision community used manually designed features for many years. After AlexNet in 2012, nobody did so (for "typical" computer vision - there are special applications where it is still worth a shot).
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
It is always the complete image which is convolved with a small filter. The convolution operation is local, meaning you can compute much of it in parallel as the result of the convolution in the upper left corner is not
dependent of the convolution in the lower left corner.
I think you may be confusing filters and channels. A filter is the weight window size in your convolution which can be used to produce channels from the convolution output. It is typically these channels that represent different features:
In this car identification example you can see some of the earlier channels picking up things like the hood, doors, and other borders of the car. It is hard to truly specify which features the network is extracting. If you already have knowledge of features that are important to the network you can feed them in as an additional mask layer or using some type of weighting matrix on them.

semantic segmentation for large images

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.

Faster-RCNN, why don't we just use only RPN for detection?

As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.
My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?
Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?
Short answer: no they are not redundant.
The R-CNN article and its variants popularized the use of what we used to call a cascade.
Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.
If the detections are partly orthogonal it allows to remove false positive along the way.
Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).
But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see e.g. this article)
PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow
If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.
However, the second stage is used mainly to obtain more accurate detection boxes.
Faster-RCNN is a two-stage detector, like Fast R-CNN.
There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.
Now why is this necessary for the RPN? So why are they only rough estimates?
One reason is the limited receptive field:
The input image is transformed via a CNN into a feature map with limited spatial resolution. For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.
The box regression is done based on the final feature map of the CNN. In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.
Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person. Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.
Therefore, RPN creates a rough estimate of the bounding box. The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate.
In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight. This however can be done much more accurate, since more content of the object is accessable for the network.
faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.
So yes, you can, but your performance is not good enough
I think the blue circle is completely redundant and just adding a class classification layer (gives class for each bounding box containing object) should work just fine and that's what the single shot detectors do with compromised accuracy.
According to my understanding, RPN is just for binary checking if you have Objects in the bbox or not and final Detector part is for classifying the classes ex) car, human, phones, etc

Resources