How to add an extra parameter to a CNN while training - machine-learning

So, I have to train a network where I have an image, ground-truth, and an extra parameter related to an image (current image state).
There's a camera which captures images at different zoom level. For a particular surrounding, I have four images with different zoom levels (0,25,50,75). I need to train the network such that given a test image, I can classify if I want to zoom in or zoom out.
So, the dataset I have is the image, ground-truth (zoom in or zoom out or no zoom), and the current zoom level.
How can I add this current zoom level in my network so that the network trains properly?
I'm planning to use VGG or AlexNet for now and then move to Inception or ResNet in future.

What you could do is create a model which processes the image via CNN and then somehow combines other inputs to the model. So your model should have a few inputs: image, (zoom in or zoom out or no zoom), current zoom level. So you pass the image to CNN (or few CNN layers) and then flatten the feature map and append other input values and then continue through some other layers. Or you augment the image on the beginning (if you have to zoom out, zoom out...) and then pass the image to CNN. I don't know which framework are you using but I know I would try to prototype it in Keras with functional API.

Related

Deep Learning - How to perform RANDOM CROP and not lose any information in data (change ground truth label)

I have image patches from DDSM Breast Mammography that are 150x150 in size. I would like to augment my dataset by randomly cropping these images 2x times to 120x120 size. So, If my dataset contains 6500 images, augmenting it with random crop should get me to 13000 images. Thing is, I do NOT want to lose potential information in the image and possibly change ground truth label.
What would be best way to do this? Should I crop them randomly from 150x150 to 120x120 and hope for the best or maybe pad them first and then perform the cropping? What is the standard way to approach this problem?
If your ground truth contains the exact location of what you are trying to classify, use the ground truth to crop your images in an informed way. I.e. adjust the ground truth, if you are removing what you are trying to classify.
If you don't know the location of what you are classifying, you could
attempt to train a classifier on your un-augmented dataset,
find out, what the regions of the images are that your classifier reacts to,
make note of these location
crop your images in an informed way
train a new classifier
But how do you "find out, what regions your classifier reacts to"?
Multiple ways are described in Visualizing and Understanding Convolutional Networks by Zeiler and Fergus:
Imagine your classifier classifies breast cancer or no breast cancer. Now simply take an image that contains positive information for breast cancer and occlude part of the image with some blank color (see gray square in image above, image by Zeiler et al.) and predict cancer or not. Now move the occluded square around. In the end you'll get rough predictions scores for all parts of your original image (see (d) in the image above), because when you covered up the important part that is responsible for a positive prediction, you (should) get a negative cancer prediction.
If you have someone who can actually recognize cancer in an image, this is also a good way to check for and guard against confounding factors.
BTW: You might want to crop on-the-fly and randomize how you crop even more to generate way more samples.
If the 150x150 is already the region of interest (ROI) you could try the following data augmentations:
use a larger patch, e.g. 170x170 that always contains your 150x150 patch
use a larger patch, e.g. 200x200, and scale it down to 150x150
add some gaussian noise to the image
rotate the image slightly (by random amounts)
change image contrast slightly
artificially emulate whatever other (image-)effects you see in the original dataset

Labeling runways for localization and detection using deep learning

Shown above is a sample image of runway that needs to be localized(a bounding box around runway)
i know how image classification is done in tensorflow, My question is how do I label this image for training?
I want model to output 4 numbers to draw bounding box.
In CS231n they say that we use a classifier and a localization head.
but how does my model knows where are the runnway in 400x400 images?
In short How do I LABEL this image for training? So that after training my model detects and localizes(draw bounding box around this runway) runways from input images.
Please feel free to give me links to lectures, videos, github tutorials from where I can learn about this.
**********Not CS231n********** I already took that lecture and couldnt understand how to solve using their approach.
Thanks
If you want to predict bounding boxes, then the labels are also bounding boxes. This is what most object detection systems use for training. You can just have bounding box labels, or if you want to detect multiple object classes, then also class labels for each bounding box would be required.
Collect data from google or any resources that contains only runway photos (From some closer view). I would suggest you to use a pre-trained image classification network (like VGG, Alexnet etc.) and fine tune this network with downloaded runway data.
After building a good image classifier on runway data set you can use any popular algorithm to generate region of proposal from the image.
Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object(Runway) is present in that region. Otherwise it's not.
If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

CNN Object Localization Preprocessing?

I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size

CNN that generate a new image from input image

CNN such that outputs the image with the feature added to the input image can be created?
For example, if an image of a person's face is input, outputs an image of the person's face wearing glasses.
There are several options but basically the same way that you have one input for every pixel you must have one output from every pixel in the output image.
In MLPs you must have the same neurons in the input layer than in the output layer.
In CNNs you can also have at the beginning convolutional layers and after that deconvolutional layers.
Take a look at this paper (it is awesome) to create very realistic images from other images (for example satellite and map views in google maps). It is a neural network that is trying to solve the problem and also trying to create images that other neural network is not capable to distinguish from real images (it also have the source code available):
https://phillipi.github.io/pix2pix/
To add to the answer above, another way of doing this is neural style transfer, where we feed two images to a CNN which then generates a new image combining the content from the second image and the style from the first. Check out this paper for further details, https://arxiv.org/abs/1508.06576
We could of course always use GANs to do achieve full perfection.

Having a neural network output a gaussian distribution rather than one single value?

Let's consider I have a neural network with one single output neuron. To outline the scenario: the network gets an image as input and should find one single object in that image. For simplifying the scenario, it should just output the x-coordinate of the object.
However, since the object can be at various locations, the network's output will certainly have some noise on it. Additionally the image can be a bit blurry and stuff.
Therefore I thought it might be a better idea to have the network output a gaussian distribution of the object's location.
Unfortunately I am struggling to model this idea. How would I design the output? A flattened 100 dimensional vector if the image has a width of 100 pixels? So that the network can fit in a gaussian distribution in this vector and I just need to locate the peaks for getting the approximated object's location?
Additionally I fail in figuring out the cost function and teacher signal. Would the teacher signal be a perfect gaussian distribution on the exact x-coordination of the object?
How to model the cost function, then? Currently I have a softmax cross entropy or simply a squared error: network's output <-> real x coordinate.
Is there maybe a better way to handle this scenario? Like a better distribution or any other way to have the network not output a single value without any information of the noise and so on?
Sounds like what you really need is a convolutional network.
You could train a network to recognize your target object when it's positioned in the center of the network's receptive field. You can then create a moving window, at each step feeding the portion of the larger image under that window into the net. If you keep track of the outputs of the trained network for each (x,y) position of the window, some locations of the window will produce better matches than others. Once you've covered the whole image, you can pick the position with the maximum network output as the position where the target object is most likely located.
To handle scale and rotation variations, consider creating an image pyramid, or sets of images at different scales and rotations that are versions of the original image. Then sieve over those images to find the target image.

Resources