Pixel wise classification using Convolutional Neural Network? - image-processing

The question is conceptual. I basically understand how MNIST example works, the feedforward net takes an image as input and output a predicted label 0 to 9.
I'm working on a project that ideally will take an image as the input, and for every pixel on that image, I will output a probability of that pixel being a certain label or not.
So my input, for example is of the size 600 * 800 * 3 pixels, and my output would be 600 * 800, where every single entry on my output is a probability.
How can I design the pipeline for that using Convolutional Neural Network? I'm working with Tensorflow. Thanks
Elaboration:
Basically I wanted to label every pixel as either foreground or background (The probability of the pixel being foreground). My intuition is that in convolutional layers, the neurons will be able to pick up information in a patch around that pixel, and finally be able to tell how likely this pixel could be the foreground.

Although it wouldn't be very efficient, a naive method could be to color a window (say, 5px x 5px) of pixels black, record the probabilities for each output class, then slide the window over a bit, then record again. This would be repeated until the window passed over the whole image.
Now we have some interesting information. For each window position, we know the delta of the probability distribution over the labels compared to the probabilities when the classifier received the whole image. That delta corresponds to the amount that that region contributed to the classifier making that decision.
If you want this mapped down to a per-pixel level for visualization purposes, you could use a stride length of 1 pixel when sliding the window and map the probability delta to the centermost pixel of the window.
Note that you don't want to make the window too small, otherwise the deltas will be too small to make a difference. Also, you'll probably want to be a bit smart about how you choose the color of the window so the window itself doesn't appear to be a feature to the classifier.
Edit in response to your elaboration:
This would still work for what you're trying to do. In fact, it becomes a bit nicer even. Instead of keeping all the label probability deltas separate, you would sum them. This would give you measurement which tells you "how much does this region make the image more like a number" (or in other words, the foreground). Also, you wouldn't measure the deltas against the uncovered image, but rather against the vector of probabilities where P(x)=0 for each label.

Related

How does background removal help reduce computation in CNN?

I read in many papers that a preprocessing of background removal help reduce the amount of computation. But why is this the case? My understanding is that he CNN works on a rectangular window no matter how is it filled up, 0 or positive.
See this for an example.
In the paper you provide, it seems that they do not pass the entire image to the network. Instead, they seem to be selecting smaller patches from the non-white background. This makes sense because it reduces the noise in their data, but it also reduces computational complexity, because of the effect it has on fully connected layers.
Suppose the input image is of size h*w. In your CNN, the image passes through a series of convolutions and max-poolings, and as a result, right before the first fully connected layer, you end up with a feature map of size
sz=m*(h/k)*(w/d)
where m is the number of feature planes, and where k and d depend on the number of layers, the parameters of each convolution and max pooling modules (e.g. the size of the convolution kernel, etc). Usually, we'll have d==k. Now, assume that you feed this to a fully connected layer, to produce a vector of q parameters. What this layer does is basicaly a matrix multiplication
A*x
where A is a matrix of size q*sz, and x is just your feature map written as a vector.
Now, assume you pass a patch of size (h/t)*(w/t) to the network. You end up with a feature map of size
sz/(t^2)
Given the size of the images in their datasets, this is a considerable reduction in the number of parameters. Also, small patches also means larger batches, and that too can accelerate training (better gradient approximation.).
I hope this helps.
Edit, following #wlnirvana's comment : Yes, patch size is a hyper parameter. In the example I gave, it is set via selecting t. Given the size of the images in the dataset, I'd say something like t>=6 would be realistic. As for how this relate to background removal, to quote the paper (section 3.1):
"To reduce computation time and to focus our analysis on regions of the slide most likely to contain cancer metastasis..."
This means that they select patches only around areas that are not background. This makes sense, since passing a completely white patch to the network would just be a waste of time (in figure 1, you can have so many white/gray/useless patches if you select them randomly, without removing the background). I didn't find any explanation on how patch selection is done in their paper, but I assume something like selecting a number of pixels p_1,...,p_n in the non-background regions, and considering n patches of size (h/t)*(w/t) around each of them would make sense.

In a feedforward neural network, am I able to put in a feature input of "don't care"?

I've created a feedforward neural network using DL4J in Java.
Hypothetically and to keep things simple, assume this neural network is a binary classifier of squares and circles.
The input, a feature vector, would be composed of say... 5 different variables:
[number_of_corners,
number_of_edges,
area,
height,
width]
Now so far, my binary classifier can tell the two shapes apart quite well as I'm giving it a complete feature vector.
My question: is it possible to input only maybe 2 or 3 of these features? Or even 1? I understand results will be less accurate while doing so, I just need to be able to do so.
If it is possible, how?
How would I do it for a neural network with 213 different features in the input vector?
Let's assume, for example, that you know the area, height, and width features (so you don't know the number_of_corners and number_of_edges features).
If you know that a shape can have, say, a maximum of 10 corners and 10 edges, you could input 10 feature vectors with the same area, height and width but where each vector has a different value for the number_of_corners and number_of_edges features. Then you can just average over the 10 outputs of the network and round to the nearest integer (so that you still get a binary value).
Similarly, if you only know the area feature you could average over the outputs of the network given several random combinations of input values, where the only fixed value is the area and all the others vary. (I.e. the area feature is the same for each vector but every other feature has a random value.)
This may be a "trick" but I think that the average will converge to a value as you increase the number of (almost-)random vectors.
Edit
My solution would not be a good choice if you have a lot of features. In this case you could try to use maybe a Deep Belief Network or some autoencoder to infer the values of the other features given a small number of them. For example, a DBN can "reconstruct" a noisy output (if you train it enough, of course); you could then try to give the reconstructed input vector to your feed-forward network.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Having a neural network output a gaussian distribution rather than one single value?

Let's consider I have a neural network with one single output neuron. To outline the scenario: the network gets an image as input and should find one single object in that image. For simplifying the scenario, it should just output the x-coordinate of the object.
However, since the object can be at various locations, the network's output will certainly have some noise on it. Additionally the image can be a bit blurry and stuff.
Therefore I thought it might be a better idea to have the network output a gaussian distribution of the object's location.
Unfortunately I am struggling to model this idea. How would I design the output? A flattened 100 dimensional vector if the image has a width of 100 pixels? So that the network can fit in a gaussian distribution in this vector and I just need to locate the peaks for getting the approximated object's location?
Additionally I fail in figuring out the cost function and teacher signal. Would the teacher signal be a perfect gaussian distribution on the exact x-coordination of the object?
How to model the cost function, then? Currently I have a softmax cross entropy or simply a squared error: network's output <-> real x coordinate.
Is there maybe a better way to handle this scenario? Like a better distribution or any other way to have the network not output a single value without any information of the noise and so on?
Sounds like what you really need is a convolutional network.
You could train a network to recognize your target object when it's positioned in the center of the network's receptive field. You can then create a moving window, at each step feeding the portion of the larger image under that window into the net. If you keep track of the outputs of the trained network for each (x,y) position of the window, some locations of the window will produce better matches than others. Once you've covered the whole image, you can pick the position with the maximum network output as the position where the target object is most likely located.
To handle scale and rotation variations, consider creating an image pyramid, or sets of images at different scales and rotations that are versions of the original image. Then sieve over those images to find the target image.

Determine if an image needs contrasting automatically in OpenCV

OpenCV has a handy cvEqualizeHist() function that works great on faded/low-contrast images.
However when an already high-contrast image is given, the result is a low-contrast one. I got the reason - the histogram being distributed evenly and stuff.
Question is - how do I get to know the difference between a low-contrast and a high-contrast image?
I'm operating on Grayscale images and setting their contrast properly so that thresholding them won't delete the text i'm supposed to extract (thats a different story).
Suggestions welcome - esp on how to find out if the majority of the pixels in the image are light gray (which means that the equalise hist is to be performed)
Please help!
EDIT: thanks everyone for many informative answers. But the standard deviation calculation was sufficient for my requirements and hence I'm taking that to be the answer to my query.
You can probably just use a simple statistical measure of the image to determine whether an image has sufficient contrast. The variance of the image would probably be a good starting point. If the variance is below a certain threshold (to be empirically determined) then you can consider it to be "low contrast".
If you're adjusting contrast just so you can threshold later on, you may be able to avoid the contrast adjustment step if you set your threshold adaptively using Ohtsu's method.
If you're still interested in finding out the image contrast, then read on.
While there are a number of different ways to calculate "contrast". Often, those metrics are applied locally as opposed to the entire image, to make the result more sensitive to image content:
Divide the image into adjacent non-overlaying neighborhoods.
Pick neighborhood sizes that are approximate to size of the features of your image (e.g. if your main feature is horizontal text, make neighborhoods tall enough to capture 2 lines of text, and just as wide).
Apply the metric to each neighborhood individually
Threshold the metric result to separate low and high variance blocks. This will prevent such things as large, blank areas of page skewing your contrast estimates.
From there, you can use a number of features to determine contrast:
The proportion of high metric blocks to low metric blocks
High metric block mean
Intensity distance between the high and low metric blocks (using means, modes, etc)
This may serve as a better indication of image contrast than global image variance alone. Here's why:
(stddev: 50.6)
(stddev: 7.9)
The two images are perfectly in contrast (the grey background is just there to make it obvious it's an image), but their standard deviations (and thus variance) are completely different.
Calculate cumulative histogram of image.
Make linear regression of cumulative histogram in the form y(x) = A*x + B.
Calculate RMSE of real_cumulative_frequency(x)-y(x).
If that RMSE is close to zero - image is already equalized. (That means that for equalized images cumulative histograms must be linear)
Idea is taken from here.
EDIT:
I've illustrated this approach in my blog (C example code included).
There is a support provided in skimage for this. skimage.exposure.is_low_contrast. reference
example :
>>> image = np.linspace(0, 0.04, 100)
>>> is_low_contrast(image)
True
>>> image[-1] = 1
>>> is_low_contrast(image)
True
>>> is_low_contrast(image, upper_percentile=100)
False

Resources