I read in many papers that a preprocessing of background removal help reduce the amount of computation. But why is this the case? My understanding is that he CNN works on a rectangular window no matter how is it filled up, 0 or positive.
See this for an example.
In the paper you provide, it seems that they do not pass the entire image to the network. Instead, they seem to be selecting smaller patches from the non-white background. This makes sense because it reduces the noise in their data, but it also reduces computational complexity, because of the effect it has on fully connected layers.
Suppose the input image is of size h*w. In your CNN, the image passes through a series of convolutions and max-poolings, and as a result, right before the first fully connected layer, you end up with a feature map of size
sz=m*(h/k)*(w/d)
where m is the number of feature planes, and where k and d depend on the number of layers, the parameters of each convolution and max pooling modules (e.g. the size of the convolution kernel, etc). Usually, we'll have d==k. Now, assume that you feed this to a fully connected layer, to produce a vector of q parameters. What this layer does is basicaly a matrix multiplication
A*x
where A is a matrix of size q*sz, and x is just your feature map written as a vector.
Now, assume you pass a patch of size (h/t)*(w/t) to the network. You end up with a feature map of size
sz/(t^2)
Given the size of the images in their datasets, this is a considerable reduction in the number of parameters. Also, small patches also means larger batches, and that too can accelerate training (better gradient approximation.).
I hope this helps.
Edit, following #wlnirvana's comment : Yes, patch size is a hyper parameter. In the example I gave, it is set via selecting t. Given the size of the images in the dataset, I'd say something like t>=6 would be realistic. As for how this relate to background removal, to quote the paper (section 3.1):
"To reduce computation time and to focus our analysis on regions of the slide most likely to contain cancer metastasis..."
This means that they select patches only around areas that are not background. This makes sense, since passing a completely white patch to the network would just be a waste of time (in figure 1, you can have so many white/gray/useless patches if you select them randomly, without removing the background). I didn't find any explanation on how patch selection is done in their paper, but I assume something like selecting a number of pixels p_1,...,p_n in the non-background regions, and considering n patches of size (h/t)*(w/t) around each of them would make sense.
Related
My understanding is that we use padding when we convolute because convoluting with filters reduces the dimension of the output by shrinking it, as well as loses information from the edges/corners of the input matrix. However, we also use a pooling layer after a number of Conv layers in order to downsample our feature maps. Doesn't this seem sort of contradicting? We use padding because we do NOT want to reduce the spatial dimensions but we later use pooling to reduce the spatial dimensions. Could someone provide some intuition behind these 2?
Without loss of generality, assume we are dealing with images as inputs. The reasons behind padding is not only to keep the dimensions from shrinking, it's also to ensure that input pixels on the corners and edges of the input are not "disadvantaged" in affecting the output. Without padding, a pixel on the corner of an images overlaps with just one filter region, while a pixel in the middle of the image overlaps with many filter regions. Hence, the pixel in the middle affects more units in the next layer and therefore has a greater impact on the output. Secondly, you actually do want to shrink dimensions of your input (Remember, Deep Learning is all about compression, i.e. to find low dimensional representations of the input that disentangle the factors of variation in your data). The shrinking induced by convolutions with no padding is not ideal and if you have a really deep net you would quickly end up with very low dimensional representations that lose most of the relevant information in the data. Instead you want to shrink your dimensions in a smart way, which is achieved by Pooling. In particular, Max Pooling has been found to work well. This is really an empirical result, i.e. there isn't a lot of theory to explain why this is the case. You could imagine that by taking the max over nearby activations, you still retain the information about the presence of a particular feature in this region, while losing information about its exact location. This can be good or bad. Good because it buys you translation invariance, and bad because exact location may be relevant for you problem.
What is the general consensus on rescaling images that have different sizes? I have read that one approach is to rescale the largest size of an image to a fixed size. It's not clear to me how only rescaling one of the dimensions would lead to uniform image shapes across the dataset.
Are there other approaches, e.g. would it work to take the average size of the two dimensions and then rescale the dimensions of each image to the mean of each dimension across the dataset?
Is it important which interpolation method is used in the rescaling?
Would it make sense to simply take an nxm part of each image and cut off the rest of each image?
Is there a list of approaches people have used and how they perform in different scenarios.
Depends on the target application of the CNNs. For object detection/classification usually a sliding window approach or cropping is used. For the first option, sliding window is moved around the image and for every patch (with different overlapping criterion) a prediction is made. This predictions are then filtered with other pooling or filter strategies.
For image segmentation (aka semantic segmentation), similar approaches are used. 1) image scaling + segmenting + scaling back to its original size. 2) different image patches + segmentation of each, or 3) sliding window segmentation + maxpooling. With the option (3) each pixel has a N = HxW votes (where N is the size of the sliding window). This N predictions are then aggregated into a maxixmum-voting classifier (similar to ensemble models on Random Forest and other classifiers).
So, in short, I believe there is no short nor unique answer to this question. The decision you take will depend in the goal you try to achieve with the CNN, and of course, the quality of your approach will have an impact in the performance of the CNN. I don't know about any study of this kind though.
Is there a pixel-based region growing algorithm that can be employed for the extraction of features (segmentation) on an image, by adding pixels to the seed based on the minimization of a certain metric. Potentially, a pixel can be removed if the metric is not optimized when this pixel is added (i.e. possibility to backtrack and go back to the seed obtained in the previous iterations).
I'll try to explain further my objectives:
This algorithm starts from a central pixel selected as an initial seed on the image.
Afterwards, each of the 4 neighbors is explored (right, left, bottom and top neighbors) separately, to see if the metric is optimized by growing the seed in the selected direction.
A neighboring pixel might not optimize the metric immediately, even if the seed created by adding this pixel will be optimal in future iterations.
There is a possibility that a neighboring pixel is added to the seed but is removed later, if the obtained seed is not optimal.
Can anyone suggest to me an Artificial Intelligence technique (or a greedy approach) that is adequate to solve this kind of problems? Also, what would be a good criteria to judge that the addition of a pixel will optimize the metric even though this will probably happen in future iterations.
P.S: I started implementing what's explained above in Python but was stuck in the issue of determining if a path (neighboring pixel) is worth exploring or not. Right now, I try to add a neighboring pixel only if the seed produced improve (i.e. minimize) the error relatively to the metric. However, even though by adding the right or left neighbors the metric isn't optimized, one of these two paths might lead to the optimal solution in the future (as explained in the third objective).
You've basically outlined the most successful algorithm you could get with this approach. It's success will depend heavily on the metric you use to add/remove pixels, but there are a few things you can do to emulate the behavior you want.
Definitions
We'll call the metric we're optimizing M where M(R) is the metric's value for a region R and a region R is some collections of pixels. I will assume that optimizing the metric will result in the largest possible value of M, but this approach can work if the goal is to minimize M as well.
Methodology
This approach is going to be slightly backwards to your original outline, but it should satisfy both requirements of adding pixels that lie in non-optimal paths from the seed and removing pixels that do not contribute significantly to the optimization.
We will begin at a seed s, but instead of evaluating paths as we go we will add all pixels in the image (or maximum feature size) iteratively to our region. At each step we will determine a value of the pixel based on how much it improves the metric for the current region, M(p). This is not the same as the value of the region containing the pixel (M(R) where p is in R). Rather it would be the difference of the value of the region containing the pixel and the value of the region before the pixel was added (M(p) = M(R) - M(R') where R = R' + p). If you have the capacity to evaluate a single pixel you could simply use that instead.
The next change is to include an regularization parameter in M(R) that penalizes the score based on the number of pixels included: N(R) = M(R) - a * |R| where a is some arbitrary positive constant and |R| represents the cardinality (number of pixels) in our region. Note: if the goal is to minimize M then a should be negative. This will have the effect of penalizing the score of the region if it includes too many pixels.
Finally, after all pixels have been added to the region and N(p) has been evaluated for each pixel we iterate over the region again. This time we begin at the last pixel added and iterate backwards over our set of pixels, ending at the seed s. At each iteration determine the score of the region N(R). If the score N(R) has decreased since the last iteration then we remove the pixel p with the lowest score N(p). This should have the effect of the smallest number of pixels in the region that contribute the most to the score.
Additional Considerations
If the remaining pixels lie on non-contiguous paths after pruning you could run a secondary algorithm to add in adjoining pixels. You'll need to do testing to determine an optimal value of a such that enough pixels are kept to reconstruct the building, but it doesn't include every pixel from the image.
My Opinion (that you didn't ask for)
In general I think you would have more luck with more robust algorithms such as Convolutional Neural Networks for feature classification. They'll likely be faster and definitely more accurate than the algorithm described above.
[This question is now also posed at Cross Validated]
The question in short
I'm studying convolutional neural networks, and I believe that these networks do not treat every input neuron (pixel/parameter) equivalently. Imagine we have a deep network (many layers) that applies convolution on some input image. The neurons in the "middle" of the image have many unique pathways to many deeper layer neurons, which means that a small variation in the middle neurons has a strong effect on the output. However, the neurons at the edge of the image have only 1 way (or, depending on the exact implementation, of the order of 1) pathways in which their information flows through the graph. It seems that these are "under-represented".
I am concerned about this, as this discrimination of edge neurons scales exponentially with the depth (number of layers) of the network. Even adding a max-pooling layer won't halt the exponential increase, only a full connection brings all neurons on equal footing. I'm not convinced that my reasoning is correct, though, so my questions are:
Am I right that this effect takes place in deep convolutional networks?
Is there any theory about this, has it ever been mentioned in literature?
Are there ways to overcome this effect?
Because I'm not sure if this gives sufficient information, I'll elaborate a bit more about the problem statement, and why I believe this is a concern.
More detailed explanation
Imagine we have a deep neural network that takes an image as input. Assume we apply a convolutional filter of 64x64 pixel over the image, where we shift the convolution window by 4 pixels each time. This means that every neuron in the input sends it's activation to 16x16 = 265 neurons in layer 2. Each of these neurons might send their activation to another 265, such that our topmost neuron is represented in 265^2 output neurons, and so on. This is, however, not true for neurons on the edges: these might be represented in only a small number of convolution windows, thus causing them to activate (of the order of) only 1 neuron in the next layer. Using tricks such as mirroring along the edges won't help this: the second-layer-neurons that will be projected to are still at the edges, which means that that the second-layer-neurons will be underrepresented (thus limiting the importance of our edge neurons as well). As can be seen, this discrepancy scales exponentially with the number of layers.
I have created an image to visualize the problem, which can be found here (I'm not allowed to include images in the post itself). This network has a convolution window of size 3. The numbers next to neurons indicate the number of pathways down to the deepest neuron. The image is reminiscent of Pascal's Triangle.
https://www.dropbox.com/s/7rbwv7z14j4h0jr/deep_conv_problem_stackxchange.png?dl=0
Why is this a problem?
This effect doesn't seem to be a problem at first sight: In principle, the weights should automatically adjust in such a way that the network does it's job. Moreover, the edges of an image are not that important anyway in image recognition. This effect might not be noticeable in everyday image recognition tests, but it still concerns me because of two reasons: 1) generalization to other applications, and 2) problems arising in the case of very deep networks.
1) There might be other applications, like speech or sound recognition, where it is not true that the middle-most neurons are the most important. Applying convolution is often done in this field, but I haven't been able to find any papers that mention the effect that I'm concerned with.
2) Very deep networks will notice an exponentially bad effect of the discrimination of boundary neurons, which means that central neurons can be overrepresented by multiple order of magnitude (imagine we have 10 layers such that the above example would give 265^10 ways the central neurons can project their information). As one increases the number of layers, one is bound to hit a limit where weights cannot feasibly compensate for this effect. Now imagine we perturb all neurons by a small amount. The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
I will quote your sentences and below I will write my answers.
Am I right that this effect takes place in deep convolution networks
I think you are wrong in general but right according to your 64 by 64 sized convolution filter example. While you are structuring your convolution layer filter sizes, they would never be bigger than what you are looking for in your images. In other words - if your images are 200by200 and you convolve for 64by64 patches, you say that these 64by64 patches will learn some parts or exactly that image patch that identifies your category. The idea in the first layer is to learn edge-like partial important images not the entire cat or car itself.
Is there any theory about this, has it ever been mentioned in literature? and Are there ways to overcome this effect?
I never saw it in any paper I have looked through so far. And I do not think that this would be an issue even for very deep networks.
There is no such effect. Suppose your first layer which learned 64by64 patches is in action. If there is a patch in the top-left-most corner that would get fired(become active) then it will show up as a 1 in the next layers topmost left corner hence the information will be propagated through the network.
(not quoted) You should not think as 'a pixel is being useful in more neurons when it gets closer to center'. Think about 64x64 filter with a stride of 4:
if the pattern that your 64x64 filter look for is in the top-most-left corner of the image then it will get propagated to the next layers top most corner, otherwise there will be nothing in the next layer.
the idea is to keep meaningful parts of the image alive while suppressing the non-meaningful, dull parts, and combining these meaningful parts in following layers. In case of learning "an uppercase letter a-A" please look at only the images in the very old paper of Fukushima 1980 (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf) figure 7 and 5. Hence there is no importance of a pixel, there is importance of image patch which is the size of your convolution layer.
The central neurons will cause the output to change more strongly by several orders of magnitude, compared to the edge neurons. I believe that for general applications, and for very deep networks, ways around my problem should be found?
Suppose you are looking for a car in an image,
And suppose that in your 1st example the car is definitely in the 64by64 top-left-most part of your 200by200 image, in 2nd example the car is definitely in the 64by64 bottom-right-most part of your 200by200 image
In the second layer all your pixel values will be almost 0, for 1st image except the one in the very top-left-most corner and for 2nd image except the one in the very bottom-right-most corner.
Now, the center part of the image will mean nothing to my forward and backward propagation because the values will already be 0. But the corner values will never be discarded and will effect my learning weights.
I have to do a foreground/background segmentation using maxflow algorithm in C++. (http://wiki.icub.org/iCub/contrib/dox/html/poeticon_2src_2objSeg_2src_2maxflow-v3_802_2maxflow_8cpp_source.html). I get an array of pixels from a png file according to their RBG but what are the next steps. How could I use this algorithm for my problem?
I recognize that source very well. That's the Boykov-Kolmogorov Graph Cuts library. What I would recommend you do first is read their paper.
Graph Cuts is an interactive image segmentation algorithm. You mark pixels in your image on what you believe belong to the object (a.k.a. foreground) and what don't belong to the object (a.k.a the background). That's what you need first. Once you do this, the Graph Cuts algorithm best guesses what the labels of the other pixels are in the image. It basically goes through each of the other pixels that are not labeled and figures out whether or not they belong to foreground and background.
The whole premise behind Graph Cuts is that image segmentation is akin to energy minimization. Image segmentation can be formulated as a cost function with a summation of two terms:
Self-Penalty: This is the cost of assigning each pixel as either foreground or background. This is also known as a data cost.
Neighbouring Penalties: This enforces that neighbouring pixels more or less should share the same classification label. This is also known as a smoothness cost.
This kind of formulation is well known as the Maximum A Posteriori Markov Random Field classification problem (MAP-MRF). The goal is to minimize that cost function so that you achieve the best image segmentation possible. This is actually an NP-Hard problem, and is actually one of the problems that is up for money from the Clay Math Institute.
Boykov and Kolmogorov theoretically proved that the MAP-MRF problem can be translated into graph theory, and solving the MAP-MRF problem is akin to taking your image and forming it into a graph with source and sink links, as well as links that connect neighbouring pixels together. To solve the MAP-MRF, you perform the maximum-flow/minimum-cut algorithm. There are many ways to do this, but Boykov / Kolmogorov find a more efficient way that is much faster than more established algorithms, such as Push-Relabel, Ford-Fulkenson, etc.
The self penalties are what are known as t links, while the neighbouring penalties are what are known as n links. You should read up the paper to figure out how these are computed, but the t links describe the classification penalty. Basically, how much it would cost to classify each pixel as belonging to the foreground or the background. These are usually based on the negative log probability distributions of the image. What you do is you create a histogram of the distribution of what was classified as foreground and a histogram of what was classified as background.
Usually, a uniform quanitization of each colour channel for both foreground and background suffices. You then turn these into PDFs but dividing by the total number of elements in each histogram, then when you calculate the t-links for each pixel, you access the colour, then see where it lies in the histogram, then take the negative log. This will tell you how much it will cost to classify that pixel to be either foreground or background.
The neighbouring pixel costs are more intuitive. People usually just take the Euclidean distance between one pixel and a neighbouring pixel and apply this distance to a Gaussian. To make things simple, a 4 pixel neighbourhood is what is usually used (North, South, East and West).
Once you figure out how to compute the cost, you follow this procedure:
Mark pixels as foreground or background.
Create a graph structure using their library
Compute the histograms of the foreground and background pixels
Calculate t-links and add to the graph
Calculate n-links and add to the graph
Invoke the maxflow routine on the graph to segment the image
Go through each pixel and figure out whether or not the pixel belongs to foreground or background.
Create a binary map that reflects this, then copy over image pixels where the binary map is true, and don't do this when it's false.
The original source of maxflow can be found here: http://pub.ist.ac.at/~vnk/software/maxflow-v3.03.src.zip
It also has a README so you can see how the library is supposed to work given some example images.
You have a lot to digest, but Graph Cuts is one of the most powerful interactive segmentation tools out there.
Good luck!