CNN: Why is the image matrix transformed to (channel, width, height)? - machine-learning

I am going through some CNN articles. I see that they transform the input image to (channel, width, height).
A code example taken from MXNET CNN Tutorial.
def transform(data, label):
# 2,0,1 means channels,width, height
return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32)
Can any one explain why do we do this transformation?

There are several image formats for 2-dimensional convolution, the main ones are:
Channel-first or NCHW format, i.e., (batch, channels, height, width).
Channel-last or NHWC format, i.e., (batch, height, width, channels).
They are basically equivalent and can be easily converted from one to another, though there is evidence that certain low-level implementations perform more efficiently when a particular data format is used (see this question).
Computational engines usually accept both formats, but have different defaults, e.g.,
Tensorflow accepts both and uses NHWC by default.
Theano accepts only NCHW format.
Keras also works with both and has a dedicated setting for it. The latest version also uses NHWC by default.
MXNet accepts both formats too, but the default is NCHW:
The default data layout is NCHW, namely (batch_size, channel, height, width). We can choose other layouts such as NHWC.
This default is pretty much the only reason to reshape the tensors, simply to avoid layout argument in the network.

Related

Do convolutional neural networks run faster on binary images

I am trying some DCNN to recognize handwriting words (word spotting) where the images are binary, and I am wondering if the computation time will be faster than using DCNNs with other gray-level or color images.
In addition, how one can equalize the image sizes, as normalizing the images of words will produce words with different scales.
Any suggestions?
The computation time for gray-scale images is certainly faster, but not due to zeros, it's simply the input tensor size. Color images are [batch, width, height, 3], while gray-scale images are [batch, width, height, 1]. The difference in depth, as well as in spatial size, affects the time spent on the first convolutional layer, which is usually one of the most time-consuming. That's why consider resizing the images as well.
You may also want to read about 1x1 convolution trick to speed up computation. Usually it's applied in the middle of the network when the number of filters becomes significantly large.
As for the second question (if I get it right), ultimately you have to resize the images. If the images contain the texts of different font sizes, one possible strategy is to resize + pad or crop + resize. You have to know the font size on each particular image to select the right padding or crop size. This method needs (possibly) fair amount of manual work.
A completely different way would to ignore these differences and let the network learn OCR, despite the font size discrepancy. It is a viable solution, doesn't require a lot of manual pre-processing, but simply needs more training data to avoid overfitting. If you examine MNIST dataset, you notice the digits are not always the same size, yet CNNs achieve 99.5% accuracy pretty easily.

Backward pass on convolutional layer with 3-channel images

I know that deconvolution is basically convolution of output with flipped filters and I have implemented it for 2D data. But I am not able to generalize it for 3D data. For example consider the input of dimension 3x5x5 and the filter is of dimension 3x3x3 and stride is set to 1. SO, the output will be of the dimension 1x3x3. What I don't understand is how to calculate the deconvolution for this output. The flipped filter again will be of dimension 3x3x3 and output of convolution is of dimension 1x3x3 which are incompatible for convolution. So how can we calculate deconvolution ?
Perhaps this post will help you out a bit. You are correct in saying that a filter of the same size cannot fit the deconvolution dimensions. So in order to remedy that, the 1x3x3 gets padded throughout with zeros, mean-values, nn, etc. until it is the appropriate size that you require. Depth can be handled the same way. In your example, you want a 3x3x3 filter to 'deconvolve' the 1x3x3 to a 3x5x5. So we pad out the 1x3x3 to a 5x7x7 (with whichever method you prefer), and apply the filter. There are definite drawbacks with this process, stemming from the fact that you're trying to extrapolate more information from less.

How to enable a Convolutional NN to take variable size input?

So, I've seen that many of the first CNN examples in Machine Learning use the MNIST dataset. Each image there is 28x28, and so we know the shape of the input before hand. How would this be done for variable size input, let's say you have some images that are 56x56 and some 28x28.
I'm looking for a language and framework agnostic answer if possible or in tensorflow terms preferable
In some cases, resizing the images appropriately (for example to keep the aspectratio) will be sufficient. But, this can introduce distortion, and in case this is harmful, another solution is to use Spatial Pyramidal Pooling (SPP). The problem with different image sizes is that it produces layers of different sizes, for example, taking the features of the n-th layer of some network, you can end up with a featuremap of size 128*fw*fh where fw and fh vary depending on the size of the input example. What SPP does in order to alleviate this problem, is to turn this variable size feature map into a fix-length vector of features. It operates on different scales, by dividing the image into equal patches and performing maxpooling on them. I think this paper does a great job at explaining it. An example application can be seen here.
As a quick explanation, imagine you have a feature map of size k*fw*fh. You can consider it as k maps of the form
X Y
Z T
where each of the blocks are of size fw/2*fh/2. Now, performing maxpooling on each of those blocks separately gives you a vector of size 4, and therefore, you can grossly describe the k*fw*fh map as a k*4 fixed-size vector of features.
Now, call this fixed-size vector w and set it aside, and this time, consider the k*fw*fh featuremap as k featureplanes written as
A B C D
E F G H
I J K L
M N O P
and again, perform maxpooling separately on each block. So, using this, you obtain a more fine-grained representation, as a vector of length v=k*16.
Now, concatenating the two vectors u=[v;w] gives you a fixed-size representation. This is exaclty what a 2-scale SPP does (well, of course you can change the number/sizes of divisions).
Hope this helps.
When you use CNN for classification task, your network has two part:
Feature generator. Part generates feature map with size WF x HF and CF channels by image with size WI x HI and CI channels . The relation between image sizes and feature map size depends of structure your NN (for example, on amount of pooling layers and stride of them).
Classifier. Part solves the task of classification vectors with WF*HF*CF components into classes.
You can put image with different size into feature generator, and get feature map with different sizes. But classifier can only be training on some fixed lengths vectors. Therefore you obviously train your network for some fixed sizes of images. If you have images with different size you resize it to input size of network, or crop some part of image.
Another way described in the article
K. He, X. Zhang, S. Ren, J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," arXiv:1406.4729 2014
Authors offered Spatial pyramid pooling, which solve the problem with different image on the input of CNN. But I don't sure is spatial pyramid pooling layer exists in tensorflow.

Rule of thumb to select some of the Gabor filter parameters in OpenCV based on image size

I'm using Gabor kernel on equal-area sized blocks of an image. Let's say each block is 36 by 56. I am using OpenCV cv2.getGaborKernel to get several filters, mainly by changing theta parameter, but I am not sure what values or range of values to choose for ksize, sigma, and lambd. So, my question is, is there any rule of thumb to choose values for these three parameters, based on the size of image (of course here by image I mean blocks of size 36 by 56)? In particular, if you want to apply Gabor filters on image blocks, rather than the whole image, where the size of each block is relatively small? At least are there any reasonable values to start with rather than just trying different values?

Can Keras deal with input images with different size?

Can the Keras deal with input images with different size? For example, in the fully convolutional neural network, the input images can have any size. However, we need to specify the input shape when we create a network by Keras. Therefore, how can we use Keras to deal with different input size without resizing the input images to the same size? Thanks for any help.
Yes.
Just change your input shape to shape=(n_channels, None, None).
Where n_channels is the number of channels in your input image.
I'm using Theano backend though, so if you are using tensorflow you might have to change it to (None,None,n_channels)
You should use:
input_shape=(1, None, None)
None in a shape denotes a variable dimension. Note that not all layers
will work with such variable dimensions, since some layers require
shape information (such as Flatten).
https://github.com/fchollet/keras/issues/1920
For example, using keras's functional API your input layer would be:
For a RGB dataset
inp = Input(shape=(3,None,None))
For a Gray dataset
inp = Input(shape=(1,None,None))
Implementing arbitrarily sized input arrays with the same computational kernels can pose many challenges - e.g. on a GPU, you need to know how big buffers to reserve, and more weakly how much to unroll your loops, etc. This is the main reason that Keras requires constant input shapes, variable-sized inputs are too painful to deal with.
This more commonly occurs when processing variable-length sequences like sentences in NLP. The common approach is to establish an upper bound on the size (and crop longer sequences), and then pad the sequences with zeros up to this size.
(You could also include masking on zero values to skip computations on the padded areas, except that the convolutional layers in Keras might still not support masked inputs...)
I'm not sure if for 3D data structures, the overhead of padding is not prohibitive - if you start getting memory errors, the easiest workaround is to reduce the batch size. Let us know about your experience with applying this trick on images!
Just use None while specifying input shape. But I still do not know how to pass different-shaped images into fit function.

Resources