How does the size of the patch/kernel impact the result of a convnet? - machine-learning

I am playing around convolutional neural networks at home with tensorflow (btw I have done the udacity deep learning course, so I have the theory basis). What impact has the size of the patch when one runs a convolution? does such size have to change when the image is bigger/smaller?
One of the exercises I did involved the CIFAR-10 databaese of images (32x32 px), then I used convolutions of 3x3 (with a padding of 1), getting decent results.
But lets say now I want to play with images larger than that (say 100x100), should I make my patches bigger? Do I keep them 3x3? Furthermore, what would be the impact of making a patch really big? (Say 50x50).
Normally I would test this at home directly, but running this on my computer is a bit slow (no nvidia GPU!)
So the question should be summarized as
Should I increase/decrease the size of my patches when my input images are bigger/smaller?
What is the impact (in terms of performance/overfitting) of increasing/decreasing my path size?

If you are not using padding, larger kernel makes number of neuron in the next layer will be smaller.
Example: Kernel with size 1x1 give the next layer the same number of neuron; kernel with size NxN give only one neuron in the next layer.
The impact of larger kernel:
Computational time is faster, memory usage is smaller
Loss a lot of details. Imagine NxN input neuron and the kernel size is NxN too, then the next layer only gives you one neuron. Loss a lot of details can lead you to underfitting.
The answer:
It depends on the images, if you needed a lot of details from the image you don't need to increase your kernel size. If your image is a 1000x1000 pixel large-version of MNIST image, I will increase the kernel size.
Smaller kernel will gives you a lot of details, it can lead you to overfitting, but larger kernel will gives you loss a lot of details, it can lead you to underfitting. You should tune your model to find the best size. Sometimes, time and machine specification should be considered
If you are using padding, you can adjust so the result neuron after convolution will be the same. I can't said it will be better than not using padding, but the loss of more details still occurs than using smaller kernel

It depends more on the size of the objects you want to detect or in other words, the size of the receptive field you want to have. Nevertheless, choosing the kernel size was always a challenging decision. That is why the Inception model was created which uses different kernel sizes (1x1, 3x3, 5x5). The creators of this model also went deeper and tried to decompose the convolutional layers into ones with smaller patch size while maintaining the same receptive field to try to speed up the training (ex. 5x5 was decomposed to two 3x3 and 3x3 was decomposed to 3x1 and 1x3) creating different versions of the inception model.
You can also check the Inception V2 paper for more details https://arxiv.org/abs/1512.00567

Related

CNN padding and striding

In CNN, if padding is used so that the size of the image doesn't get shrinked after several convolutional layers – then why do we use strided convolutions? I wonder because strided convolutions are also reducing the size of image.
Because we want to reduce to size of image. There are some reasons:
Reduce computational and memory requirement.
Aggregate local features to higher level features.
Subsequent convolutions would have a larger receptive field in the original scale.
Traditionally we have used pooling to reduce the size of image, like max-pooling. Strided convolution is another way to do this (and it's getting more popular).

why the kernel size become greater while the spatial size of feature map goes down in inception network?

In the inception networks like inception-v3 and inception-v4, the kernel sizes are smaller in the lower layers,such as 3*3, but in the higher layers, the kernel sizes seem to be larger,such as 5*5,7*7,although they may be factorized to n*1&1*n later.But as network goes deeper,the spatial size of the feature map goes down,is there any relationship between these two thing?
ps:My question is why the kernel sizes in the lower layers seem to be samller(no more than 3*3),and you can find larger kernel size like 7*7 in the higher layers(more precisely, in the middle layers ).Is there any relationship between the spatial size of the feature map and the spatial spatial size of the conv kernels?Take inception v3 as a example, when the spatial sizes of the feature maps are larger than 35 in the first few layers of the network,the biggest kernel size is 5*5, but when the spatial size become 17, kernel size like 7*7 is used.
Any help will be appreciated.
Generally as you go deep into the network:-
The spatial size of feature map decreases to localize the object and
to reduce the computational cost.
The number of filters/kernels
increase as usually the initial layer represent generic features,
while the deeper layer represents more detailed features. Since
initial layers learn only primitive regularities in the data, you do
not need to have high volumes of filter there. However as you go
deep, you should try to look into as much details as possible, hence
increase in the number of filters/kernels. Therefore increasing the
number of filters in deeper layers increases the representational
power of the network.
In inception module, at each layer, kernels of multiple size (1x1, 3x3, 5x5) are used in computation and and resulting feature map is concatenated and passed to the next layer.

TensorFlow for image recognition, size of images

How can size of an image effect training the model for this task?
My current training set holds images that are 2880 X 1800, but I am worried this may be too large to train. In total my sample size will be about 200-500 images.
Would this just mean that I need more resources (GPU,RAM, Distribution) when training my model?
If this is too large, how should I go about resizing? -- I want to mimic real-world photo resolutions as best as possible for better accuracy.
Edit:
I would also be using TFRecord format for the image files
Your memory and processing requirements will be proportional to the pixel size of your image. Whether this is too large for you to process efficiently will depend on your hardware constraints and the time you have available.
With regards to resizing the images there is no one answer, you have to consider how to best preserve information that'll be required for your algorithm to learn from your data while removing information that won't be useful. Reducing the size of your input images won't necessarily be a negative for accuracy. Consider two cases:
Handwritten digits
Here the images could be reduced considerably in size and maintain all the structural information necessary to be correctly identified. Have a look at the MNIST data set, these images are distributed at 28 x 28 resolution and identifiable to 99.7%+ accuracy.
Identifying Tree Species
Imagine a set of images of trees where individual leaves could help identify species. Here you might find that reducing the image size reduces small scale detail on leaf shape in a way that's detrimental to the model, but you might find that you get a similar result with a tight crop (which preserves individual leaves) rather than an image resize. If this is the case you may find that creating multiple crops from the same image gives you an augmented data set for training that considerably improves results (which is something to consider, if possible, given your training set is very small)
Deep learning models are achieving results around human level in many image classification tasks: if you struggle to identify your own images then it's less likely you'll train an algorithm to. This is often a useful starting point when considering the level of scaling that might be appropriate.
If you are using GPUs to train, this will def affect your training time. Tensorflow does most of the GPU allocation so you don't have to worry about that. But with big photos you will be experiencing long training time although your dataset is small. You should consider data-augmentation.
You could complement your resizing with the data-augmentation. Resize in equal dimensions and then perform reflection and translation (as in geometric movement)
If your images are too big, your GPU might run out of memory before it can start training because it has to store the convolution outputs on its memory. If that happens, you can do some of the following things to reduce memory consumption:
resize the image
reduce batch size
reduce model complexity
To resize your image, there are many scripts just one Google search away, but I will add that in your case 1440 by 900 is probably a sweet spot.
Higher resolution images will result in a higher training time and an increased memory consumption (mainly GPU memory).
Depending on your concrete task, you might want to reduce the image size in order to therefore fit a reasonable batch size of let's say 32 or 64 on the GPU - for stable learning.
Your accuracy is probably affected more by the size of your training set. So instead of going for image size, you might want to go for 500-1000 sample images. Recent publications like SSD - Single Shot MultiBox Detector achieve high accuracy values like an mAP of 72% on the PascalVOC dataset - with "only" using 300x300 image resolution.
Resizing and augmentation: SSD for instance just scales every input image down to 300x300, independent of the aspect ratio - does not seem to hurt. You could also augment your data by mirroring, translating, ... etc (but I assume there are built-in methods in Tensorflow for that).

Rescaling input to CNNs

What is the general consensus on rescaling images that have different sizes? I have read that one approach is to rescale the largest size of an image to a fixed size. It's not clear to me how only rescaling one of the dimensions would lead to uniform image shapes across the dataset.
Are there other approaches, e.g. would it work to take the average size of the two dimensions and then rescale the dimensions of each image to the mean of each dimension across the dataset?
Is it important which interpolation method is used in the rescaling?
Would it make sense to simply take an nxm part of each image and cut off the rest of each image?
Is there a list of approaches people have used and how they perform in different scenarios.
Depends on the target application of the CNNs. For object detection/classification usually a sliding window approach or cropping is used. For the first option, sliding window is moved around the image and for every patch (with different overlapping criterion) a prediction is made. This predictions are then filtered with other pooling or filter strategies.
For image segmentation (aka semantic segmentation), similar approaches are used. 1) image scaling + segmenting + scaling back to its original size. 2) different image patches + segmentation of each, or 3) sliding window segmentation + maxpooling. With the option (3) each pixel has a N = HxW votes (where N is the size of the sliding window). This N predictions are then aggregated into a maxixmum-voting classifier (similar to ensemble models on Random Forest and other classifiers).
So, in short, I believe there is no short nor unique answer to this question. The decision you take will depend in the goal you try to achieve with the CNN, and of course, the quality of your approach will have an impact in the performance of the CNN. I don't know about any study of this kind though.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Resources