sub patch generation mechanism for training fully convolutional neural network - machine-learning

I have a image set, consisting of 300 image pairs, i.e., raw image and mask image. A typical mask image is shown as follows. Each image has size of 800*800. I am trying to train a fully convolutional neural network model for this image set to perform the semantic segmentation. I am trying to generate the small patches (256*256) from the original images for constructing the training set. Are there any strategies recommended for this patch sampling process? Naturally, random sampling is a trivial approach. Here the area marked with yellow, foreground class, usually take 25% of the whole image area across the image set. It tends to reflect an imbalanced data set.

If you train a fully convolutional architecture, assuming 800x800 inputs and 25x25 outputs (after five 2x2 pooling layers, 25=800/2^5). Try to build the 25x25 outputs directly and train directly on them. You can add higher weights in the loss function for the "positive" labels to balance them with the "negative".
I definitely do not recommend sampling because it will be an expensive process and is not really fully convolutional.

Related

Best practice for large size image handling/processing with neural network

I have tried some neural network architectures for object classification and recognition. Such neural networks can distinguish cats from dogs, classify numbers from MNIST dataset, and recover private keys from public ones. A feature of such models is a small number of neurons in the last layer, and the input images are scaled to certain rather small sizes, for example, 224x224 pixels. Now I would like to try to solve more complex (for me) problems using a neural network. I'm interested in neural networks for image super resolution. For these purposes, I want to use autoencoders or a fully convolutional network like UNET. At the moment I don't understand how exactly to handle large size images. Is it necessary to feed the complete image to the input of the neural network, or is it necessary to process the image in parts, dividing it into a smaller tile and forming the final image from the fragments received at the output of the network? I think that in the first case, the network will become very large and will not be able to converge to good results in the learning process, and in the second case, artifacts will appear on the final image at the junctions of the resulting fragments. All the papers and articles I've read use small image sizes as examples.
But how then do generative adversarial networks (GAN) models, autoencoders for noise reduction, semantic segmentation, instance segmentation or image upscaling networks work? After all, the output of such a network should be a large image, for example, 2K, 4K, 8K resolution. Do I understand correctly that the number of input and output neurons in such networks is in the millions? How does this affect training time and convergence? Or are there some other ways to process large images with neural networks?

How does the size of the patch/kernel impact the result of a convnet?

I am playing around convolutional neural networks at home with tensorflow (btw I have done the udacity deep learning course, so I have the theory basis). What impact has the size of the patch when one runs a convolution? does such size have to change when the image is bigger/smaller?
One of the exercises I did involved the CIFAR-10 databaese of images (32x32 px), then I used convolutions of 3x3 (with a padding of 1), getting decent results.
But lets say now I want to play with images larger than that (say 100x100), should I make my patches bigger? Do I keep them 3x3? Furthermore, what would be the impact of making a patch really big? (Say 50x50).
Normally I would test this at home directly, but running this on my computer is a bit slow (no nvidia GPU!)
So the question should be summarized as
Should I increase/decrease the size of my patches when my input images are bigger/smaller?
What is the impact (in terms of performance/overfitting) of increasing/decreasing my path size?
If you are not using padding, larger kernel makes number of neuron in the next layer will be smaller.
Example: Kernel with size 1x1 give the next layer the same number of neuron; kernel with size NxN give only one neuron in the next layer.
The impact of larger kernel:
Computational time is faster, memory usage is smaller
Loss a lot of details. Imagine NxN input neuron and the kernel size is NxN too, then the next layer only gives you one neuron. Loss a lot of details can lead you to underfitting.
The answer:
It depends on the images, if you needed a lot of details from the image you don't need to increase your kernel size. If your image is a 1000x1000 pixel large-version of MNIST image, I will increase the kernel size.
Smaller kernel will gives you a lot of details, it can lead you to overfitting, but larger kernel will gives you loss a lot of details, it can lead you to underfitting. You should tune your model to find the best size. Sometimes, time and machine specification should be considered
If you are using padding, you can adjust so the result neuron after convolution will be the same. I can't said it will be better than not using padding, but the loss of more details still occurs than using smaller kernel
It depends more on the size of the objects you want to detect or in other words, the size of the receptive field you want to have. Nevertheless, choosing the kernel size was always a challenging decision. That is why the Inception model was created which uses different kernel sizes (1x1, 3x3, 5x5). The creators of this model also went deeper and tried to decompose the convolutional layers into ones with smaller patch size while maintaining the same receptive field to try to speed up the training (ex. 5x5 was decomposed to two 3x3 and 3x3 was decomposed to 3x1 and 1x3) creating different versions of the inception model.
You can also check the Inception V2 paper for more details https://arxiv.org/abs/1512.00567

Understanding Faster rcnn

I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's
Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).

how to make Multi-scale images to train the CNN

I am working on Convolution Neural Network using satellite images. I want to try Multi-scale problem. Can you please suggest me how can I make the multi-scale dataset. As the input of the CNN is fixed image is fixed (e.g. 100x100)
how can the images of different scale to train the system for multi-scale problem.
There is a similar question about YOLO9000:Multi-Scale Training?
Since there are only convolutional and pooling layers, so when you input multi-scale image, the weight parameter amount is same. Thus, multi-scale images can use a CNN model to train.
In different tasks, the methods are different. for example, in classification task, we can we can add a global pooling after the last layer; in detection task, the output is not fixed if we input multi-scale images.

Poor performance on digit recognition with CNN trained on MNIST dataset

I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.

Resources