I am trying to train a CNN to predict the resolution of images. I have train/test data with the following naming convention:
labelName_imgName.jpg
where labelName is an integer representing the resolution of the image. I want to extract the labels as integers so that my neural network can have a single output as the predicted label. Any help would be appreciated.
Currently I have a CNN that outputs the single number, but I have to feed it into a 100 dense layer (since resolutions range from 1 to 100) and the performance is poor. I am trying to verify that this method of labeling might be producing poor results due to the dense layer.
Related
I am trying to design a Neural Network where I want the feature vector size equal to the input vector size. In essence, I have an image ( my input ) and I wish to perform a regression task on
each of the pixels (i.e., my output is a prediction on how I should act on each of the pixel).
However, my experience with ML ( newbie ) seems to show that the size of the output vector is usually small compared to the input vector size. Is there a reason why I must design my network in a similar manner ? Are there any pitfalls in having an output feature vector as long as the input vector ?
You can safely have the output of the network as big as the input. Look for example at UNet for semantic segmentation. In that case there is one output for each pixel which represents the category (class) of that pixel.
I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's
Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).
I've created a feedforward neural network using DL4J in Java.
Hypothetically and to keep things simple, assume this neural network is a binary classifier of squares and circles.
The input, a feature vector, would be composed of say... 5 different variables:
[number_of_corners,
number_of_edges,
area,
height,
width]
Now so far, my binary classifier can tell the two shapes apart quite well as I'm giving it a complete feature vector.
My question: is it possible to input only maybe 2 or 3 of these features? Or even 1? I understand results will be less accurate while doing so, I just need to be able to do so.
If it is possible, how?
How would I do it for a neural network with 213 different features in the input vector?
Let's assume, for example, that you know the area, height, and width features (so you don't know the number_of_corners and number_of_edges features).
If you know that a shape can have, say, a maximum of 10 corners and 10 edges, you could input 10 feature vectors with the same area, height and width but where each vector has a different value for the number_of_corners and number_of_edges features. Then you can just average over the 10 outputs of the network and round to the nearest integer (so that you still get a binary value).
Similarly, if you only know the area feature you could average over the outputs of the network given several random combinations of input values, where the only fixed value is the area and all the others vary. (I.e. the area feature is the same for each vector but every other feature has a random value.)
This may be a "trick" but I think that the average will converge to a value as you increase the number of (almost-)random vectors.
Edit
My solution would not be a good choice if you have a lot of features. In this case you could try to use maybe a Deep Belief Network or some autoencoder to infer the values of the other features given a small number of them. For example, a DBN can "reconstruct" a noisy output (if you train it enough, of course); you could then try to give the reconstructed input vector to your feed-forward network.
I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.
I'm training my neural network to classify some things in an image. I crop 40x40 pixels images and classify it that it as some object or not. So it has 1600 input neurons, 3 hidden layers (500, 200, 30) and 1 output neuron that must say 1 or 0. I use the Flood library.
I cannot train it with QuasiNewtonMethod, because it uses a big matrix in the algorithm and it do not fit in my memory. So I use GradientDescent and the ObjectiveFunctional is NormalizedSquaredError.
The problem is that by training it overflows the weights and the output of the neural network is INF or NaN for every input.
Also my dataset is too big (about 800mb when it is in CSV) and I can't load it fully. So I made many InputTargetDataSets with 1000 instances and saved it as XML (the default format for Flood) and training it for one epoch on each dataset randomly shuffled. But also when I train it just on one big dataset (10000 instances) it overflows.
Why is this happening and how can I prevent that?
I would recommend normalization of inputs. You should also think about that if you have 1600 neurons..output of input layer will sum(if sigmoid neurons) and there can be many problems.
It is quite useful to print out some steps..for example in which step it overflows.
There are some tips for weights of neurons. I would recommend very small < 0.01. Maybe if you could give more info about NN and intervals of inputs, weights etc. I could give you some other ideas.
And btw I think it is mathematically proved that two layers should be enough so there is no need for three hidden layers if you are not using some specialized algorithms which simulate human eye..