I'm kind of new to the subject and build a convolutional neural network based on google's tensorflow. I wanted to classify a test data set of pictures belonging to 10 categories. My CNN setup is aligned to the tensorflow tutorial with some amendmends to meet my images' size.
I ran the trainig step repeatedly for 20 times over a random sample of 500 images and then repeated that step for 50 times on different samples of size 500. I used a sample of 200 as validation data set (kept this fixed for all runs). As a result I got an accuracy of about 35%, which isn't to bad in my eyes, since I didn't do any optimizations and the images are kind of hard to assign to a single category evan for humans.
So here are my questions:
Does it really make sense to run a step for 20 times over the same batch? (I did this becuase it's about what fits in the RAM and loading a new batch took quite a while - so I could get more runs in less time)
In the training accuracy diagram (see below) there's a jump at some point around step 120-130. From there on the accuracy goes up close to 100% for each 20-run of the same random batch. What does that jump mean in terms of network structure / learning?
Your spikes are likely due to the network overfitting on the batch that you are repeatedly showing it, while not really learning something that is useful in general. This also answers your first question - in this case, it doesn't make sense.
Related
tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.
I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem
I'm doing some programming with neural network backpropagation.
I have about 90 datas and doing some training with all data for data training (90 datas) and same data for data test (90 datas). I'm using iteration threshold about 2 iteration to test it and it gave me quite big error (About 60% with MAPE/Mean Absolute Square Error).
I'm afraid I've got the algorithm wrong since the only way to get training error less than threshold 10% is using iteration threshold around 3000k iteration and it's training takes quite a long time (I'm not using momentum. Just a Backpropagation Neural Network). But the test accuracy around 95-99% after that using said condition.
Is this normal? Or my program is work as it shouldn't be?
Of course, it will depend on the data set used, but I wouldn't be surprised if you get an error below 1% even for highly nonlinear data (I've seen this for example in sales data). As long as you separate training and test data sets, the error is expected to rise, but with the same set, it should drop to zero if there are enough hidden units. The capacity of an ANN to fit nonlinear data is huge (and, of course, the more fitted, the less general).
So, I would look for some program bug instead.
You say 3000k iteration, but i'm assume you mean 3k or 3000. The other answer says there might a bug in your code, but 3000 iterations for a problem with 90 samples is definitely normal.
You cannot expect a neural network to fit a training set with just 2 iterations, especially with a low learning rate.
TL;DR - you have nothing to worry. 3000 iterations is fine.
I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).
I am relatively new in Deep learning and its framework. Currently, I am experimenting with Caffe framework and trying to fine tune the Vgg16_places_365.
I am using the Amazone EC2 instance g2.8xlarge with 4 GPUs (each has 4 GB of RAM). However, when I try to train my model (using a single GPU), I got this error:
Check failed: error == cudaSuccess (2 vs. 0) out of memory
After I did some research, I found that one of the ways to solve this out of memory problem is by reducing the batch size in my train.prototxt
Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory.
Initially, I set the batch size into 50, and iteratively reduced it until 10 (since it worked when batch_size = 10).
Now, the model is being trained and I am pretty sure it will take quite long time. However, as a newcomer in this domain, I am curious about the relation between this batch size and another parameter such as the learning rate, stepsize and even the max iteration that we specify in the solver.prototxt.
How significant the size of the batch will affect the quality of the model (like accuracy may be). How the other parameters can be used to leverage the quality. Also, instead of reducing the batch size or scale up my machine, is there another way to fix this problem?
To answer your first question regarding the relationship between parameters such as batch size, learning rate and maximum number of iterations, you are best of reading about the mathematical background. A good place to start might be this stats.stackexchange question: How large should the batch size be for stochastic gradient descent?. The answer will briefly discuss the relation between batch size and learning rate (from your question I assume learning rate = stepsize) and also provide some references for further reading.
To answer your last question, with the dataset you are finetuning on and the model (i.e. the VGG16) being fixed (i.e. the input data of fixed size, and the model of fixed size), you will have a hard time avoiding the out of memory problem for large batch sizes. However, if you are willing to reduce the input size or the model size you might be able to use larger batch sizes. Depending on how (and what) exactly you are finetuning, reducing the model size may already be achieved by discarding learned layers or reducing the number/size of fully connected layers.
The remaining questions, i.e. how significant the batchsize influences quality/accuracy and how other parameters influence quality/accuracy, are hard to answer without knowing the concrete problem you are trying to solve. The influence of e.g. the batchsize on the achieved accuracy might depend on various factors such as the noise in your dataset, the dimensionality of your dataset, the size of your dataset as well as other parameters such as learning rate (=stepsize) or momentum parameter. For these sort of questions, I recommend the textbook by Goodfellow et al., e.g. chapter 11 may provide some general guidelines on choosing these hyperparmeters (i.e. batchsize, learning rate etc.).
another way to solve your problem is using all the GPUs on your machine. If you have 4x4=16GB RAM on your GPUs, that would be enough. If you are running caffe in command mode, just add the --gpu argument as follows (assuming you have 4 GPUs indexed as default 0,1,2,3):
build/tools/caffe train --solver=solver.prototxt --gpu=0,1,2,3
However if you are using the python interface, running with multiple GPUs is not yet supported.
I can point out some general hints to answer your question on the batchsize:
- The smaller the batchsize is, the more stochastic your learning would be --> less probability of overfitting on the training data; higher probability of not converging.
- each iteration in caffe fetches one batch of data, runs forward and ends with a backpropagation.
- Let's say your training data is 50'000 and your batchsize is 10; then in 1000 iterations, 10'000 of your data has been fed to the network. In the same scenario scenario, if your batchsize is 50, in 1000 iterations, all your training data are seen by the network. This is called one epoch. You should design your batchsize and maximum iterations in a way that your network is trained for a certain number of epochs.
- stepsize in caffe, is the number of iterations your solver will run before multiplying the learning rate with the gamma value (if you have set your training approach as "step").