I am fine-tuning a VGG16 network on 32 cpu machine using tensorflow. I used cross entropy loss with sparse. I have to classify the cloths images into 50 classes. After 2 weeks of training this is how the loss is going down, which I feel is very slow convergence. My batch size is 50. Is it normal or what do you think is going wrong here? Accuracy is also really bad. And now it crashed with bad memory allocation error.
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_allo
My last line in log file looks like this -
2016-12-13 08:56:57.162186: step 31525, loss = 232179.64 (1463843.280 sec/batch)
I also tried Tesla K80 GPU and after 20 hrs of training this is how the loss looks like. All parameters are same. Worrying part is - using GPU didn't increase the iteration rate which means each step is taking same time either in 32 cpu with 50 threds or in tesla K80.
I definitely need some practical advice here.
Another -- and drastically better -- option is to not use VGG16. If you look at Figure 5 in this paper, you'll note that VGG16 does very badly in terms of accuracy vs. FLOPs (floating point operations per second). If you need speed, Mobilenet or a reduced-size ResNet will do much better. Even inception-v2 will outperform VGG in accuracy with much lower computational cost.
This will drastically reduce your training time and memory use.
Related
Currently I am reading the following paper: "SqueezeNet: AlexNet-level accuracy with 50 x fewer parameters and <0.5 MB model size".
In this 4.2.3 (Activation function layer), there is the following statement:
The ramifications of the activation function is almost entirely
constrained to the training phase, and it has little impact on the
computational requirements during inference.
I understand the influence of activation function as follows.
An activation function (ReLU etc.) is applied to each unit of the feature map after convolution operation processing. I think that processing at this time is the same processing in both the training mode and the inference mode. Why can we say that it has a big influence on training and does not have much influence on inference?
Can someone please explain it.
I think that processing at this time is the same processing in both the training mode and the inference mode.
You are right, the processing time of the activation function is the same.
But still there is big difference between training time and test time:
Training time involves applying the forward pass for a number of epochs, where each epoch usually consists of the whole training dataset. Even for a small dataset, such as MNIST (consisting of 60000 training images) this accounts for tens of thousands invocations. Exact runtime impact depends on a number of factors, e.g. GPUs allow a lot of computation in parallel. But in any case it's several orders of magnitude larger than the number of invocations at test time, when usually a single batch is processed exactly once.
On top of that you shouldn't forget about the backward pass, in which the derivative of the activation is also applied for the same number of epochs. For some activations the derivative can be significantly more expensive, e.g. elu vs relu (elu has learnable parameters that need to be updated).
In the end, you are likely to ignore 5% slowdown at inference time, because the inference of a neural network it's blazingly fast anyway. But you might care about extra minutes to hours of training of a single architecture, especially if you need to do cross-validation or hyper-parameters tuning of a number of models.
I'm doing some programming with neural network backpropagation.
I have about 90 datas and doing some training with all data for data training (90 datas) and same data for data test (90 datas). I'm using iteration threshold about 2 iteration to test it and it gave me quite big error (About 60% with MAPE/Mean Absolute Square Error).
I'm afraid I've got the algorithm wrong since the only way to get training error less than threshold 10% is using iteration threshold around 3000k iteration and it's training takes quite a long time (I'm not using momentum. Just a Backpropagation Neural Network). But the test accuracy around 95-99% after that using said condition.
Is this normal? Or my program is work as it shouldn't be?
Of course, it will depend on the data set used, but I wouldn't be surprised if you get an error below 1% even for highly nonlinear data (I've seen this for example in sales data). As long as you separate training and test data sets, the error is expected to rise, but with the same set, it should drop to zero if there are enough hidden units. The capacity of an ANN to fit nonlinear data is huge (and, of course, the more fitted, the less general).
So, I would look for some program bug instead.
You say 3000k iteration, but i'm assume you mean 3k or 3000. The other answer says there might a bug in your code, but 3000 iterations for a problem with 90 samples is definitely normal.
You cannot expect a neural network to fit a training set with just 2 iterations, especially with a low learning rate.
TL;DR - you have nothing to worry. 3000 iterations is fine.
I am designing a new network architecture for semantic segmentation. The training loss reduces when training iteration increases. However, when I measure the testing accuracy. I got the below figure
From 0 to 20.000 iterations, the accuracy increase. However, after 20.000 iterations, the testing accuracy reduce. I guess it is overfitting issue.
I tried to add dropout to the network, but the graph trend is similar. Could you suggest to me the reason and how can I solve it? I think early stopping is not good solution. Thanks
Be sure to randomize your data on your training, also you can start testing with a higher learning rate (say 0.1) to get out of local minima then decrease it to a very small value to let settle down things. To do this change the step size to say 1000 iterations to reduce the size of the learning rate every 1000 iterations.
I am trying to build a prediction model, initially I did Variational Autoencoder and reduced the features from 2100 to 64.
Now having (5000 X 64) samples for training and (2000 X 64) for testing with that I tried to build a Fully feed forward or MLP network, but as a result when my mean absolute error reaches 161 it's not going down. I tried varying all hyper-parameters and also the hidden layers but no use.
Can anyone suggest what would be the reason and How I can overcome this problem?
First of all, training a neural network can be a bit tricky. Performance of the network after training (even the training process itself) depends on a large number of factors. Secondly, you have to be more specific about the your dataset (the problem rather) in your question.
Just by looking at your question, what can be said is ...
What is the range of values in your data ? Having a mean absolute error the magnitude of your error being 161 is quite high. It seems like you have large values in your data. (Try normalizing the data, i.e. subtract the mean and divide by variance of each of your features/variables.
How you initialized the weights of your network ? Training performance depends very much on the initial weight values. A wrong initialization can lead to a local minimum. (Try initializing using Glorot's initialization method)
You have reduced the dimensionality from 2100 to 64. Isn't this too much? (actually it might be okay to do but it really depends on your data).
I want to train CaffeNet on the MNIST dataset in Caffe. However, I noticed that after 100 iterations the loss just slightly dropped (from 2.66364 to 2.29882).
However, when I use LeNet on MNIST, the loss goes from 2.41197 to 0.22359, after 100 iterations.
Does this happen because CaffeNet has more layers, and therefore needs more training time to converge? Or is it due to something else? I made sure the solver.prototxt of the nets were the same.
While I know 100 iterations is extremely short (as CaffeNet usually trains for ~300-400k iterations), I find it odd that LeNet is able to get a loss so small, so soon.
I am not familiar with architecture of these nets, but in general there are several possible reasons:
1) One of the nets is really much more complicated
2) One of the nets was trained with a bigger learning rate
3) Or maybe it used a training with momentum while other net didn't use it?
4) Also possible that they both use momentum during training but one of them had the bigger momentum coefficient specified
Really, there are tons of possible explanations for that.