Reusing Inception V3 Conv Neural Net (tensorflow) with 0% accuracy - machine-learning

EDIT1: My code is the same as here, https://github.com/tensorflow/models/blob/master/inception/inception. The only difference is that I pack my files into TFRecords and feed it bactch wise. Also, the ratio of Class 0 : Class 1 is 70:30.
I'm currently working on a project in which I'm making use of inception-V3 CNN model to train a classifier. Currently, I am working on a binary classifier (either predict 1 or 0) but, my model only predicts class 0 for everything. While troubleshooting I've found that the probability of prediction is 100% for class 0 all the time. I have verified everything from the input queuing system to the eval and testing, everything seems to be working well too.
Strangely, the loss value reduces in a perfect semi-parabolic fashion which makes me think that the loss has converged to a local minima. Upon testing the script only churns out class 0(with 100% probability) each time. Another thing I've noticed is that the activation across various Conv layers are always constant which could imply that the neurons are just not firing at all.
My question is,
1. Is my model working ? The loss seems to converge but the activation across various layers seems to be stagnant.
2. I am using the training code available from the models section of the tensorflow (https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py)
I am reusing the train, eval and supporting code to train my model with a custom input pipeline created by me (which is also working). Can someone help guide me in the right direction on this?
Thanks.

Ik I am a little late in answering this question 😅
First of all your model didn't learn anything at all. All it did (cleverly 😂) was to predict class 0 for all cases so that it achieves a baseline accuracy of 70% without any effort. (Probably the model was lazy 😪😋) JK. This is a very well known problem in machine learning. This is called as class imbalance problem. Refer this http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/.
Apart from the techniques mentioned there. The one technique that works wonders is using class weights. That is, basically telling the network to be biased towards the weaker class. In your case class weights will be class0:class1 = 3:7. This is a hyperparameter too! But this is a good point to start.
Moreover, you didn't give any info about your dataset size. Whether you are fine tuning or training from scratch. Without them it's hard to speculate. By default I would suggest fine tuning.
Moreover, by loss you mean the training loss or validation loss? Because training loss has literally no info regarding the performance of the model. Moreover, in my opinion both training loss, and validation losses have very little info to derive meaningful insights about the model's performance. Use other metrics like confusion matrix,f1 score, recall, precision etc.
Finally, there is absolutely no single answer to your question. The only way is the hard way - you will learn along with the model 😉. Because I consider training a NN especially a CNN an art. In which, intuition plays a very crucial role coz, most of the times, the least expected changes would give the best results. Anyway that's the fun part of training a NN.
Happy training 💪
P.S: Try using the visualisation tools like gradcam to know whether the model is looking at the correct part of the image for classification. This is very important!

Related

Where do # of epochs and batch size belong in the hyperparameter tuning process?

I'm fairly new to machine learning, and working on optimizing hyperparameters for my model. I'm doing this via a randomized search. My question is: should I be searching over # of epochs and batch size along with my other hyperparameters (e.g. loss function, number of layers, etc.)? If not, should I fix a these values first, find the other parameters, then return to tune these?
My concern is a) that searching over many epochs will be extremely time-consuming, so leaving it at one low value for the initial scan would be useful and b) that these parameters, esp. # of epochs, will disproportionately affect the results when the model is behaving well, and won't really give me much information about the rest of my architecture, as there should be a regime where more epochs, up to a point, are better. I know this isn't totally accurate, i.e. # of epochs is a real hyperparameter and too many can lead to overfitting issues, for example. Currently, my model is not clearly improving with # of epochs, though it was suggested by someone working on a similar problem within my area of research that this may be mitigated by implementing batch normalization, which is another parameter I am testing. Finally, I am worried that batch size will be quite affected by the fact that I am scaling my data down to 60% to allow my code to run reasonably (and I think the final model will be trained on vastly more data than the simulated data currently available to me).
I agree with your intuition on epochs. It is common to keep this value as low as possible in order to complete more training "experiments" in the same number of working hours. I don't have a great reference here, but I would welcome one in the comments.
For almost everything else, there is a paper by Leslie N. Smith that I can't recommend enough, A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay.
As you can see, batch size is included but epochs are not. You will also notice that the model architecture is not included (number of layers, layer size, etc). Neural Architecture Search is a huge research field in its own right, separate from hyper-parameter tuning.
As for the loss function, I can't think of any reason to "tune" that except in the context of an Auxiliary Loss for training only, which I suspect is not what you are talking about.
The loss function that will be applied to your validation or test set is part of the problem statement. That, along with the data, defines the problem you are solving. You don't changing it by tuning, you change it by convincing a product manager that your alternative is better for the business need.

How to overcome overfitting in CNN - standard methods don't work

I've been recently playing around with car data set from Stanford (http://ai.stanford.edu/~jkrause/cars/car_dataset.html).
From the very beginning I had an overfitting problem so decided to:
Add regularization (L2, dropout, batch norm, ...)
Tried different architectures (VGG16, VGG19, InceptionV3, DenseNet121, ...)
Tried trasnfer learning using models trained on ImageNet
Used data augmentation
Every step moved me a little bit forward. However I finished with 50% validation accuracy (started below 20%) compared to 99% train accuracy.
Do you have an idea what more can I do to get to around 80-90% accuracy?
Hope this can help some people!:)
Things you should try include:
Early stopping, i.e. use a portion of your data to monitor validation loss and stop training if performance does not improve for some epochs.
Check whether you have unbalanced classes, use class weighting to equally represent each class in the data.
Regularization parameter tuning: different l2 coefficients, different dropout values, different regularization constraints (e.g. l1).
Other general suggestions may be to try and replicate the state of the art models on this particular dataset, see if those perform as they should.
Also make sure to have all implementation details ironed out (e.g. convolution is being performed along width and height, and not along the channels dimension - this is a classic rookie mistake when starting out with Keras, for instance).
It would also help to have some more details on the code that you are using, but for now these suggestions will do.
50% accuracy on a 200-class problem doesn't sound so bad anyway.
Cheers
For those who encounter the same problem I managed to get 66,11% accuracy by playing with drop out, learning rate and learning decay mainly.
The best results were achieved on VGG16 architecture.
The model is on https://github.com/michalgdak/car-recognition

Neural Network gets stuck

I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem

Using Convolution Neural network as Binary classifiers

Given any image I want my classifier to tell if it is Sunflower or not. How can I go about creating the second class ? Keeping the set of all possible images - {Sunflower} in the second class is an overkill. Is there any research in this direction ? Currently my classifier uses a neural network in the final layer. I have based it upon the following tutorial :
https://github.com/torch/tutorials/tree/master/2_supervised
I am taking images with 254x254 as the input.
Would SVM help in the final layer ? Also I am open to using any other classifier/features that might help me in this.
The standard approach in ML is that:
1) Build model
2) Try to train on some data with positive\negative examples (start with 50\50 of pos\neg in training set)
3) Validate it on test set (again, try 50\50 of pos\neg examples in test set)
If results not fine:
a) Try different model?
b) Get more data
For case #b, when deciding which additional data you need the rule of thumb which works for me nicely would be:
1) If classifier gives lots of false positive (tells that this is a sunflower when it is actually not a sunflower at all) - get more negative examples
2) If classifier gives lots of false negative (tells that this is not a sunflower when it is actually a sunflower) - get more positive examples
Generally, start with some reasonable amount of data, check the results, if results on train set or test set are bad - get more data. Stop getting more data when you get the optimal results.
And another thing you need to consider, is if your results with current data and current classifier are not good you need to understand if the problem is high bias (well, bad results on train set and test set) or if it is a high variance problem (nice results on train set but bad results on test set). If you have high bias problem - more data or more powerful classifier will definitely help. If you have a high variance problem - more powerful classifier is not needed and you need to thing about the generalization - introduce regularization, remove couple of layers from your ANN maybe. Also possible way of fighting high variance is geting much, MUCH more data.
So to sum up, you need to use iterative approach and try to increase the amount of data step by step, until you get good results. There is no magic stick classifier and there is no simple answer on how much data you should use.
It is a good idea to use CNN as the feature extractor, peel off the original fully connected layer that was used for classification and add a new classifier. This is also known as the transfer learning technique that has being widely used in the Deep Learning research community. For your problem, using the one-class SVM as the added classifier is a good choice.
Specifically,
a good CNN feature extractor can be trained on a large dataset, e.g. ImageNet,
the one-class SVM can then be trained using your 'sunflower' dataset.
The essential part of solving your problem is the implementation of the one-class SVM, which is also known as anomaly detection or novelty detection. You may refer http://scikit-learn.org/stable/modules/outlier_detection.html for some insights about the method.

Things to try when Neural Network not Converging

One of the most popular questions regarding Neural Networks seem to be:
Help!! My Neural Network is not converging!!
See here, here, here, here and here.
So after eliminating any error in implementation of the network, What are the most common things one should try??
I know that the things to try would vary widely depending on network architecture.
But tweaking which parameters (learning rate, momentum, initial weights, etc) and implementing what new features (windowed momentum?) were you able to overcome some similar problems while building your own neural net?
Please give answers which are language agnostic if possible. This question is intended to give some pointers to people stuck with neural nets which are not converging..
If you are using ReLU activations, you may have a "dying ReLU" problem. In short, under certain conditions, any neuron with a ReLU activation can be subject to a (bias) adjustment that leads to it never being activated ever again. It can be fixed with a "Leaky ReLU" activation, well explained in that article.
For example, I produced a simple MLP (3-layer) network with ReLU output which failed. I provided data it could not possibly fail on, and it still failed. I turned the learning rate way down, and it failed more slowly. It always converged to predicting each class with equal probability. It was all fixed by using a Leaky ReLU instead of standard ReLU.
If we are talking about classification tasks, then you should shuffle examples before training your net. I mean, don't feed your net with thousands examples of class #1, after thousands examples of class #2, etc... If you do that, your net most probably wouldn't converge, but would tend to predict last trained class.
I had faced this problem while implementing my own back prop neural network. I tried the following:
Implemented momentum (and kept the value at 0.5)
Kept the learning rate at 0.1
Charted the error, weights, input as well as output of each and every neuron, Seeing the data as a graph is more helpful in figuring out what is going wrong
Tried out different activation function (all sigmoid). But this did not help me much.
Initialized all weights to random values between -0.5 and 0.5 (My network's output was in the range -1 and 1)
I did not try this but Gradient Checking can be helpful as well
If the problem is only convergence (not the actual "well trained network", which is way to broad problem for SO) then the only thing that can be the problem once the code is ok is the training method parameters. If one use naive backpropagation, then these parameters are learning rate and momentum. Nothing else matters, as for any initialization, and any architecture, correctly implemented neural network should converge for a good choice of these two parameters (in fact, for momentum=0 it should converge to some solution too, for a small enough learning rate).
In particular - there is a good heuristic approach called "resillient backprop" which is in fact parameterless appraoch, which should (almost) always converge (assuming correct implementation).
after you've tried different meta parameters (optimization / architecture), the most probable place to look at is - THE DATA
as for myself - to minimize fiddling with meta parameters, i keep my optimizer automated - Adam is by opt-of-choice.
there are some rules of thumb regarding application vs architecture... but its really best to crunch those on your own.
to the point:
in my experience, after you've debugged the net (the easy debugging), and still don't converge or get to an undesired local minima, the usual suspect is the data.
weather you have contradictory samples or just incorrect ones (outliers), a small amount can make the difference from say 0.6-acc to (after cleaning) 0.9-acc..
a smaller but golden (clean) dataset is much better than a big slightly dirty one...
with augmentation you can tweak results even further.

Resources