Deep Neural Network - Order of the Parameters to tune - machine-learning

I am new to this DNN field and I am fed up with tunning hyperparameters and other parameters in a DNN cause there are a lot of parameters to tune and it is like a multivariable analysis without the help of a computer. How human can move towards the highest accuracy that can be achieved for a task using DNN due to the huge number of variables inside a DNN. And how will we know what accuracy is possible to get by using DNN or do I have to give up on DNN? I am lost. Help is appreciated.
Main problems I have :
1. What are the limits of DNN / when we have to give up on DNN
2. What is the proper way of tunning without missing good parameter values
Here is the summary I got by learning theory in this field. Corrections are much appreciated if I am wrong or misunderstood. You can add anything I missed. Sorted by the importance according to my knowledge.
for overfitting -
1. reduce the number of layers
2. reduce the number of nodes of layers
3. add regularizers (l1/ l2/ l1-l2) - have to decide the factors
4. add dropout layers and -have to decide the dropout factor
5. reduce batch size
6. stop earlier
for underfitting
1. increase the number of layers
2. increase number of nodes of layers
3. Add different types of layers (Conv, LSTM, ...)
4. add learning rate decay (decide the type and parameters for the type)
5. reduce the learning rate
other than that generally we can do,
1. number of epochs (by seeing what is happening while model training)
2. Adjust Learning Rate
3. batch normalization -for fast learning
4. initializing techniques (zero/ random/ Xavier / he)
5. different optimization algorithms
auto tunning methods
- Gridsearchcv - but for this, we have to choose what we want to change and it takes a lot of time.

Short Answer: You should experiment a lot!
Long Answer: At first, you may be overwhelmed by having plenty of knobs that you can tweak, but you gradually become experienced. A very quick way to gain some intuition on how you should tune the hyperparameters of your model is trying to replicate what other researchers have published. By replicating the results (and trying to improve the state-of-the-art), you acquire the intuition about deep learning.
I, personally, follow no particular order in tuning the hyperparameters of the model. Instead, I try to implement a dirty model and try to improve it. For instance, if I see that there are overshoots in validation accuracy, which might be an indicator of the fact that the model is bouncing around the sweet spot, I divide the learning rate by ten and see how it goes. If I see the model begins to overfit, I use early stopping to save the best parameters before overfitting. I also play with dropout rates and weight decay to find the best combination of them in order to have the model fit enough while maintaining the regularization effect. And so on.
To correct some of your assumptions, adding different types of layers will not necessarily help your model not to overfit. Moreover, sometimes (especially when using transfer learning, which is a trend these days), you cannot simply add a convolutional layer to your neural network.
Assuming you are dealing with computer vision tasks, Data Augmentation is another useful approach to increase the amount of available data to train your model and perform its performance.
Also, note that Batch Normalization also has a regularization effect. Weight Decay is another implementation of l2 regularization that is widely used.
Another interesting technique that can improve the training of neural networks is the One Cycle policy for learning rate and momentum (if applicable). Check this paper out: https://doi.org/10.1109/WACV.2017.58

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

How to overcome overfitting in CNN - standard methods don't work

I've been recently playing around with car data set from Stanford (http://ai.stanford.edu/~jkrause/cars/car_dataset.html).
From the very beginning I had an overfitting problem so decided to:
Add regularization (L2, dropout, batch norm, ...)
Tried different architectures (VGG16, VGG19, InceptionV3, DenseNet121, ...)
Tried trasnfer learning using models trained on ImageNet
Used data augmentation
Every step moved me a little bit forward. However I finished with 50% validation accuracy (started below 20%) compared to 99% train accuracy.
Do you have an idea what more can I do to get to around 80-90% accuracy?
Hope this can help some people!:)
Things you should try include:
Early stopping, i.e. use a portion of your data to monitor validation loss and stop training if performance does not improve for some epochs.
Check whether you have unbalanced classes, use class weighting to equally represent each class in the data.
Regularization parameter tuning: different l2 coefficients, different dropout values, different regularization constraints (e.g. l1).
Other general suggestions may be to try and replicate the state of the art models on this particular dataset, see if those perform as they should.
Also make sure to have all implementation details ironed out (e.g. convolution is being performed along width and height, and not along the channels dimension - this is a classic rookie mistake when starting out with Keras, for instance).
It would also help to have some more details on the code that you are using, but for now these suggestions will do.
50% accuracy on a 200-class problem doesn't sound so bad anyway.
Cheers
For those who encounter the same problem I managed to get 66,11% accuracy by playing with drop out, learning rate and learning decay mainly.
The best results were achieved on VGG16 architecture.
The model is on https://github.com/michalgdak/car-recognition

Determining number of epochs for model fitting in Keras

I'm trying to automatically determine when a Keras autoencoder converges. For example, look at this link under "Let's build the simplest autoencoder possible." The number of epochs is hardcoded at 50 (when the loss value converges). However, how would you code this using Keras if you didn't know the number was 50? Would you just keep calling fit()?
This question is actually ridiculously wide and hard. There are many techniques on how to set the number of epochs:
Early stopping- in this case you set the number of epochs to a really high number and you turn off the training when the improvement over next epochs is not satisfying. In Keras you have a special object called EarlyStopping which does the job for you.
Model Checkpoint - here you once again set up a really high number of epochs and you simply save only the best model w.r.t. to a metric chosen. Once again you have a special callback for this scenario.
Of course, there are other scenarios like e.g. using Reinforcement learning to find the stopping time or more complexed scenarios when you choose this in a Bayesian hyperparameter set up but those are much harder methods which are often not introducing any improvement.
One sure thing is that restarting a fit method might end up in unexpected behaviour as many inner states of a model are reset which could cause instability. For this scenario I strongly advise you to use train_on_batch which is not resetting model states and makes a lot of fancy training scenarios possible.

when to apply feature selection

I am developing a software used to automate machine learning .
I have observed in some of the datasets with less number of features (4,5),if we apply feature selection and consequently my classifiers models the performance actually decreases(due to the loss of information)... But in cases of datasets with larger number of features if we apply feature selection the performance actually improves.......
So I am looking for some heurestic so as to determine whether to apply feature selection or not ?
Is there any paper /work which addresses this issue ?When to apply feature selection and when not to ?
There are quite a few heuristics. I don't know a single paper or source that addresses them all in a trivial amount of time.
When you say 'performance' I'm assuming you're referring to the accuracy of prediction for your test data set by your model which has been trained and cross validated by a training data set and cross validation data set.
There are a large number of ML algorithms as well, feature selection may not affect them all the same. Which are you using?
For example Applying feature selection for a Neural Network will result in changes that affect the Bias and Variance of you model which in turn will affect the accuracy of prediction on the test set:
too many features may result in overfitting (depending on sample training size) due to high varience
too few you may end up underfitting or high bias (regardless of sample training size)
Either will cause prediction on test sets to suffer. Also, accuracy alone isn't enough when 'tuning' a models (figuring out feature, degrees, regularization lambda's, etc...) To figure out what's best what you'll need to look at is the precision and recall of your model.
Unfortunately, there's no quick-and-easy way I can explain in a short SO answer in detail what you need to do to optimize your model.
I suggest you spend the time to take something like Andrew Ng's intro to machine learning course https://www.coursera.org/learn/machine-learning/home/welcome. Chapter 6 discusses how to determine how to optimize NN model.

How does pre-training improve classification in neural networks?

Many of the papers I have read so far have this mentioned "pre-training network could improve computational efficiency in terms of back-propagating errors", and could be achieved using RBMs or Autoencoders.
If I have understood correctly, AutoEncoders work by learning the
identity function, and if it has hidden units less than the size of
input data, then it also does compression, BUT what does this even have
anything to do with improving computational efficiency in propagating
error signal backwards? Is it because the weights of the pre
trained hidden units does not diverge much from its initial values?
Assuming data scientists who are reading this would by theirselves
know already that AutoEncoders take inputs as target values since
they are learning identity function, which is regarded as
unsupervised learning, but can such method be applied to
Convolutional Neural Networks for which the first hidden layer is
feature map? Each feature map is created by convolving a learned
kernel with a receptive field in the image. This learned kernel, how
could this be obtained by pre-training (unsupervised fashion)?
One thing to note is that autoencoders try to learn the non-trivial identify function, not the identify function itself. Otherwise they wouldn't have been useful at all. Well the pre-training helps moving the weight vectors towards a good starting point on the error surface. Then the backpropagation algorithm, which is basically doing gradient descent, is used improve upon those weights. Note that gradient descent gets stuck in the closes local minima.
[Ignore the term Global Minima in the image posted and think of it as another, better, local minima]
Intuitively speaking, suppose you are looking for an optimal path to get from origin A to destination B. Having a map with no routes shown on it (the errors you obtain at the last layer of the neural network model) kind of tells you where to to go. But you may put yourself in a route which has a lot of obstacles, up hills and down hills. Then suppose someone tells you about a route a a direction he has gone through before (the pre-training) and hands you a new map (the pre=training phase's starting point).
This could be an intuitive reason on why starting with random weights and immediately start to optimize the model with backpropagation may not necessarily help you achieve the performance you obtain with a pre-trained model. However, note that many models achieving state-of-the-art results do not use pre-training necessarily and they may use the backpropagation in combination with other optimization methods (e.g. adagrad, RMSProp, Momentum and ...) to hopefully avoid getting stuck in a bad local minima.
Here's the source for the second image.
I don't know a lot about autoencoder theory, but I've done a bit of work with RBMs. What RBMs do is they predict what the probability is of seeing the specific type of data in order to get the weights initialized to the right ball park- it is considered an (unsupervised) probabilistic model, so you don't correct using the known labels. Basically, the idea here is that having a learning rate that is too big will never lead to convergence but having one that is too small will take forever to train. Thus, by "pretraining" in this way you find out the ball park of the weights and then can set the learning rate to be small in order to get them down to the optimal values.
As for the second question, no, you don't generally prelearn kernels, at least not in an unsupervised fashion. I suspect that what is meant by pretraining here is a bit different than in your first question- this is to say, that what is happening is that they are taking a pretrained model (say from model zoo) and fine tuning it with a new set of data.
Which model you use generally depends on the type of data you have and the task at hand. Convnets I've found to train faster and efficiently, but not all data has meaning when convolved, in which case dbns may be the way to go. Unless say, you have a small amount of data then I'd use something other than neural networks entirely.
Anyways, I hope this helps clear some of your questions.

Resources