I know what a dropout ratio is and how it helps but i have two questions:
i don't know how it works i know that much it switches off some of the neuron but how it helps in increasing accuracy.
I have seen sometimes people apply very large dropout ratio sometimes very low so it confuses me is there any ideal value for dropout ratio.
There is no common value of dropout ratio which you can use everywhere it depends on your model and a number of hit and trials. Its value should not be too high otherwise your model will not be trained properly and it should not be too low as the model will still be overfitted.
Now to your first question when we turn off some of the neuron it prevents over fitting which occurs when when the model learns too much from your sample data.
Related
I am doing deep learning using a multi-layer perceptron for regression. The loss curve turns flat in the third epoch however accuracy curve remains flat at the beginning. I wonder whether this makes sense.
Since you didn't provide the code, it would be harder to narrow down what is the problem. Being said, here are some pointers that might help you see what is the problem:
Validation set is either small or it is a bad representation of your training set. (bear in mind, if you are using validation_split in fit function, then keras will only take the last percentage of your training set and will keep it the same for all epochs. link]).
You are not using any regularization (Dropout, Regularization, Constraints).
The model could be small (layers- and neurons-wise), so it is underfitting.
Hope these pointers help you with your problem.
I'm using batch normalization with a batch size of size 10 for face detection, I wanted to know if it is better to remove the batch norm layers or keep them.
And if it is better to remove them what can I use instead?
This question depends on a few things, first being the depth of your neural network. Batch normalization is useful for increasing the training of your data when there are a lot of hidden layers. It can decrease the number of epochs it takes to train your model and hep regulate your data. By standardizing the inputs to your network, you reduce the risk of chasing a 'moving target', meaning your learning algorithm is not performing as optimally as it could be.
My advice would be to include batch normalization layers in your code if you have a deep neural network. Reminder, you should probably include some Dropout in your layers as well.
Let me know if this helps!
Yes, it works for the smaller size, it will work even with the smallest possible size you set.
The trick is the bach size also adds to the regularization effect, not only the batch norm.
I will show you few pics:
We are on the same scale tracking the bach loss. The left-hand side is a module without the batch norm layer (black), the right-hand side is with the batch norm layer.
Note how the regularization effect is evident even for the bs=10.
When we set the bs=64 the batch loss regularization is super evident. Note the y scale is always [0, 4].
My examination was purely on nn.BatchNorm1d(10, affine=False) without learnable parameters gamma and beta i.e. w and b.
This is why when you have low batch size, it has sense to use the BatchNorm layer.
I am training a Naive Bayes classifier on a balanced dataset with equal number of positive and negative examples. At test time I am computing the accuracy in turn for the examples in the positive class, negative class, and the subsets which make up the negative class. However, for some subsets of the negative class I get accuracy values lower than 50%, i.e. random guessing. I am wondering, should I worry about these results being much lower than 50%? Thank you!
It's impossible to fully answer this question without specific details, so here instead are guidelines:
If you have a dataset with equal amounts of classes, then random guessing would give you 50% accuracy on average.
To be clear, are you certain your model has learned something on your training dataset? Is the training dataset accuracy higher than 50%? If yes, continue reading.
Assuming that your validation set is large enough to rule out statistical fluctuations, then lower than 50% accuracy suggests that something is indeed wrong with your model.
For example, are your classes accidentally switched somehow in the validation dataset? Because notice that if you instead use 1 - model.predict(x), your accuracy would be above 50%.
I have been using the training method proposed in the cifar10_multi_gpu_train example for (local) multi-gpu training, i.e., creating several towers and then average the gradient. However, I was wondering the following: What does happen if I just take the losses coming from the different GPUs, sum them up and then just apply gradient descent to that new loss.
Would that work? Probably this is a silly question, and there must be a limitation somewhere. So I would be happy if you could comment on this.
Thanks and best regards,
G.
It would not work with the sum. You would get a bigger loss and consequentially bigger and probably erroneous gradients. While averaging the gradients you get an average of the direction that the weights have to take in order to minimize the loss, but each single direction is the one computed for the exact loss value.
One thing that you can try is to run the towers independently and then average the weights from time to time, slower convergence rate but faster processing on each node.
How can I make Weka classify the smaller classification? I have a data set where the positive classification is 35% of the data set and the negative classification is 65% of the data set. I want Weka to predict the positive classification but in some cases, the resultant model predicts all instances to be the negative classification. Regardless, it is classifying the negative (larger) class. How can I force it to classify the positive (smaller) classification?
One simple solution is to adjust your training set to be more balanced (50% positive, 50% negative) to encourage classification for both cases. I would guess that more of your cases are negative in the problem space, and therefore you would need to find some way to ensure that the negative cases still represent the problem well.
Since the ratio of positive to negative is 1:2, you could also try duplicating the positive cases in the training set to make it 2:2 and see how that goes.
Use stratified sampling (e.g. train on a 50%/50% sample) or class weights/class priors. It helps greatly if you tell us which specific classifier? Weka seems to have at least 50.
Is the penalty for Type I errors = penalty for Type II errors?
This is a special case of the receiver operating curve (ROC).
If the penalties are not equal, experiment with the cutoff value and the AUC.
You probably also want to read the sister site CrossValidated for statistics.
Use CostSensitiveClassifier, which is available under "meta" classifiers
You will need to change "classifier" to your J48 and (!) change cost matrix
to be like [(0,1), (2,0)]. This will tell J48 that misclassification of a positive instance is twice more costly than misclassification of a negative instance. Of course, you adjust your cost matrix according to your business values.