Classification with imbalanced dataset using Multi Layer Perceptrons - machine-learning

I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?

I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5

You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.

That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.

Related

Number of outputs in dense softmax layer

I've been working through a Coursera course for extra practice and ran into an issue I don't understand.
Link to Collab
So as far as I've worked on ML neural network problems, I've always been taught that the output layer of a multiclass classification problem will be Dense, with number of nodes equal to the number of classes. E.g. Dog, cat, horse - 3 classes = 3 nodes.
However, in the notebook, there are 5 classes in the labels, checked using len(label_tokenizer.word_index) but using 5 nodes I had terrible results and with 6 nodes the model worked properly.
Can anyone please explain why this is the case? I can't find any online example explaining this. Cheers!
I figured it out. The output of the dense layer with loss of categorical cross entropy expects labels/targets to be starting from zero. For example:
cat - 0
dog - 1
horse - 2
In this case, the number of dense nodes are 3.
However, in the collab, the labels were generated using keras tokenizer, which tokenizes starting from 1 (because padding is usually 0).
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
# {'business': 2, 'entertainment': 5, 'politics': 3, 'sport': 1, 'tech': 4}
This leads to a weird case where, if we have 5 dense nodes, we have output classes from 0-4, which doesn't match up with our labels with predictions 1-5.
I proved this empirically by rerunning the code with all labels reduced by 1 and the model trains successfully with 5 dense nodes, since labels are 0-4 now.
I suspect that with labels 1-5 and 6 dense nodes work because the model simply learns that label 0 is not used and focuses on 1-5.
If anyone understands the inner workings of categorical cross entropy, do feel free to add on!

Is there any regularizer that penalizes redundant neurons?

I trained a network on a real-value labels (floating point numbers from 0.0 to 1.0) - several residual blocks in the beginning, and the last layers are
fully-connected layer with 64 neurons + ELU activation,
fully-connected layer with 16 neurons + ELU activation,
output logistic regression layer ( 1 neuron with y = 1 / (1 + exp(-x) ).
After training, I visualised weights of the layer with 16 neurons:
figure rows represents weights that every single 1 of 16 neurons developed for every single 1 of 64 neurons of previous layer, indices are 0..15 and 0..63;
UPD: figure shows neurons weights correlation (Pearson);
UPD: figure shows neurons weights MAD (mean absolute difference) - this proves redundancy event better than correlation.
Now the detailed questions:
Can we say that there are redundant features? I see several redundant groups of neurons: 0,4; 1,6,7 (maybe 8,11,15 too); 2,14; 12,13 (maybe) .
is it bad ?
if so, is there any regularizer, that penalizes redundant neuron weights, and makes neurons develop uncorrelated weights?
I use adam regularizer, Xavier initialization (the best of the tested), weight decay 1e-5/batch (the best of the tested), other output layers did not work as well as logistic regression (by means of precison & recall & lack of overfitting).
I use only 10 filters in each resnet blocks (which are 10, too) to address overfitting.
Are you using Tensorflow ? if yes, is post training quantization an option ?
tensorflow.org/lite/performance/post_training_quantization
This has some similar effect to what you need but also makes other improvements.
Alternatively maybe you can also try to use Quantization-aware training
https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize

Why my CNN based on Alexnet fails in classification?

I'm trying to build a CNN to classify dogs.In fact , my data set consists of 5 classes of dogs. I've 50 images of dogs splitted into 40 images for training and 10 for testing.
I've trained my network based on AlexNet pretrained model over 100,000 and 140,000 iterations but the accuracy is always between 20 % and 30 %.
In fact, I have adapted AlexNet to my problem as following : I changed the name of last fully connected network and num_output to 5. Also , I ve changed the name of the first fully connected layer (fc6).
So why this model failed even I' ve used data augmentation (cropping )?
Should I use a linear classification on top layer of my network since I have a little bit of data and similar to AlexNet dataset ( as mentioned here transfer learning) or my data set is very different of original data set of AlexNet and I should train linear classifier in earlier network ?
Here is my solver :
net: "models/mymodel/train_val.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 200000
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "models/mymodel/my_model_alex_net_train"
solver_mode: GPU
Although you haven't given us much debugging information, I suspect that you've done some serious over-fitting. In general, a model's "sweet spot" for training is dependent on epochs, not iterations. Single-node AlexNet and GoogleNet, on an ILSVRC-style of data base, train in 50-90 epochs. Even if your batch size is as small as 1, you've trained for 2500 epochs with only 5 classes. With only 8 images per class, the AlexNet topology is serious overkill and is likely adapted to each individual photo.
Consider this: you have only 40 training photos, but 96 kernels in the first convolution layer and 256 in the second. This means that your model can spend over 2 kernels in conv1 and 6 in conv 2 for each photograph! You get no commonality of features, no averaging ... instead of edge detection generalizing to finding faces, you're going to have dedicated filters tuned to the individual photos.
In short, your model is trained to find "Aunt Polly's dog on a green throw rug in front of the kitchen cabinet with a patch of sun to the left." It doesn't have to learn to discriminate a basenji from a basset, just to recognize whatever is randomly convenient in each photo.

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Parameter tuning/model selection using resampling

I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).

Resources