I have an ML problem. I have an machine learning classification task where the classifications are either -1, 0 or 1. In reality, the vast majority of the time the correct classification is 0, and approx 1% of the time, the answer is -1 or 1.
When training (I'm using auto_ml but I think this is a general problem) I'm finding that my model decides it can get a 99% accuracy by just predicting a 0 every time.
Is this a known phenomenon? Is there anything I can do to work around this other than come up with more classifications? Maybe something which splits the 0s into different classes.
Any advice, or pointers at what to read up on next are appreciated.
Thanks.
You should look deeper into your dataset. Seems, your dataset is imbalanced. Possible solutions:
try to balance your dataset - add more data with labels 1 and -1 or reduce number of rows with 0 label;
if it's not possible make your dataset balanced, try change approach. You can assume, that labels 1 and -1 are outliers and try to solve a problem of finding outliers. Here are some examples how to deal with outliers using library scikit-learn;
Yeah, ML can be lazy ;-)
You could try including more of the rare cases into your training set. Though, you use the word 'Event' which makes me wonder if you're doing some kind of time series analysis - is this some kind of recurrent net? If so then training with more of the rare events might be unrealistic.
Related
I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem
I found that while training some CNNs and RNNs on imbalanced training data, that my training converges relatively quickly, with the accuracy being around the percentage of the bigger class (so e.g. if there are 80% yes examples it will probably always output yes). I find that explainable .. that this solution is a local optimum and the network cannot escape it while training. Is this explantion correct and this behaviour thus mostly found in these cases?
What can I do against it? Synthesize more training data to make the set more even? What else?
Thanks a lot!
Yes, you are right. Imbalanced training data do impacts the accuracy. Some of the solutions to come over imbalanced class problem are following:
1) More data collection: This is not easy in some cases. For example, there are a very small number of cases of frauds compared to non fraud cases.
2) Undersampling: Removing the data from the majority class. You can remove it randomly or informative (taking help from the distribution to decide what parts/patches to be removed)
3) Oversampling: Replicating observations belonging to minority class.
Your question has nothing to do with TF, this is a standard problem in machine learning. Just type "dealing with imbalanced data in machine learning" in google and read a few pages.
Here are a few approaches:
get more data
use other metric (f1)
undersampling/oversampling/weighting
I am trying to use a neural network to solve a problem. I learned about them from the Machine Learning course offered on Coursera, and was happy to find that FANN is a Ruby implementation of neural networks, so I didn't have to re-invent the airplane.
However, I'm not really understanding why FANN is giving me such strange output. Based on what I learned from the class,
I have a set of training data that's results of matches. The player is given a number, their opponent is given a number, and the result is 1 for a win and 0 for a loss. The data is a little noisy because of upsets, but not terribly so. My goal is to find which rating gaps are more prone to upsets - for instance, my intuition tells me that lower-rated matches tend to entail more upsets because the ratings are less accurate.
So I got a training set of about 100 examples. Each example is (rating, delta) => 1/0. So it's a classification problem, but not really one that I think lends itself to a logistic regression-type chart, and a neural network seemed more correct.
My code begins
training_data = RubyFann::TrainData.new(:inputs => inputs, :desired_outputs => outputs)
I then set up the neural network with
network = RubyFann::Standard.new(
:num_inputs=>2,
:hidden_neurons=>[8, 8, 8, 8],
:num_outputs=>1)
In the class, I learned that a reasonably default is to have each hidden layer with the same number of units. Since I don't really know how to work this or what I'm doing yet, I went with the default.
network.train_on_data(training_data, 1000, 1, 0.15)
And then finally, I went through a set of sample input ratings in increments and, at each increment, increased delta until the result switched from being > 0.5 to < 0.5, which I took to be about 0 and about 1, although really they were more like 0.45 and 0.55.
When I ran this once, it gave me 0 for every input. I ran it again twice with the same data and got a decreasing trend of negative numbers and an increasing trend of positive numbers, completely opposite predictions.
I thought maybe I wasn't including enough features, so I added (rating**2 and delta**2). Unfortunately, then I started getting either my starting delta or my maximum delta for every input every time.
I don't really understand why I'm getting such divergent results or what Ruby-FANN is telling me, partly because I don't understand the library but also, I suspect, because I just started learning about neural networks and am missing something big and obvious. Do I not have enough training data, do I need to include more features, what is the problem and how can I either fix it or learn how to do things better?
What about playing a little with parameters? At first I would highly recommend only two layers..there should be mathematical proof somewhere that it is enough for many problems. If you have too many neurons your NN will not have enough epochs to really learn something.. so you can also play with number of epochs as well as gama..I think that in your case it's 0.15 ..if you use a little bigger value your NN should learn a little bit faster(don't be afraid to try 0.3 or even 0.7), right value of gama usually depends on weight's intervals or input normalization.
Your NN shows such a different results most probably because in each run there is new initialization and then there is totally different network and it will learn in different way as the previous one(different weights will have higher values so different parts of NN will learn same things).
I am not familiar with this library I am just writing some experiences with NN. Hope something from these will help..
I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)
I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.
When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".
Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.
I just can't create a model that overfits my data!
I've also tried almost all of the other classifiers Weka provides, but got similar results.
Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.
How can I create a completely unpruned tree?
Or otherwise force Weka to overfit my data?
Thanks.
Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):
J48 unpruned tree
------------------
F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)
Needless to say, IB1 still gives 100% precision.
Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.
Weka contains two meta-classifiers of interest:
weka.classifiers.meta.CostSensitiveClassifier
weka.classifiers.meta.MetaCost
They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.
The result is that the algorithm would then try to:
minimize expected misclassification cost (rather than the most likely class)
The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.
The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.