I'm new to data mining and I'm trying to train decision tree, but the data set I've chosen is very biased therefore the result that I am getting is also biased. I've searched online and I came across with balanced accuracy. I'm not satisfied with the result.
Will it be a good idea if I sample my data set in such a way that I proportion it equally, as in 1000 cases of YES and 1000 of NO?
One way to handle class imbalance is to undersample the larger class so that the class distribution is approximately half and half.
Answer of your question is yes, provided 1000 is the size of smaller class so that you lose less larger-class data points.
Note: While selecting from larger-class data points, try to leave out those data points which have more missing values.
You can also give weightage while modelling. you can assign higher weight to minority class it will compensate the imbalance.
Related
Suppose you have an imbalanced dataset. Without considering to generate new data to it, how can you handle it efficiently. I know we can use sample weight or down sampling. However, between these two, I am not sure which to choose. Also, suppose you need to build a classification model on this imbalanced data, how will these two techniques influence the model performances differently?
It totally depends upon that if you down sample how much of data observations do you have left, and how efficiently the down sampled class is able to accommodate the variety of the class down sampled.
for example, you have class 1 which consists of 100 observations and class 2 which contains 2000 observations (Class 1 is ~ 5%). Then downsampling wont make sense as there wont be enough data observations present to effetively implement a model. 100 observations are very less. The model would have high training errors.
But if you have Class 1 which has 100,000 observations and Class 2 having 2,000,000 (5% again) then it still make sense to downsample as you have enough observations to train model.
So the answer totally depends on the type of data you have. I personally would go with SMOTE. Hope this helps.
I am building a sentiment analysis program using some tweets i have gathered. The labeled data which i have gathered would go through a neural network which classifies them into two classes, positive and negative.
The data is still being labeled. So far i have observed that the positive category has very small number of observations.
The records for positive category in my training set could be about 5% of the training data set (the same ratio could reflect in the population as well).
Would this create problems in the final "program" ?
Size of the data set is about 5000 records.
Yes, yes it can. There are two things to consider:
5% of 5000 is 250. Thus you will try to model data distribution of your class based on just 250 samples. This might be orders of magnitude to small for neural network. Consequently you might need 40x more data to have a representative sample of your data. While you can easily reduce the majority class through subsampling, without the big risk of destroing the structure - there is no way to get "more structure" from less points (you can replicate points, add noise etc. but this does not add structure, this just adds assumptions).
Class imbalance can also lead to convergence to naive solutions, like "always False" which has 95% accuracy. Here you can simply play around with the cost function to make it more robust to imbalance (in particular - train splits suggested by #PureW is nothing else like "black box" method of trying to change the loss function so it has bigger weight on minority class. When you have access to your classifier loss, like in NN you should not due this - but instead change the cost function and still keep all the data).
Without even splits of the different classes, you may want to introduce weights in your loss function such that errors in the smaller class are deemed more important.
Another solution, since 5000 samples may or may not be much data depending on your problem, could be to sample more datasets. You basically take this set of 5000 samples, and sample datapoints from it such that you have a new dataset with even split of the classes. This means the new dataset is only 10% of your original dataset. But it is evenly split between the classes. You can redo this sampling many times and end up with several datasets, useful in bootstrap aggregating.
I realized that the related question Positives/negatives proportion in train set suggested that a 1-to-1 ratio of positive to negative training examples is favorable for the Rocchio algorithm.
However, this question differs from the related question in that it concerns a random forest model and also in the following two ways.
1) I have plenty of training data to work with, and the main bottleneck on using more training examples is training iteration time. That is, I'd prefer not to take more than a night to train one ranker because I want to iterate quickly.
2) In practice, the classifier will probably see 1 positive example for every 4 negative examples.
In this situation, should I train using more negative examples than positive examples, or still equal numbers of positive and negative examples?
See the section titled "Balancing prediction error" from the official documentation on random forests here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
I marked some parts in bold.
In summary, this seems to suggest that your training and test data should either
reflect the 1:4 ratio of classes that your real-life data will have
or
you can have a 1:1 mix, but then you should carefully adjust the
weights per class as demonstrated below till the OOB error rate on
your desired (smaller) class is lowered
Hope that helps.
In some data sets, the prediction error between classes is highly
unbalanced. Some classes have a low prediction error, others a high.
This occurs usually when one class is much larger than another. Then
random forests, trying to minimize overall error rate, will keep the
error rate low on the large class while letting the smaller classes
have a larger error rate. For instance, in drug discovery, where a
given molecule is classified as active or not, it is common to have
the actives outnumbered by 10 to 1, up to 100 to 1. In these
situations the error rate on the interesting class (actives) will be
very high.
The user can detect the imbalance by outputs the error rates for the
individual classes. To illustrate 20 dimensional synthetic data is
used. Class 1 occurs in one spherical Gaussian, class 2 on another. A
training set of 1000 class 1's and 50 class 2's is generated, together
with a test set of 5000 class 1's and 250 class 2's.
The final output of a forest of 500 trees on this data is:
500 3.7 0.0 78.4
There is a low overall test set error (3.73%) but class 2 has over 3/4
of its cases misclassified.
The error balancing can be done by setting different weights for
the classes.
The higher the weight a class is given, the more its error rate is
decreased. A guide as to what weights to give is to make them
inversely proportional to the class populations. So set weights to 1
on class 1, and 20 on class 2, and run again. The output is:
500 12.1 12.7 0.0
The weight of 20 on class 2 is too high. Set it to 10 and try again,
getting:
500 4.3 4.2 5.2
This is pretty close to balance. If exact balance is wanted, the
weight on class 2 could be jiggled around a bit more.
Note that in getting this balance, the overall error rate went up.
This is the usual result - to get better balance, the overall error
rate will be increased.
This might seem like a trivial answer but the best thing I can suggest is to try on a small subset of your data (small enough that the algorithm trains quickly), and observe what you accuracy is when you use 1-1, 1-2, 1-3 etc...
Plot the results as you gradually increase the total amount of examples for each ratio and see how the performance responds. Very often you'll find that fractions of the data get very close to the performance of training on the full dataset, in which case you can make an informed decision to your question.
Hope that helps.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?
I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?
There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).
Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:
Split your data into training and testing (80/20 is indeed a good starting point)
Split the training data into training and validation (again, 80/20 is a fair split).
Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples
You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.
However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.
There has been some research into what is the proper ratio between the training set and the validation set:
The fraction of patterns reserved for the validation set should be
inversely proportional to the square root of the number of free
adjustable parameters.
In their conclusion they specify a formula:
Validation set (v) to training set (t) size ratio, v/t, scales like
ln(N/h-max), where N is the number of families of recognizers and
h-max is the largest complexity of those families.
What they mean by complexity is:
Each family of recognizer is characterized by its complexity, which
may or may not be related to the VC-dimension, the description
length, the number of adjustable parameters, or other measures of
complexity.
Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.
Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:
Training: 60%
Cross-validation: 20%
Testing: 20%
Well, you should think about one more thing.
If you have a really big dataset, like 1,000,000 examples, split 80/10/10 may be unnecessary, because 10% = 100,000 examples may be just too much for just saying that model works fine.
Maybe 99/0.5/0.5 is enough because 5,000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5,000 examples in test and dev.
Don't use 80/20 just because you've heard it's ok. Think about the purpose of the test set.
Perhaps a 63.2% / 36.8% is a reasonable choice. The reason would be that if you had a total sample size n and wanted to randomly sample with replacement (a.k.a. re-sample, as in the statistical bootstrap) n cases out of the initial n, the probability of an individual case being selected in the re-sample would be approximately 0.632, provided that n is not too small, as explained here: https://stats.stackexchange.com/a/88993/16263
For a sample of n=250, the probability of an individual case being selected for a re-sample to 4 digits is 0.6329.
For a sample of n=20000, the probability is 0.6321.
It all depends on the data at hand. If you have considerable amount of data then 80/20 is a good choice as mentioned above. But if you do not Cross-Validation with a 50/50 split might help you a lot more and prevent you from creating a model over-fitting your training data.
Suppose you have less data, I suggest to try 70%, 80% and 90% and test which is giving better result. In case of 90% there are chances that for 10% test you get poor accuracy.
I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)