Destribution of classes in training set - machine-learning

When making a predictive model (specificly in telecommunication regarding churn), is it essential to have a 1:1 split between the classes in the training set(the actual distribution is more like 1:50)? When reading on what other people have done this seems to be the case. But they dont neccesarily state it as a requirement. What is recommended?

Your problem is frequently referred to as "Class Imbalance". Whether and how it will impact your result depends on the algorithm and the evaluation metric you use. The logistic regression algorithm, and the model accuracy, for example, can be very susceptible to this problem. Simple envelope models, and the model AUC, on the other hand, are more resilient against class imbalance. I am aware of five broad possible approaches to deal with this:
1) Up-sampling: Basically artificially increase the number of the rare class. This may be the go-to solution when you have very little data but you are confident that it is quite representative of the wider population.
2) Down-sampling: Just leave out a part of the abundant class. This is an option when you have a very large quantity of data.
3) Weighting: Telling your algorithm to give more importance to the information obtained from the rare class.
4) Bagging: Here, you are randomly sub-sampling your data and fitting "weak" learners to each subsample. Later, these weak learners are aggregated to create one final prediction.
5) Boosting: Similar to bagging, but each "weak" learner is not agnostic to the previously fitted one. Instead, they take the residuals from the latest ensemble.
There is a really nice article here that goes through these in great detail, including some worked examples in R, and another one here which focuses more on python

Related

Machine learning testing data

I am new to machine learning and it might be a bit of a stupid question.
I have implemented my model and its working. I have a question about running it on testing data. It's a binary classification problem. If I know the proportions of classes in test data how could I use it to improve my model or improve predictions made by the model?
So let's say 75% belong to class 1 and 25% to class 0 of the testing data.
Any help is greatly appreciated
Thanks
Well, the first thing first is that your data should be balanced. And often in machine learning problem paradigm test data is treated as something that you know nothing about.
Any kind of information regarding improving your model by using some held out dataset is done by validation dataset.
Look for Validation Dataset. Why you need Validation Dataset, Balancing of the dataset. These terms will help you to proceed further.
There has been two different approaches to addressing imbalanced data: algorithm-level and data-level approach.
Algorithm approach: As mentioned above, ML algorithms penalize False Positives and False Negatives equally. A way to counter that is to modify the algorithm itself to boost predictive performance on minority class. This can be executed through either recognition-based learning or cost-sensitive learning. Feel free to check Drummond & Holte (2003); Elkan (2001); and Manevitz & Yousef (2001) in case you want to learn more about the topic.
Data approach: This consists of re-sampling the data in order to mitigate the effect caused by class imbalance. The data approach has gained popular acceptance among practitioners as it is more flexible and allows for the use of latest algorithms. The two most common techniques are over-sampling and under-sampling.
Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to overfitting.
Under-sampling, on contrary to over-sampling, aims to reduce the number of majority samples to balance the class distribution. Since it is removing observations from the original data set, it might discard useful information.
for more reference visit: https://medium.com/james-blogs/handling-imbalanced-data-in-classification-problems-7de598c1059f

What should be the proportion of positive and negative examples to make a training set result in an unskewed classifier?

My training data set contains 46071 examples from one class and 33606 examples from another class. Does this result in a skewed classifier?
I am using SVM but don't want to use SVM's options to deal with skewed data.
A dataset is skewed if the classification categories are not approximately equally represented (I don‘t think there is a precise value).
Yours isn‘t a highly unbalanced dataset. Anyway it could introduce bias toward majority (potentially uninteresting) class, especially using accuracy for evaluating classifiers.
Skewed training sets can be managed in various ways. Two frequently used approach are:
At the data level a form of re-sampling such as
random oversampling with replacement,
random undersampling,
directed oversampling (no new examples are created, the choice of samples to replace is informed rather than random),
directed undersampling,
oversampling with informed generation of new samples,
combinations of the above techniques.
At the algorithmic level, adjusting the costs of the various classes so as to counter the class imbalance.
Even if you don't like this approach, with SVM you can change the class weighting scheme (e.g.
How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)). You could prefer this to sub-sampling as it means there is no variability in the results due to the particular sub-sample used.
It's worth noting that (from Issue on Learning from Imbalanced Data Sets):
in certain domains (e.g. fraud detection) the class imbalance is
intrinsic to the problem: there are typically very few cases of fraud
as compared to the large number of honest use of the facilities.
However, class imbalances sometimes occur in domains that do not have
an intrinsic imbalance.
This will happen when the data collection process is limited (e.g. due
to economic or privacy reasons), thus creating articial imbalances.
Conversely, in certain cases, the data abounds and it is for the
scientist to decide which examples to select and in what quantity.
In addition, there can also be an imbalance in costs of making
different errors, which could vary per case.
So it all depends on your data, really!
Further details:
Extreme rebalancing for SVMs: A case study - Bhavani Raskutt, Adam Kowalczyk
Learning from umbalanced data - Haibo He, Edwardo Garcia - IEEE Transactions on Knowledge and Data Engineering

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Document Classification using Naive Bayes classifier

I am making a document classifier in mahout using the simple naive bayes algorithm. Currently, 98% of the data(documents) I have is of Class A and only 2% is of class B. My question is, since there is such a wide gap in the percentage of Class A docs vs Class B docs, would the classifier be able to train accurately still?
What I'm thinking of doing is ignoring a whole bunch of Class A documents and "manipulating" the dataset I have so that there isn't such a wide gap in the composition of the documents. Thus, the dataset I'll end up having will consist 30% of Class B and 70% of Class A. But, are there any repercussions of doing that I am not aware of?
A lot of this gets into how good "accuracy" is as a measure of performance, and that depends on your problem. If misclassifying "A" as "B" is just as bad/ok as misclassifying "B" as "A", then there is little reason to do anything other than just mark everything as "A", since you know it will reliably get you a 98% accuracy (so long as that unbalanced distribution is representative of the true distribution).
Without knowing your problem (and if accuracy is the measure you should use), the best answer I could give is "it depends on the data set". It is possible that you could get past 99% accuracy with standard naive bays, though it may be unlikely. For Naive Bayes in particular, one thing you could do is to disable the use of priors (the prior is essentially the proportion of each class). This has the effect of pretending that every class is equally likely to occur, though the model parameters will have been learned from uneven amounts of data.
Your proposed solution is a common practice, it sometimes works well. Another practice is to create fake data for the smaller class (how would depend on your data, for text documents I'm not aware of any particularly good way). Another practice is to increase the weights of the data points in the under-represented classes.
You can search for "imbalanced classification" and find a lot more information about these types of problems (they are one of the harder ones).
If accuracy is not actually a good measure for your problem, you can search for more information about "cost sensitive classification" which should be helpful.
You should not necessarily sample dataset A to reduce its instances. Several methods are available for efficient learning from imbalanced datasets, such as Majority Undersampling (exactly what you did), Minority Oversampling, SMOTE, and etc. Here is an empirical comparison of these methods: http://machinelearning.org/proceedings/icml2007/papers/62.pdf
Alternatively, you may define a custom cost matrix for the classifier. In other words, assuming B=Positive class, you may define cost(False Positive) < cost(False Negative). In this case, the classifier's output will bias towards the positive class. Here is a very helpful tutorial: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.4418&rep=rep1&type=pdf

Does prior distribution matter in classification?

Currently I get a classification problem with two classes. what I want to do is that given a bunch of candidates, find out who will more likely to be the class 1. The problem is that class 1 is very rare (around 1%), which I guess makes my prediction quite inaccurate.
For training the dataset, can I sample half class 1 and half class 0? This will change the prior distribution, but I don't know whether the prior distribution affects the classification results?
Indeed, a very imbalanced dataset can cause problems in classification. Because by defaulting to the majority class 0, you can get your error rate already very low.
There are some workarounds that may or may not work for your particular problem, such as giving equal weight to the two classes (thus weighting instances from the rare class stronger), oversampling the rare class (i.e. learning each instance multiple times), producing slight variations of the rare objects to restore balance etc. SMOTE and so on.
You really should to grab some classification or machine learning book, and check the index for "imbalanced classification" or "unbalanced classification". If the book is any good, it will discuss this problem. (I just assume you did not know the term that they use.)
If you're forced to pick exactly one from a group, then the prior distribution over classes won't matter because it will be constant for all members of that group. If you must look at each in turn and make an independent decision as to whether they're class one or class two, the prior will potentially change the decision, depending on which method you choose to do the classification. I would suggest you get hold of as many examples of the rare class as possible, but beware that feeding a 50-50 split to a classifier as training blindly may make it implicitly fit a model that assumes this is the distribution at test time.
Sampling your two classes evenly doesn't change assumed priors unless your classification algorithm computes (and uses) priors based on the training data. You stated that your problem is "given a bunch of candidates, find out who will more likely to be the class 1". I read this to mean that you want to determine which observation is most likely to belong to class 1. To do this, you want to pick the observation $x_i$ that maximizes $p(c_1|x_i)$. Using Bayes' theorem, this becomes:
$$
p(c_1|x_i)=\frac{p(x_i|c_1)p(c_1)}{p(x_i)}
$$
You can ignore $p(c_1)$ in the equation above since it is a constant. However, computing the denominator will still involve using prior probabilities. Since your problem is really more of a target detection problem than a classification problem, an alternate approach for detecting low probability targets is to take the likelihood ratio of the two classes:
$$
\Lambda=\frac{p(x_i|c_1)}{p(x_i|c_0)}
$$
To pick which of your candidates is most likely to belong to class 1, pick the one with the highest value of $\Lambda$. If your two classes are described by multivariate Gaussian distributions, you can replace $\Lambda$ with its natural logarithm, resulting in a simpler quadratic detector. If you further assume that the target and background have the same covariance matrices, this results in a linear discriminant (http://en.wikipedia.org/wiki/Linear_discriminant_analysis).
You may want to consider Bayesian utility theory to re-weight the costs of different kinds of error to get away from the problem of the priors dominating the decision.
Let A be the 99% prior probability class, B be the 1% class.
If we just say that all errors incur the same cost (negative utility), then
it's possible that the optimal decision approach is to always declare "A". Many
classification algorithms (implicitly) assume this.
If instead, we declare that the cost of declaring "B" when, in fact, the instance
was "A" is much bigger than the cost of the opposite error, then the decision logic
becomes, in a sense, more sensitive to slighter differences in the features.
This kind of situation frequently comes up in fault detection -- faults in the monitored
system will be rare, but you want to be sure that if we see any data that points to
an error condition, action needs to be taken (even if it is just reviewing the data).

Resources