Accuracy below 50% for binary classification - machine-learning

I am training a Naive Bayes classifier on a balanced dataset with equal number of positive and negative examples. At test time I am computing the accuracy in turn for the examples in the positive class, negative class, and the subsets which make up the negative class. However, for some subsets of the negative class I get accuracy values lower than 50%, i.e. random guessing. I am wondering, should I worry about these results being much lower than 50%? Thank you!

It's impossible to fully answer this question without specific details, so here instead are guidelines:
If you have a dataset with equal amounts of classes, then random guessing would give you 50% accuracy on average.
To be clear, are you certain your model has learned something on your training dataset? Is the training dataset accuracy higher than 50%? If yes, continue reading.
Assuming that your validation set is large enough to rule out statistical fluctuations, then lower than 50% accuracy suggests that something is indeed wrong with your model.
For example, are your classes accidentally switched somehow in the validation dataset? Because notice that if you instead use 1 - model.predict(x), your accuracy would be above 50%.

Related

naive bayes accuracy increasing as increasing in the alpha value

I'm using naive Bayes for text classification and I have 100k records in which 88k are positive class records and 12krecords are negative class records. I converted sentences to unigrams and bigrams using countvectorizer and I took alpha range from [0,10] with 50 values and I draw the plot.
In Laplace additive smoothing, If I keep increasing the alpha value then accuracy on the cross-validation dataset also increasing. My question is is this trend expected or not?
If you keep increasing the alpha value then naive bayes model will bias towards the class which has more records and model becomes a dumb model(underfitting) so by choosing small alpha value is good idea.
Because you have 88k Positive Point and 12K negative point which means that you have unbalanced data set.
You can add more negative point to balanced data set, you can clone or replicate your negative point which we called upsampling. After that, your data set is balanced now you can apply naive bayes with alpha it will work properly, now your model is not dumb model, earlier you model was dumb that's why as increase alpha it increase you Accuracy.

How to interpret the results of a training/validation learning curve?

I am using the Random Forest classifier in the Scikit package and have plotted F1 scores versus training set size. The red is the training set F1 scores and the green is the scores for the validation set. This is about what I expected but I would like some advice on interpretation.
I see that there is some significant variance, yet the validation curve appears to be converging. Should I assume that adding data would do little to affect the variance given the convergence or am I jumping to conclusion about the rate of convergence?
Is the amount of variance here significant enough to warrant taking further actions that may increase the bias slightly? I realize this is a fairly domain-specific question but I wonder if there is any general guidelines for how much variance is worth a bit of bias tradeoff?
I see that there is some significant variance, yet the validation curve appears to be converging. Should I assume that adding data would do little to affect the variance given the convergence or am I jumping to conclusion about the rate of convergence?
This seems true conditioning on your learning procedure, thus in particular - selection of hyperparameters. Thus it does not mean that given different set of hyperparameters the same effect would occur. It only seems that given current setting - rate of convergence is relatively small thus getting to 95% would probably require significant amounts of data.
Is the amount of variance here significant enough to warrant taking further actions that may increase the bias slightly? I realize this is a fairly domain-specific question but I wonder if there is any general guidelines for how much variance is worth a bit of bias tradeoff?
Yes, in general - these kind of curves at least do not reject option to go for higher bias. You clearly overfit towards training set. On the other hand, trees usually do that, thus increasing bias might be hard without changing the model. One option that I would suggest is going for Extremely Randomized Trees, which is nearly the same as Random Forest, but with randomly chosen threshold instead of full optimization. They have significantly bigger bias and should take these curves a bit closer to each other.
Obviously there is no guarantee - as you said, this is data specific, but the general characteristic looks promising (however might require changing the model).

How can I make Weka classify the smaller class, with a 2:1 class imbalance?

How can I make Weka classify the smaller classification? I have a data set where the positive classification is 35% of the data set and the negative classification is 65% of the data set. I want Weka to predict the positive classification but in some cases, the resultant model predicts all instances to be the negative classification. Regardless, it is classifying the negative (larger) class. How can I force it to classify the positive (smaller) classification?
One simple solution is to adjust your training set to be more balanced (50% positive, 50% negative) to encourage classification for both cases. I would guess that more of your cases are negative in the problem space, and therefore you would need to find some way to ensure that the negative cases still represent the problem well.
Since the ratio of positive to negative is 1:2, you could also try duplicating the positive cases in the training set to make it 2:2 and see how that goes.
Use stratified sampling (e.g. train on a 50%/50% sample) or class weights/class priors. It helps greatly if you tell us which specific classifier? Weka seems to have at least 50.
Is the penalty for Type I errors = penalty for Type II errors?
This is a special case of the receiver operating curve (ROC).
If the penalties are not equal, experiment with the cutoff value and the AUC.
You probably also want to read the sister site CrossValidated for statistics.
Use CostSensitiveClassifier, which is available under "meta" classifiers
You will need to change "classifier" to your J48 and (!) change cost matrix
to be like [(0,1), (2,0)]. This will tell J48 that misclassification of a positive instance is twice more costly than misclassification of a negative instance. Of course, you adjust your cost matrix according to your business values.

Word2Vec: Number of Dimensions

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?
Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.
I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.
I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.
The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.

SVM Hard margin: why imbalanced dataset may cause bad results?

I can understand why soft Margin SVM are subject to imbalanced training set: minimizing the error of the optimiziation problem can drive to classify all data training to be negative (if |negative examples| >> |positive examples| ).
But, in hard margin SVM, I haven't slack variables and C costant, so I don't want to minimize the error, because hard margin SVM expected no error (for the definition of the problem)! hard margin SVM just search the support vectors and maximize the margin between the classes support hyperplanes "identified" by the support vectors; now, if I have "behind" the negative support vectors (i.e. the negative class suppoort hyperplane) a lot of points or the same number of positive points, these do not affect my margin and separation hyperplane;
it's always the same since it depends just from support vector, and they are always the same regardless if I increase the number of points! why hard margin SVM are subject to imbalanced dataset or where my reasoning is wrong?
thanks!
For a true hard margin SVM there are two options for any data set, regardless of how its balanced:
The training data is perfectly separable in feature space, you get a resulting model with 0 training errors.
The training data is not separable in feature space, you will not get anything (no model).
Additionally, take note that you could train hard margin SVM on any data set given a kernel that is complex enough (RBF with very large gamma, for instance). The resulting model is generally bad, though, as it is a total overfit of the training data.

Resources