naive bayes accuracy increasing as increasing in the alpha value - machine-learning

I'm using naive Bayes for text classification and I have 100k records in which 88k are positive class records and 12krecords are negative class records. I converted sentences to unigrams and bigrams using countvectorizer and I took alpha range from [0,10] with 50 values and I draw the plot.
In Laplace additive smoothing, If I keep increasing the alpha value then accuracy on the cross-validation dataset also increasing. My question is is this trend expected or not?

If you keep increasing the alpha value then naive bayes model will bias towards the class which has more records and model becomes a dumb model(underfitting) so by choosing small alpha value is good idea.

Because you have 88k Positive Point and 12K negative point which means that you have unbalanced data set.
You can add more negative point to balanced data set, you can clone or replicate your negative point which we called upsampling. After that, your data set is balanced now you can apply naive bayes with alpha it will work properly, now your model is not dumb model, earlier you model was dumb that's why as increase alpha it increase you Accuracy.

Related

LSTM cannot predict well when the ground truth is near zero

While training the LSTM model, I encountered one problem that I couldn't solve. To begin with, let me describe my model: I used a stacked LSTM model in Pytorch with 3 layers, 256 hidden units in each layer to predict human joint torques and joint angles from EMG features. After the training, the model can predict well when the ground truth is far away from 0, but when the ground truth is near zero, there is always an offset between the predicted value and the ground truth. I guess the reason would be that the large value of ground truth will give more impact during the training process to reduce the loss function.
This is the result:
The prediction for the validation set
The prediction for the training set
As you can see from the figures, in both datasets, the model can predict well when the ground truth is above 20 degrees. I have tried with different loss functions but the situation did not improve. Since I am just a beginner in this field, I hope someone can point out the problem in my method and how to solve it. Thank you!

Accuracy below 50% for binary classification

I am training a Naive Bayes classifier on a balanced dataset with equal number of positive and negative examples. At test time I am computing the accuracy in turn for the examples in the positive class, negative class, and the subsets which make up the negative class. However, for some subsets of the negative class I get accuracy values lower than 50%, i.e. random guessing. I am wondering, should I worry about these results being much lower than 50%? Thank you!
It's impossible to fully answer this question without specific details, so here instead are guidelines:
If you have a dataset with equal amounts of classes, then random guessing would give you 50% accuracy on average.
To be clear, are you certain your model has learned something on your training dataset? Is the training dataset accuracy higher than 50%? If yes, continue reading.
Assuming that your validation set is large enough to rule out statistical fluctuations, then lower than 50% accuracy suggests that something is indeed wrong with your model.
For example, are your classes accidentally switched somehow in the validation dataset? Because notice that if you instead use 1 - model.predict(x), your accuracy would be above 50%.

Weighted accuracy for image segmentation using FCN

I have built a FCN for image segmentation. The object to be segmented is only very few pixels relatively to the image size (1024x1024). This results in that the accuracy is very high, even if I only train with 10 images instead of 18000 (my full training set).
My approach to solve this is to use some kind of weighted accuracy, so that the accuracy actually say something about the performance of identifying the small object (now it gets high accuracy since so many pixels are not the object and by not classifying anything the accuracy still gets high).
How do I decide the weight, anybody with some experience?
As you wrote, use a custom weight function which penalizes misclassification of underrepresented pixels more. You can get the weight by calculating the quotient between the number of object pixels versus all of the pixels in the image, or you can try it by hand - just make sure you follow the metrics which tell you the accuracy of object pixels. Hope it helps.
You can use infogain loss layer for a "weighted" loss.
The infogain loss is a generalization of the cross entropy loss commonly used. It is defined using a weight matrix H (of size L-by-L, where L is the number of classes):
L(p) = -H log(p)
Where p is a vector of class probabilities.
You can find more details on this loss here.

receiver operating characteristic (ROC) on a test set

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

How can I make Weka classify the smaller class, with a 2:1 class imbalance?

How can I make Weka classify the smaller classification? I have a data set where the positive classification is 35% of the data set and the negative classification is 65% of the data set. I want Weka to predict the positive classification but in some cases, the resultant model predicts all instances to be the negative classification. Regardless, it is classifying the negative (larger) class. How can I force it to classify the positive (smaller) classification?
One simple solution is to adjust your training set to be more balanced (50% positive, 50% negative) to encourage classification for both cases. I would guess that more of your cases are negative in the problem space, and therefore you would need to find some way to ensure that the negative cases still represent the problem well.
Since the ratio of positive to negative is 1:2, you could also try duplicating the positive cases in the training set to make it 2:2 and see how that goes.
Use stratified sampling (e.g. train on a 50%/50% sample) or class weights/class priors. It helps greatly if you tell us which specific classifier? Weka seems to have at least 50.
Is the penalty for Type I errors = penalty for Type II errors?
This is a special case of the receiver operating curve (ROC).
If the penalties are not equal, experiment with the cutoff value and the AUC.
You probably also want to read the sister site CrossValidated for statistics.
Use CostSensitiveClassifier, which is available under "meta" classifiers
You will need to change "classifier" to your J48 and (!) change cost matrix
to be like [(0,1), (2,0)]. This will tell J48 that misclassification of a positive instance is twice more costly than misclassification of a negative instance. Of course, you adjust your cost matrix according to your business values.

Resources