How to get only the predictions with a probability greater than x - random-forest

I used a random forest to classify texts to certain categories. When I used my testdata I got an accuracy of 0.98. But with another set of data the overall accuracy decreases to 0.7. I think, most of the rows still have a high accuracy.
So now I want to show only the predicted categories with a high confidence.
random-forrest gives me a column "probability", which is an array of probabilities. How do I get the actual probabilty of the chosen prediction?
val randomForrest = new RandomForestClassifier()
.setLabelCol(labelIndexer.getOutputCol)
.setFeaturesCol(vectorAssembler.getOutputCol)
.setProbabilityCol("probability")
.setSeed(123)
.setPredictionCol("prediction")

I eventually came up with the following udf to get the best prediction together with its probability.
If there is a more convenient way, please comment.
def getBestPrediction = udf((
rawPrediction: org.apache.spark.ml.linalg.Vector, probability: org.apache.spark.ml.linalg.Vector) => {
val bestPrediction = probability.argmax
val bestProbability = probability(bestPrediction)
(bestPrediction, bestProbability)
})

Related

Display inverted ROC Curve

my anomaly detection algorithm gave me an array of predictions where all the values greater than 0 should be of the positive class (= 0) and all the other should be classified as anomalies (= 1). I built my classifier as well: (I have three datasets, the one with only non-anomaly values and the other with all anomaly values):
normal = np.load('normal_score.pkl')
anom_1 = np.load('anom1_score.pkl')
anom2_ = np.load('anom2_score.pkl')
y_normal = np.asarray([0]*len(normal)) # I know they are normal
y_anom_1 = np.asarray([1]*len(anom_1)) # I know they are anomaly
y_anom_2 = np.asarray([1]*len(anom_2)) # I know they are anomaly
score = np.concatenate([normal, anom_1, anom_2])
y = np.concatenate([y_normal, y_anom_1, y_anom_2])
auc = roc_auc_score(y, score)
fpr, tpr, thresholds = roc_curve(y, score)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
The AUC score I get is 0.02 and the plot looks like:
From what I understood this result is great because I should just reverse the labels to make it almost 0.98, but my question is: is there a way to specify it and automatically reverse it through a function?
The values in my normal score data are all in the range (21;57) and the anomalies values are in the range (-1090; -1836) so it should be easy to spot them.
"I should just reverse the labels to make it almost 0.98"
That's not how it should be done. It is because if you can predict "normal", let's say, with 95% confidence, you can not infer from this that you can also predict "anomaly" with the same confidence.
It becomes crucial in case of heavily imbalanced data which is probably the case here.
You should define which of these two you want to predict with high confidence and what are the target prediction metrics. For example, if you have a target on the precision and recall for predicting the "anomaly" then that should be your class "1" and calculate the metrics accordingly, and vice versa.

Low confidence score in SVC for example from training set

Here is my code for SVC classifier.
vectorizer = TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(training_data)
classifier_linear = svm.LinearSVC()
clf = CalibratedClassifierCV(classifier_linear)
linear_svc_model = clf.fit(train_vectors, train_labels)
training_data here is a list of english sentences and train_lables are the labels associated. I do the usual stopwords removal and some preprocessing before creating final version of training_data. Here is how my testing code:
test_lables = ["no"]
test_vectors = vectorizer.transform(test_lables)
prediction_linear = clf.predict_proba(test_vectors)
counter = 0
class_probability = {}
lables = []
for item in train_labels:
if item in lables:
continue
else:
lables.append(item)
for val in np.nditer(prediction_linear):
new_val = val.item(0)
class_probability[lables[counter]] = new_val
counter = counter + 1
sorted_class_probability = sorted(class_probability.items(), key=operator.itemgetter(1), reverse=True)
print(sorted_class_probability)
Now when I run the code with a phrase that is already there in the training set (a word 'no' in this case), it identifies properly, but the confidence score is even below .9. The output is as follows:
[('no', 0.8474342514152964), ('hi', 0.06830103628879058), ('thanks', 0.03070201906552546), ('confused', 0.02647134535600733), ('ok', 0.015857384248465656), ('yes', 0.005961945963546264), ('bye', 0.005272017662368208)]
When I am studying online, I have seen that usually confidence score for data already in the training set is closer to 1 or almost 1 and rest of them are really negligible. What can I do to get better confidence score? Should I be worried that if I add more classes to it, the confidence score will further dip and it will be difficult for me to surely point out one standout class?
As long as your scores help you classify your inputs correctly, you shouldn't worry at all. If anything, if your confidence on the input already in your training data is too high, that probably means your method has overfit to the data, and cannot generalize to the unseen data.
However, you can tune the complexity of your method by changing the penalization parameters. In the case of a LinearSVC, you have both the penalty and the C parameter. Try different values of those two and observe the effect. Make sure you also observe the effect on an unseen test set.
Just a not that the values of C should be in exponential space, eg. [0.001, 0.01, 0.1, 1, 10, 100, 1000] for you to see meaningful effects.
The SGDClassifier may be relevant to your case if you're interested in such linear models and tuning your parameters.

How to adjust Logistic Regression classification threshold value in Scikit-learn? [duplicate]

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.
I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.
Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?
I did not find anything in the documentation page.
Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?
There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;
pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
print ('\n******** For i = {} ******'.format(i))
Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
print('Our testing accuracy is {}'.format(test_accuracy))
print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
Best!
Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).
Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.
Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:
One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.
As another option, one can graphically view precision vs. recall at various thresholds using the following code.
### Predict test_y values and probabilities based on fitted logistic
regression model
pred_y=log.predict(test_x)
probs_y=log.predict_proba(test_x)
# probs_y is a 2-D array of probability of being labeled as 0 (first
column of
array) vs 1 (2nd column in array)
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:,
1])
#retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)
plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
we can use a wrapper as follows:
model = LogisticRegression()
model.fit(X, y)
def custom_predict(X, threshold):
probs = model.predict_proba(X)
return (probs[:, 1] > threshold).astype(int)
new_preds = custom_predict(X=X, threshold=0.4)

Too small RMSE. Recommender systems

Sorry, I'am newbie at recommender systems, but i wrote few lines of code using apache mahout lib. Well, my dataset is pretty small, 500x100 with 8102 cells known.
So, my dataset is actually a subset of Yelp dataset from "Yelp business rating prediction" competition. I just take top 100 most commented restaurants, and then take 500 most active customers.
I created SVDRecommender and then I evaluated RMSE. And so the result is about 0.4... Why is it so small? Maybe i just don't understand something and my dataset is not so sparse, but then i tried with larger and more sparse dataset and RMSE become even smaller (about 0.18)! Could anyone explain me such behaviour?
DataModel model = new FileDataModel(new File("datamf.csv"));
final RatingSGDFactorizer factorizer = new RatingSGDFactorizer(model, 20, 200);
final Factorization f = factorizer.factorize();
RecommenderBuilder builder = new RecommenderBuilder() {
public Recommender buildRecommender(DataModel model) throws TasteException {
//build here whatever existing or customized recommendation algorithm
return new SVDRecommender(model, factorizer);
}
};
RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
double score = evaluator.evaluate(builder,
null,
model,
0.6,
1);
System.out.println(score);
RMSE is calculated by looking at predicted ratings versus their hidden ground-truth. So a sparse dataset may only have very few hidden ratings to predict, or your algorithm may not be able to predict for many hidden ratings because there's no correlation to other ratings. This means that even though your RMSE is low ("better"), your coverage will be low because you aren't predicting very many items.
There's another issue: RMSE is completely dataset dependent. On the MovieLens ratings dataset which has star ratings 0.5 to 5.0 stars, an RMSE of roughly 0.9 is common. But on another dataset with 0.0 to 1.0 points, I've observed an RMSE of around 0.2. Look at the properties of your dataset and see if 0.4 makes sense.

RMSE in Naive Bayes Classifier

I have a very basic question about calculating RMSE in an NB classification scenario. My training data X has some 1000-odd reviews with ratings in [1,5] which are the class labels Y.
So what I am doing is something like this:
model = nb_classifier_train(trainingX,Y)
Yhat = nb_classifier_test(model,testingX)
My testing data has some 400-odd reviews with missing ratings (whose labels/ratings I need to predict. Now to calculate RMSE
RMSE = sqrt(mean((Y - Yhat).^2))
What is the Y in this scenario? I understand RMSE is calculated using difference between predicted values and actual values. What are the actual values here? Or is there something missing?
Y in this case is the labels for your training data, so the RMSE you're calculating does not make much sense since you are making a prediction on the test examples and comparing against the training labels. In fact, there is no reason that Y and Yhat vectors would even be the same length. Instead you should replace the Y with your test labels, and if you don't have test labels then you simply have no way of calculating your test error.

Resources