I am using SPSS for producing ROC curve, but ROC cure does not give me the confidence-interval for sensitivity and specificity.
So if anyone can help me to produce confidence-interval for Sensitivity and specificity in SPSS will be the biggest help for me.
Thanks you
Version 26 has these statistics in its ROC ANALYSIS procedure (Note: NOT the ROC CURVE procedure you were using. This is a new one.)
Related
I am training my dataset using a multivariable (say, 10) linear regression model. I have to choose the parameters but it results in overfitting. I have read that using the genetic algorithm for tuning my parameters we may achieve the minimum possible cost function error.
I am a complete beginner in this area and am unable to understand how the genetic algorithm can help in choosing the parameters using the parent's MSE. Any help is appreciated.
The dataset is extremely imbalanced the positive results were only 10% approximately compared to negative results. Eg: (0 - 11401, 1- 1280).
I have tried
1. RandomForestClassifier with GridSearchCV - hyper parameter tuning.
2. Weighted RandomForest with class_weight="balanced"
3. Penalised SVC
4. UpSampling and DownSampling
Still I don't get good precision or recall in any of the above methods.
Im aware prevalence is related PPV. And my dataset has very low class -1. Also Random Forest may lean to majority class.
But i was hoping sampling should work but it didn't. Am I missing something? Any suggestion would be really appreciated.
a few methodes should help you:
predict probabilities and do a manual thresholding.
change the loss/metric you are using.
for an imbalance dataset (outliers detection) you shoudln't use class_weight=balance but put more weight on the outliers.
try other algorithm to see if some do better (XGBoost,catboost,lightgbm if you want to stick with tree based solutions)
we can also use tpot to find the best algo in sklearn for your particular dataset
tell me if any of those helped you
I've used sci-kit learn to build a random forest model to predict insurance renewals. This is tricky because, in my data set, 96.24% renew while only 3.76% do not renew. After I ran the model I evaluated model performance with a confusion matrix, classification report, and ROC curve.
[[ 2448 8439]
[ 3 278953]]
precision recall f1-score support
0 1.00 0.22 0.37 10887
1 0.97 1.00 0.99 278956
avg / total 0.97 0.97 0.96 289843
My ROC curve looks like this:
The model predicted renewals at just a hair under 100% (rounded to 1.00, see recall column) and non-renewals at about 22% (see recall column). The ROC curve would suggest an area under the curve of much greater than what is indicated in the bottom-right portion of the plot (area = 0.61).
Does anyone understand why this is happening?
Thank you!
In cases where the classes are highly imbalanced, ROC turns out to be an inappropriate metric. A better metric would be to use average precision or area under the PR curve.
This supporting Kaggle link talks about the exact same issue in a similar problem setting.
This answer and the linked paper explain that the optimizing for the best area under PR curve will also give the best ROC.
I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.
I'm using the Weka 3.6 GUI to compare the performance of a group of supervised learning algorithms on a dataset. I'm generating separate ROC curves for each learning algorithm. My problem is: is there a way in Weka to generate all ROC curves for all algorithms on the same set of scales (which would make for easier comparison)? If not, what could I do? Thanks.
This is possible. You need to use the KnowledgeFlow GUI though instead of the Experimenter.
In KnowledgeFlow you can load your dataset and perform different algorithms on it. The result of each algorithm can then be combined into the same Model PerformanceChart resulting in a plot which combines the multiple ROC curves. Detailed steps can be found in section 4.2 in this guide.
As far as my experience tells me- No. You can view ROC of one classifier at a time not ROCs of all classifiers in one place. However, to compare, you can take the ROC value from the classifier tab and compare the values (closer to 1 means good classifier).