I get different results for model.predict_proba(X)[:,0] compared to model.decision_function(X)for a regular Grad Boost Decision Tree classifier in SKLearn so I know that that is not the same.
I want the scores of the model. To plot ROC curves etc. How can I get the decision function for XGBoost classifier using the SKLearn wrapper? And why is predict_proba different from scores?
In general, I would not expect the sklearn.GradientBoostingClassifier and xgboost.XGBClassifier to agree, as those use very different implementations. But there are also conceptual difference between the quantities that you have tried to compare:
And why is predict_proba different from scores?
Probabilities (output of model.predict_proba(X)) are obtained from the scores (output of model.decision_function(X)) applying the loss/objective function, see here for the call to the loss function and here for the actual transformation.
I want the scores of the model. To plot ROC curves etc. How can I get the decision function for XGBoost classifier using the SKLearn wrapper?
For the ROC curve you will want to use xgbmodel.predict_proba(X)[:,1], i.e. the second column that correspond to the class 1.
Related
I have a question. Is the best score from GridSearchCV, which corresponds to mean cross-validation score, the right metric to evaluate an algorithm trained with unbalanced data?
GridSearchCV can be used to find appropriate parameter values for your model.
For the right metric to evaluate an algorithm trained with unbalanced data, you want to look at the area under the precision-recall curve (PR AUC) or 'average precision' or maybe even a cost-sensitive one (Jason Brownlee has a bunch of blogs on this topic).
In Sklearn, roc curve requires (y_true, y_scores). Generally, for y_scores, I feed in probabilities as outputted by a classifier's predict_proba function. But in the sklearn example, I see both predict_prob and decision_fucnction are used.
I wonder what is the difference in terms of real life model evaluation?
The functional form of logistic regression is -
f(x)=11+e−(β0+β1x1+⋯+βkxk)
This is what is returned by predict_proba.
The term inside the exponential i.e.
d(x)=β0+β1x1+⋯+βkxk
is what is returned by decision_function. The "hyperplane" referred to in the documentation is
β0+β1x1+⋯+βkxk=0
My Understanding after reading few resources:
Decision Function: Gives the distances from the hyperplane. These are therefore unbounded. This can not be equated to probabilities. For getting probabilities, there are 2 solutions - Platt Scaling & Multi-Attribute Spaces to calibrate outputs using Extreme Value Theory.
Predict Proba: Gives the actual probabilities (0 to 1) however attribute 'probability' has to be set to True while fitting the model itself. It uses Platt scaling which is known to have theoretical issues.
Refer to this in documentation.
I would like to compare the accuracies of running logistic regression on a dataset following PCA and LDA. The dataset I am using is the wisconsin cancer dataset, which contains two classes: malignant or benign tumors and 30 features. I have already conducted PCA on this data and have been able to get good accuracy scores with 10 PCAs. I know that LDA is similar to PCA. My understanding is that you calculate the mean vectors of each feature for each class, compute scatter matricies and then get the eigenvalues for the dataset. Is LDA similar to PCA in the sense that I can choose 10 LDA eigenvalues to better separate my data? I have tried LDA with scikit learn, however it has only given me one LDA back. Is this becasue I only have 2 classes, or do I need to do an addiontional step? I would like to have 10 LDAs in order to compare it with my 10 PCAs. Is this even possible?
Actually both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels). You can picture PCA as a technique that finds the directions of maximal variance.And LDA as a technique that also cares about class separability (note that here, LD 2 would be a very bad linear discriminant).Remember that LDA makes assumptions about normally distributed classes and equal class covariances (at least the multiclass version; the generalized version by Rao).
Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?
When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example
Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.
I am interested in understanding how probability estimates are calculated by random forests, both in general and specifically in Python's scikit-learn library (where probability estimated are returned by the predict_proba function).
Thanks,
Guy
The probabilities returned by a forest are the mean probabilities returned by the trees in the ensemble (docs).
The probabilities returned by a single tree are the normalized class histograms of the leaf a sample lands in.
In addition to what Andreas/Dougal said,
when you train the RF, turn on compute_importances=True.
Then inspect classifier.feature_importances_ to see which features are occurring high-up in the RF's trees.