I have the following code which performs 5-fold cross validation and returns several metric values.
iris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy',
'prec_macro': 'precision_macro',
'rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
cv=5, return_train_score=True)
I want to know if this can be modified to print the predicted values for each fold.
If you're using sklearn you can use cross_val_predict:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf_name,X_train,y_train_5,cv=3)
cross_val_score gives score for each fold. while cross_val_predict gives prediction for each fold.
Since I need also this feature in scikit-learn, I've hacked the code in my sklearn repo.
If you still need this, you can find this on my github, on the branch group_cv:
https://github.com/robbisg/scikit-learn/tree/group_cv
The modified cross_validate function is here:
https://github.com/robbisg/scikit-learn/blob/group_cv/sklearn/model_selection/_validation.py
You need to call cross_validate with return_predictions=True.
Hope this helps.
Related
I'm trying to solve a binary classification task. The training data set contains 9 features and after my feature engineering I ended having 14 features. I want to use a stacking classifier approach with
mlxtend.classifier.StackingClassifier by using 4 different classifiers, but when trying to predict the test datata set I got the error: ValueError: query data dimension must match training data dimension
%%time
models=[KNeighborsClassifier(weights='distance'),
GaussianNB(),SGDClassifier(loss='hinge'),XGBClassifier()]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression()
stacker=StackingCVClassifier(classifiers=calibrated_models,meta_classifier=meta,use_probas=True).fit(X.values,y.values)
Remark: In my code I just programmed a function to return a list with calibrated classifiers StackingCVClassifier I have checked this is not causing the error
Remark 2: I had already tried to perform a stacker from scratch with the same results so I had thought It was something wrong with my own stacker
from sklearn.linear_model import LogisticRegression
def StackingClassifier(X,y,models,stacker=LogisticRegression(),return_data=True):
names,ls=[],[]
predictions=pd.DataFrame()
for model in models:
names.append(str(model)[:str(model).find('(')])
for i,model in enumerate(models):
model.fit(X,y)
ls=model.predict_proba(X)[:,1]
predictions[names[i]]=ls
if return_data:
return predictions
else:
return stacker.fit(predictions,y)
Could you please help me to understand the correct usage of a stacking classifiers?
EDIT:
This is my code for calibrated classifier. This function takes a list of n classifiers and apply sklearn fucntion CalibratedClassifierCV to each one and returns a list with n calibrated classifiers. You have an option to return as a zip list since this function is mainly intended to be used along with sklearn's VotingClassifier
def Calibrated_classifier(models,method='sigmoid',return_names=True):
calibrated,names=[],[]
for model in models:
names.append(str(model)[:str(model).find('(')])
for model in models:
clf=CalibratedClassifierCV(base_estimator=model,method=method)
calibrated.append(clf)
if return_names:
return zip(names,calibrated)
else:
return calibrated
I have tried your code with Iris dataset. It is working fine, I think the problem is with the dimension of your test data and not with the calibration.
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingCVClassifier
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
models=[KNeighborsClassifier(weights='distance'),
SGDClassifier(loss='hinge')]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression( multi_class='ovr')
stacker = StackingCVClassifier(classifiers=calibrated_models,
meta_classifier=meta,use_probas=True,cv=3).fit(X,y)
Prediction
stacker.predict([X[0]])
#array([0])
I know this is a very classical question which might be answered many times in this forum, however I could not find any clear answer which explains this clearly from scratch.
Firstly, imgine that my dataset called my_data has 4 variables such as
my_data = variable1, variable2, variable3, target_variable
So, let's come to my problem. I'll explain all my steps and ask your help for where I've been stuck:
# STEP1 : split my_data into [predictors] and [targets]
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
# STEP2 : import the required libraries
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
#STEP3 : define a simple Random Forest model attirbutes
model = RandomForestClassifier(n_estimators=100)
#STEP4 : Simple K-Fold cross validation. 3 folds.
cv = cross_validation.KFold(len(my_data), n_folds=3, random_state=30)
# STEP 5
At this step, I want to fit my model based on the training dataset, and then
use that model on test dataset and predict test targets. I also want to calculate the required statistics such as MSE, r2 etc. for understanding the performance of my model.
I'd appreciate if someone helps me woth some basic codelines for Step5.
First off, you are using the deprecated package cross-validation of scikit library. New package is named model_selection. So I am using that in this answer.
Second, you are importing RandomForestRegressor, but defining RandomForestClassifier in the code. I am taking RandomForestRegressor here, because the metrics you want (MSE, R2 etc) are only defined for regression problems, not classification.
There are multiple ways to do what you want. I assume that since you are trying to use the KFold cross-validation here, you want to use the left-out data of each fold as test fold. To accomplish this, we can do:
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
model = RandomForestRegressor(n_estimators=100)
cv = model_selection.KFold(n_splits=3)
for train_index, test_index in kf.split(predictors):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = predictors[train_index], predictors[test_index]
y_train, y_test = targets[train_index], targets[test_index]
# For training, fit() is used
model.fit(X_train, y_train)
# Default metric is R2 for regression, which can be accessed by score()
model.score(X_test, y_test)
# For other metrics, we need the predictions of the model
y_pred = model.predict(X_test)
metrics.mean_squared_error(y_test, y_pred)
metrics.r2_score(y_test, y_pred)
For all this, documentation is your best friend. And scikit-learn documentation are one of the best I have ever seen. Following links may help you know more about them:
http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
http://scikit-learn.org/stable/user_guide.html
Also in the for loop it should be:
model = RandomForestRegressor(n_estimators=100)
for train_index, test_index in cv.split(X):
I have written my custom scorer object which is necessary for my problem and which I've called "p_value_scoring_object".
For the function sklearn.cross_validation.cross_val_score one of the parameters is "scoring", which allows to use this scorer object.
However, this option is not available for the score method of a classifier. Is sklearn just lacking that feature, or is there a way around it?
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10,scoring=p_value_scoring_object)
This works. However, this doesn't:
clf.fit(iris.data,iris.target)
clf.score(iris.data,iris.target,scoring=p_value_scoring_object)
sklearn just lacking that feature. Score is internally binded to different metrics for different types of estimators. For example classifiers are binded to classification accuracy score metric, for regressors it's binded to r2_score.
You can look at these binds in sklearn.base, every mixin (For example ClassifierMixin) provides this score method.
Istead of this you can just run:
p_value_scoring_object(p_value_scoring_object, iris.data, iris.target)
I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.
To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
])
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).
I built a random forest and I want to find the out of bag score.But my out of bag score is coming out to be 1.0,but it should be less than 1.My sample size consists of 20000 elements.Here is the python code.Please tell the changes to be done.Here X is a numpy array of datasets and Z contains true labels.
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
with open('C:\Users\Harsh Bhandari\Desktop\letter.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, int(col2), int(col3), int(col4),int(col5),int(col6),int(col7),int(col8),int(col9),int(col10),int(col11),int(col12),int(col13),int(col14),int(col15),int(col16),int(col17))
for col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17 in reader]
X=[]
Y=[]
i=0
while i<20000:
t=data[i][1:]
X.append(t)
t=data[i][0]
Y.append(t)
i=1+i
X=np.asarray(X)
Y=np.asarray(Y)
le = preprocessing.LabelEncoder()
Z=le.fit_transform(Y)
clf = RandomForestClassifier(n_estimators=100,oob_score=True)
clf=clf.fit(X,Z)
a=clf.predict(X)
scores=clf.score(X,a)
print scores
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In score you send the Test Data and its actual labels, here you are passing the predicted labels itself which match the prediction hence you are
getting 1.0 score.
i see a couple things here.
you are doing clf.score(X, a)
but you should be doing clf.score(X, Z)
where Z is the true label for X
the score parameter is defined as such clf.score(X, true_labels_for_X)
you instead put the values that you predicted as y_true which dosen't make sense. since Sklearn will already run predict on X, you don't need to pass a.
Also, you can find the oobscore of by doing
print clf.oob_score_