I have written my custom scorer object which is necessary for my problem and which I've called "p_value_scoring_object".
For the function sklearn.cross_validation.cross_val_score one of the parameters is "scoring", which allows to use this scorer object.
However, this option is not available for the score method of a classifier. Is sklearn just lacking that feature, or is there a way around it?
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10,scoring=p_value_scoring_object)
This works. However, this doesn't:
clf.fit(iris.data,iris.target)
clf.score(iris.data,iris.target,scoring=p_value_scoring_object)
sklearn just lacking that feature. Score is internally binded to different metrics for different types of estimators. For example classifiers are binded to classification accuracy score metric, for regressors it's binded to r2_score.
You can look at these binds in sklearn.base, every mixin (For example ClassifierMixin) provides this score method.
Istead of this you can just run:
p_value_scoring_object(p_value_scoring_object, iris.data, iris.target)
Related
Let's say i have this python code:
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.decomposition import PCA
selector = VarianceThreshold()
scaler = StandardScaler()
ros = RandomOverSampler()
pca = PCA()
clf = neighbors.KNeighborsClassifier(n_jobs=-1)
pipe = Pipeline(steps=[('selector', selector), ('scaler', scaler), ('sampler', ros), ('pca', pca), ('kNN', clf)])
pipe.fit(X_train,y_train)
preds = pipe.predict(X_test)
What this does is import 4 transformers and an estimator from scickit learn.Then it fit's these in the data and lastly it predicts.If i understand correctly fit method applies the 4 transformers to the data and the predict method makes the final estimation(in our case using kNN).My question is this:For scaler as well as pca the alterations that are done in the train data must be also applied in the test data.But in fit's parameters we don't give test and as a result the test data won't be altered.How does that make sense?Is there something i am missing?
The model only learns the parameters from the training data and assumes that the test data will have similar patterns and transforms it accordingly. You cannot have a test data that is completely different from training data and expect good predictions, hence the same PCA and scaler models are also used on test dataset. If you fit a scaler on a smaller test dataset, the results might come out completely different than what the model is originally trained on.
Showing same accuracy in decision tree and naive bayes algorithm with different symptoms
I tried to get different accuracy but all results are remaining same
this project is about disease prediction
#decision_tree
from sklearn import tree
from sklearn.metrics import accuracy_score
decision_tree = tree.DecisiontTreeClassifier()
decision_tree = decision_tree.fit(train_x,train_y)
res_pred = decision_tree.predict(x_test)
print(accuracy_score(y_test,res_pred))
#naive_bayes
from sklearn.naive_bayes import GaussuanNB
gnb = gnb.fit(train_x,np.ravel(train_y))
y_pred = gnb.predict(x_test)
print(accuracy_score(y_test,y_pred)
result is 0.9512195121951219 all time
There are often some ML problems which are so simple that almost every model will perform equally well on them. To get different results from both the models, try to change their hyperparameters (like set the max depth of decision tree to 1).
I'm trying to solve a binary classification task. The training data set contains 9 features and after my feature engineering I ended having 14 features. I want to use a stacking classifier approach with
mlxtend.classifier.StackingClassifier by using 4 different classifiers, but when trying to predict the test datata set I got the error: ValueError: query data dimension must match training data dimension
%%time
models=[KNeighborsClassifier(weights='distance'),
GaussianNB(),SGDClassifier(loss='hinge'),XGBClassifier()]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression()
stacker=StackingCVClassifier(classifiers=calibrated_models,meta_classifier=meta,use_probas=True).fit(X.values,y.values)
Remark: In my code I just programmed a function to return a list with calibrated classifiers StackingCVClassifier I have checked this is not causing the error
Remark 2: I had already tried to perform a stacker from scratch with the same results so I had thought It was something wrong with my own stacker
from sklearn.linear_model import LogisticRegression
def StackingClassifier(X,y,models,stacker=LogisticRegression(),return_data=True):
names,ls=[],[]
predictions=pd.DataFrame()
for model in models:
names.append(str(model)[:str(model).find('(')])
for i,model in enumerate(models):
model.fit(X,y)
ls=model.predict_proba(X)[:,1]
predictions[names[i]]=ls
if return_data:
return predictions
else:
return stacker.fit(predictions,y)
Could you please help me to understand the correct usage of a stacking classifiers?
EDIT:
This is my code for calibrated classifier. This function takes a list of n classifiers and apply sklearn fucntion CalibratedClassifierCV to each one and returns a list with n calibrated classifiers. You have an option to return as a zip list since this function is mainly intended to be used along with sklearn's VotingClassifier
def Calibrated_classifier(models,method='sigmoid',return_names=True):
calibrated,names=[],[]
for model in models:
names.append(str(model)[:str(model).find('(')])
for model in models:
clf=CalibratedClassifierCV(base_estimator=model,method=method)
calibrated.append(clf)
if return_names:
return zip(names,calibrated)
else:
return calibrated
I have tried your code with Iris dataset. It is working fine, I think the problem is with the dimension of your test data and not with the calibration.
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingCVClassifier
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
models=[KNeighborsClassifier(weights='distance'),
SGDClassifier(loss='hinge')]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression( multi_class='ovr')
stacker = StackingCVClassifier(classifiers=calibrated_models,
meta_classifier=meta,use_probas=True,cv=3).fit(X,y)
Prediction
stacker.predict([X[0]])
#array([0])
I have the following code which performs 5-fold cross validation and returns several metric values.
iris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy',
'prec_macro': 'precision_macro',
'rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
cv=5, return_train_score=True)
I want to know if this can be modified to print the predicted values for each fold.
If you're using sklearn you can use cross_val_predict:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf_name,X_train,y_train_5,cv=3)
cross_val_score gives score for each fold. while cross_val_predict gives prediction for each fold.
Since I need also this feature in scikit-learn, I've hacked the code in my sklearn repo.
If you still need this, you can find this on my github, on the branch group_cv:
https://github.com/robbisg/scikit-learn/tree/group_cv
The modified cross_validate function is here:
https://github.com/robbisg/scikit-learn/blob/group_cv/sklearn/model_selection/_validation.py
You need to call cross_validate with return_predictions=True.
Hope this helps.
I want to use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python. My training data set is of the order 16000x60. Also how to map principal component to original column to use it in SVM or can I use principal component directly?
It is unclear what the problem is and what did you try already. Of course you can. You can either add PCA output to your original set or just use the output as a single feature. I encourage you to use sklearn pipelines.
Simple example:
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn import svm
svc = svm.SVC()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('svc', svc)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
pipe.fit(X_digits, y_digits)
print(pipe.score(X_digits,y_digits))