I am trying a code but it show me this error
NameError:name 'cross_validation' is not defined
when I run this line
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
sklrean version is 0.19.1
use cross_val_score and train_test_split separately. Import them using
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
Then before applying cross validation score you need to pass the data through some model. Follow below code as an example and change accordingly:
xtrain,ytrain,xtest,ytest=train_test_split(balancedData.iloc[:,0:29],balancedData['Left'],test_size=0.25,random_state=123)
rf=RandomForestClassifier(max_depth=8,n_estimators=5)
rf_cv_score=cross_val_score(estimator=rf,X=xtrain,y=xtest,cv=5)
print(rf_cv_score)
import random forest from sklearn before using it.
Related
I just tried to implement logistic regression on a very simple and small dataset at Jupyter notebook. But the output that I am getting at the end having applied the algorithm is unwanted and shocking. I am getting the output as LogisticRegression() only nothing but only this.
import numpy as np
import pandas as pd
df = pd.read_csv('placement.csv')
df.head()
df.info()
df = df.iloc[:,1:]
df.head()
import matplotlib.pyplot as plt
plt.scatter(df['cgpa'],df['iq'],c=df['placement'])
X = df.iloc[:,0:2]
y = df.iloc[:,-1]
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1)
X_train
y_train
X_test
y_test
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train
X_test = scaler.transform(X_test)
X_test
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train,y_train)
LogisticRegression() ## at the end I get this.
Please bear with me for the way I have uploaded the code.
How can I fix this output of logisticregression(), need help.
From my error I get to know that the scikit-learn model repr implementations were updated so they would not display default parameters when printing some time back.
I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.
I am learning how to use Scikit-Learn and I am trying to get the feature importance of a multi-label classification problem.
I defined and trained the model in the following way:
classifier = OneVsRestClassifier(
make_pipeline(RandomForestClassifier(random_state=42))
)
classifier.classes_ = classes
y_train_pred = cross_val_predict(classifier, X_train_prepared, y_train, cv=3)
The code seems to be working fine until I try to get the feature importance. This is what I tried:
classifier.feature_importances_
But I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-a9c91f6f2504> in <module>
----> 1 classifier.feature_importances_
AttributeError: 'OneVsRestClassifier' object has no attribute 'feature_importances_'
I tried also the solution proposed in this question but I think it is outdated.
Would you be able to propose a newer smart and elegant solution to display the feature importances of a multi-label classification problem?
I would say that the solution within the referenced post is not outdated; instead, you have a slightly different setting to take care of.
The estimator that you're passing to OneVsRestClassifier is a Pipeline; in the referenced post it was a RandomForestClassifier directly.
Therefore you'll have to access one of the pipeline's steps to get to the RandomForestClassifier instance on which you'll be finally able to access the feature_importances_ attribute. That's one way of proceeding:
classifier.estimators_[0].named_steps['randomforestclassifier'].feature_importances_
Eventually, be aware that you'll have to fit your OneVsRestClassifier instance to be able to access its estimators_ attribute. Indeed, though cross_val_predict already takes care of fitting the estimator as you might see here, cross_val_predict does not return the estimator instance, as .fit() method does. Therefore, outside cross_val_predict the fact that classifier was fit is not known, reason why you're not able to access the estimators_ attribute.
Here is a toy example:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_predict
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)
classifier = OneVsRestClassifier(
make_pipeline(RandomForestClassifier(random_state=42))
)
classifier.fit(X_train, y_train)
y_train_pred = cross_val_predict(classifier, X_train, y_train, cv=3)
classifier.estimators_[0].named_steps['randomforestclassifier'].feature_importances_
I'm trying to use the Discriminationthreshold Visualizer for my fitted models; They're all binary classifiers (logistic regression, lightgbm, and xgbclassifier) however, based on the documentation I am having a hard time producing the plot on already fitted models. My code is the following
# test is a logistic regression model
from yellowbrick.classifier import DiscriminationThreshold
visualizer = DiscriminationThreshold(test, is_fitted = True)
visualizer.show()
the output of this is the following:
Can someone please help me understand how to use the discriminationthreshold properly on a fitted model. I tried with the others lgbm and xgb and got an empty plot as well.
The DiscriminationThreshold visualizer works as the evaluator of a model and requires evaluation data set. This means you need to fit the visualizer regardless whether your model is already fitted or not. You seem to have omitted this step because your model is already fitted.
Try something like this:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import DiscriminationThreshold
from yellowbrick.datasets import load_spam
# Load a binary classification dataset and split
X, y = load_spam()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Instantiate and fit the LogisticRegression model
model = LogisticRegression(multi_class="auto", solver="liblinear")
model.fit(X_train, y_train)
visualizer = DiscriminationThreshold(model, is_fitted=True)
visualizer.fit(X_test, y_test) # Fit the test data to the visualizer
visualizer.show()
How to achieve the title mentioned task. Do we have any parameter in RBF kernel to set the distance metric as chi-squared distance metric. I can see a chi2_kernel in the sk-learn library.
Below is the code that i have written.
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.preprocessing import Imputer
from numpy import genfromtxt
from sklearn.metrics.pairwise import chi2_kernel
file_csv = 'dermatology.data.csv'
dataset = genfromtxt(file_csv, delimiter=',')
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=1)
dataset = imp.fit_transform(dataset)
target = dataset[:, [34]].flatten()
data = dataset[:, range(0,34)]
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)
# TODO : willing to set chi-squared distance metric instead. How to do that ?
clf = svm.SVC(kernel='rbf', C=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
Are you sure you want to compose rbf and chi2? Chi2 on its own defines a valid kernel, and all you have to do is
clf = svm.SVC(kernel=chi2_kernel, C=1)
since sklearn accepts functions as kernels (however this will require O(N^2) memory and time). If you would like to compose these two it is a bit more complex, and you will have to implement your own kernel to do that. For a bit more control (and other kernels) you might also try pykernels, however there is no support for composing yet.