I just tried to implement logistic regression on a very simple and small dataset at Jupyter notebook. But the output that I am getting at the end having applied the algorithm is unwanted and shocking. I am getting the output as LogisticRegression() only nothing but only this.
import numpy as np
import pandas as pd
df = pd.read_csv('placement.csv')
df.head()
df.info()
df = df.iloc[:,1:]
df.head()
import matplotlib.pyplot as plt
plt.scatter(df['cgpa'],df['iq'],c=df['placement'])
X = df.iloc[:,0:2]
y = df.iloc[:,-1]
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1)
X_train
y_train
X_test
y_test
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train
X_test = scaler.transform(X_test)
X_test
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train,y_train)
LogisticRegression() ## at the end I get this.
Please bear with me for the way I have uploaded the code.
How can I fix this output of logisticregression(), need help.
From my error I get to know that the scikit-learn model repr implementations were updated so they would not display default parameters when printing some time back.
Related
I have some troubles to implement cross-validation. I understand that after cross-validation I have to re-train the model but I have the next doubts:
Do train_test split before cross validation and use X_train and y_train for cross-validation process and then re-train model with X_train and y_train
Split data in features (X) and labels (y) and use this variables in cross-validation process and then do train test split and train model with X_train and y_train
If I use features and label variables what is the next step after cross-validation?
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('../data/pima-indians-diabetes.csv')
data.head()
# All the columns except the one we want to predict
features = data.drop(['Outcome'], axis=1)
# Only the column we want to predict
labels = data['Outcome']
from sklearn.model_selection import train_test_split
test_size = 0.33
seed = 12
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_size,
random_state=seed)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X_train, y_train, cv=kfold)`
model.fit(X_train, Y_train)
kfold = KFold(n_splits=10, random_state=1)
model = LogisticRegression()
scores = cross_val_score(model, features, labels, cv=kfold)
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
model.fit(X_train, Y_train)
Which of the two code blocks is correct or is there another way to implement cross-validation correctly?
I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.
import math
import numpy as np
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
# Test the accuracy
accuracy = clf.score(X_test, y_test)
I'm trying to fit split value into LinearRegression.fit() but it shows Input contains NaN, infinity or a value too large for dtype('float64').
I am trying a code but it show me this error
NameError:name 'cross_validation' is not defined
when I run this line
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
sklrean version is 0.19.1
use cross_val_score and train_test_split separately. Import them using
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
Then before applying cross validation score you need to pass the data through some model. Follow below code as an example and change accordingly:
xtrain,ytrain,xtest,ytest=train_test_split(balancedData.iloc[:,0:29],balancedData['Left'],test_size=0.25,random_state=123)
rf=RandomForestClassifier(max_depth=8,n_estimators=5)
rf_cv_score=cross_val_score(estimator=rf,X=xtrain,y=xtest,cv=5)
print(rf_cv_score)
import random forest from sklearn before using it.
How to achieve the title mentioned task. Do we have any parameter in RBF kernel to set the distance metric as chi-squared distance metric. I can see a chi2_kernel in the sk-learn library.
Below is the code that i have written.
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.preprocessing import Imputer
from numpy import genfromtxt
from sklearn.metrics.pairwise import chi2_kernel
file_csv = 'dermatology.data.csv'
dataset = genfromtxt(file_csv, delimiter=',')
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=1)
dataset = imp.fit_transform(dataset)
target = dataset[:, [34]].flatten()
data = dataset[:, range(0,34)]
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)
# TODO : willing to set chi-squared distance metric instead. How to do that ?
clf = svm.SVC(kernel='rbf', C=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
Are you sure you want to compose rbf and chi2? Chi2 on its own defines a valid kernel, and all you have to do is
clf = svm.SVC(kernel=chi2_kernel, C=1)
since sklearn accepts functions as kernels (however this will require O(N^2) memory and time). If you would like to compose these two it is a bit more complex, and you will have to implement your own kernel to do that. For a bit more control (and other kernels) you might also try pykernels, however there is no support for composing yet.