Troubles with Cross-Validation - machine-learning

I have some troubles to implement cross-validation. I understand that after cross-validation I have to re-train the model but I have the next doubts:
Do train_test split before cross validation and use X_train and y_train for cross-validation process and then re-train model with X_train and y_train
Split data in features (X) and labels (y) and use this variables in cross-validation process and then do train test split and train model with X_train and y_train
If I use features and label variables what is the next step after cross-validation?
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('../data/pima-indians-diabetes.csv')
data.head()
# All the columns except the one we want to predict
features = data.drop(['Outcome'], axis=1)
# Only the column we want to predict
labels = data['Outcome']
from sklearn.model_selection import train_test_split
test_size = 0.33
seed = 12
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_size,
random_state=seed)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X_train, y_train, cv=kfold)`
model.fit(X_train, Y_train)
kfold = KFold(n_splits=10, random_state=1)
model = LogisticRegression()
scores = cross_val_score(model, features, labels, cv=kfold)
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
model.fit(X_train, Y_train)
Which of the two code blocks is correct or is there another way to implement cross-validation correctly?

Related

ERROR: Output of Logistic Regression is displayed as LogisticRegression()

I just tried to implement logistic regression on a very simple and small dataset at Jupyter notebook. But the output that I am getting at the end having applied the algorithm is unwanted and shocking. I am getting the output as LogisticRegression() only nothing but only this.
import numpy as np
import pandas as pd
df = pd.read_csv('placement.csv')
df.head()
df.info()
df = df.iloc[:,1:]
df.head()
import matplotlib.pyplot as plt
plt.scatter(df['cgpa'],df['iq'],c=df['placement'])
X = df.iloc[:,0:2]
y = df.iloc[:,-1]
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1)
X_train
y_train
X_test
y_test
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train
X_test = scaler.transform(X_test)
X_test
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train,y_train)
LogisticRegression() ## at the end I get this.
Please bear with me for the way I have uploaded the code.
How can I fix this output of logisticregression(), need help.
From my error I get to know that the scikit-learn model repr implementations were updated so they would not display default parameters when printing some time back.

Get 100% accuracy score on Decision tree model

I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.

Prediction in machine learning

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
clf.fit(X_train, y_train)
y_true, y_pred = y_test, clf.predict(X_test)
acc = accuracy_score(y_pred, y_test)
The above 4 lines of code are from a model I came across. Can somebody tell me the difference between y_test and y_pred.
y_test is the actual label set which is used to evaluate your final model performance in comparison with y_pred, which is the one we get after fitting our machine learning model, i.e. what our model would predict for X_test.

DNN binary classifier's accuracy not increasing

My binary classifier DNN's accuracy seems stuck since epoch 1. I think this means that the model is not learning. Any insight on why this is happening?
Problem statement: I would like to classify a given sequence of readings for sensors (ex. [0 1 15 1 0 3]) into either 0 or 1 (0 equivalent to "idle" state, 1 equivalent to "active" state).
About the dataset: Dataset is available here
The "state" column is the target, while the rest of the columns are the features.
I've tried using SGD instead of Adam, tried using different kernel initializes, tried changing the number of hidden layers and number of neurons per layer and tried using sklearn's StandardScaler instead of the MinMaxScaler. None of these approaches seemed to change the outcome.
This is the code:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.initializers import he_uniform
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
seed = 7
random_state = np.random.seed(seed)
data = pd.read_csv('Dataset/Reformed/Model0_Dataset.csv')
X = data.drop(['state'], axis=1).values
y = data['state'].values
#min_max_scaler = MinMaxScaler()
std_scaler = StandardScaler()
# X_scaled = min_max_scaler.fit_transform(X)
X_scaled = std_scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=random_state)
# One Hot encode targets
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
enc = OneHotEncoder(categories='auto')
y_train_enc = enc.fit_transform(y_train).toarray()
y_test_enc = enc.fit_transform(y_test).toarray()
epochs = 500
batch_size = 100
model = Sequential()
model.add(Dense(700, input_shape=(X.shape[1],), kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(1400, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(700, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(800, activation='relu', kernel_initializer=he_uniform(seed)))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.summary()
early_stopping_monitor = EarlyStopping(patience=25)
# model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9, nesterov=True), loss='binary_crossentropy', metrics=['accuracy'])
model.compile(Adam(lr=.01, decay=1e-6), loss='binary_crossentropy', metrics=['accuracy'], )
history = model.fit(X_train, y_train_enc, validation_split=0.2, batch_size=batch_size,
callbacks=[early_stopping_monitor], epochs=epochs, shuffle=True, verbose=1)
eval = model.evaluate(X_test, y_test_enc, batch_size=batch_size, verbose=1)
Expected results: Accuracy increasing (and loss decreasing) with each epoch (at least for the early epochs).
Actual results: The following values are fixed throughout the entire training process:
loss: 8.0118 - acc: 0.5001 - val_loss: 8.0366 - val_acc: 0.4987
You are using the wrong loss, with a two-output softmax you should use categorical_crossentropy and you should one-hot encode your labels. If you want to use binary_crossentropy, then the output layer should be a one unit with a sigmoid activation.

LinearRegression().fit() Input contains NaN, infinity or a value too large for dtype('float64')

import math
import numpy as np
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
# Test the accuracy
accuracy = clf.score(X_test, y_test)
I'm trying to fit split value into LinearRegression.fit() but it shows Input contains NaN, infinity or a value too large for dtype('float64').

Resources