ValueError: Found input variables with inconsistent numbers of samples: [559, 140] - machine-learning

Here's the code below... I dont know what's wrong with my code. plz help. the error is occured in line
clf.fit(X_train, y_train)
import numpy as np
from sklearn import preprocessing, neighbors
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.txt')
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
X_train,y_train,X_test,y_test =
train_test_split(X,y, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
print(X_train.shape)
print(y_train.shape)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print (accuracy)

The problem is in your X_train,y_train,X_test,y_test =
train_test_split(X,y, test_size=0.2) part.
According to scikit-learn documentation in here,the correct order of return value of train_test_split function is:
X_train,
X_test,
y_train,
y_test
Your order in the code is wrong. Let's replace the line you have used ** train_test_split** with this line:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
Hopefully, this will resolve your issue.

Related

AutoSKLearn predict_proba equivalent?

Is there an equivalent to SKLearn's predict_proba in AutoSKLearn? I can't seem to find a way to determine the confidence of AutoSKLearns predictions.
A predict_proba method should be implemented for AutoSklearnClassifier
From auto-sklearn documentation:
predict_proba(X, batch_size=None, n_jobs=1)
Predict probabilities of classes for all samples X.
Parameters:
Xarray-like or sparse matrix of shape = [n_samples, n_features]
...
Returns:
yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]
Which in context looks something like this:
from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = AutoSklearnClassifier(time_left_for_this_task=30)
clf.fit(X_train, y_train)
predictions = clf.predict_proba(X_test)
print(predictions)

LSTM multiclass classifier : why my predictions have nothing to do with my target set?

I am trying to design a LSTM model for forecasting price movement.
I have issues regarding the results I obtain for my predictions. I did not normalize my target set y (nor train nor test), only X because it's a classification (-1,0,1) but the predictions I obtain are float.
Maybe I did not normalize the righ sets. My code is below :
Many thanks for you help and feel free to add comments other my other lines of code too I am a beginner.
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from datetime import datetime as dt
from pandas_datareader import data as pdr
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import LSTM
startdate=dt(2018,3,31)
enddate=dt(2022,3,31)
tickers = ['ETH-USD']
Data=pdr.get_data_yahoo(tickers,start=startdate, end=enddate)['Adj Close']
df_change = Data.apply(lambda x: np.log(x) - np.log(x.shift(1)))
df_change.drop(index=df_change.index[0], axis=0, inplace=True)
df_change = df_change*100
pd.options.mode.chained_assignment = None #to not display the error of copy dataframe
df_y = df_change.copy()
df_y.columns = ['ETH-y']
def Target(df,column,df2,column2):
for i in range(len(df)):
if df[column].iloc[i] > 0:
df2[column2][i] = 1 #value is up par rapport au jour d'avant
elif -0.5 < df[column].iloc[i] < 0.5 :
df2[column2][i] = 0 #value is steady
else:
df2[column2][i] = -1 #value is down
Target(df_change,'ETH-USD',df_y,'ETH-y')
print(df_y['ETH-y'].value_counts())
Data.drop(index=Data.index[0], axis=0, inplace=True) #drop first row to have same values
X = Data
y = df_y
## split my train val and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify = y)
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler().fit(X_train)
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
#reshaping for 3D array
X_train = np.reshape(X_train,(1169,1,1))
X_test = np.reshape(X_test,(293,1,1))
from keras.models import Sequential
from keras.layers import Dense, LSTM
model = Sequential()
model.add(LSTM(64, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True))
model.add(LSTM(32, activation='relu', return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer='adam', loss='mse')
model.summary()
history = model.fit(X_train, y_train, epochs=10, batch_size=16, validation_split=0.1, verbose=1)
pred = model.predict(X_test)
pred = sc.inverse_transform(pred)
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend()

one-hot encoding not working properly with logistic regression

I am trying to train a logistic regression model to recognize handwritten English letters. In my test data I have 74880 images. Each image has 784 pixels. The labels correspond to the place in the English alphabet. For example, A is 1, B is 2 and so on. In total there are 26 classes.
In order to optimize the model I decided to one-hot encode the labels. This means for an image with the label 23 (the letter W) after encoding the label will become: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]. However, when encoding the labels I receive this weird error: ValueError: y should be a 1d array, got an array of shape (74880, 26) instead. This error does not occur when using another model like multilayer perceptron. Weird fact: sometimes I receive (37440, 26) instead of the (74880, 26) in my error after running the same exact code again.
Anyone has an explanation? Thanks in advance.
Here is the source code:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
def binarize(y_train, y_val, y_test):
one_hot = LabelBinarizer()
Y_train = one_hot.fit_transform(y_train)
Y_val = one_hot.transform(y_val)
Y_test = one_hot.transform(y_test)
return Y_train, Y_val, Y_test
def lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test):
lgr = LogisticRegression(random_state=999, verbose=2)
parameters = {
'solver': ['sag'],
'max_iter': [10]
}
clf = GridSearchCV(lgr, parameters, n_jobs=-1, cv=2, verbose=2)
print(X_train.shape)
print(Y_train.shape)
clf.fit(X_train, Y_train)
print(grid_result.best_score_, grid_result.best_params_)
# Y_pred = lgr.predict(X_val)
# acc = accuracy_score(Y_val, Y_pred)
# print(acc)
def main():
# loading dataset
with np.load('training-dataset.npz') as data:
img = data['x']
lbl = data['y']
# train 60% validation 20% test 20% split
X_train, X_test, y_train, y_test = train_test_split(img, lbl, test_size=0.2, random_state=999)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=999)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# one-hot encoding
Y_train, Y_val, Y_test = binarize(y_train, y_val, y_test)
lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test)
if __name__ == '__main__':
main()
You are in a multiclass classification problem. The logistic Regression function of sklearn supports this type of problem without HotEncoding the Output. That's why your shapes don't match.
So use your Y without HotEncoding. Logistic regression will change the multi_class parameter to "multinomial" automatically to deal with it.
If you prefer, you can use these parameters: multi_class='ovr', solver='liblinear'. Here you are using the technique One Vs Rest (ovr).
Logistic Regression and MLP seems to work different with multiclass classification, each algorithm is different, so you have to check how they works.

ValueError when making predictions with LinearRegression

I have started learning ML.
This is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 1].values
# Split the data set into Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test =\
train_test_split(X, Y, test_size=1/3, random_state=0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Simple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train , Y_train)
# Predicting the Test set Results
y_pred = regressor.predict(X_test)
I am getting the error:
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a
minimum of 1 is required.
for the last line. How to resolve this??

label_binarize Does not fit for sklearn Naive Bayes classifier showing bad input shape

I was trying to create roc curve for multiclass using Naive Bayes But it ending with
ValueError: bad input shape.
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.naive_bayes import BernoulliNB
from scipy import interp
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = BernoulliNB(alpha=1.0, binarize=6, class_prior=None, fit_prior=True)
y_score = classifier.fit(X_train, y_train).predict(X_test)
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (75, 6)
The error because of binarizing the y variable. The estimator can work with string values itself.
Remove the following lines,
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
You are good to go!
To get the predicted probabilities for roc_curve, use the following:
classifier.fit(X_train, y_train)
y_score = classifier.predict_proba(X_test)
y_score.shape
# (75, 3)

Resources