one-hot encoding not working properly with logistic regression - machine-learning

I am trying to train a logistic regression model to recognize handwritten English letters. In my test data I have 74880 images. Each image has 784 pixels. The labels correspond to the place in the English alphabet. For example, A is 1, B is 2 and so on. In total there are 26 classes.
In order to optimize the model I decided to one-hot encode the labels. This means for an image with the label 23 (the letter W) after encoding the label will become: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]. However, when encoding the labels I receive this weird error: ValueError: y should be a 1d array, got an array of shape (74880, 26) instead. This error does not occur when using another model like multilayer perceptron. Weird fact: sometimes I receive (37440, 26) instead of the (74880, 26) in my error after running the same exact code again.
Anyone has an explanation? Thanks in advance.
Here is the source code:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
def binarize(y_train, y_val, y_test):
one_hot = LabelBinarizer()
Y_train = one_hot.fit_transform(y_train)
Y_val = one_hot.transform(y_val)
Y_test = one_hot.transform(y_test)
return Y_train, Y_val, Y_test
def lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test):
lgr = LogisticRegression(random_state=999, verbose=2)
parameters = {
'solver': ['sag'],
'max_iter': [10]
}
clf = GridSearchCV(lgr, parameters, n_jobs=-1, cv=2, verbose=2)
print(X_train.shape)
print(Y_train.shape)
clf.fit(X_train, Y_train)
print(grid_result.best_score_, grid_result.best_params_)
# Y_pred = lgr.predict(X_val)
# acc = accuracy_score(Y_val, Y_pred)
# print(acc)
def main():
# loading dataset
with np.load('training-dataset.npz') as data:
img = data['x']
lbl = data['y']
# train 60% validation 20% test 20% split
X_train, X_test, y_train, y_test = train_test_split(img, lbl, test_size=0.2, random_state=999)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=999)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# one-hot encoding
Y_train, Y_val, Y_test = binarize(y_train, y_val, y_test)
lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test)
if __name__ == '__main__':
main()

You are in a multiclass classification problem. The logistic Regression function of sklearn supports this type of problem without HotEncoding the Output. That's why your shapes don't match.
So use your Y without HotEncoding. Logistic regression will change the multi_class parameter to "multinomial" automatically to deal with it.
If you prefer, you can use these parameters: multi_class='ovr', solver='liblinear'. Here you are using the technique One Vs Rest (ovr).
Logistic Regression and MLP seems to work different with multiclass classification, each algorithm is different, so you have to check how they works.

Related

AutoSKLearn predict_proba equivalent?

Is there an equivalent to SKLearn's predict_proba in AutoSKLearn? I can't seem to find a way to determine the confidence of AutoSKLearns predictions.
A predict_proba method should be implemented for AutoSklearnClassifier
From auto-sklearn documentation:
predict_proba(X, batch_size=None, n_jobs=1)
Predict probabilities of classes for all samples X.
Parameters:
Xarray-like or sparse matrix of shape = [n_samples, n_features]
...
Returns:
yarray of shape = [n_samples, n_classes] or [n_samples, n_labels]
Which in context looks something like this:
from autosklearn.classification import AutoSklearnClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = AutoSklearnClassifier(time_left_for_this_task=30)
clf.fit(X_train, y_train)
predictions = clf.predict_proba(X_test)
print(predictions)

SMOTE resampling produces nan values

I am using SMOTE to oversample the minority of a dataset. My code is as follows:
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(features_coded, labels, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42, sampling_strategy='all')
# also tried the following, same result
# sm = SMOTE(random_state=42, sampling_strategy=0.5)
X_train, y_train = sm.fit_resample(X_train, y_train)
I check features_coded, labels, X_train and y_train using statements like the following:
features_coded[features_coded.isnull().any(axis=1)]
I am pretty sure that they do not contain any nan values before oversampling. However, after resampling, there are a lot of nan values in the X_train dataframe.
Just in case you are wondering:
This is my dataframe (saved as csv file) before oversampling, nothing is missing.
This is my dataframe (saved as csv file) after oversampling, a lot of empty values!
Is anything wrong?
I had a similar issue, I converted my inputs X and Y as arrays using the lines X_arr = numpy.array(X) and y_arr = numpy.array(Y) and fed them to train_test_split() as follows:
X_train, X_test, y_train, y_test = train_test_split(X_arr, y_arr, test_size = 0.2, random_state = 2)
smote = SMOTE(random_state=2)
X_train_balanced, Y_train_balanced = smote.fit_resample(X_train, y_train)

Unable to do Stacking for a Multi-label classifier

I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking.
Text features : HashingVectorizer(number of features 2**20, char analyzer)
TSVD to reduce the dimensionality (n_components=200).
text_pipeline = Pipeline([
('hashing_vectorizer', HashingVectorizer(n_features=2**20,
analyzer='char')),
('svd', TruncatedSVD(algorithm='randomized',
n_components=200, random_state=19204))])
feat_pipeline = FeatureUnion([('text', text_pipeline)])
estimators_list = [('ExtraTrees',
OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621))),
('linearSVC',
OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator=OneVsRestClassifier(
LogisticRegression(solver='lbfgs',
max_iter=300)))
classifier_pipeline = Pipeline([
('features', feat_pipeline),
('clf', estimators_ensemble)])
Error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
3 print(f"Execution time {time.time()-start}")
4
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
795 return np.ravel(y)
796
--> 797 raise ValueError("bad input shape {0}".format(shape))
798
799
ValueError: bad input shape (89792, 83)
StackingClassifier does not support multi label classification as of now. You could get to understand these functionalities by looking at the shape value for the fit parameters such as here.
Solution would be to put the OneVsRestClassifier wrapper on top of StackingClassifier rather on the individual models.
Example:
from sklearn.datasets import make_multilabel_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier
from sklearn.multiclass import OneVsRestClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621)),
('linearSVC', LinearSVC(class_weight='balanced'))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))
ovr_model = OneVsRestClassifier(estimators_ensemble)
ovr_model.fit(X_train, y_train)
ovr_model.score(X_test, y_test)
# 0.45454545454545453

label_binarize Does not fit for sklearn Naive Bayes classifier showing bad input shape

I was trying to create roc curve for multiclass using Naive Bayes But it ending with
ValueError: bad input shape.
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.naive_bayes import BernoulliNB
from scipy import interp
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = BernoulliNB(alpha=1.0, binarize=6, class_prior=None, fit_prior=True)
y_score = classifier.fit(X_train, y_train).predict(X_test)
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (75, 6)
The error because of binarizing the y variable. The estimator can work with string values itself.
Remove the following lines,
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
You are good to go!
To get the predicted probabilities for roc_curve, use the following:
classifier.fit(X_train, y_train)
y_score = classifier.predict_proba(X_test)
y_score.shape
# (75, 3)

ValueError: Found input variables with inconsistent numbers of samples: [559, 140]

Here's the code below... I dont know what's wrong with my code. plz help. the error is occured in line
clf.fit(X_train, y_train)
import numpy as np
from sklearn import preprocessing, neighbors
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.txt')
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
X_train,y_train,X_test,y_test =
train_test_split(X,y, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
print(X_train.shape)
print(y_train.shape)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print (accuracy)
The problem is in your X_train,y_train,X_test,y_test =
train_test_split(X,y, test_size=0.2) part.
According to scikit-learn documentation in here,the correct order of return value of train_test_split function is:
X_train,
X_test,
y_train,
y_test
Your order in the code is wrong. Let's replace the line you have used ** train_test_split** with this line:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
Hopefully, this will resolve your issue.

Resources