SMOTE resampling produces nan values - machine-learning

I am using SMOTE to oversample the minority of a dataset. My code is as follows:
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(features_coded, labels, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42, sampling_strategy='all')
# also tried the following, same result
# sm = SMOTE(random_state=42, sampling_strategy=0.5)
X_train, y_train = sm.fit_resample(X_train, y_train)
I check features_coded, labels, X_train and y_train using statements like the following:
features_coded[features_coded.isnull().any(axis=1)]
I am pretty sure that they do not contain any nan values before oversampling. However, after resampling, there are a lot of nan values in the X_train dataframe.
Just in case you are wondering:
This is my dataframe (saved as csv file) before oversampling, nothing is missing.
This is my dataframe (saved as csv file) after oversampling, a lot of empty values!
Is anything wrong?

I had a similar issue, I converted my inputs X and Y as arrays using the lines X_arr = numpy.array(X) and y_arr = numpy.array(Y) and fed them to train_test_split() as follows:
X_train, X_test, y_train, y_test = train_test_split(X_arr, y_arr, test_size = 0.2, random_state = 2)
smote = SMOTE(random_state=2)
X_train_balanced, Y_train_balanced = smote.fit_resample(X_train, y_train)

Related

I want to impute using my regressor model in pysark

I am fairly new to Pyspark. I am building my own imputing ML function to impute missing values in a Pyspark dataframe. I just want to know the correct syntax to select all values of a respective row during when() so that I may use those values as an argument for my regressor.predict().
for Y in arrEmptyColumns:
#===============BUILDING MODEL===============
X_train, X_test, Y_train, Y_test = train_test_split(tblTempX.toPandas(), tblTempDataSet.select(Y).toPandas(), test_size = 0.20)
regressor = RandomForestRegressor(n_estimators = int(len(tblTempX.columns)*.50), random_state=0)
#===============TRAINING===============
regressor.fit(X_train,Y_train)
Y_pred = regressor.predict(X_test)
print("==============================")
print("Mean Squared Error for ",Y,": ")
print(np.sqrt(mean_squared_error(Y_test,Y_pred)))
print("R2_Score; Performance Accuracy: ")
print(r2_score(Y_test, Y_pred))
#===============TESTING===============
tblFinalOutputDataSet = tblFinalOutputDataSet.withColumn(Y, when(tblFinalOutputDataSet[Y].isNull(), regressor.predict('''*code here to select all values of respective row to be used as argument for the model*''')))
tblFinalOutputDataSet.display()

one-hot encoding not working properly with logistic regression

I am trying to train a logistic regression model to recognize handwritten English letters. In my test data I have 74880 images. Each image has 784 pixels. The labels correspond to the place in the English alphabet. For example, A is 1, B is 2 and so on. In total there are 26 classes.
In order to optimize the model I decided to one-hot encode the labels. This means for an image with the label 23 (the letter W) after encoding the label will become: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]. However, when encoding the labels I receive this weird error: ValueError: y should be a 1d array, got an array of shape (74880, 26) instead. This error does not occur when using another model like multilayer perceptron. Weird fact: sometimes I receive (37440, 26) instead of the (74880, 26) in my error after running the same exact code again.
Anyone has an explanation? Thanks in advance.
Here is the source code:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
def binarize(y_train, y_val, y_test):
one_hot = LabelBinarizer()
Y_train = one_hot.fit_transform(y_train)
Y_val = one_hot.transform(y_val)
Y_test = one_hot.transform(y_test)
return Y_train, Y_val, Y_test
def lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test):
lgr = LogisticRegression(random_state=999, verbose=2)
parameters = {
'solver': ['sag'],
'max_iter': [10]
}
clf = GridSearchCV(lgr, parameters, n_jobs=-1, cv=2, verbose=2)
print(X_train.shape)
print(Y_train.shape)
clf.fit(X_train, Y_train)
print(grid_result.best_score_, grid_result.best_params_)
# Y_pred = lgr.predict(X_val)
# acc = accuracy_score(Y_val, Y_pred)
# print(acc)
def main():
# loading dataset
with np.load('training-dataset.npz') as data:
img = data['x']
lbl = data['y']
# train 60% validation 20% test 20% split
X_train, X_test, y_train, y_test = train_test_split(img, lbl, test_size=0.2, random_state=999)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=999)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# one-hot encoding
Y_train, Y_val, Y_test = binarize(y_train, y_val, y_test)
lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test)
if __name__ == '__main__':
main()
You are in a multiclass classification problem. The logistic Regression function of sklearn supports this type of problem without HotEncoding the Output. That's why your shapes don't match.
So use your Y without HotEncoding. Logistic regression will change the multi_class parameter to "multinomial" automatically to deal with it.
If you prefer, you can use these parameters: multi_class='ovr', solver='liblinear'. Here you are using the technique One Vs Rest (ovr).
Logistic Regression and MLP seems to work different with multiclass classification, each algorithm is different, so you have to check how they works.

making predictions using classification models with multiple independent variables in hand

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

Training and testing ML from two different sources

I am using sklearn for a classification task. I want to train my model on data from table "train" and test it on data from a different table"test". Both tables have the same exact features, but different numbers of rows. I have the code below, but I am getting the error:
(<class 'ValueError'>, ValueError('Found input variables with inconsistent numbers of samples: [123, 174]',), <traceback object at 0x0000016476E10C48>).
what am I doing wrong?
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X = df_train[:, 2:30]
Y = df_test[:, :30]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
model.fit(X_train, Y_train)
predictions = model.predict(X_test)
split_mat=confusion_matrix(Y_test, predictions)
If you want to train on dataframe df_train and test on dataframe df_test, why are you taking the features of df_train and the target column of df_test and pass them to the train_test_split function?
You can simply do the following:
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X_train = df_train[:, 2:30]
y_train = df_train.y # assuming y is the name of your target variable in df_train
X_test = df_test[:, i:j] # change i to j with the number that allow you to take the same columns as X_train
y_test = df_test.y # assuming y is the name of your target variable in df_test
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Do something with predictions, e.g.
mean(predictions == y_test)

knn.fit() error: valueError: Found input variables with inconsistent numbers of samples

I'm doing supervised learning course on data camp. And trying to reproduce the code in jupiter notebook.
I do the following :
url = 'https://assets.datacamp.com/production/repositories/628/datasets/444cdbf175d5fbf564b564bd36ac21740627a834/diabetes.csv'
df2 = pd.read_csv(url)
y = df2['diabetes'].values
X = df2.loc[:,['pregnancies', 'bmi','age']]
X = np.array(X)
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
When I do knn.fit() it gives me an error:
ValueError: Found input variables with inconsistent numbers of samples: [460, 308]
I look through some solutions here, basically it's all about X and y
array dimensions, I changed them but it didn't help.
Thank you in advance!
print(X.shape, y.shape)
print(type(X), type(y))
(768, 3) (768,)
class 'numpy.ndarray'
class 'numpy.ndarray'
According to the sklearn documentation, train_test_split creates train-test subsets in the same order as arguments that are passed in it.
This will fix your problem:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

Resources