Why does output changes after I perform cross validation? - machine-learning

I have built a neural network for performing regression. However, if I'm performing cross-validation before making prediction, the output changes. Below are the graphs with and without cross validation.
With Cross Validation
Without Cross Validation
The code that I use for cross validation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import KFold
epoch = 5000
n_cols = X_train.shape[1]
def baseline_model():
model = Sequential()
model.add(Dense(3, activation='sigmoid', input_shape=(n_cols,)))
model.add(Dense(1, activation = 'linear'))
model.compile(optimizer='adam', loss='mse')
return model
estimator = KerasRegressor(build_fn=baseline_model, epochs=epoch, batch_size=16, verbose = 0)
kfold = KFold(n_splits=5)
results = cross_val_score(estimator, X_train, y_train, cv=kfold)
print("Results: %.10f (%.10f) MSE" % (results.mean(), results.std()))
print("RMSE:", np.sqrt(abs(results.mean())))
print(results)
for prediction
epoch = 5000
n_cols = X_train.shape[1]
def modelling():
model = Sequential()
model.add(Dense(4, activation='tanh', input_shape=(n_cols,)))
model.add(Dense(1, activation = 'linear'))
model.compile(optimizer='adam', loss='mse')
return model
model = modelling()
history = model.fit(X_train, y_train, epochs= epoch, validation_split = 0.3, batch_size= 16, verbose = 0)
Using keras with tensorflow backend

That's the essence of cross-validation. Instead of one evaluation, it yields the mean and std of many evaluations. For your example, you are using a 5 split Kfold, which means you will be learning on 4/5 of train data and testing on the remaining 1/5 for 5 times.
Cross validation is used to be sure that your model is not overfitting.

Related

Poor predictions on second dataset from trained LSTM model

I've trained an LSTM model with 8 features and 1 output. I have one dataset and split it into two separate files to train and predict with the first half of the set, and then attempt to predict the second half of the set using the trained model from the first part of my dataset. My model predicts the trained and testing sets from the dataset I used to train the model pretty well (RMSE of around 5-7), however when I attempt to predict using the second half of the set I get very poor predictions (RMSE of around 50-60). How can I get my trained model to predict outside datasets well?
dataset at this link
file = r'/content/drive/MyDrive/only_force_pt1.csv'
df = pd.read_csv(file)
df.head()
X = df.iloc[:, 1:9]
y = df.iloc[:,9]
print(X.shape)
print(y.shape)
plt.figure(figsize = (20, 6), dpi = 100)
plt.plot(y)
WINDOW_LEN = 50
def window_size(size, inputdata, targetdata):
X = []
y = []
i=0
while(i + size) <= len(inputdata)-1:
X.append(inputdata[i: i+size])
y.append(targetdata[i+size])
i+=1
assert len(X)==len(y)
return (X,y)
X_series, y_series = window_size(WINDOW_LEN, X, y)
print(len(X))
print(len(X_series))
print(len(y_series))
X_train, X_val, y_train, y_val = train_test_split(np.array(X_series),np.array(y_series),test_size=0.3, shuffle = True)
X_val, X_test,y_val, y_test = train_test_split(np.array(X_val),np.array(y_val),test_size=0.3, shuffle = False)
n_timesteps, n_features, n_outputs = X_train.shape[1], X_train.shape[2],1
[verbose, epochs, batch_size] = [1, 300, 32]
input_shape = (n_timesteps, n_features)
model = Sequential()
# LSTM
model.add(LSTM(64, input_shape=input_shape, return_sequences = False))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001)))
#model.add(Dropout(0.2))
model.add(Dense(32, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001)))
model.add(Dense(1, activation='relu'))
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience = 30, verbose =1, mode = 'auto')
model.summary()
model.compile(loss = 'mse', optimizer = Adam(learning_rate = 0.001), metrics=[tf.keras.metrics.RootMeanSquaredError()])
history = model.fit(X_train, y_train, batch_size = batch_size, epochs = epochs, verbose = verbose, validation_data=(X_val,y_val), callbacks = [earlystopper])
Second dataset:
tests = r'/content/drive/MyDrive/only_force_pt2.csv'
df_testing = pd.read_csv(tests)
X_testing = df_testing.iloc[:4038,1:9]
torque = df_testing.iloc[:4038,9]
print(X_testing.shape)
print(torque.shape)
plt.figure(figsize = (20, 6), dpi = 100)
plt.plot(torque)
X_testing = X_testing.to_numpy()
X_testing_series, y_testing_series = window_size(WINDOW_LEN, X_testing, torque)
X_testing_series = np.array(X_testing_series)
y_testing_series = np.array(y_testing_series)
scores = model.evaluate(X_testing_series, y_testing_series, verbose =1)
X_prediction = model.predict(X_testing_series, batch_size = 32)
If your model is working fine on training data but performs bad on validation data, then your model did not learn the "true" connection between input and output variables but simply memorized the corresponding output to your input. To tackle this you can do multiple things:
Typically you would use 80% of your data to train and 20% to test, this will present more data to the model, which should make it learn more of the true underlying function
If your model is too complex, it will have neurons which will just be used to memorize input-output data pairs. Try to reduce the complexity of your model (layers, neurons) to make it more simple, so that the remaining layers can really learn instead of memorize
Look into more detail on training performance here

Improve Accuracy in neural network with Keras

Below is the code of what I'm trying to do, but my accuracy is always under 50% so I'm wondering how should I fix this? What I'm trying to do is use the first 1885 daily unit sale data as input and the rest of the daily unit sale data from 1885 as output. After train these data, I need to use it to predict 20 more daily unit sale in the future
The data I used here is provided in this link
https://drive.google.com/file/d/13qzIZMD6Wz7e1GpOsNw1_9Yq-4PI2HrC/view?usp=sharing
import pandas as pd
import numpy as np
import keras
import keras.backend as k
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.callbacks import EarlyStopping
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = pd.read_csv('sales_train.csv')
#Since there are 3 departments and 10 store from 3 different areas, thus I categorized the data into 30 groups and numerize them
Unique_dept = data["dept_id"].unique()
Unique_state = data['state_id'].unique()
Unique_store = data["store_id"].unique()
data0 = data.copy()
for i in range(3):
data0["dept_id"] = data0["dept_id"].replace(to_replace=Unique_dept[i], value = i)
data0["state_id"] = data0["state_id"].replace(to_replace=Unique_state[i], value = i)
for j in range(10):
data0["store_id"] = data0["store_id"].replace(to_replace=Unique_store[j], value = int(Unique_store[j][3]) -1)
# Select the three numerized categorical variables and daily unit sale data
pt = 6 + 1885
X = pd.concat([data0.iloc[:,2],data0.iloc[:, 4:pt]], axis = 1)
Y = data0.iloc[:, pt:]
# Remove the daily unit sale data that are highly correlated to each other (corr > 0.9)
correlation = X.corr(method = 'pearson')
corr_lst = []
for i in correlation:
for j in correlation:
if (i != j) & (correlation[i][j] >= 0.9) & (j not in corr_lst) & (i not in corr_lst):
corr_lst.append(i)
x = X.drop(corr_lst, axis = 1)
x_value = x.values
y_value = Y.values
sc = StandardScaler()
X_scale = sc.fit_transform(x_value)
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(x_value, y_value, test_size=0.2)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
print(X_train.shape, X_val.shape, X_test.shape, Y_train.shape, Y_val.shape, Y_test.shape)
#create model
model = Sequential()
#get number of columns in training data
n_cols = X_train.shape[1]
#add model layers
model.add(Dense(32, activation='softmax', input_shape=(n_cols,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='softmax'))
model.add(Dense(1))
#compile model using rmsse as a measure of model performance
model.compile(optimizer='Adagrad', loss= "mean_absolute_error", metrics = ['accuracy'])
#set early stopping monitor so the model stops training when it won't improve anymore early_stopping_monitor = EarlyStopping(patience=3)
early_stopping_monitor = EarlyStopping(patience=20)
#train model
model.fit(X_train, Y_train,batch_size=32, epochs=10, validation_data=(X_val, Y_val))
Here is what I got
The plots are also pretty strange:
Accuracy
Loss
Two mistakes:
Accuracy is meaningless in regression settings, such as yours here (it is meaningful only for classification ones); see What function defines accuracy in Keras when the loss is mean squared error (MSE)? (the argument is identical when MAE loss is used, like here). Your performance measure here is the same with your loss (i.e. MAE).
We never use softmax activations in anything but the final layer of a classification model; replace both softmax activation functions used in your model with relu (keep the last layer as is, as no activation means linear, which is indeed the correct one for regression).

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

CNN architecture the same but getting different results

I have a CNN that saves the bottleneck features of the training and test data with the VGG16 architecture, then uploads the features to my custom fully connected layers to classify the images.
#create data augmentations for training set; helps reduce overfitting and find more features
train_datagen = ImageDataGenerator(rescale=1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip=True)
#use ImageDataGenerator to upload validation images; data augmentation not necessary for
validating process
val_datagen = ImageDataGenerator(rescale=1./255)
#load VGG16 model, pretrained on imagenet database
model = applications.VGG16(include_top=False, weights='imagenet')
#generator to load images into NN
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode=None,
shuffle=False)
#total number of images used for training data
num_train = len(train_generator.filenames)
#save features to numpy array file so features do not overload memory
bottleneck_features_train = model.predict_generator(train_generator, num_train // batch_size)
val_generator = val_datagen.flow_from_directory(
val_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode=None,
shuffle=False)
num_val = len(val_generator.filenames)
bottleneck_features_validation = model.predict_generator(val_generator, num_val // batch_size)`
#used to retrieve the labels of the images
label_datagen = ImageDataGenerator(rescale=1./255)
#generators can create class labels for each image in either
train_label_generator = label_datagen.flow_from_directory(
train_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode=None,
shuffle=False)
#total number of images used for training data
num_train = len(train_label_generator.filenames)
#load features from VGG16 and pair each image with corresponding label (0 for normal, 1 for pneumonia)
#train_data = np.load('xray/bottleneck_features_train.npy')
#get the class labels generated by train_label_generator
train_labels = train_label_generator.classes
val_label_generator = label_datagen.flow_from_directory(
val_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode=None,
shuffle=False)
num_val = len(val_label_generator.filenames)
#val_data = np.load('xray/bottleneck_features_validation.npy')
val_labels = val_label_generator.classes
#create fully connected layers, replacing the ones cut off from the VGG16 model
model = Sequential()
#converts model's expected input dimensions to same shape as bottleneck feature arrays
model.add(Flatten(input_shape=bottleneck_features_train.shape[1:]))
#ignores a fraction of input neurons so they do not become co-dependent on each other; helps prevent overfitting
model.add(Dropout(0.7))
#normal fully-connected layer with relu activation. Replaces all negative inputs with 0 and does not fire neuron,
#creating a lighetr network
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.7))
#output layer to classify 0 or 1
model.add(Dense(1, activation='sigmoid'))
#compile model and specify which optimizer and loss function to use
#optimizer used to update the weights to optimal values; adam optimizer maintains seperate learning rates
#for each weight and updates accordingly
#cross-entropy function measures the ability of model to correctly classify 0 or 1
model.compile(optimizer=optimizers.Adam(lr=0.0007), loss='binary_crossentropy', metrics=['accuracy'])
#used to stop training if NN shows no improvement for 5 epochs
early_stop = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=5, verbose=1)
#checks each epoch as it runs and saves the weight file from the model with the lowest validation loss
checkpointer = ModelCheckpoint(filepath=top_model_weights_dir, verbose=1, save_best_only=True)
#fit the model to the data
history = model.fit(bottleneck_features_train, train_labels,
epochs=epochs,
batch_size=batch_size,
callbacks = [early_stop, checkpointer],
verbose=2,
validation_data=(bottleneck_features_validation, val_labels))`
After calling train_top_model(), the CNN gets an 86% accuracy after around 10 epochs.
However, when I try implementing this architecture in by building the fully connected layers directly on top of the VGG16 layers, The network gets stuck at a val_acc of 0.5000 and basically does not train. Are there any issues with the code?
epochs = 10
batch_size = 20
train_datagen = ImageDataGenerator(rescale=1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip=True)
val_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='binary',
shuffle=False)
num_train = len(train_generator.filenames)
val_generator = val_datagen.flow_from_directory(
val_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='binary',
shuffle=False)
num_val = len(val_generator.filenames)`
base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width,
img_height, 3))
x = base_model.output
x = Flatten()(x)
x = Dropout(0.7)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.7)(x)
predictions = Dense(1, activation='sigmoid')(x)
model = Model(inputs=base_model.input, outputs=predictions)
for layer in model.layers[:19]:
layer.trainable = False
checkpointer = ModelCheckpoint(filepath=top_model_weights_dir, verbose=1, save_best_only=True)
model.compile(optimizer=optimizers.Adam(lr=0.0007), loss='binary_crossentropy', metrics=
['accuracy'])
history = model.fit_generator(train_generator,
steps_per_epoch=(num_train//batch_size),
validation_data=val_generator,
validation_steps=(num_val//batch_size),
callbacks=[checkpointer],
verbose=1,
epochs=epochs)
The reason is that in the second approach, you have not frozen the VGG16 layers. In other words, you are training the whole network. Whereas in the first approach you are just training the weights of your fully connected layers.
Use something like this:
for layer in base_model.layers[:end_layer]:
layer.trainable = False
where end_layer is the last layer you are importing.

Keras accuracy metrics differ from manual computation

I am working on a binary classification problem on Keras. The loss function I use is binary_crossentropy and metrics is metrics=['accuracy']. Since two classes are imbalanced, I use class_weight='auto' when I fit training data set to the model.
To see the performance, I print out the accuracy by
print GNN.model.test_on_batch([test_sample_1, test_sample_2], test_label)[1]
The output is 0.973. But this result is different when I use following lines to get the prediction accuracy
predict_label = GNN.model.predict([test_sample_1, test_sample_2])
rounded = predict_label.round(1)
print (rounded == test_label).sum()/float(rounded.shape[0])
which is 0.953.
So I am wondering how metrics=['accuracy'] evaluate the model performance and why the result is different.
For details, I attached the model summary below.
input_size = self.n_feature
encoder_size = 2000
dropout_rate = 0.5
X1 = Input(shape=(input_size, ), name='input_1')
X2 = Input(shape=(input_size, ), name='input_2')
encoder = Sequential()
encoder.add(Dropout(dropout_rate, input_shape=(input_size, )))
encoder.add(Dense(encoder_size, activation='tanh'))
encoded_1 = encoder(X1)
encoded_2 = encoder(X2)
merged = concatenate([encoded_1, encoded_2])
comparer = Sequential()
comparer.add(Dropout(dropout_rate, input_shape=(encoder_size * 2, )))
comparer.add(Dense(500, activation='relu'))
comparer.add(Dropout(dropout_rate))
comparer.add(Dense(200, activation='relu'))
comparer.add(Dropout(dropout_rate))
comparer.add(Dense(1, activation='sigmoid'))
Y = comparer(merged)
model = Model(inputs=[X1, X2], outputs=Y)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
self.model = model
And I train model by
self.hist = self.model.fit(
x=[train_sample_1, train_sample_2],
y=train_label,
class_weight = 'auto',
validation_split=0.1,
batch_size=batch_size,
epochs=epochs,
callbacks=callbacks)

Resources