Learning curve is the same for training and validation? - machine-learning

I've been training a neural network with scikit-learn's MLPRegressor using ShuffleSplit with 10 splits and 20% of the data set aside for testing. First I use GridSearchCV to find good parameters. I then instantiate a new (unfitted) estimator with those params, and finally use the plot_learning_curve function, with a MAPE scorer and the same ShuffleSplit cv.
In most of the learning curve examples I've seen, the validation and training curves are distinctly separate. However, I've consistently been getting learning curves where the cross validation and training curve are almost identical. How should I interpet this - does it seem realistic, or have I made a mistake somewhere?
Learning Curve
As requested, here's the code:
node_range = list(range(1,16))
layer_range = range(1,6)
hidden_sizes = [(nodes,) * layers for layers in layer_range for nodes in node_range]
param_grid = [{'hidden_layer_sizes': hidden_sizes,
'activation': ['relu'],
'learning_rate_init': [0.5]}
]
cv = ShuffleSplit(n_splits=10, test_size=0.2)
search = GridSearchCV(estimator, param_grid, cv=cv, scoring=neg_MAPE, refit=True)
search.fit(X, y)
best_params = search.best_params_
estimator = MLPRegressor().set_params(**best_params)
plot_learning_curve(estimator, X, y, cv=cv, scoring=neg_MAPE)
And here is my scorer:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
neg_MAPE = make_scorer(mean_absolute_percentage_error, greater_is_better=False)

Related

Discriminator's loss stuck at value = 1 while training conditional GAN

I am training a conditional GAN that generates image time series (similar to video prediction). I built a conditional GAN based on this paper. However, several probelms happened when I was training the cGAN.
Problems of training cGAN:
The discriminator's loss stucks at one.
It seems like the generator's loss is not effected by discriminator no matter how I adjust the hyper parameters related to the discriminator.
Training loss of discriminator
D_loss = (fake_D_loss + true_D_loss) / 2
fake_D_loss = Hinge_loss(D(G(x, z)))
true_D_loss = Hinge_loss(D(x, y))
The margin of hinge loss = 1
Training loss of generator
D_loss = -torch.mean(D(G(x,z))
G_loss = weighted MAE
Gradient flow of discriminator
Gradient flow of generator
Several settings of the cGAN:
The output layer of discriminator is linear sum.
The discriminator is trained twice per epoch while the generator is only trained once.
The number of neurons of the generator and discriminator are exactly the same as the paper.
I replaced the ReLU (original setting) to LeakyReLU to avoid nan.
I added gradient norm to avoid gradient vanishing problem.
Other hyper parameters are listed as follows:
Hyper parameters
Paper
Mine
number of input images
4
4
number of predicted images
18
10
batch size
16
16
opt_g, opt_d
Adam
Adam
lr_g
5e-5
5e-5
lr_d
2e-4
2e-4
The loss function I use for discriminator.
def HingeLoss(pred, validity, margin=1.):
if validity:
loss = F.relu(margin - pred)
else:
loss = F.relu(margin + pred)
return loss.mean()
The loss function for examining the validity of predicted image from generator.
def HingeLossG(pred):
return -torch.mean(pred)
I use the trainer of pytorch_lightning to train the model. The training codes I wrote are as follows.
def training_step(self, batch, batch_idx, optimizer_idx):
x, y = batch
x.requires_grad = True
if self.n_sample > 1:
pred = [self(x) for _ in range(self.n_sample)]
pred = torch.mean(torch.stack(pred, dim=0), dim=0)
else:
pred = self(x)
##### TRAIN DISCRIMINATOR #####
if optimizer_idx == 1:
true_D_loss = self.discriminator_loss(self.discriminator(x, y), True)
fake_D_loss = self.discriminator_loss(self.discriminator(x, pred.detach()), False)
D_loss = (fake_D_loss + true_D_loss) / 2
return D_loss
##### TRAIN GENERATOR #####
if optimizer_idx == 0:
G_loss = self.generator_loss(pred, y)
GD_loss = self.generator_d_loss(self.discriminator(x, pred.detach()))
train_G_loss = G_loss + GD_loss
return train_G_loss
I have several guesses of why these problems may occur:
Since the original model predicts 18 frames rather than 10 frames (my version), maybe the number of neurons in the original generator is too much for my case (predicting 10 frames), leading an exceedingly powerful generator that breaks the balance of training. However, I've tried to lower the learning rate of generator to 1e-5 (original 5e-5) or increase the training times of discriminator to 3 to 5 times. It seems that the loss curve of generator didn't much changed.
Various results of training cGAN
I have also adjust the weights of generator's loss, but the same problems still occurred.
The architecture codes of this model: https://github.com/hyungting/DGMR-pytorch

Model validation with separate dataset

Apologies, quite new to sklearn. I'm trying to validate a model using an external dataset for binary classification of text strings. I've trained the model but want to use it against another dataset of a different size for prediction rather than include the data in the initial dataset split. Is this even possible?
Initial split
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(3, 3))
Xprod = vectorizer.fit_transform(prod_good)
X = vectorizer.fit_transform(total_requests)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)
Test the model
linear_svm=LinearSVC(C=1)
linear_svm.fit(X_train, y_train)
y_pred = linear_svm.predict(X_test)
score_test = metrics.accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
New prediction
newpred = linear_svm.predict(Xprod)
...
Error:
ValueError: X has 4553 features per sample; expecting 24422
Think I'm misunderstanding some basic concepts here
The function fit_transform makes a fit and then a transform. So this line fits your vectorizer and then transforms total_requests to X:
X = vectorizer.fit_transform(total_requests)
As your vectorizer must be fitted only one time (in order to have the same matrix of features each time you use your vectorizer), to compute Xprod, you just need to use transform:
Xprod = vectorizer.transform(prod_good)
Also, you need to compute Xprod after the vectorizer is fitted, so compute Xprod after X.

Linear regression gives worse results after normalization or standardization

I'm performing linear regression on this dataset:
archive.ics.uci.edu/ml/datasets/online+news+popularity
It contains various types of features - rates, binary, numbers etc.
I've tried using scikit-learn Normalizer, StandardScaler and PowerTransformer, but the've all resulted in worse results than without using them.
I'm using them like this:
from sklearn.preprocessing import StandardScaler
X = df.drop(columns=['url', 'shares'])
Y = df['shares']
transformer = StandardScaler().fit(X)
X_scaled = transformer.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
perform_linear_and_ridge_regression(X=X_scaled, Y=Y)
The function on the last line perform_linear_and_ridge_regression() is correct for sure and is using GridSearchCV to determine the best hyperparameters.
Just to make sure I include the function as well:
def perform_linear_and_ridge_regression(X, Y):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=10)
lin_reg_parameters = { 'fit_intercept': [True, False] }
lin_reg = GridSearchCV(LinearRegression(), lin_reg_parameters, cv=5)
lin_reg.fit(X=X_train, y=Y_train)
Y_pred = lin_reg.predict(X_test)
print('Linear regression MAE =', median_absolute_error(Y_test, Y_pred))
The results are surprising as all of them provide worse results:
Linear reg. on original data: MAE = 1620.510555135375
Linear reg. after using Normalizer: MAE = 1979.8525218964242
Linear reg. after using StandardScaler: MAE = 2915.024521207241
Linear reg. after using PowerScaler: MAE = 1663.7148884463259
Is this just a special case, where Standardization doesn't help, or am I doing something wrong?
EDIT: Even when I leave the binary features out, most of the transformers gives worse results.
Your dataset has many categorical and ordinal features. You should take care of that first separately. Also, it seems like you are applying normalization on categorical variables too, which is completely wrong.
Here is nice-link, which explains how to handle categorical features for regression problem.

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

How do you plot learning curves for Random Forest models?

Following Andrew Ng's machine learning course, I'd like to try his method of plotting learning curves (cost versus number of samples) in order to evaluate the need for additional data samples. However, with Random Forests I'm confused about how to plot a learning curve. Random Forests don't seem to have a basic cost function like, for example, linear regression so I'm not sure what exactly to use on the y axis.
You can use this function to plot learning curve of any general estimator (including random forest). Don't forget to correct the indentation.
import matplotlib.pyplot as plt
def learning_curves(estimator, data, features, target, train_sizes, cv):
train_sizes, train_scores, validation_scores = learning_curve(
estimator, data[features], data[target], train_sizes = train_sizes,
cv = cv, scoring = 'neg_mean_squared_error')
train_scores_mean = -train_scores.mean(axis = 1)
validation_scores_mean = -validation_scores.mean(axis = 1)
plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')
plt.ylabel('MSE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'
plt.title(title, fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,40)
Plotting the learning curves using this function:
from sklearn.ensemble import RandomForestRegressor
plt.figure(figsize = (16,5))
model = RandomForestRegressor()
plt.subplot(1,2,i)
learning_curves(model, data, features, target, train_sizes, 5)
It might be possible that you're confusing a few categories here.
To begin with, in machine learning, the learning curve is defined as
Plots relating performance to experience.... Performance is the error rate or accuracy of the learning system, while experience may be the number of training examples used for learning or the number of iterations used in optimizing the system model parameters.
Both random forests and linear models can be used for regression or classification.
For regression, the cost is usually a function of the l2 norm (although sometimes the l1 norm) of the difference between the prediction and the signal.
For classification, the cost is usually mismatch or log loss.
The point is that it's not a question of whether the underlying mechanism is a linear model or a forest. You should decide what type of problem it is, and what's the cost function. After deciding that, plotting the learning curve is just a function of the signal and the predictions.

Resources