Applying encoding to train and test data separately - machine-learning

I have a feature in my dataset State, so after splitting I apply encoding to train set like this
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['State'])], remainder='passthrough')
encoded_X_train = ct.fit_transform(X_train)
and train model like this
regressor = LinearRegression()
regressor.fit(encoded_X_train, y_train)
then encodes and predict like this
encoded_X_test = ct.fit_transform(X_test)
y_pred = regressor.predict(encoded_X_test)
Is this the right process of doing so, or am I doing something wrong?

No. You should train the encoding model using the train data only.
fit_transform is transforming data based on the model fitted with the data.
Thus, you should use the following code instead.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['State'])], remainder='passthrough')
encoded_X_train = ct.fit_transform(X_train)
encoded_X_test = ct.transform(X_test)

Related

Using random forest model to predict values from a new dataset

I have built a random forest model to predict 'size' from several values of 'shape'. I now want to input new data, not from the data the model was created on, and for it to predict the values of 'size'. I was wondering how to format this, and how to get the predicted results. I have attached what the original data looks like, as well as the new data I want to input into my model . Should I change X_test to be the new data? I have also attached the code I used to create the model. Thank you so much!
X = dataset[headers]
y = dataset["size"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
sc = StandardScaler()
scaled_X_train = sc.fit_transform(X_train)
scaled_X_test = sc.transform(X_test)
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(scaled_X_train,y_train)
y_predicted = classifier.predict(scaled_X_test)

scikit-learn and imblearn: does GridSearchCV/RandomSearchCV apply preprocessing to the validation set as well?

I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)
Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):
params = {
'linearsvc__C' : [...],
'linearsvc__tol' : [...],
'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)
My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.
Does sklearn account for this?
And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:
clf = make_pipeline(SMOTE(), SVC(kernel='linear'))
still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here.
Does sklearn/imblearn account for this as well?
Many thanks in advance

Keras Denoising Autoencoder (tabular data)

I have a project where I am doing a regression with Gradient Boosted Trees using tabular data. I want to see if using a denoising autoencoder on my data can find a better representation of my original data and improve my original GBT scores. Inspiration is taken from the popular Kaggle winner here.
AFAIK I have two main choices for extracting the activation's of the DAE - creating a bottleneck structure and taking the single middle layer activations or concatenating every layer's activation's as the representation.
Let's assume I want all layer activations from the 3x 512 node layers below:
inputs = Input(shape=(31,))
encoded = Dense(512, activation='relu')(inputs)
encoded = Dense(512, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(encoded)
decoded = Dense(31, activation='linear')(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer='Adam', loss='mse')
history = autoencoder.fit(x_train_noisy, x_train_clean,
epochs=100,
batch_size=128,
shuffle=True,
validation_data=(x_test_noisy, x_test_clean),
callbacks=[reduce_lr])
My questions are:
Taking the activations of the above will give me a new representation of x_train, right? Should I repeat this process for x_test? I need both to train my GBT model.
How can I do inference? Each new data point will need to be "converted" into this new representation format. How can I do that with Keras?
Do I actually need to provide validation_data= to .fit in this situation?
Taking the activations of the above will give me a new representation
of x_train, right? Should I repeat this process for x_test? I need
both to train my GBT model.
Of course, you need to have the denoised representation for both training and testing data, because the GBT model that you train later only accepts the denoised feature.
How can I do inference? Each new data point will need to be
"converted" into this new representation format. How can I do that
with Keras?
If you want to use the denoised/reconstructed feature, you can directly use autoencoder.predict( X_feat ) to extract features. If you want to use the middle layer, you need to build a new model encoder_only=Model(inputs, encoded) first and use it for feature extraction.
Do I actually need to provide validation_data= to .fit in this
situation?
You'd better separate some training data for validation to prevent overfitting. However, you can always train multiple models, e.g. in a leave-one-out way to fully use all data in an ensemble way.
Additional remarks:
512 hidden neurons seems to be too many for your task
consider to use DropOut
be careful about tabular data, especially when data in different columns are of different dynamic ranges (i.e. MSE does not fairly quantize the reconstruction errors of different columns).
Denoising autoencoder model is a model that can help denoising noisy data. As train data we are using our train data with target the same data.
The model you are describing above is not a denoising autoencoder model. For an autoencoder model, on encoding part, units must gradually be decreased in number from layer to layer thus on decoding part units must gradually be increased in number.
Simple autoencoder model should look like this:
input = Input(shape=(31,))
encoded = Dense(128, activation='relu')(input)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
decoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(31, activation='sigmoid')(decoded)
autoencoder = Model(input, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(x_train_noisy, x_train_noisy,
epochs=100,
batch_size=256,
shuffle=True,
validation_data=(x_test_noisy, x_test_noisy))

Using different loss functions for different outputs simultaneously Keras?

I'm trying to make a network that outputs a depth map, and semantic segmentation data separately.
In order to train the network, I'd like to use categorical cross entropy for the segmentation branch, and mean squared error for the branch that outputs the depth map.
I couldn't find any info on implementing the two loss functions for each branches in the Keras documentation for the Functional API.
Is it possible for me to use these loss functions simultaneously during training, or would it be better for me to train the different branches separately?
From the documentation of Model.compile:
loss: String (name of objective function) or objective function. See
losses. If the model has multiple outputs, you can use a different
loss on each output by passing a dictionary or a list of losses. The
loss value that will be minimized by the model will then be the sum of
all individual losses.
If your output is named, you can use a dictionary mapping the names to the corresponding losses:
x = Input((10,))
out1 = Dense(10, activation='softmax', name='segmentation')(x)
out2 = Dense(10, name='depth')(x)
model = Model(x, [out1, out2])
model.compile(loss={'segmentation': 'categorical_crossentropy', 'depth': 'mse'},
optimizer='adam')
Otherwise, use a list of losses (in the same order as the corresponding model outputs).
x = Input((10,))
out1 = Dense(10, activation='softmax')(x)
out2 = Dense(10)(x)
model = Model(x, [out1, out2])
model.compile(loss=['categorical_crossentropy', 'mse'], optimizer='adam')

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

Resources