I have a project where I am doing a regression with Gradient Boosted Trees using tabular data. I want to see if using a denoising autoencoder on my data can find a better representation of my original data and improve my original GBT scores. Inspiration is taken from the popular Kaggle winner here.
AFAIK I have two main choices for extracting the activation's of the DAE - creating a bottleneck structure and taking the single middle layer activations or concatenating every layer's activation's as the representation.
Let's assume I want all layer activations from the 3x 512 node layers below:
inputs = Input(shape=(31,))
encoded = Dense(512, activation='relu')(inputs)
encoded = Dense(512, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(encoded)
decoded = Dense(31, activation='linear')(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer='Adam', loss='mse')
history = autoencoder.fit(x_train_noisy, x_train_clean,
epochs=100,
batch_size=128,
shuffle=True,
validation_data=(x_test_noisy, x_test_clean),
callbacks=[reduce_lr])
My questions are:
Taking the activations of the above will give me a new representation of x_train, right? Should I repeat this process for x_test? I need both to train my GBT model.
How can I do inference? Each new data point will need to be "converted" into this new representation format. How can I do that with Keras?
Do I actually need to provide validation_data= to .fit in this situation?
Taking the activations of the above will give me a new representation
of x_train, right? Should I repeat this process for x_test? I need
both to train my GBT model.
Of course, you need to have the denoised representation for both training and testing data, because the GBT model that you train later only accepts the denoised feature.
How can I do inference? Each new data point will need to be
"converted" into this new representation format. How can I do that
with Keras?
If you want to use the denoised/reconstructed feature, you can directly use autoencoder.predict( X_feat ) to extract features. If you want to use the middle layer, you need to build a new model encoder_only=Model(inputs, encoded) first and use it for feature extraction.
Do I actually need to provide validation_data= to .fit in this
situation?
You'd better separate some training data for validation to prevent overfitting. However, you can always train multiple models, e.g. in a leave-one-out way to fully use all data in an ensemble way.
Additional remarks:
512 hidden neurons seems to be too many for your task
consider to use DropOut
be careful about tabular data, especially when data in different columns are of different dynamic ranges (i.e. MSE does not fairly quantize the reconstruction errors of different columns).
Denoising autoencoder model is a model that can help denoising noisy data. As train data we are using our train data with target the same data.
The model you are describing above is not a denoising autoencoder model. For an autoencoder model, on encoding part, units must gradually be decreased in number from layer to layer thus on decoding part units must gradually be increased in number.
Simple autoencoder model should look like this:
input = Input(shape=(31,))
encoded = Dense(128, activation='relu')(input)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
decoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(31, activation='sigmoid')(decoded)
autoencoder = Model(input, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(x_train_noisy, x_train_noisy,
epochs=100,
batch_size=256,
shuffle=True,
validation_data=(x_test_noisy, x_test_noisy))
Related
I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.
I have a feature in my dataset State, so after splitting I apply encoding to train set like this
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['State'])], remainder='passthrough')
encoded_X_train = ct.fit_transform(X_train)
and train model like this
regressor = LinearRegression()
regressor.fit(encoded_X_train, y_train)
then encodes and predict like this
encoded_X_test = ct.fit_transform(X_test)
y_pred = regressor.predict(encoded_X_test)
Is this the right process of doing so, or am I doing something wrong?
No. You should train the encoding model using the train data only.
fit_transform is transforming data based on the model fitted with the data.
Thus, you should use the following code instead.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['State'])], remainder='passthrough')
encoded_X_train = ct.fit_transform(X_train)
encoded_X_test = ct.transform(X_test)
When I used to do classification work with textcnn, I had experience finetuning textcnn using pretrained word embedding with like Word2Vec and fasttext. And I use this process:
Create an embedding layer in textcnn
Load the embedding matrix of the words used this time by Word2Vec or
fasttext
Since the vector value of the embedding layer will change during training, the network is
being finetuning.
Recently I also want to try BERT to do this. I thought, 'As there should be few differences to use BERT pretrained embedding to initial other networks' embedding layer and finetuning, it should be easy!' But in fact yesterday I tried all day and still cannot do it.
The fact I found is that, as BERT's embedding is a contextual embedding, especially when extracting the word embeddings, the vector of each word from each sentence will vary, so it seems that there is no way to use that embedding to initialize the embedding layer of another network as usual...
Finally, I thought up one method to 'finetuning', as the following steps:
First, do not define an embedding layer in textcnn.
Instead of using embedding layer, in the network training part, I
firstly pass sequence tokens to the pretrained BERT model and get
the word embeddings for each sentence.
Put the BERT word embedding from 2. into textcnn and train the
textcnn network.
By using this method I was finally able to train, but thinking seriously, I don't think I'm doing a finetuning at all...
Because as you can see, every time when I start a new training loop, the word embedding generated from BERT is always the same vector, so just input these unchanged vectors to the textcnn wouldn't let the textcnn be finetuned at all, right?
UPDATE:
I thought up a new method to use the BERT embeddings and 'train' BERT and textcnn together.
Some part of my code is:
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
I think by enable BERTmodel.train() and delete torch.no_grad() when get the embedding, the loss gradient could be backward to BERTmodel, too. The training process of TextCNNmodel also went smoothly.
To use this model later, I saved the parameters of both TextCNNmodel and BERTmodel.
Then to experiment whether the BERTmodel was really being trained and changed, in another program I load the BERTModel, and input a sentence to test that whether the BERTModel was really being trained.
However, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are the same, which is disappointing...
I really have no idea why the BERTmodel part did not change...
Here I would like to thanks #Jindřich , thank you for giving me the important hint!
I think I am almost there when using my updated version code, but I forgot to set an optimizer for BERTmodel.
After I set the optimizer and did the training process again, this time when I load my BERTmodel, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are finally different, which means this BERTmodel is changed and should be finetuned.
Here is my final codes, hope it could help you, too.
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
optimizer_bert = torch.optim.Adamw(BERTmodel.parameters(), lr=2e-5, weight_decay=1e-2)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
optimizer_bert.zero_grad()
loss.backward()
optimizer.step()
optimizer_bert.step()
I will continue my experiments to see if my BERTmodel is really being finetuned.
I implement a neural net in keras, with the following structure:
model = Sequential([... layers ...])
model.compile(optimizer=..., loss=...)
hist=model.fit(x=X,y=Y, validation_split=0.1, epochs=100)
Is there a way to extract from either model or hist the train and validation sets? That is, I want to know which indices in X and Y were used for training and which were used for validation.
Keras splits the dataset at
split_at = int(x[0].shape * (1-validation_split))
into the train and validation part. So if you have n samples, the first int(n*(1-validation_split)) samples will be the training sample, the remainder is the validation set.
If you want to have more control, you can split the dataset yourself and pass the validation dataset with the parameter validation_data:
model.fit(train_x, train_y, …, validation_data=(validation_x, validation_y))
I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]