Variational Autoencoder for Feature Extraction

Variational Autoencoder for Feature Extraction - machine-learning

I would like to ask if would it be possible (rather if it can make any sense) to use a variational autoencoder for feature extraction. I ask because for the encoding part we sample from a distribution, and then it means that the same sample can have a different encoding (Due to the stochastic nature in the sampling process). Thanks!

Yes the feature extraction goal is the same for vae's or sparse autoencoders.
Once you have an encoder plug-in a classifier on the extracted features.
Best reggards,

Yes the output of encoder network can be used as your feature.
Just think about this: using the output of encoder network as input, the decoder network can generate you an image quite like your old image. Therefore the output of encoder network has pretty much covered most of the information in your original image. In other words, they are the most important features of your original image that distinguish it from other images.
The only thing you want to pay attention to is that variational autoencoder is a stochastic feature extractor, while usually the feature extractor is deterministic. You can either use the mean and variance as your extracted feature, or use Monte Carlo method by drawing from the Gaussian distribution defined by the mean and variance as "sampled extracted features".

Yes, you can.
I used the below code to extract the important features from my dataset.
prostate_df <- read.csv('your_data')
prostate_df <- prostate_df[,-1] # first column.
train_df<-prostate_df
outcome_name <- 'subtype' # my label column
feature_names <- setdiff(names(prostate_df), outcome_name)
library(h2o)
localH2O = h2o.init()
prostate.hex<-as.h2o(train_df, destination_frame="train.hex")
prostate.dl = h2o.deeplearning(x = feature_names,
#y="subtype",
training_frame = prostate.hex,
model_id = "AE100",
# input_dropout_ratio = 0.3, #Quite high,
#l2 = 1e-5, #Quite high
autoencoder = TRUE,
#validation_frame = prostate.hex,
#reproducible = T,seed=1,
hidden = c(1), epochs = 700,
#activation = "Tanh",
#activation ="TanhWithDropout",
activation ="Rectifier",
#activation ="RectifierWithDropout",
standardize = TRUE,
#regression_stop = -1,
#stopping_metric="MSE",
train_samples_per_iteration = 0,
variable_importances=TRUE
)
label1<-ncol(train_df)
train_supervised_features2 = h2o.deepfeatures(prostate.dl, prostate.hex, layer=1)
plotdata = as.data.frame(train_supervised_features2)
plotdata$label = as.character(as.vector(train_df[,label1]))
library(ggplot2)
qplot(DF.L1.C1, DF.L1.C2, data = plotdata, color = label, main = "Cancer Normal Pathway data ")
prostate.anon = h2o.anomaly(prostate.dl, prostate.hex, per_feature=FALSE)
head(prostate.anon)
err <- as.data.frame(prostate.anon)
h2o.scoreHistory(prostate.dl)
head(h2o.varimp(prostate.dl),10)
h2o.varimp_plot(prostate.dl)

Related

Applying encoding to train and test data separately

I have a feature in my dataset State, so after splitting I apply encoding to train set like this
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['State'])], remainder='passthrough')
encoded_X_train = ct.fit_transform(X_train)
and train model like this
regressor = LinearRegression()
regressor.fit(encoded_X_train, y_train)
then encodes and predict like this
encoded_X_test = ct.fit_transform(X_test)
y_pred = regressor.predict(encoded_X_test)
Is this the right process of doing so, or am I doing something wrong?

No. You should train the encoding model using the train data only.
fit_transform is transforming data based on the model fitted with the data.
Thus, you should use the following code instead.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['State'])], remainder='passthrough')
encoded_X_train = ct.fit_transform(X_train)
encoded_X_test = ct.transform(X_test)

Combining Neural Networks Pytorch

I have 2 images as input, x1 and x2 and try to use convolution as a similarity measure. The idea is that the learned weights substitute more traditional measure of similarity (cross correlation, NN, ...). Defining my forward function as follows:
def forward(self,x1,x2):
out_conv1a = self.conv1(x1)
out_conv2a = self.conv2(out_conv1a)
out_conv3a = self.conv3(out_conv2a)
out_conv1b = self.conv1(x2)
out_conv2b = self.conv2(out_conv1b)
out_conv3b = self.conv3(out_conv2b)
Now for the similarity measure:
out_cat = torch.cat([out_conv3a, out_conv3b],dim=1)
futher_conv = nn.Conv2d(out_cat)
My question is as follows:
1) Would Depthwise/Separable Convolutions as in the google paper yield any advantage over 2d convolution of the concatenated input. For that matter can convolution be a similarity measure, cross correlation and convolution are very similar.
2) It is my understanding that the groups=2 option in conv2d would provide 2 separate inputs to train weights with, in this case each of the previous networks weights. How are these combined afterwards?
For a basic concept see here.

Using a nn.Conv2d layer you assume weights are trainable parameters. However, if you want to filter one feature map with another, you can dive deeper and use torch.nn.functional.conv2d to explicitly define both input and filter yourself:
out = torch.nn.functional.conv2d(out_conv3a, out_conv3b)

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?

Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)

you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

How to handle gradients when training two sub-graphs simultaneously

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.
Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).
The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.
Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):
if global_step % 10 == 0:
execute-the-price-training_code
else:
execute-the-decoder-training_code
So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.
My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.
Here is the training-part of the code:
This is the (almost original) training-op-preparation:
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdadeltaOptimizer(self.learning_rate)
for bucket_id in xrange(len(buckets)):
tf.scalar_summary("seq2seq loss", self.losses[bucket_id])
gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))
Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:
with tf.name_scope('prices') as scope:
#First layer
W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
#self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)
#Second layer to softmax (price ranges)
W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
W_price_t = tf.transpose(W_price)
B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")
self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
self.label_price = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")
#Remember the prices trainables
var_list_prices = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
var_list_all = tf.trainable_variables()
#Backprop
self.loss_price = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
self.loss_price_scalar = tf.reduce_mean(self.loss_price)
self.optimizer_price = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)
Thx a bunch

I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.
Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.
The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.

Does CvSVM scale the training data?

I am using OpenCV's CvSVM to learn SVM classifiers for 29 Classes.
The application is Face Recognition, and I divide the face image into 3x6 grid. For each block in the grid, I train a SVM classifier on the SURF features extracted from the block.
I read here http://www.csie.ntu.edu.tw/~cjlin/talks.html that it is very important to scale the training and testing data similarly.
Does CvSVM scale the data? If not, does OpenCV provide any function I can use to do the scaling?

there's a very good cross-platform library for utilizing machine learning algorithms in practice:
http://www.nickgillian.com/software/grt
I don't know whether OpenCV provides opportunity for scaling training/test data to achieve a better classification result, but with GRT you can do it:
https://github.com/nickgillian/grt/blob/master/GRT/ClassificationModules/SVM/SVM.h
...
#param bool useScaling: sets if the training/prediction data will be scaled to the default range of [-1. 1.]. The SVM algorithm commonly achieves a better classification result if scaling is turned on. The default useScaling value is useScaling=true
...
SVM(UINT kernelType = LINEAR_KERNEL,UINT svmType = C_SVC,bool useScaling = true,bool useNullRejection = false,bool useAutoGamma = true,double gamma = 0.1,UINT degree = 3,double coef0 = 0,double nu = 0.5,double C = 1,bool useCrossValidation = false,UINT kFoldValue = 10);
This is also a wrapper class to libsvm.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Variational Autoencoder for Feature Extraction - machine-learning

Yes the feature extraction goal is the same for vae's or sparse autoencoders. Once you have an encoder plug-in a classifier on the extracted features. Best reggards,

Related

Applying encoding to train and test data separately

Combining Neural Networks Pytorch

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

How to handle gradients when training two sub-graphs simultaneously

Does CvSVM scale the training data?

Categories

Resources