How to use stacked autoencoders for pretraining - machine-learning

Let's say I wish to used stacked autoencoders as a pretraining step.
Let's say my full autoencoder is 40-30-10-30-40.
My steps are:
Train a 40-30-40 using the original 40 features data set in both input and output layers.
Using the trained encoder part only of the above i.e. 40-30 encoder, derive a new 30 feature representation of the original 40 features.
Train a 30-10-30 using the new 30 features data set (derived in step 2) in both input and output layers.
Take the trained encoder from step 1 ,40-30, and feed it into the encoder from step 3,30-10, giving a 40-30-10 encoder.
Take the 40-30-10 encoder from step 4 and use it as the input the NN.
a) Is that correct?
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder

a) Is that correct?
This is one of the typical approaches. You could also try to fit the autoencoder directly, as "raw" autoencoder with that many layers should be possible to fit right away, As an alternative you might consider fitting stacked denoising autoencoders instead, which might benefit more from "stacked" training.
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
When you train whole NN you do not freeze anything. Pretraining is only a kind of preconditioning for the optimization process - you show your method where to start, but you do not want to limit the fitting procedure of actual supervised learning.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
No, you do not have to tie weights, especially that you actually throw away your decoder anyway. Tieing the weights is important for some more probabilistic models in order to make minimization procedure possible (like in the case of RBMs), but for autoencoder there is no point.

Related

How to classify images with Variational Autoencoder

I have trained an autoencoder in both labeled images (1200) and unlabeled images (4000) and I have both models saved separately (vae_fake_img and vae_real_img). So I was wondering what to do next. I know Variational Autoencoders are not useful for a classification task but feature extraction seems like a good try. So here are my attempts:
Labeled my unlabeled data using k-means clustering from the labeled images latent space.
My supervisor suggested training the unlabeled images on the VAE, then visualize the latent space with t-SNE, then K-means clustering, then MLP for final prediction.
I want to train a Conditional VAE to create more labeled samples and retrain the VAE and use the reconstruction (64,64,3) output and using the last three fully connected (FC) layers of VGGNet16 architecture for final classification as done in this paper Encoder as feature extraction paper.
I have tried so many methods for my thesis and I really need to achieve high accuracy if I want to get a job in my current internship. So any suggestion or guidance is highly appreciated. I've read so many Autoencoder papers but the architecture for classification is not fully explained (or Im not understanding properly), I want to know which part of the VAE holds more information for multiclassification as I believe that the latent space of the encoder has more useful information than the decoder reconstruction. I want to know which part of the autoencoder has better feature extraction for a final classification.
in case of Autoencoders yoh don't need labels for reconstructing input data. So I think these approaches might make slight improvements:
Use VAE(Variational Auto Encoder) instead of AE
Use Conditional VAE(CVAE) and the combine all the data and train the network feeding all of data into that.
consider Batch as condition, for labeled and unlabeled data and use onehot of batch of data as its condition.
Inject the condition to Encoder and Decoder
Then the latent space won't have any batch effect and you can use KNN to get the label of nearest labeled data for unlabeled ones.
Alternatively you can train a somple MLP to classify every sample of your latent space. (in this approach you should train the MLP only with labeled data and then test it on unlabeled data)
don't forget Batch normalization and drop out layers
p.s., the most meaningful layer of an AE is the latent space.

Predicting inputs in neural network

Is it possible to predict inputs in "Keras neural network" for a particular output?
For example, I have a dataset with 28 inputs and 3 outputs. So, I have trained the model in Keras which works fine. Now, I have to enter the particular values in outputs and I have to predict that what will be the inputs for that particular output.
I'm not 100% sure I understand the question correctly, but if you're trying to build a model that can take inputs and predict outputs, then you will need to train a second model to predict inputs from outputs, where you swap the inputs and outputs so that outputs are your inputs, and your inputs are the outputs. Although this might be annoying, you might have to build a separate network to predict each of your input variables.
To get around this problem, you can consider autoencoders if you're okay with getting a close approximation of the input. An autoencoder is an unsupervised artificial neural network that learns how to efficiently compress and encode data then learns how to reconstruct the data back from the reduced encoded representation to a representation that is as close to the original input as possible (you can read more here: https://towardsdatascience.com/auto-encoder-what-is-it-and-what-is-it-used-for-part-1-3e5c6f017726).
Yes it is definitely possible to predict inputs from the output. In fact, what you're describing is essentially an autoencoder.
Let's say you have a NN trained on MNIST. If you then use the outputs of the classification layer to train the decoder of an auto encoder, you will get a rough indication of the input.
However this is not the best way to do it. The best way to do it is to simply have the latent space be considered the "output", then feed this output into:
a): A 1 layer classification to give you the predicted output and
b): the decoder
This will give you the predicted output and the original image

Data Science Scaling/Normalization real case

When do data pre-processing, it is suggested to do either scaling or normalization. It is easy to do it when you have data on your hand. You have all the data and can do it right away. But after the model built and run, does the first data that comes in need to be scaled or normalized? If it needed, it only one single row how to scale or normalize it? How do we know what is the min/max/mean/stdev from each feature? And how is the incoming data is the min/max/mean each feature?
Please advise
First of all you should know when to use scaling and normalization.
Scaling - scaling is nothing but to transform your features to comparable magnitudes.Let say if you have features like person's income and you noticed that some have value of order 10^3 and some have 10^6.Now if you model your problem with this features then algorithms like KNN, Ridge Regression will give higher weight to higher magnitude of such attributes.To prevent this you need to first scale your features.Min-Max scaler is one of the most used scaling.
Mean Normalisation -
If after examining the distribution of the feature and you found that feature is not centered around zero then for the algorithm like svm where objective function already assumes zero mean and same order variance, we could have problem in modeling.So here you should do Mean Normalisation.
Standardization - For the algorithm like svm, neural network, logistic regression it is necessary to have a variance of the feature in the same order.So why don't we make it to one.So in standardization, we make the distribution of features to zero mean and unit variance.
Now let's try to answer your question in terms of training and testing set.
So let's say you are training your model on 50k dataset and testing on 10k dataset.
For the above three transformations, the standard approach says that you should fit any normalizer or scaler to only training dataset and use only transform for the testing dataset.
In our case, if we want to use standardization then we will first fit our standardizer on 50k training dataset and then used to transform it 50k training dataset and also testing dataset.
Note - We shouldn't fit our standardizer to test dataset, in place of we will use already fitted standardizer to transform testing dataset.
Yes, you need to apply normalization to the input data, else the model will predict nonsense.
You also have to save the normalization coefficients that were used during training, or from training data. Then you have to apply the same coefficients to incoming data.
For example if you use min-max normalization:
f_n = (f - min(f)) / (max(f) - min_(f))
Then you need to save the min(f) and max(f) in order to perform normalization for new data.

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

Fine Tuning of GoogLeNet Model

I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?
Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel
Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.

Resources