Why flatten last encoder layer in a convolutional VAE? - machine-learning

I am quite new in the deep learning game, I was wondering why do we flatten the last layer of the encoder in a VAE and then give the flattened output to a linear layer, which then approximates a location and scale parameter for the prior? Can't we just split the output of a convolutional layer and get the location and scale from here directly, or do the spatial information captured by a convolution mess up the scale and location?
Thanks a lot!

Why do we flatten the last layer of the encoder in a VAE?
There isn't really a good reason other than to make it convenient for printing or reporting. If right before flattening the encoder is of shape [BatchSize,2,2,32] , flattening it to [BatchSize,128] just makes it handy to just list all 128 encoded values per sample. When the decoder then reshapes it to [BatchSize,2,2,32] all the spacial information is put back where it was. No spacial information was lost.
Of course, one may decide to use the encoder of a trained VAE as an image feature extractor. This is actually very useful when we have a LOT of unlabeled images to train a VAE with, but only a few labeled images. After training the VAE on the large unlabeled image set, the encoder effectively becomes a feature extractor. We can then feed the feature extractor into a dense layer whos purpose is to learn the labels. Having the encoder output a flattened data set is very useful in this situation.

Related

Is there a fundamental limit on how accurate location information is encoded in CNNs

Each layer in a CNN reduces the size of the input via convolution and max-pooling operations. Convolution is translation equivariant, but max-pooling is translation invariant. Correct me if this is wrong : each time max-pooling applied, the precise location of a feature is reduced. So the feature maps of the final conv layer in a very deep CNN will have a large receptive field (w.r.t the original image), but the location of this feature (in the original image) is not discernible from looking at this feature map alone.
If this is true, how can the accuracy of bounding boxes when we do localisation be so good with a deep CNN? I understand how classification works, but making accurate bounding box predictions is confusing me.
Perhaps a toy example will clarify my confusion;
Say we have a dataset of images with dimension 256x256x1, and we want to predict whether a cat is present, and if so, where it is, so our target is something like [sigmoid_cat_present, cat_location].
Our vanilla CNN (let's assume something like VGG) will take in the image and transform it to something like 16x16x256 in the last convolutional layer. Each pixel in this final 16x16 feature map can be influenced by a much larger region in the original image. So if we determine a cat is present, how can the [cat_location] be refined to value more granular than this effective receptive field?
To add to your question - how about pixel perfect accuracy of segmentation boundary !!
Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.
The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.
Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.
Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.
In TensorFlow, atrous convolutions are implemented with function:
tf.nn.atrous_conv2d
There are many more methods and this is an ongoing research area.

Predicting Dance Frames with Pytorch

I created a VAE archictecture to encode dance frames into latent representations.
Then I planned to use LSTM to take a sequence of those latent vectors to predict the next one. Then decode it and thus generate a new dance sequence.
However, this is not working. The LSTMs prediction is slightly off which makes the predicted frame not entirely accurate. Since that inaccurate frame is used as input to the LSTM next, the prediction of the LSTM becomes even worse off until the predictions are just black.
Does anyone have experience with encoded sequential data prediction who knows how to solve the issue? Or maybe could suggest another architecture that can work to predict dance frames?

Image enhancement before CNN helpful?

I have a Deep learning model ( transfer learning based in keras) to do regression problem on medical images. Does it help or have any logical idea or doing some image enhancements like strengthening the edges or doing histogram equalization before feeding the inputs to the CNN?
It is possible to train model accurately by using something you told.
For training CNN model with data, they almost use image augmentation in pre-processing phase.
There are list usually used in augmentation.
color noise
transform
rotate
whitening
affine
crop
flip
etc...
You can refer to here

Can i Retrain Inception's Final Layer using depth images from Kinect.?

I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.

Poor performance on digit recognition with CNN trained on MNIST dataset

I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.

Resources