I am not able to understand how to define the (samples, timesteps) when i used the
ConvLSTM2D layer when the data are images not videos for image compression model.
Iam working on mnsit data set and ima not able to know how to define the input shape?
I am stuck on how i can define the input shape for my network cause the ConvLSTM2D layer take 5d input tensor (sample,time,w,h.channel)
I am trying to do like this model:
Image compression model using CONV-RNN
should i reshape the train and test splits also?
Thanks
Related
I am quite new in the deep learning game, I was wondering why do we flatten the last layer of the encoder in a VAE and then give the flattened output to a linear layer, which then approximates a location and scale parameter for the prior? Can't we just split the output of a convolutional layer and get the location and scale from here directly, or do the spatial information captured by a convolution mess up the scale and location?
Thanks a lot!
Why do we flatten the last layer of the encoder in a VAE?
There isn't really a good reason other than to make it convenient for printing or reporting. If right before flattening the encoder is of shape [BatchSize,2,2,32] , flattening it to [BatchSize,128] just makes it handy to just list all 128 encoded values per sample. When the decoder then reshapes it to [BatchSize,2,2,32] all the spacial information is put back where it was. No spacial information was lost.
Of course, one may decide to use the encoder of a trained VAE as an image feature extractor. This is actually very useful when we have a LOT of unlabeled images to train a VAE with, but only a few labeled images. After training the VAE on the large unlabeled image set, the encoder effectively becomes a feature extractor. We can then feed the feature extractor into a dense layer whos purpose is to learn the labels. Having the encoder output a flattened data set is very useful in this situation.
I have an image ~(5000*5000*8).I converted this big image to small images(eg:400 images with 256*256*8 dim (x,y,channel)) and imported those to numpy array.now I have an array (400,256,256,8) for one of my class and another array (285,256,256,8) for my second class and I save these arrays to npy files. I want to classify this images pixel by pixel, I have a label matrix with 2 class. now I want to classify this image by a UNET customised method and I use deep cognition and peltarion websites to config my network and data, so I need a method to help me for classify my image pixel-wise. please help me.
So far I have trained my neural network is trained on the MNIST data set (from this tutorial). Now, I want to test it by feeding my own images into it.
I've processed the image using OpenCV by making the dimensions 28x28 pixels, turning it into grayscale, and using adaptive thresholding. Where do I proceed from here?
An 'image' is a 28x28 array of values from 0-1... so not really an image. Just greyscaling your original image will not make it fit for input. You have to go through the following steps.
Load your image into your programming langauge, with 784 rgb values representing pixels
For each rgb value, take the average of r, g and b. Then divide this value by 255. You will now have the greyscale of an image, a value between 0 and 1.
Replace the rgb values with the greyscale values
You will now have an image which looks like this (see the right array):
So you must do everything through your programming language. If you just greyscale an image with a photoeditor, the pixels will still be r,g,b.
You can use libraries like PIL, skimage that let you load the data into numpy arrays in python and also support many image operations like grayscaling, scaling etc.
After you have processed the image and read the data into numpy array you can then feed this to your network.
CNN such that outputs the image with the feature added to the input image can be created?
For example, if an image of a person's face is input, outputs an image of the person's face wearing glasses.
There are several options but basically the same way that you have one input for every pixel you must have one output from every pixel in the output image.
In MLPs you must have the same neurons in the input layer than in the output layer.
In CNNs you can also have at the beginning convolutional layers and after that deconvolutional layers.
Take a look at this paper (it is awesome) to create very realistic images from other images (for example satellite and map views in google maps). It is a neural network that is trying to solve the problem and also trying to create images that other neural network is not capable to distinguish from real images (it also have the source code available):
https://phillipi.github.io/pix2pix/
To add to the answer above, another way of doing this is neural style transfer, where we feed two images to a CNN which then generates a new image combining the content from the second image and the style from the first. Check out this paper for further details, https://arxiv.org/abs/1508.06576
We could of course always use GANs to do achieve full perfection.
I modified the MNIST example and when I train it with my 3 image classes it returns an accuracy of 91%. However, when I modify the C++ example with a deploy prototxt file and labels file, and try to test it on some images it returns a prediction of the second class (1 circle) with a probability of 1.0 no matter what image I give it - even if it's images that were used in the training set. I've tried a dozen images and it consistently just predicts the one class.
To clarify things, in the C++ example I modified I did scale the image to be predicted just like the images were scaled in the training stage:
img.convertTo(img, CV_32FC1);
img = img * 0.00390625;
If that was the right thing to do, then it makes me wonder if I've done something wrong with the output layers that calculate probability in my deploy_arch.prototxt file.
I think you have forgotten to scale the input image during classification time, as can be seen in line 11 of the train_test.prototxt file. You should probably multiply by that factor somewhere in your C++ code, or alternatively use a Caffe layer to scale the input (look into ELTWISE or POWER layers for this).
EDIT:
After a conversation in the comments, it turned out that the image mean was mistakenly being subtracted in the classification.cpp file whereas it was not being subtracted in the original training/testing pipeline.
Are your train classes balanced?
You may get to a stacked network on a prediction of one major class.
In order to find the issue I suggest to output the train prediction during training compared to predictions with the forward example on same train images from a different class.