I created a VAE archictecture to encode dance frames into latent representations.
Then I planned to use LSTM to take a sequence of those latent vectors to predict the next one. Then decode it and thus generate a new dance sequence.
However, this is not working. The LSTMs prediction is slightly off which makes the predicted frame not entirely accurate. Since that inaccurate frame is used as input to the LSTM next, the prediction of the LSTM becomes even worse off until the predictions are just black.
Does anyone have experience with encoded sequential data prediction who knows how to solve the issue? Or maybe could suggest another architecture that can work to predict dance frames?
Related
I am quite new in the deep learning game, I was wondering why do we flatten the last layer of the encoder in a VAE and then give the flattened output to a linear layer, which then approximates a location and scale parameter for the prior? Can't we just split the output of a convolutional layer and get the location and scale from here directly, or do the spatial information captured by a convolution mess up the scale and location?
Thanks a lot!
Why do we flatten the last layer of the encoder in a VAE?
There isn't really a good reason other than to make it convenient for printing or reporting. If right before flattening the encoder is of shape [BatchSize,2,2,32] , flattening it to [BatchSize,128] just makes it handy to just list all 128 encoded values per sample. When the decoder then reshapes it to [BatchSize,2,2,32] all the spacial information is put back where it was. No spacial information was lost.
Of course, one may decide to use the encoder of a trained VAE as an image feature extractor. This is actually very useful when we have a LOT of unlabeled images to train a VAE with, but only a few labeled images. After training the VAE on the large unlabeled image set, the encoder effectively becomes a feature extractor. We can then feed the feature extractor into a dense layer whos purpose is to learn the labels. Having the encoder output a flattened data set is very useful in this situation.
While training the LSTM model, I encountered one problem that I couldn't solve. To begin with, let me describe my model: I used a stacked LSTM model in Pytorch with 3 layers, 256 hidden units in each layer to predict human joint torques and joint angles from EMG features. After the training, the model can predict well when the ground truth is far away from 0, but when the ground truth is near zero, there is always an offset between the predicted value and the ground truth. I guess the reason would be that the large value of ground truth will give more impact during the training process to reduce the loss function.
This is the result:
The prediction for the validation set
The prediction for the training set
As you can see from the figures, in both datasets, the model can predict well when the ground truth is above 20 degrees. I have tried with different loss functions but the situation did not improve. Since I am just a beginner in this field, I hope someone can point out the problem in my method and how to solve it. Thank you!
Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice
I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.
There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.
I am attempting to train a 2 hidden layer tanh neural neural network on the MNIST data set using the ADADELTA algorithm.
Here are the parameters of my setup:
Tanh activation function
2 Hidden layers with 784 units (same as the number of input units)
I am using softmax with cross entropy loss on the output layer
I randomly initialized weights with a fanin of ~15, and gaussian distributed weights with standard deviation of 1/sqrt(15)
I am using a minibatch size of 10 with 50% dropout.
I am using the default parameters of ADADELTA (rho=0.95, epsilon=1e-6)
I have checked my derivatives vs automatic differentiation
If I run ADADELTA, at first it makes gains in the error, and it I can see that the first layer is learning to identify the shapes of digits. It does a decent job of classifying the digits. However, when I run ADADELTA for a long time (30,000 iterations), it's clear that something is going wrong. While the objective function stops improving after a few hundred iterations (and the internal ADADELTA variables stop changing), the first layer weights still have the same sparse noise they were initialized with (despite real features being learned on top of that noise).
To illustrate what I mean, here is the example output from the visualization of the network.
Notice the pixel noise in the weights of the first layer, despite them having structure. This is the same noise that they were initialized with.
None of the training examples have discontinuous values like this noise, but for some reason the ADADELTA algorithm never reduces these outlier weights to be in line with their neighbors.
What is going on?
I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results