Goal
Trying to run an LSTM autoencoder over a dataset of multi variate time series:
X_train (200, 23, 178) - X_val (100, 23, 178) - X_test (100, 23, 178)
Current situation
A plain autoencoder gets better results rather than a simple architecture of a LSTM AE.
I have some doubts about how I use the Repeat Vector wrapper layer which, as far as I understood, is supposed to simply repeat a number of times equal to the sequence length the last state of the LSTM/GRU cell, in order to feed the input shape of the decoder layer.
The model architecture does not rise any error, but still the results are an order of magnitude worst than a simple AE, while I was expecting them to be at least the same, as I am using an architecture which should better fit the temporal problem.
Are these results comparable, first of all?
Nevertheless, the reconstruction error of the LSTM-AE does not look good at all.
My AE model:
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 178) 31862
_________________________________________________________________
batch_normalization (BatchNo (None, 178) 712
_________________________________________________________________
dense_1 (Dense) (None, 59) 10561
_________________________________________________________________
dense_2 (Dense) (None, 178) 10680
=================================================================
optimizer: sgd
loss: mse
activation function of the dense layers: relu
My LSTM/GRU AE:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 23, 178) 0
_________________________________________________________________
gru (GRU) (None, 59) 42126
_________________________________________________________________
repeat_vector (RepeatVector) (None, 23, 59) 0
_________________________________________________________________
gru_1 (GRU) (None, 23, 178) 127092
_________________________________________________________________
time_distributed (TimeDistri (None, 23, 178) 31862
=================================================================
optimizer: sgd
loss: mse
activation function of the gru layers: relu
The 2 models you have above do not seem to be comparable, in a meaningful way. The first model is attempting to compress your vector of 178 values. It is quite possible that these vectors contain some redundant information so it is reasonable to assume that you will be able to compress them.
The second model is attempting to compress a sequence of 23 x 178 vectors via single GRU layer. This is a task with a significantly higher number of parameters. The repeat vector simply takes the output of the 1st GRU layer (the encoder) and makes it in input of the 2nd GRU layer (the decoder). But then you take a single value of the decoder. Instead of the TimeDistributed layer, I'd recommend that you use return_sequences=True in the 2nd GRU (decoder). Otherwise you are saying that you are expecting that the 23x178 sequence is constituted with elements all with the same value; that has to lead to a very high error / no solution.
I'd recommend you take a step back. Is your goal to find similarity between the sequences ? Or to be able to make predictions ? An auto-encoder approach is preferable for a similarity task. In order to make predictions, I'd recommend that you go more towards an approach where you apply a Dense(1) layer to the output of the sequences step.
Is your data-set open ? available ? I'd be curious on taking it for a spin if that would be possible.
Related
I have many images of documents and I would like to cluster them to create categories (invoices, receipts, etc.). I would like to explore the image approach (I know I can use text), so I decided to build a CNN auto-encoder to compress the dimensions to a lower space then run a clustering algorithm like DBSCAN.
My issue is that I have no idea how to select the network layers and the different activation functions etc. This is my current model, what do you think ?
model = Sequential()
model.add(Conv2D(16, (3, 3), strides=2, padding='same', kernel_regularizer = l2(), input_shape=image_rgb_dims_top))
model.add(LeakyReLU(alpha=0.2))
#model.add(AveragePooling2D(pool_size=(2,2), padding='same'))
model.add(Conv2D(32, (3, 3), strides=2, padding='same', kernel_regularizer = l2()))
model.add(LeakyReLU(alpha=0.2))
#model.add(AveragePooling2D(pool_size=(2,2), padding='same'))
model.add(Flatten())
model.add(Dense(96, activity_regularizer=l1(10e-6)))
model.add(Dense(np.prod(model.layers[-2].output_shape[1:]),activation='relu'))
model.add(Reshape(model.layers[-4].output_shape[1:]))
model.add(Conv2DTranspose(32,(3, 3), strides=(2,2), padding='same'))
model.add(LeakyReLU(alpha=0.2))
#model.add(UpSampling2D((2, 2)))
model.add(Conv2DTranspose(16,(3, 3), strides=(2,2), padding='same'))
model.add(LeakyReLU(alpha=0.2))
#model.add(UpSampling2D((2, 2)))
model.add(Conv2D(1,(3, 3), padding='same'))
model.add(Activation('sigmoid'))
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 20, 76, 16) 160
_________________________________________________________________
leaky_re_lu (LeakyReLU) (None, 20, 76, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 10, 38, 32) 4640
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 10, 38, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 12160) 0
_________________________________________________________________
dense (Dense) (None, 96) 1167456
_________________________________________________________________
dense_1 (Dense) (None, 12160) 1179520
_________________________________________________________________
reshape (Reshape) (None, 10, 38, 32) 0
_________________________________________________________________
conv2d_transpose (Conv2DTran (None, 20, 76, 32) 9248
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 20, 76, 32) 0
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 40, 152, 16) 4624
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 40, 152, 16) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 40, 152, 1) 145
_________________________________________________________________
activation (Activation) (None, 40, 152, 1) 0
=================================================================
Total params: 2,365,793
Trainable params: 2,365,793
Non-trainable params: 0
_________________________________________________________________
I use MSE and adam optimizers.
The problems I encountered are :
The model overfits to the images most present in the dataset, so it creates many categories for the same type of that document when there is very little difference between them (just a small logo added, and it considers it a new cluster)
The images that are less present are not learned enough, I get a very blurry output, and most of them are clustered as -1 with DBSCAN.
Any idea how to make the model more effective ? I don't want it to overfit yet it's underfitting some images.
What are good layers/activation functions/regluarizers to use ? Should I increase the compressed representation size or decrease it ? It's very difficult to benchmark the effect of changes in the network, all I can do is run the dbscan clustering and look at the output classes, but this still depends on the dbscan epsilon parameter so I can't know if the model did well or not.
There are many things that you could try, and it is hard to know whether will it be effective before actually trying.
I will first address the problems which you explicitly stated.
"The model overfits to the images most present in the dataset"
You may try using bigger datasets, or if you can't, using the smaller model might work.
You may try using a pretrained model and run a Transfer Learning on it.
You may try Early Stopping.
"The images that are less present are not learned enough"
You may try using datasets that are not biased towards particular
classes.
"What are good layers/activation functions/regluarizers to use ?"
For activation function, ReLU and its variants mostly work well.
There are various layers and combination of them that you can use. Why don't you try using modern SOTA(State of the art) CNN network's architecture as a reference? (You can easily find some of them on Here )
"This is my current model, what do you think ?"
At least, the architecture seems old. If it works well, it would be fine. However, if needed, try using the modern SOTA architectures as mentioned earlier.
"Should I increase the compressed representation size or decrease it ?"
It is unclear. You should try both and pick the better performing method.
You could try different training methods too!
Hand-labeling all of the images would be a complete nightmare, so you might try labeling some of them and run a semi-supervised learning. Ex) SimCLR
Or else, you could search for researches about document-image-classification and use them as a reference.
Hope the answer helped!
I am working on a project about deep learning. I have a array with shape (101,3) and output with shape (101,3). It means each row in input data is related to same line row in output data. My purpose is creating a deep learning model for traning my dataset. I made some research and I found a few example about it. One of them is at this link. I need a many to many model as I understand but I don't know how to create it. Please can you help me about it? How can I create this model or is there any resources which you can suggest.
You can use something like this which you may have to change based on network performance,
from tensorflow.keras.layers import RepeatVector, TimeDistributed, Dense, LSTM
from tensorflow.keras.models import *
model = Sequential()
# encoder layer
model.add(LSTM(100, activation='relu', return_sequences= True, input_shape=(101, 3)))
# decoder layer
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(3)))
model.compile(optimizer='adam', loss='mse')
print(model.summary())
You could use something as follows. Please start your journey from this tutorial. You can play removing some layers below or adding more layers and see how your results are changing.
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(128, activation='relu',input_shape=(3,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(3)
])
model architecture is as follows
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_6 (Dense) (None, 128) 512
_________________________________________________________________
dense_7 (Dense) (None, 64) 8256
_________________________________________________________________
dense_8 (Dense) (None, 32) 2080
_________________________________________________________________
dense_9 (Dense) (None, 3) 99
=================================================================
Total params: 10,947
Trainable params: 10,947
Non-trainable params: 0
_________________________________________________________________
GaussianNoise in Keras seems to be only to add noise during training time. I need to add noise to activations in test time. My architecture is resnet50 pretrained on imagenet with all layers frozen, except for the fact that the gaussian noise needs to be added to the last activation layer prior to the FC layer.
How can this be done!? The gaussian noise layer I added as below at the end is not making any effect as the documentation says its only for during training. What is the alternative to this layer during test time?
__________________________________________________________________________________________________
bn5c_branch2c (BatchNormalizati (None, 7, 7, 2048) 8192 res5c_branch2c[0][0]
__________________________________________________________________________________________________
add_80 (Add) (None, 7, 7, 2048) 0 bn5c_branch2c[0][0]
activation_242[0][0]
__________________________________________________________________________________________________
activation_245 (Activation) (None, 7, 7, 2048) 0 add_80[0][0]
__________________________________________________________________________________________________
gaussian_noise_1519 (GaussianNo (None, 7, 7, 2048) 0 activation_245[0][0]
__________________________________________________________________________________________________
avg_pool (GlobalAveragePooling2 (None, 2048) 0 gaussian_noise_1519[0][0]
__________________________________________________________________________________________________
fc1000 (Dense) (None, 1000) 2049000 avg_pool[19][0]
==================================================================================================
Total params: 25,636,712
Trainable params: 0
Non-trainable params: 25,636,712
You can keep active those layers which have different behavior in test phase (e.g. Dropout) by passing training=True argument when calling them on a tensor:
out = SomeLayer(**configs)(inp, training=True)
With that, SomeLayer would be active in both training and test phases.
Assume this is my model:
_________________________________________________________________
Layer (type) Output Shape Param # =================================================================
embedding_16 (Embedding) (None, 10, 500) 71500 _________________________________________________________________
lstm_31 (LSTM) (None, 10, 500) 2002000 _________________________________________________________________
dropout_15 (Dropout) (None, 10, 500) 0 _________________________________________________________________
time_distributed_16 (None, 10, 500) 250500 _________________________________________________________________
softmax (Activation) (None, 10, 500) 0 =================================================================
But I want to have in my last layer:
softmax (Activation) (None, 100, 1000) 0
I have been trying to do this for hours. I don't know if this is possible or not. I don't think you can change output size of LSTM (looking at its model) but is there a layer that i can add so that it generates , say, 10 ouputs per input?
In simple words, assume I want to my model to generate 10 words for each word i put in. I hope I am able to explain.
There are different ways of looking at the "multiple output" here (and by "here" I take a guess that you are using keras library - it seems so from the printout).
In a simple case, having e.g. Dense(10) layer would solve it. The "secret sauce" in using TimeDistributed layer wrapper as explained in this SO post.
The other approach requires using functional API of keras. How to get multiple out is explained in the docs.
I'm new to Keras and wondering how to train an LTSM with (interrupted) time series of different lengths. Consider, for example, a continuous series from day 1 to day 10 and another continuous series from day 15 to day 20. Simply concatenating them to a single series might yield wrong results. I see basically two options to bring them to shape (batch_size, timesteps, output_features):
Extend the shorter series by some default value (0), i.e. for the above example we would have the following batch:
d1, ..., d10
d15, ..., d20, 0, 0, 0, 0, 0
Compute the GCD of the lengths, cut the series into pieces, and use a stateful LSTM, i.e.:
d1, ..., d5
d6, ..., d10
reset_state
d15, ..., d20
Are there any other / better solutions? Is training a stateless LSTM with a complete sequence equivalent to training a stateful LSTM with pieces?
Have you tried feeding the LSTM layer with inputs of different length? The input time-series can be of different length when LSTM is used (even the batch sizes can be different from one batch to another, but obvisouly the dimension of features should be the same). Here is an example in Keras:
from keras import models, layers
n_feats = 32
latent_dim = 64
lstm_input = layers.Input(shape=(None, n_feats))
lstm_output = layers.LSTM(latent_dim)(lstm_input)
model = models.Model(lstm_input, lstm_output)
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, None, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 64) 24832
=================================================================
Total params: 24,832
Trainable params: 24,832
Non-trainable params: 0
As you can see the first and second axis of Input layer is None. It means they are not pre-specified and can be any value. You can think of LSTM as a loop. No matter the input length, as long as there are remaining data vectors of same length (i.e. n_feats), the LSTM layer processes them. Therefore, as you can see above, the number of parameters used in a LSTM layer does not depend on the batch size or time-series length (it only depends on input feature vector's length and the latent dimension of LSTM).
import numpy as np
# feed LSTM with: batch_size=10, timestamps=5
model.predict(np.random.rand(10, 5, n_feats)) # This works
# feed LSTM with: batch_size=5, timestamps=100
model.predict(np.random.rand(5, 100, n_feats)) # This also works
However, depending on the specific problem you are working on, this may not work; though I don't have any specific examples in my mind now in which this behavior may not be suitable and you should make sure all the time-series have the same length.