How do you build a CNN autoencoder? - machine-learning

I have many images of documents and I would like to cluster them to create categories (invoices, receipts, etc.). I would like to explore the image approach (I know I can use text), so I decided to build a CNN auto-encoder to compress the dimensions to a lower space then run a clustering algorithm like DBSCAN.
My issue is that I have no idea how to select the network layers and the different activation functions etc. This is my current model, what do you think ?
model = Sequential()
model.add(Conv2D(16, (3, 3), strides=2, padding='same', kernel_regularizer = l2(), input_shape=image_rgb_dims_top))
model.add(LeakyReLU(alpha=0.2))
#model.add(AveragePooling2D(pool_size=(2,2), padding='same'))
model.add(Conv2D(32, (3, 3), strides=2, padding='same', kernel_regularizer = l2()))
model.add(LeakyReLU(alpha=0.2))
#model.add(AveragePooling2D(pool_size=(2,2), padding='same'))
model.add(Flatten())
model.add(Dense(96, activity_regularizer=l1(10e-6)))
model.add(Dense(np.prod(model.layers[-2].output_shape[1:]),activation='relu'))
model.add(Reshape(model.layers[-4].output_shape[1:]))
model.add(Conv2DTranspose(32,(3, 3), strides=(2,2), padding='same'))
model.add(LeakyReLU(alpha=0.2))
#model.add(UpSampling2D((2, 2)))
model.add(Conv2DTranspose(16,(3, 3), strides=(2,2), padding='same'))
model.add(LeakyReLU(alpha=0.2))
#model.add(UpSampling2D((2, 2)))
model.add(Conv2D(1,(3, 3), padding='same'))
model.add(Activation('sigmoid'))
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 20, 76, 16) 160
_________________________________________________________________
leaky_re_lu (LeakyReLU) (None, 20, 76, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 10, 38, 32) 4640
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 10, 38, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 12160) 0
_________________________________________________________________
dense (Dense) (None, 96) 1167456
_________________________________________________________________
dense_1 (Dense) (None, 12160) 1179520
_________________________________________________________________
reshape (Reshape) (None, 10, 38, 32) 0
_________________________________________________________________
conv2d_transpose (Conv2DTran (None, 20, 76, 32) 9248
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 20, 76, 32) 0
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 40, 152, 16) 4624
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 40, 152, 16) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 40, 152, 1) 145
_________________________________________________________________
activation (Activation) (None, 40, 152, 1) 0
=================================================================
Total params: 2,365,793
Trainable params: 2,365,793
Non-trainable params: 0
_________________________________________________________________
I use MSE and adam optimizers.
The problems I encountered are :
The model overfits to the images most present in the dataset, so it creates many categories for the same type of that document when there is very little difference between them (just a small logo added, and it considers it a new cluster)
The images that are less present are not learned enough, I get a very blurry output, and most of them are clustered as -1 with DBSCAN.
Any idea how to make the model more effective ? I don't want it to overfit yet it's underfitting some images.
What are good layers/activation functions/regluarizers to use ? Should I increase the compressed representation size or decrease it ? It's very difficult to benchmark the effect of changes in the network, all I can do is run the dbscan clustering and look at the output classes, but this still depends on the dbscan epsilon parameter so I can't know if the model did well or not.

There are many things that you could try, and it is hard to know whether will it be effective before actually trying.
I will first address the problems which you explicitly stated.
"The model overfits to the images most present in the dataset"
You may try using bigger datasets, or if you can't, using the smaller model might work.
You may try using a pretrained model and run a Transfer Learning on it.
You may try Early Stopping.
"The images that are less present are not learned enough"
You may try using datasets that are not biased towards particular
classes.
"What are good layers/activation functions/regluarizers to use ?"
For activation function, ReLU and its variants mostly work well.
There are various layers and combination of them that you can use. Why don't you try using modern SOTA(State of the art) CNN network's architecture as a reference? (You can easily find some of them on Here )
"This is my current model, what do you think ?"
At least, the architecture seems old. If it works well, it would be fine. However, if needed, try using the modern SOTA architectures as mentioned earlier.
"Should I increase the compressed representation size or decrease it ?"
It is unclear. You should try both and pick the better performing method.
You could try different training methods too!
Hand-labeling all of the images would be a complete nightmare, so you might try labeling some of them and run a semi-supervised learning. Ex) SimCLR
Or else, you could search for researches about document-image-classification and use them as a reference.
Hope the answer helped!

Related

How can I create a Deep Learning Model for 2D input and output?

I am working on a project about deep learning. I have a array with shape (101,3) and output with shape (101,3). It means each row in input data is related to same line row in output data. My purpose is creating a deep learning model for traning my dataset. I made some research and I found a few example about it. One of them is at this link. I need a many to many model as I understand but I don't know how to create it. Please can you help me about it? How can I create this model or is there any resources which you can suggest.
You can use something like this which you may have to change based on network performance,
from tensorflow.keras.layers import RepeatVector, TimeDistributed, Dense, LSTM
from tensorflow.keras.models import *
model = Sequential()
# encoder layer
model.add(LSTM(100, activation='relu', return_sequences= True, input_shape=(101, 3)))
# decoder layer
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(3)))
model.compile(optimizer='adam', loss='mse')
print(model.summary())
You could use something as follows. Please start your journey from this tutorial. You can play removing some layers below or adding more layers and see how your results are changing.
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(128, activation='relu',input_shape=(3,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(3)
])
model architecture is as follows
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_6 (Dense) (None, 128) 512
_________________________________________________________________
dense_7 (Dense) (None, 64) 8256
_________________________________________________________________
dense_8 (Dense) (None, 32) 2080
_________________________________________________________________
dense_9 (Dense) (None, 3) 99
=================================================================
Total params: 10,947
Trainable params: 10,947
Non-trainable params: 0
_________________________________________________________________

How to add noise to activations at in Keras at inference time?

GaussianNoise in Keras seems to be only to add noise during training time. I need to add noise to activations in test time. My architecture is resnet50 pretrained on imagenet with all layers frozen, except for the fact that the gaussian noise needs to be added to the last activation layer prior to the FC layer.
How can this be done!? The gaussian noise layer I added as below at the end is not making any effect as the documentation says its only for during training. What is the alternative to this layer during test time?
__________________________________________________________________________________________________
bn5c_branch2c (BatchNormalizati (None, 7, 7, 2048) 8192 res5c_branch2c[0][0]
__________________________________________________________________________________________________
add_80 (Add) (None, 7, 7, 2048) 0 bn5c_branch2c[0][0]
activation_242[0][0]
__________________________________________________________________________________________________
activation_245 (Activation) (None, 7, 7, 2048) 0 add_80[0][0]
__________________________________________________________________________________________________
gaussian_noise_1519 (GaussianNo (None, 7, 7, 2048) 0 activation_245[0][0]
__________________________________________________________________________________________________
avg_pool (GlobalAveragePooling2 (None, 2048) 0 gaussian_noise_1519[0][0]
__________________________________________________________________________________________________
fc1000 (Dense) (None, 1000) 2049000 avg_pool[19][0]
==================================================================================================
Total params: 25,636,712
Trainable params: 0
Non-trainable params: 25,636,712
You can keep active those layers which have different behavior in test phase (e.g. Dropout) by passing training=True argument when calling them on a tensor:
out = SomeLayer(**configs)(inp, training=True)
With that, SomeLayer would be active in both training and test phases.

LSTM/GRU autoencoder convergency

Goal
Trying to run an LSTM autoencoder over a dataset of multi variate time series:
X_train (200, 23, 178) - X_val (100, 23, 178) - X_test (100, 23, 178)
Current situation
A plain autoencoder gets better results rather than a simple architecture of a LSTM AE.
I have some doubts about how I use the Repeat Vector wrapper layer which, as far as I understood, is supposed to simply repeat a number of times equal to the sequence length the last state of the LSTM/GRU cell, in order to feed the input shape of the decoder layer.
The model architecture does not rise any error, but still the results are an order of magnitude worst than a simple AE, while I was expecting them to be at least the same, as I am using an architecture which should better fit the temporal problem.
Are these results comparable, first of all?
Nevertheless, the reconstruction error of the LSTM-AE does not look good at all.
My AE model:
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 178) 31862
_________________________________________________________________
batch_normalization (BatchNo (None, 178) 712
_________________________________________________________________
dense_1 (Dense) (None, 59) 10561
_________________________________________________________________
dense_2 (Dense) (None, 178) 10680
=================================================================
optimizer: sgd
loss: mse
activation function of the dense layers: relu
My LSTM/GRU AE:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 23, 178) 0
_________________________________________________________________
gru (GRU) (None, 59) 42126
_________________________________________________________________
repeat_vector (RepeatVector) (None, 23, 59) 0
_________________________________________________________________
gru_1 (GRU) (None, 23, 178) 127092
_________________________________________________________________
time_distributed (TimeDistri (None, 23, 178) 31862
=================================================================
optimizer: sgd
loss: mse
activation function of the gru layers: relu
The 2 models you have above do not seem to be comparable, in a meaningful way. The first model is attempting to compress your vector of 178 values. It is quite possible that these vectors contain some redundant information so it is reasonable to assume that you will be able to compress them.
The second model is attempting to compress a sequence of 23 x 178 vectors via single GRU layer. This is a task with a significantly higher number of parameters. The repeat vector simply takes the output of the 1st GRU layer (the encoder) and makes it in input of the 2nd GRU layer (the decoder). But then you take a single value of the decoder. Instead of the TimeDistributed layer, I'd recommend that you use return_sequences=True in the 2nd GRU (decoder). Otherwise you are saying that you are expecting that the 23x178 sequence is constituted with elements all with the same value; that has to lead to a very high error / no solution.
I'd recommend you take a step back. Is your goal to find similarity between the sequences ? Or to be able to make predictions ? An auto-encoder approach is preferable for a similarity task. In order to make predictions, I'd recommend that you go more towards an approach where you apply a Dense(1) layer to the output of the sequences step.
Is your data-set open ? available ? I'd be curious on taking it for a spin if that would be possible.

How to generate more than 1 output per input in LSTM?

Assume this is my model:
_________________________________________________________________
Layer (type) Output Shape Param # =================================================================
embedding_16 (Embedding) (None, 10, 500) 71500 _________________________________________________________________
lstm_31 (LSTM) (None, 10, 500) 2002000 _________________________________________________________________
dropout_15 (Dropout) (None, 10, 500) 0 _________________________________________________________________
time_distributed_16 (None, 10, 500) 250500 _________________________________________________________________
softmax (Activation) (None, 10, 500) 0 =================================================================
But I want to have in my last layer:
softmax (Activation) (None, 100, 1000) 0
I have been trying to do this for hours. I don't know if this is possible or not. I don't think you can change output size of LSTM (looking at its model) but is there a layer that i can add so that it generates , say, 10 ouputs per input?
In simple words, assume I want to my model to generate 10 words for each word i put in. I hope I am able to explain.
There are different ways of looking at the "multiple output" here (and by "here" I take a guess that you are using keras library - it seems so from the printout).
In a simple case, having e.g. Dense(10) layer would solve it. The "secret sauce" in using TimeDistributed layer wrapper as explained in this SO post.
The other approach requires using functional API of keras. How to get multiple out is explained in the docs.

input_shape 2D Convolutional layer in keras

In the Keras Documentation for Convolution2D the input_shape a 128x128 RGB pictures is given by input_shape=(3, 128, 128), thus I figured the first component should be the number of planes (or feature layers).
If I run the following code:
model = Sequential()
model.add(Convolution2D(4, 5,5, border_mode='same', input_shape=(3, 19, 19), activation='relu'))
print(model.output_shape)
I get an output_shapeof (None, 3, 19, 4), whereas in my understanding this should be (None, 4, 19, 19) with 4 the number of filters.
Is this an error in the example from the keras documentation or am I missing something?
(I am trying to recreate a part of AlphaGo so the 19x19 is the board size which would correspond to the images size. )
You are using the Theano dimension ordering (channels, rows, cols) as input but your Keras seems to use the Tensorflow one which is (rows, cols, channels).
So either you can switch to the Theano dimension ordering, directly in your code with :
import keras.backend as K
K.set_image_dim_ordering('th')
Or editing the keras.json file in (usually in ~\.keras) and switching
"image_dim_ordering": "tf" to "image_dim_ordering": "th"
Or you can keep the Tensorflow dimension ordering and switch your input_shape to (19,19,3)
Yes it should be (None, 4, 19, 19). There is something called dim_ordering in keras that decides in which index should one place the number of input channels. Check the documentation of "dim_ordering" parameter in the documentation. Mine is set to 'tf'.
So; just change the input shape to (19, 19, 3) like so
model.add(Convolution2D(4, 5,5, border_mode='same', input_shape=(19, 19,3), activation='relu'))
Then check the output shape.
You can also modify the dim_ordering in the file usually at ~/.keras/keras.json to your liking

Resources