I'm trying to stack vgg16 and lstm layers. Any ideas how to it make it work?
Pss, LSTM requires time distributed over all other layers, which don't return sequences.
Thanks for considering my problem
Related
my CNN network
Above is my config of the network.
l am training a CNN network on picture size of 192*192.
my target is a classification network of 11 kinds.
However, the loss and the accuracy on testing dataset appears to be very unstable. l have to run 15+ epochs to get a stable accuracy and loss. The maximum accuracy is only 50%.
What can l do to improve the performance?
I would recommend you to first refer to models which are widely known like VGG-16, LeNET or VGG-19 and check out the way how the conv2D and max-pooling layers are placed.
Start with a very basic model without any batch normalization and Leaky ReLU layers. You just keep the conv2D and max pooling layers and train your model for a few epochs.
Next, try other activations like ReLU to TanH. Try Changing the max pooling to average pooling.
If you are solving a classification problem then use the softmax layer at the end. Also, introduce Dense layer(s) after flattening.
Your dataset should be large and also the target should be one-hot encoded if you wish to use the softmax layer.
I'm using the transformers library of HuggingFace. As far as I know changing the number of hidden layers in the config file leads to loading the first x layers of the pre-trained BERT. I want to load the even layers (or the last x layers) of the pre-trained BERT and then fine-tune them for a classification task.
An example for classification tasks can be found here : run_glue.py
Thanks in advance
I can't seem to find a concrete answer to the question. I am currently doing transfer learning from a VGG19 network, and my target domain is document classification (either solely by visual classification or using CNN's feature extraction for another model).
I want to understand in which cases is it desirable to keep all fully connected layers of the model, and in which cases should I remove the fully connected layers and make a new fully-connected layer on top of the last convolutional layer. What does each of these choices imply for the training, predictions, etc. ?
These are code examples using Keras of what I mean:
Extracting the last fully connected layer:
original_model = VGG19(include_top=True, weights='imagenet', input_shape=(224, 224, 3))
layer_name = 'fc2'
x = Dropout(0.5)(original_model.get_layer(layer_name).output)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
features_model = Model(inputs=original_model.input, outputs=predictions)
adam = optimizers.Adam(lr=0.001)
features_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
features_model.summary()
return features_model
Adding one fully connected layer after the last convolutional layer:
original_model = VGG19(include_top=False, weights='imagenet', input_shape=(224, 224, 3))
x = Flatten()(base_model.output)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
head_model = Model(input=base_model.input, output=predictions)
adam = optimizers.Adam(lr=0.001)
head_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
head_model.summary()
return head_model
Is there a rule of thumb for what to choose when doing transfer-learning?
According to my past experience, applying transfer learning from stock market to business forecast successfully, you should keep original structure, because if you are doing transfer learning, you will want to load weights trained from original structure, without issues regarding differences in neural net architecture. Then you unfreeze parts of the CNN and your neural net training will start training from a high accuracy and adapt weights for the target problem.
However, if you remove a Flatten layer, computational cost will decrease as you will have fewer parameters to train.
I follow the rule of keeping neural nets as simple as possible (equals bigger generalization properties), with high efficiency.
#Kamen, as a complement to your comment, regarding how much data you will need, it depends on the variance of your data. More variance, you will need more layers and weights to learn the details. However, when you increase complexity in the architecture, your neural net will be more prone to overfit, than can be decreased using Dropout, for instance.
As fully connected layers are the more expensive part of a neural net, if you add one or two of them your parameter number will increase a lot, demanding more time to train. With more layers you will get a higher accuracy, but you may have overfit.
For instance, MNIST with 10,000 examples can reach an accuracy bigger than 99% with a quite simple architecture. However, IMAGENET has 1,000,000 examples (155 GB) and then demands a more complex structure, like VGG16.
I am confused between how to correctly use dropout with RNN in keras, specifically with GRU units. The keras documentation refers to this paper (https://arxiv.org/abs/1512.05287) and I understand that same dropout mask should be used for all time-steps. This is achieved by dropout argument while specifying the GRU layer itself. What I don't understand is:
Why there are several examples over the internet including keras own example (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) and "Trigger word detection" assignment in Andrew Ng's Coursera Seq. Models course, where they add a dropout layer explicitly "model.add(Dropout(0.5))" which, in my understanding, will add a different mask to every time-step.
The paper mentioned above suggests that doing this is inappropriate and we might lose the signal as well as long-term memory due to the accumulation of this dropout noise over all the time-steps.
But then, how are these models (using different dropout masks at every time-step) are able to learn and perform well.
I myself have trained a model which uses different dropout masks at every time-step, and although I haven't gotten results as I wanted, the model is able to overfit the training data. This, in my understanding, invalidates the "accumulation of noise" and "signal getting lost" over all the time-steps (I have 1000 time-step series being input to the GRU layers).
Any insights, explanations or experience with the situation will be helpful. Thanks.
UPDATE:
To make it more clear I'll mention an extract from keras documentation of Dropout Layer ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features").
So, I believe, it can be seen that when using Dropout layer explicitly and needing the same mask at every time-step (as mentioned in the paper), we need to edit this noise_shape argument which is not done in the examples I linked earlier.
As Asterisk explained in his comment, there is a fundamental difference between dropout within a recurrent unit and dropout after the unit's output. This is the architecture from the keras tutorial you linked in your question:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
You're adding a dropout layer after the LSTM finished its computation, meaning that there won't be any more recurrent passes in that unit. Imagine this dropout layer as teaching the network not to rely on the output for a specific feature of a specific time step, but to generalize over information in different features and time steps. Dropout here is no different to feed-forward architectures.
What Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost to be very helpful to understand the paper and how it relates to the keras implementation.
I am training MNIST on 8 layers (1568-784-512-256-128-64-32-10) fully-connected deep neural network with the newly created activation function as shown in the figure below.This function looks a bit similar to the ReLU, however, it gives a litter curve at the "kink".
It was working fine when I used it to train 5 layers, 6 layers and 7 layers fully-connected neural networks. The problem arises when I use it in 8 layers fully-connected neural networks. Where it will only learn at the 1st few epochs then stop learning (Test Loss gives "nan" and Test accuracy drop to 9.8%). Why does this happen?
My other configurations are as follow: Dropout=0.5, Weight initialization= Xavier initialization, Learning rate=0.1
I believe this is called Gradient vanishing problem which usually occurs in deep network. There is no hard and fast rule for solving it. My advice would be to reshape your network architecture
See here [Avoiding vanishing gradient in deep neural networks