Can Keras.Backend.Switch() function be used in the layers of a model?
I have a situation in which at certain layers, there will be branching into 3 different layers. But the forward and backward operations need to only pass through one of the 3 branches at this layer.
I was thinking of using K.switch at these layers. Will it work out? Will the backpropagation work when K.switch() is used in the model layers?
Related
I am trying to utilize the pretrained Bert model of tensorflow which has approx 110 million params and it is near impossible to train these params using my gpu. And freezing the entire layer makes all these params untrainable.
Is it possible to make the layer partially trainable? Like have a couple million params trainable and the rest untrainable?
input_ids_layer = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name='input_ids')
input_attention_layer = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name='attention_mask')
model = TFAutoModel.from_pretrained("bert-base-uncased")
for layer in model.layers:
for i in range(len(layer.weights)):
//assuming there are 199 weights
if i>150:
layer.weights[i]._trainable = True
else:
layer.weights[i]._trainable = False
I don't know about training some weights inside a layers, but I still suggest you to do the "standard way": freezing the layers is what is usually done in these cases to avoid retraining everything. However, you must not freeze all the layers, since it would be useless. What you want to do is to freeze everything except the last few layers, and then train the neural network.
This works since the first layers usually learn very abstract features, and therefore are transferrable across many problems. On the other hand, the last layers usually learn the features that really solves the task at hand, based on the current dataset.
Therefore, if you want to re-train a pretrained model in another dataset, you just need to retrain the last few layers. You can also edit the last layers of the neural network by adding some Dense layers and changing the output of the last layer, which is useful if for example the number of classes to predict is different w.r.t the original dataset. There are a lot of short and easy tutorials that you can follow online to do that.
To summarize:
Freeze all the layers expect the last one
(optional) Create new layers and link them with the output of the second-last layer
Train the network
I want to train Inceptionv3 model where i am trying to give 3 different view of a single image and train it. So i want to give three images as my input in a single feed.
Use case:
I want to predict type of footwear. In this problem usually a lot of information is present different view so just want to try this approach.
The easy way would be to input all 3 images separately into the Inceptionv3 model, and than perform some weighted decision on all 3 outputs together.
A better approach would be to use the Inceptionv3 model as 1 of 3 input branches, than take the embedding layer of each branch (the layer before last) and combine them all with one fully connected classification layer (with softmax activation). The 3 branches can be trained either view-specific or together with shared weights (with such a big model, together will work fine).
By the way, for shoe type classification task I would suggest to use a simpler model (Inceptionv3 is an overkill).
I think you have different ways of acting:
Remove the first layer of inception and create yours to support 3x3
dimensions.
Use the first inception blocks for each input, then concatenate them in some fc layer (or before). If the features to search are similar you can use shared parameters.
The first case will merge all dimensions and difuse the information provided for any image.
The second one will extract specific features in each image.
I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.
I have a Multi Task Network with two similar branches and a pre-trained network with only one branch (which is also same).
I want to initialize the weights of the layers in the two branches(in my multi task network) with the weights of the layers in my pre-trained network.
Now, I can initialize one of the branch correctly by using the same name for the layers as in the pre-trained network.
But, I have to keep the names of the layers in the other branch different, and thus those layers won't take the pre-trained weights.
Also, I don't want to share the weights in the two branches. So, giving the same name to the weights in the corresponding layers in the two branches won't work.
Is there a nice way/hack to do this ?
PS: I would want to avoid Network Surgery, but any comments, explaining a nice way to do it, are also welcome.
Clarification : I just want to initialize the two branches with the same weights. They can learn different weights during the training phase, since they are governed by different loss layers.
The answer by Przemak D is a nice hack to do the above.
give different names to the layers in the two branches and enable weight sharing
initialize the network and train for 1-2 iterations
then train the original network(without weight sharing) initializing the weights with the caffemodel obtained as a result after step 2.
The above is a nice hack, but net surgery is a better way to do this.
Are there known methods of continuous training and graceful degradation of a neural net while it shrinks or grows in size (by number of nodes, connections, whatever)?
To the best of my memory, everything I've read about neural networks is from a static perspective. You define the net and then train it.
If there is some neural network X with N nodes (neurons, whatever), is it possible to train the network (X) so that while N increases or decreases, the network is still useful and capable of performing?
In general, changing network architecture (adding new layers, adding more neurons into existing layers) once the network was already trained makes sense and a rather common operation in Deep Learning domain. One example is the dropout - during training half of the neurons randomly get switched off completely and only remaining half participates in training during specific iteration (each iteration or 'epoch' as it often is named has different random list of switched off neurons). Another example is transfer learning - where you learn network on one set of input data, cut off part of the outcoming layers, replace them with new layers and re-learn the model on another dataset.
To better explain why it makes sense lets step back for a moment. In deep networks, where you have lots of hidden layers each layer learns some abstraction from the incoming data. Each additional layer uses abstract representations learned by previous layer and builds upon them, combining such abstraction to form a higher level of the data representation. For instance, you could be trying to classify the images with DNN. First layer will learn rather simple concepts from images - like edges or points in data. Next layer could combine this simple concepts to learn primitives - like triangles or circles of squares. Next layer could drive it further and combine this primitives to represent some objects which you could find in images, like 'a car' or 'a house'and using softmax it calculates the probabilities of the answer you are looking for (what to actually output). I need to mention that these facts and learned representations could be actually checked. You could visualize the activation of your hidden layer and see what it learned. For example this was done with google's project 'inceptionism'. With that in mind let's get back to what I mentioned earlier.
Dropout is used to improve generalization of the network. It forces each neuron to 'not be so sure' that some pieces of the information from the previous layer will be available and makes it to try to learn the representations relying on less favorable and informative pieces of abstractions from previous layer. It forces it to consider all of the representations from previous layer to make decisions instead of putting all of its weight into couple of neurons it 'likes most of all'. By doing this the network is usually better prepared to new data where the input will be different from the training set.
Q: "As far as you're aware is the quality of the stored knowledge (whatever training has done to the net) still usable following the dropout? Maybe random halves could be substituted by random 10ths with a single 10th dropping, that might result in less knowledge loss during the transition period."
A: Unfortunately I can't properly answer why precisely half of the neurons is switched off and not 10% (or any other number). Maybe there is an explanation but I haven't seen it. In general it just works and that's it.
Also I need to mention that the task of dropout is to ensure that each neuron doesn't consider just several of the neurons from previous layer and is ready to make some decision even if neurons which usually helped it to make correct decision are not available. This is used for generalization only and helps the network to better cope with the data it haven't seen previously, nothing else is achieved with a dropout.
Now let's consider Transfer Learning again. Consider that you have a network with 4 layers. You train it to recognize specific objects in pictures (cat, dog, table, car etc). Than you cut off last layer, replace it with three additional layers and now you train the resulting 6-layered network on a dataset which, for instance, wrights short sentences about what is shown on this image ('a cat is on the car', 'house with windows and tree nearby' etc). What we did with such operation? Our original 4-layer network was capable to understand if some specific object is in the image we feed it with. Its first 3 layers learned good representations of the images - first layer learned about possible edges or points or some extremely primitive geometric shapes in images. Second layer learned some more elaborate geometric figures like 'circle' or 'square'. Last layer knows how to combine them to form some higher level objects - 'car', 'cat', 'house'. Now, we could just re-use this good representation which we learned in different domain and just add several more layers. Each of them will use abstractions from last (3rd) layer of original network and learn how combine them to create meaningful descriptions of images. While you will perform learning on new dataset with images as input and sentences as output it will adjust first 3 layers which we got from original network but these adjustments will be mostly minor, while 3 new layers will be adjusted by learning significantly. What we achieve with transfer learning is:
1) We can learn a much better data representations. We could create a network which is very good at specific task and than build upon that network to perform something different.
2) We can save training time - first layers of network will already be trained well enough so that your layers which are closer to output already get a rather good data representations. So the training should finish much faster using pre-trained first layers.
So the bottom line is that pre-training some network and than re-using part or whole network in another network makes perfect sense and is not something uncommon.
This is something I have seen in the likes of this video...
https://youtu.be/qv6UVOQ0F44
There are links to further resources in the video description.
And is based on a process called NEAT. Neuro Evolution of Augmenting Topologies.
It uses a genetic algorithm and evolutionary process to design and evolve a neural net from scratch with no prior assumptions of structure or complexity of the neural net.
I believe this is what you are looking for.