I'm still pretty new to Keras/Theano, so please bear with me. But after googling and reading documentation, I did not find anything (except writing your own optimizer)
I've downloaded the convnets from https://github.com/heuritech/convnets-keras and here I wanted to use and retrain the AlexNet. I just want to retrain the top layers (dense 1,2,3) for 8 classes instead of 1000, so it would retain the "concepts" of images in the conv layers but generate output on my data (8 classes).
Is it somehow possible to set learning only for the top layers, e.g. freeze the bottom layers?
Thanks
Related
My task is to adapt a pre-trained network from Keras for classification of aerial images (we have a database of 30 categories of aerial images, each containing 200-400 images).
Now, what I don't really understand is this next part.
We must use mid-level fine tuning using a smaller image database, which contains 21 aerial categories.
How can I achieve this?
Should I try to fine tune the smaller database on top of a VGG16 network and then save the model and train the larger database on top of it?
I'm guessing that they want you to fine-tune a trained model by freezing its first X layers and only updating the weights of the last few layers (maybe just the last one, not sure what "mid-level fine-tuning" means).
You need to take your trained model and replace its last layer with 30 outputs to a new layer of 21 outputs. Then you need to freeze all your other layers (except the new one) and train the model on the new dataset.
In Keras you just need to set: "trainable=False" for each layer.
How can I "freeze" Keras layers?
I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.
I have studied ordinary fully connected ANNs, I am starting to study convnets. I am struggling to understand how hidden layers connect. I do understand how the input matrix forward feeds a smaller field of values to the feature maps in the first hidden layer, by moving the local receptive field along one each time and forward feeding through the same/shared weights (for each feature map), so there are only one group of weights per feature map that are of the same structure as the local receptive field. Please correct me if I am wrong. Then, the feature maps use pooling to simplify the maps. The next part is when I get confused, here is a link to a 3d CNN visualisation to help explain my confusion
http://scs.ryerson.ca/~aharley/vis/conv/
Draw a digit between 0-9 into the top left pad and you'll see how it works. Its really cool. So, on the layer after the first pooling layer (the 4th row up containing 16 filters) if yoau hover your mouse over the filters you can see how the weights connect to the previous pooling layer. Try different filters on this row and what I do not understand is the rule that connects the second convolution layer to the previous pool layer. E.g on the filters to the very left, they are fully connected to the pooling layer. But on the ones nearer to the right, they only connect to about 3 of the previous pooled layers. Looks random.
I hope my explanation makes sense. I am essentially confused about what the pattern is that connects hidden pooled layers to the following hidden convolution layer. Even if my example is a bit odd, I would still appreciate some sort of explanation or link to a good explanation.
Thanks a lot.
Welcome to the magic of self-trained CNNs. It's confusing because the network makes up these rules as it trains. This is an image-processing example; most of these happen to train in a fashion that loosely parallels the learning in a simplified model of the visual cortex in vertebrates.
In general, the first layer's kernels "learn" to "recognize" very simple features of the input: lines and edges in various orientations. The next layer combines those for more complex features, perhaps a left-facing half-circle, or a particular angle orientation. The deeper you go in the model, the more complex the "decisions" get, and the kernels get more complex, and/or less recognizable.
The difference in connectivity from left to right may be an intentional sorting by the developer, or mere circumstance in the model. Some features need to "consult" only a handful of the previous layer's kernels; others need a committee of the whole. Note how simple features connect to relatively few kernels, while the final decision has each of the ten categories checking in with a large proportion of the "pixel"-level units in the last FC layer.
You might look around for some kernel visualizations for larger CNN implementations, such as those in the ILSVRC: GoogleNet, ResNet, VGG, etc. Those have some striking kernels through the layers, including fuzzy matches to a wheel & fender, the front part of a standing mammal, various types of faces, etc.
Does that help at all?
All of this is the result of organic growth over the training period.
I'm interested in taking advantage of some partially labeled data that I have in a deep learning task. I'm using a fully convolutional approach, not sampling patches from the labeled regions.
I have masks that outline regions of definite positive examples in an image, but the unmasked regions in the images are not necessarily negative - they may be positive. Does anyone know of a way to incorporate this type of class in a deep learning setting?
Triplet/contrastive loss seems like it may be the way to go, but I'm not sure how to accommodate the "fuzzy" or ambiguous negative/positive space.
Try label smoothing as described in section 7.5.1 of Deep Learning book:
We can assume that for some small constant eps, the training set label y is correct with probability 1 - eps, and otherwise any of the other possible labels might be correct.
Label smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of eps / k and 1 - (k - 1) / k * eps, respectively.
See my question about implementing label smoothing in Pandas.
Otherwise if you know for sure, that some areas are negative, other are positive while some are uncertain, then you can introduce a third uncertain class. I have worked with data sets that contained uncertain class, which corresponded to samples that could belong to any of the available classes.
I'm assuming that you are struggling with a data segmantation task with a problem of a ill-definied background (e.g. you are not sure if all examples are correctly labeled). Recently I came across the similiar problem and this is what I came across during my research:
In old days before deep learning and at the begining of deep learning era - the common way to deal with that is to smooth your output with some kind of a probability model which would take into account the possibility of a noisy labels (you could read about this in a Learning to Label from Noisy Data chapter from this book. It's important to discriminate this probabilistic models from models used to smooth your labels w.r.t. to image or label structure like classical CRFs for bilateral smoothing.
What we finally used (and worked really well) is the Channel Inhibited Softmax idea from this paper. In terms of a mathematical properties - it makes your network much more robust to some objects not labeled - because it makes your network to output much higher positive valued logits at correctly labeled objects.
You could treat this as a semi-supervised problem. Use the full dataset without labels to train a bottleneck autoencoder structure (or a GAN approach). This pretrained model can then be adjusted (e.g. removing the last layers, adding a better layer structure at the end on top of the bottleneck features) and finetuned on the labeled data.
I'm working on a image retrieval task(not involving faces) and one of the things I am trying is to swap out the softmax layer in the CNN model and use the LMNN classifier. For this purpose I fine tuned the model and then extracted the features at fully connected layer. I have about 3000 images right now. The fully connected layer gives a 4096 dim vector. So my final vector is a 3000x4096 vector with about 700 classes(Each class has 2+ images). I believe this is an extremely large dimension size which the LMNN algorithm is going to take forever(it really did take forever).
How can I reduce the number of dimensions? I tried PCA but that didn't squeeze down the dimensions too much(got down to 3000x3000). I am thinking 256/512/1024 dim vector should be able to help. If I were to add another layer to reduce dimensions, say a new fully connected layer would I have to fine tune my network again? Inputs on how to do that would be great!
I am also currently trying to augment my data to get more images per class and increase the size of my dataset.
Thank you.
PCA should let you reduce the data further - you should be able to specify the desired dimensionality - see the wikipedia article.
As well as PCA you can try t-distributed stochastic neighbor embedding (t-SNE). I really enjoyed Wattenberg, et al.'s article - worth a read if you want to get an insight into how it works and some of the pitfalls.
In a neural net the standard way to reduce dimensionality is by adding more, smaller layers, as you suggested. As they can only learn during training, you'll need to re-run your fine-tuning. Ideally you would re-run the entire training process if you make a change to the model structure but if you have enough data it may be OK still.
To add new layers in TensorFlow, you would add a fully connected layer whose input is the output of your 3000 element layer, and output size is the desired number of elements. You may repeat this if you want to go down gradually (e.g. 3000 -> 1024 -> 512). You would then perform your training (or fine tuning) again.
Lastly, I did a quick search and found this paper that claims to support LMNN over large datasets through random sampling. You might be able to use that to save a few headaches: Fast LMNN Algorithm through Random Sampling