How to train a model with three images as a single input - machine-learning

I want to train Inceptionv3 model where i am trying to give 3 different view of a single image and train it. So i want to give three images as my input in a single feed.
Use case:
I want to predict type of footwear. In this problem usually a lot of information is present different view so just want to try this approach.

The easy way would be to input all 3 images separately into the Inceptionv3 model, and than perform some weighted decision on all 3 outputs together.
A better approach would be to use the Inceptionv3 model as 1 of 3 input branches, than take the embedding layer of each branch (the layer before last) and combine them all with one fully connected classification layer (with softmax activation). The 3 branches can be trained either view-specific or together with shared weights (with such a big model, together will work fine).
By the way, for shoe type classification task I would suggest to use a simpler model (Inceptionv3 is an overkill).

I think you have different ways of acting:
Remove the first layer of inception and create yours to support 3x3
dimensions.
Use the first inception blocks for each input, then concatenate them in some fc layer (or before). If the features to search are similar you can use shared parameters.
The first case will merge all dimensions and difuse the information provided for any image.
The second one will extract specific features in each image.

Related

How to apply tf-idf on multiple predictors, don't want to concatenate into a single column

I have two predictors - want to vectorize each one of them using tf-idf (don't want to concatenate them since we need to have separate vocabulary for each). Should I apply the tf-idf vectorizers on each and then join the features.
For e.g. If i apply tf-idf on predictor1, I get 100 features from that and 200 from predictor2. My features for the training data would simply be 300 (100+200). Am i thinking correctly here?
I will get two matrices from this (one for each predictor), can i concatenate these using numpy functions and use them as features?
Your suggestion on getting this done is correct. The most common way of using two vectors like this is to concatenate them into a longer vector and then feed it to the model.
If, for some reason, this doesn't work out for you, we can explore alternatives based on what your constraints are.
For example, if your constraint is total dimension size, one way to solve this would be to create a multilayered MLP autoencoder
We can train it with the combined vectors as both input and output until the encoder is trained
Subsequently, we can use any intermediate layer's activations as input to our model
It would be easier to suggest a solution if you can describe your constraints in the question.

What is a multi-headed model? And what exactly is a 'head' in a model?

What is a multi-headed model in deep learning?
The only explanation I found so far is this: Every model might be thought of as a backbone plus a head, and if you pre-train backbone and put a random head, you can fine tune it and it is a good idea
Can someone please provide a more detailed explanation.
The explanation you found is accurate. Depending on what you want to predict on your data you require an adequate backbone network and a certain amount of prediction heads.
For a basic classification network for example you can view ResNet, AlexNet, VGGNet, Inception,... as the backbone and the fully connected layer as the sole prediction head.
A good example for a problem where you need multiple-heads is localization, where you not only want to classify what is in the image but also want to localize the object (find the coordinates of the bounding box around it).
The image below shows the general architecture
The backbone network ("convolution and pooling") is responsible for extracting a feature map from the image that contains higher level summarized information. Each head uses this feature map as input to predict its desired outcome.
The loss that you optimize for during training is usually a weighted sum of the individual losses for each prediction head.
Head is the top of a network. For instance, on the bottom (where data comes in) you take convolution layers of some model, say resnet. If you call ConvLearner.pretrained, CovnetBuilder will build a network with appropriate head to your data in Fast.ai (if you are working on a classification problem, it will create a head with a cross entropy loss, if you are working on a regression problem, it will create a head suited to that).
But you could build a model that has multiple heads. The model could take inputs from the base network (resnet conv layers) and feed the activations to some model, say head1 and then same data to head2. Or you could have some number of shared layers built on top of resnet and only those layers feeding to head1 and head2.
You could even have different layers feed to different heads! There are some nuances to this (for instance, with regards to the fastai lib, ConvnetBuilder will add an AdaptivePooling layer on top of the base network if you don’t specify the custom_head argument and if you do it won’t) but this is the general picture.
https://forums.fast.ai/t/terminology-question-head-of-neural-network/14819/2
https://youtu.be/h5Tz7gZT9Fo?t=3613 (1:13:00)

How can I train a single model where I have two different set of data to train with, simultaneously?

I am currently working on a model where I have to predict some materials like ladders, nuts, bolts, mouse, bottles, etc. I have written one algorithm for this which is working okay as of now, The set of images that I have is available on my local computer and I have enough training data to do the training and testing as well. As of now, I have a total of 26 image classes to predict from, all are material type.
Now, this is fine, but I want a case where if an image doesn't belong to said image classes I want it to return something like this, where it would specify that this is not a material rather it's a different picture altogether.
To do this I am thinking to double train my model with a different set of images( for e.g. Imagenet) where just by looking at any non-material image it would return me something like this "this is not material!"
So basically, the same model would get train on two different datasets, one dataset is my material dataset another one is anything other than materials, like images in Imagenet.
My question is how do I approach this? Or do I even need to do this? Or else I just write a simple if - else and put anything that it is not recognizing as material as Non-material type?
You can just merge the two datasets and label the ones that do not belong to said 26 classes as a special 27th class. Whenever your model predicts that class you know it's not part of your dataset. For example:
pred = [0.1, 0.1, 0.8] # Assume label 2 is not-this-dataset label
then you can use images from other dataset with label 2 and train as usual in a training cycle. Make sure to balance the dataset, as in there aren't proportionally too many special not-this-dataset labels so your model doesn't overfit and just predict everything is not from your original dataset.

Caffe's way of representing negative examples on benchmark dataset for binary classification

I would like to know how to define or represent a negative training set if I would want to train a binary classifier from a pre-trained model say, AlexNet on ILSVRC12 (or ImageNet) dataset. What I am currently thinking of is to take one the classes which is not related as the negative training set while the one which is related as positive one. Is there any better way which is more elegant?
The CNNs trained on the ILSVRC data set are already discriminating among 1000 classes of images. Yes, you can use one of those topologies to train a binary classifier, but I suggest that you start with an untrained model and run it through your two chosen classes. If you start with a trained model, you have to unlearn a lot, and your result is still trying to discriminate among 1000 classes: that last FC layer is going to give you trouble.
There are ways to work around the 1000-class problem. If your application already overlaps one or more of the trained classes, then simply add a layer that maps those classes to label "1" and all the others to label "0".
If you're insistent on retaining the trained kernels, then try replacing the final FC layer (1000) with a 2-class FC layer. Then choose your two classes (applicable images vs everything else) and run your training.

Multiple neural networks with one output each or one with multiple outputs?

I want to classify the input as one of 3 possibilities. Is it better to use 3 networks with one output each or 1 network with 3 outputs?
(i.e. 3 networks that output 0 or 1 or 1 network that outputs a one hot vector of length 3 [1,0,0]
Does the answer change depending on how complex the incoming data is to classify?
At what amount of outputs does it make sense to partition the networks (if ever)? For example, if I want to classify into 20 groups, does it make a difference?
I would say it would make more sense to use a single network with multiple outputs.
The main reason is that hidden layers (I'm assuming you'll have at least one hidden layer) can be interpreted as transforming the data from the original space (feature space) into a different space that is more suitable for the task (classification in your case). For example, when training a network to recognize faces from raw pixels, it might use a hidden layer to first detect simple shapes such as small lines based on pixels, then use another hidden layer to detect simple shapes such as eyes/noses based on the lines from the first layer, etc. (it may not be entirely as ''clean'' as this, but this is an easy-to-understand example).
Such a transformation that a network can learn is typically useful for the classification task, regardless of what class the specific example has. For example, it is useful to be able to detect eyes in images regardless of whether or not the actual image contains a face; if you do indeed detect two eyes, you can classify it as a face, and otherwise you classify it as not being a face. In both cases, you were looking for eyes.
So, by splitting up into multiple networks, you may end up learning quite similar patterns in all networks anyway. Then you might as well have saved yourself the computational effort and just learned it once.
Another disadvantage of splitting up into multiple networks would be that you would probably cause your dataset to become imbalanced (or more imbalanced if it already is imbalanced). Suppose you have three classes, with exactly 1/3 of the dataset belonging to each class. If you use three networks for three binary classification tasks, you suddenly always have 1/3 ''1'' classes and 2/3 ''0'' classes. A network may then become biased towards predicting 0s everywhere, since those are the majority classes in each of the three separate problems.
Note that this is all based on my intuition; the best solution if you have time would be to simply try both approaches and test! I don't think I have ever seen someone using multiple networks for a single classification task in practice though, so if you only have time for one approach I'd recommend going for a single network.
I think the only case where it would really make sense to use multiple networks would be if you actually want to predict multiple unrelated values (or at least values that are not strongly related). For example, if, given images, you want to 1) predict whether or not there is a dog on the image, and 2) whether it is a photograph or a painting. Then it may be better to use two networks with two outputs each, instead of a single network with four outputs.

Resources