Transfer Learning and linear classifier - machine-learning

In cs231n handout here, it says
New dataset is small and similar to original dataset. Since the data
is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns... Hence, the best idea might be to train a
linear classifier on the CNN codes.
I'm not sure what linear classifier means. Does the linear classifier refer to the last fully connected layer? (For example, in Alexnet, there are three fully connected layers. Does the linear classifier the last fully connected layer?)

Usually when people say "linear classifier" they refer to Linear SVM (support vector machine). A linear classifier learns a weight vecotr w and a threshold (aka "bias") b such that for each example x the sign of
<w, x> + b
is positive for the "positive" class and negative for the "negative" class.
The last (usually fully connected) layer of a neural-net can be considered as a form of a linear classifier.

Related

Numerically stable cross entropy loss calculation for Mixture of Experts model in Pytorch

I am stuck with a supposedly simple problem. I have a mixture of experts system, consisting of multiple neural networks for classification, whose mixture weights are determined by the data as well. For the posterior probability of label y, given data x and K different expert networks, we have:
In this scheme, p(y|z_k,x) are expert network posterior probabilities, which are Softmax functions applied to network outputs. p(z_k|x) are the network weights.
My problem is the following. Usually, in Pytorch we feed the outputs of the last layer (logits) into the cross entropy loss function. Pytorch handles the numerical stability issues with the Logsumexp trick (How is log_softmax() implemented to compute its value (and gradient) with better speed and numerical stability?). In my case here however, my model output is not the logits, the probabilities, directly instead, due to the nature of the mixture model. Taking the logarithm of the mixture probabilities and feeding into the NLL loss crashes after a couple of iterations since some probabilities quickly become very close to 0 and underflow-overflow issues start to appear. The calculation would be very unstable, numerically.
In this particular case, what would be the correct way to calculate CE (or NLL) loss, without losing the numerical stability?

Basic classifier and BackPropagation

I'm following the course on Machine Learning from Coursera and I just had an interrogation.
Multiple classifier making a xor classifier
On this picture we can see that in order to make a xor classifier we build other smaller classifiers which are trained with linearly separable gate.
So each classifier has a job (for example AND, OR, etc) defined and the network must be trained for this task.
But in a bigger neural net it's impossible to define a task for each neuron (or classifier).
So my question is : Is this the task of the Back-Propogation algorithm (in addition to the fact that it is used to update the weight) ?
If someone is wondering the same thing, yes it is the case.
The backprop algorithm makes "smaller linear solvable" per each neuron (or classifier).

Can a probability score from a machine learning model XGBClassifier or Neural Network be treated as a confidence score?

If there are 4 classes and output probability from the model is A=0.30,B=0.40,C=0.20 D=0.10 then can I say that output from the model is class B with 40% confidence? If not then why?
Although a softmax activation will ensure that the outputs satisfy the surface Kolmogorov axioms (probabilities always sum to one, no probability below zero and above one) and the individual values can be seen as a measure of the network's confidence, you would need to calibrate the model (train it not as a classifier but rather as a probability predictor) or use a bayesian network before you could formally claim that the output values are your per-class prediction confidences. (https://arxiv.org/pdf/1706.04599.pdf)

LDA and PCA on a dataset containing two classes

I would like to compare the accuracies of running logistic regression on a dataset following PCA and LDA. The dataset I am using is the wisconsin cancer dataset, which contains two classes: malignant or benign tumors and 30 features. I have already conducted PCA on this data and have been able to get good accuracy scores with 10 PCAs. I know that LDA is similar to PCA. My understanding is that you calculate the mean vectors of each feature for each class, compute scatter matricies and then get the eigenvalues for the dataset. Is LDA similar to PCA in the sense that I can choose 10 LDA eigenvalues to better separate my data? I have tried LDA with scikit learn, however it has only given me one LDA back. Is this becasue I only have 2 classes, or do I need to do an addiontional step? I would like to have 10 LDAs in order to compare it with my 10 PCAs. Is this even possible?
Actually both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels). You can picture PCA as a technique that finds the directions of maximal variance.And LDA as a technique that also cares about class separability (note that here, LD 2 would be a very bad linear discriminant).Remember that LDA makes assumptions about normally distributed classes and equal class covariances (at least the multiclass version; the generalized version by Rao).

What's the relationship between an SVM and hinge loss?

My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
I will answer one thing at at time
Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?
SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.
Or is it more complex than that?
No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.
Intuitively look at these three common losses:
hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2
Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.
How do the support vectors come into play?
Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.
What about the slack variables?
This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.
Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).
To sum up:
there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)

Resources