Variational Autoencoders: MSE vs BCE - machine-learning

I'm working with a Variational Autoencoder and I have seen that there are people who uses MSE Loss and some people who uses BCE Loss, does anyone know if one is more correct that the another and why?
As far as I understand, if you assume that the latent space vector of the VAE follows a Gaussian distribution, you should use MSE Loss. If you assume it follows a multinomial distribution, you should use BCE. Also, BCE is biased towards 0.5.
Could someone clarify me this concept? I know that it's related with the Lower Variational Bound term of the expectancy of information...
Thank you so much!

In short: Maximizing likelihood of model whose prediction are normal distribution(multinomial distribution) is equivalent to minimizing MSE(BCE)
Mathematical details:
The real reason you use MSE and cross-entropy loss functions
DeepMind have an awesome lecture on Modern Latent Variable Models(Mainly about Variational Autoencoders), you can understand everything you need there

Related

Cross-entropy loss influence over F-score

I'm training an FCN (Fully Convolutional Network) and using "Sigmoid Cross Entropy" as a loss function.
my measurements are F-measure and MAE.
The Train/Dev Loss w.r.t #iteration graph is something like the below:
Although Dev loss has a slight increase after #Iter=2200, my measurements on Dev set have been improved up to near #iter = 10000. I want to know is it possible in machine learning at all? If F-measure has been improved, should the loss also be decreased? How do you explain it?
Every answer would be appreciated.
Short answer, yes it's possible.
How I would explain it is by reasoning on the Cross-Entropy loss and how it differs from the metrics. Loss Functions for classification, generally speaking, are used to optimize models relying on probabilities (0.1/0.9), while metrics usually use the predicted labels. (0/1)
Assuming having strong confidence (close to 0 or to 1) in a model probability hypothesis, a wrong prediction will greatly increase the loss and have a small decrease in F-measure.
Likewise, in the opposite scenario, a model with low confidence (e.g. 0.49/0.51) would have a small impact on the loss function (from a numerical perspective) and a greater impact on the metrics.
Plotting the distribution of your predictions would help to confirm this hypothesis.

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

Can linear classification take non binary targets?

I'm following a TensorFlow example that takes a bunch of features (real estate related) and "expensive" (ie house price) as the binary target.
I was wondering if the target could take more than just a 0 or 1. Let's say, 0 (not expensive), 1 (expensive), 3 (very expensive).
I don't think this is possible as the logistic regression model has asymptotes nearing 0 and 1.
This might be a stupid question, but I'm totally new to ML.
I think I found the answer myself. From Wikipedia:
First, the conditional distribution y|x is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes.
Logistic Regression is defined for binary classification tasks.(For more details, please logistic_regression. For multi-class classification problems, you can use Softmax Classification algorithm. Following tutorials shows how to write a Softmax Classifier in Tensorflow Library.
Softmax_Regression in Tensorflow
However, your data set is linearly non-separable (most of the time this is the case in real-world datasets) you have to use an algorithm which can handle nonlinear decision boundaries. Algorithm such as Neural Network or SVM with Kernels would be a good choice. Following IPython notebook shows how to create a simple Neural Network in Tensorflow.
Neural Network in Tensorflow
Good Luck!

When should one use LinearSVC or SVC?

From my research, I found three conflicting results:
SVC(kernel="linear") is better
LinearSVC is better
Doesn't matter
Can someone explain when to use LinearSVC vs. SVC(kernel="linear")?
It seems like LinearSVC is marginally better than SVC and is usually more finicky. But if scikit decided to spend time on implementing a specific case for linear classification, why wouldn't LinearSVC outperform SVC?
Mathematically, optimizing an SVM is a convex optimization problem, usually with a unique minimizer. This means that there is only one solution to this mathematical optimization problem.
The differences in results come from several aspects: SVC and LinearSVC are supposed to optimize the same problem, but in fact all liblinear estimators penalize the intercept, whereas libsvm ones don't (IIRC). This leads to a different mathematical optimization problem and thus different results. There may also be other subtle differences such as scaling and default loss function (edit: make sure you set loss='hinge' in LinearSVC). Next, in multiclass classification, liblinear does one-vs-rest by default whereas libsvm does one-vs-one.
SGDClassifier(loss='hinge') is different from the other two in the sense that it uses stochastic gradient descent and not exact gradient descent and may not converge to the same solution. However the obtained solution may generalize better.
Between SVC and LinearSVC, one important decision criterion is that LinearSVC tends to be faster to converge the larger the number of samples is. This is due to the fact that the linear kernel is a special case, which is optimized for in Liblinear, but not in Libsvm.
The actual problem is in the problem with scikit approach, where they call SVM something which is not SVM. LinearSVC is actually minimizing squared hinge loss, instead of just hinge loss, furthermore, it penalizes size of the bias (which is not SVM), for more details refer to other question:
Under what parameters are SVC and LinearSVC in scikit-learn equivalent?
So which one to use? It is purely problem specific. As due to no free lunch theorem it is impossible to say "this loss function is best, period". Sometimes squared loss will work better, sometimes normal hinge.

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

Resources