Is Kullback Liebler Divergence already implented in TensorFlow? - machine-learning

I am working with tensorflow and using Nueral Networks to solve multi-label classification problem. I was using Softmax cross entropy as my loss function:
#Softmax loss
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
Now, i thought that i should use KL divergence loss function. But, i didn't find it in tensorflow can any body help me to use KL Divergence loss function instead of Softmax loss?

Here you go:
tf.contrib.distributions.kl(distribution_1, distribution_2)

Related

Variational Autoencoders: MSE vs BCE

I'm working with a Variational Autoencoder and I have seen that there are people who uses MSE Loss and some people who uses BCE Loss, does anyone know if one is more correct that the another and why?
As far as I understand, if you assume that the latent space vector of the VAE follows a Gaussian distribution, you should use MSE Loss. If you assume it follows a multinomial distribution, you should use BCE. Also, BCE is biased towards 0.5.
Could someone clarify me this concept? I know that it's related with the Lower Variational Bound term of the expectancy of information...
Thank you so much!
In short: Maximizing likelihood of model whose prediction are normal distribution(multinomial distribution) is equivalent to minimizing MSE(BCE)
Mathematical details:
The real reason you use MSE and cross-entropy loss functions
DeepMind have an awesome lecture on Modern Latent Variable Models(Mainly about Variational Autoencoders), you can understand everything you need there

What is pixel-wise softmax loss?

what is the pixel-wise softmax loss? In my understanding, it's just a cross-entropy loss, but I didn't find the formula. Can someone help me? It's better to have the pytorch code.
You can read here all about it (there's also a link to source code there).
As you already observed the "softmax loss" is basically a cross entropy loss which computation combines the softmax function and the loss for numerical stability and efficiency.
In your example, the loss is computed for a pixel-wise prediction so you have a per-pixel prediction, a per-pixel target and a per-pixel loss term.

How can I determine "loss function" for MLPClassifier in skilearn?

I want to use MLPClassifier of skilearn
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,
solver='sgd', verbose=10, tol=1e-4, random_state=1,
learning_rate_init=.1)
I didn't find any parameter for the loss function, I want it to be mean_squared_error. Is it possible to determine it for the model?
According to the docs:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Log-loss is basically the same as cross-entropy.
There is no way to pass another loss function to MLPClassifier, so you cannot use MSE. But MLPRegressor uses MSE, if you really want that.
However, the general advice is to stick to cross-entropy loss for classification, it is said to have some advantages over MSE. So you may just want to use MLPClassifier as is for your classification problem.

Tensorflow: Output probabilities from sigmoid cross entropy loss

I have a CNN for a multilabel classification problem and as a loss function I use the tf.nn.sigmoid_cross_entropy_with_logits .
From the cross entropy equation I would expect that the output would be probabilities of each class but instead I get floats in the (-∞, ∞) .
After some googling I found that due to some internal normalizing operation each row of logits is interpretable as probability before being fed to the equation.
I'm confused about how I can actually output the posterior probabilities instead of floats in order to draw a ROC.
tf.sigmoid(logits) gives you the probabilities.
You can see in the documentation of tf.nn.sigmoid_cross_entropy_with_logits that tf.sigmoid is the function that normalizes the logits to probabilities.

Ada-Delta method doesn't converge when used in Denoising AutoEncoder with MSE loss & ReLU activation?

I just implemented AdaDelta (http://arxiv.org/abs/1212.5701) for my own Deep Neural Network Library.
The paper kind of says that SGD with AdaDelta is not sensitive to hyperparameters, and that it always converge to somewhere good. (at least the output reconstruction loss of AdaDelta-SGD is comparable to that of well-tuned Momentum method)
When I used AdaDelta-SGD as learning method in in Denoising AutoEncoder, it did converge in some specific settings, but not always.
When I used MSE as loss function, and Sigmoid as activation function, it converged very quickly, and after iterations of 100 epochs, the final reconstruction loss was better than all of plain SGD, SGD with Momentum, and AdaGrad.
But when I used ReLU as activation function, it didn't converge but continued to be stacked(oscillating) with high(bad) reconstruction loss (just like the case when you used plain SGD with very high learning rate).
The magnitude of reconstruction loss it stacked was about 10 to 20 times higher than the final reconstruction loss generated with Momentum method.
I really don't understand why it happened since the paper says AdaDelta is just good.
Please let me know the reason behind the phenomena and teach me how I could avoid it.
The activation of a ReLU is unbounded, making its use in Auto Encoders difficult since your training vectors likely do not have arbitrarily large and unbounded responses! ReLU simply isn't a good fit for that type of network.
You can force a ReLU into an auto encoder by applying some transformation to the output layer, as is done here. However, hey don't discuss the quality of the results in terms of an auto-encoder, but instead only as a pre-training method for classification. So its not clear that its a worth while endeavor for building an auto encoder either.

Resources