Non-Reference Loss Function - machine-learning

I was going through this awesome research paper and I have found the term Non-Reference Loss Functions. Can someone help me to understand what it is? Some resource link is more than enough, I have googled this and I have found no clue.
What is this Non-Reference loss function and how they are training a model without paired or unpaired data?
Paper PDF
Any help is appreciated.

Basically, "Non-Reference loss function" is a fancy title for "Unsupervised learning".
The authors of the paper were able to define a loss function (sec. 3.3) that describes how a "good looking image" should look like without using a "clean reference" image: The four loss terms they defined compare the output image Y to the input image I and check that the contrast in Y and it's exposure is better than I, yet boundaries, colorness and spatial consistency remains.
By defining a loss function that does not requires a "ground truth" image allows the authors to train their model on "corrupt" images only - which are much easier to come by.

Related

How to balance your data-set when in the real world there isn't a constant balance

I can't decide how to balance my dataset on "distress situations" since it isn't something that can be measured as "the percentage of rotten apples in a factory".
For now, I've chosen to just use "50%-50%" of distress voice snippets and random none-distress snippets.
I'll be glad for some advice from the community, what are the best practises in this situation? I've chosen the 50-50 approach to avoid statistical biases and I'm using a Sequential (Keras) model.
Try to modify the loss function instead of the dataset if you cannot modify the dataset. But I think the question is not completely formulated.

How do sample weights work in classification models?

What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.
this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf
See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .

Regression. Optimize median instead of mean for skewed distribution

Let say I do the DNN regression task for some skewed data distribution. Now I am using mean absolute error as loss function.
All typical approaches in machine learning are minimizing mean loss, but for skewed that is unappropriating. It is better from a practical point of view to minimize median loss. I think one way is to penalize big losses with some coefficient. And then mean will be close to the median. But how to calculate that coef for the unknown distribution type? Are there other approaches? What can you to advice?
(I am using tensorflow/keras)
Just use the mean absolute error loss function in keras, instead of the mean squared.
The mean absolute is pretty much equivalent the median, and anyway would be more robust to outliers or skewed data. you should have a look at all of the possible keras losses:
https://keras.io/losses/
and obviously, you can create your own too.
But for most data sets it just empirically turns out that mean square gets you better accuracy. so i would recommend to at least try both methods before settling on the mean absolute one.
If you have skewed error distributions, you can use tfp.stats.percentile as your Keras loss function, with something like:
def loss_fn(y_true, y_pred):
return tfp.stats.percentile(tf.abs(y_true - y_pred), q=50)
model.compile(loss=loss_fn)
It gives gradients, so works with Keras, although isn't as fast as MAE / MSE.
https://www.tensorflow.org/probability/api_docs/python/tfp/stats/percentile
Customizing loss (/objective) functions is tough. Keras does theoretically allow you to do this, though they seem to have removed the documentation specifically describing it in their 2.0 release.
You can check their docs on loss functions for ideas, and then head over to the source code to see what kind of an API you should implement.
However, there are a number of issues filed by people who are having trouble with this, and the fact that they've removed the documentation on it is not inspiring.
Just remember that you have to use Keras own backend to compute your loss function. If you get it working, please write a blog post, or update with an answer here, because this is something quite a few other people have struggled/are struggling with!

Why does one not use IOU for training?

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.
My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.
I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?
Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!
It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).
Here's another way to think about this in a simple manner.
Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.
Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"
Just a trickle of an explanation, but hope it helps..
I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

Estimating parameters in multivariate classification

Newbie here typesetting my question, so excuse me if this don't work.
I am trying to give a bayesian classifier for a multivariate classification problem where input is assumed to have multivariate normal distribution. I choose to use a discriminant function defined as log(likelihood * prior).
However, from the distribution,
$${f(x \mid\mu,\Sigma) = (2\pi)^{-Nd/2}\det(\Sigma)^{-N/2}exp[(-1/2)(x-\mu)'\Sigma^{-1}(x-\mu)]}$$
i encounter a term -log(det($S_i$)), where $S_i$ is my sample covariance matrix for a specific class i. Since my input actually represents a square image data, my $S_i$ discovers quite some correlation and resulting in det(S_i) being zero. Then my discriminant function all turn Inf, which is disastrous for me.
I know there must be a lot of things go wrong here, anyone willling to help me out?
UPDATE: Anyone can help how to get the formula working?
I do not analyze the concept, as it is not very clear to me what you are trying to accomplish here, and do not know the dataset, but regarding the problem with the covariance matrix:
The most obvious solution for data, where you need a covariance matrix and its determinant, and from numerical reasons it is not feasible is to use some kind of dimensionality reduction technique in order to capture the most informative dimensions and simply discard the rest. One such method is Principal Component Analysis (PCA), which applied to your data and truncated after for example 5-20 dimensions would yield the reduced covariance matrix with non-zero determinant.
PS. It may be a good idea to post this question on Cross Validated
Probably you do not have enough data to infer parameters in a space of dimension d. Typically, the way you would get around this is to take an MAP estimate as opposed to an ML.
For the multivariate normal, this is a normal-inverse-wishart distribution. The MAP estimate adds the matrix parameter of inverse Wishart distribution to the ML covariance matrix estimate and, if chosen correctly, will get rid of the singularity problem.
If you are actually trying to create a classifier for normally distributed data, and not just doing an experiment, then a better way to do this would be with a discriminative method. The decision boundary for a multivariate normal is quadratic, so just use a quadratic kernel in conjunction with an SVM.

Resources