What is the difference between Keras model.evaluate() and model.predict()? - machine-learning

I used Keras biomedical image segmentation to segment brain neurons. I used model.evaluate() it gave me Dice coefficient: 0.916. However, when I used model.predict(), then loop through the predicted images by calculating the Dice coefficient, the Dice coefficient is 0.82. Why are these two values different?

The model.evaluate function predicts the output for the given input and then computes the metrics function specified in the model.compile and based on y_true and y_pred and returns the computed metric value as the output.
The model.predict just returns back the y_pred
So if you use model.predict and then compute the metrics yourself, the computed metric value should turn out to be the same as model.evaluate
For example, one would use model.predict instead of model.evaluate in evaluating an RNN/ LSTM based models where the output needs to be fed as input in next time step

The problem lies in the fact that every metric in Keras is evaluated in a following manner:
For each batch a metric value is evaluated.
A current value of loss (after k batches is equal to a mean value of your metric across computed k batches).
The final result is obtained as a mean of all losses computed for all batches.
Most of the most popular metrics (like mse, categorical_crossentropy, mae) etc. - as a mean of loss value of each example - have a property that such evaluation ends up with a proper result. But in case of Dice Coefficient - a mean of its value across all of the batches is not equal to actual value computed on a whole dataset and as model.evaluate() uses such way of computations - this is the direct cause of your problem.

The keras.evaluate() function will give you the loss value for every batch. The keras.predict() function will give you the actual predictions for all samples in a batch, for all batches. So even if you use the same data, the differences will be there because the value of a loss function will be almost always different than the predicted values. These are two different things.

It is about regularization. model.predict() returns the final output of the model, i.e. answer. While model.evaluate() returns the loss. The loss is used to train the model (via backpropagation) and it is not the answer.
This video of ML Tokyo should help to understand the difference between model.evaluate() and model.predict().

Related

Is the loss function='Multiclass' in catboost same as log loss if I am doing a multiclassification problem?

I am making a multiclass prediction model using catboost, The final solution should have minimum Logloss error but Logloss is not present in catboost, they have something called 'Multiclass' as the loss function. Are they both same? if not then how can I measure the accuracy of the catboost model in terms of Logloss?
Are they both same? Effectively, Yes...
The catboost documentation describe the calculation of 'MultiClass' loss as what is generally considered as Multinomial/Multiclass Cross Entropy Loss. That is effectively, a Log Softmax applied to the classifier output 'a' to produce values that can be interpreted as probabilities, and subsequently then apply Negative Log Likelihood Loss (NLLLoss), wiki1 & wiki2.
Their documentation describe the calculation of 'LogLoss' also, which again is NLLLoss, however applied to 'p'. Which they describe here to be result of applying the sigmoid fn to the classifier output. Since the NLLLoss is reworked for the binary problem, only a single class probability is calculated, using 'p' and '1-p' for each class. And in this special (binary) case, use of sigmoid and softmax are equivalent.
How can I measure the the catboost model in terms of Logloss?
They describe a method to produce desired metrics on given data.
Be careful not to confuse loss/objective function 'loss_function' with evaluation metric 'eval_metric', however in this instance, the same function can be used for both, as listed in their supported metrics.
Hope this helps!
Log loss is not a loss function but a metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1.
Learn more here.

Magnitude of Sample Weights in Keras

Keras model.fit supports per-sample weights. What is the range of acceptable values for these weights? Must they sum to 1 across all training samples? Or does keras accept any weight values and then perform some sort of normalization? The keras source includes, e.g. training_utils.standardize_weights but that does not appear to be doing statistical standardization.
After looking at the source here, I've found that you should be able to pass any acceptable numerical values (within overflow bounds) for both sample weights and class weights. They do not need to sum to 1 across all training samples and each weight may be greater than one. The only sort of normalization that appears to be happening is taking the max of 2D class weight inputs.
If both class weights and samples weights are provided, it provides the product of the two.
I think the unspoken component here is that the activation function should be dealing with normalization.

Accuracy of Neural Networks incase of doing prediction of a continuious variable

Is there a way to calculate Accuracy instead of Error metrics for neural networks when doing regression (prediction of continuous variable) the same way we do when classifying categorical variables?
Though, the concept of accuracy comes in the classification, but you can print the predicted values and check them with dependent variables.
The problem with continuous variable, is that the probability to reproduce exactly a given value is (practically) zero. For instance if your neural network produces 2.000001 and the actual value is 2, then this would count as a wrong prediction as both values are different (although they are very close). Error metric like the root mean square, measure therefore at the average difference (squared).
However, depending on your application, you could introduce a threshold value ϵ and consider a given output of your neural network as correct if the absolute value of the difference between the observed value and the output is smaller than ϵ and compute the percentage of correct prediction.
In practice such a metric is not minimized directly, because it is difficult to compute its gradient, but it is still a useful quantity to compute.

Machine learning multi-classification: Why use 'one-hot' encoding instead of a number

I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?
Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )
The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats
Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.

What is a loss function in simple words?

Can anyone please explain in simple words and possibly with some examples what is a loss function in the field of machine learning/neural networks?
This came out while I was following a Tensorflow tutorial:
https://www.tensorflow.org/get_started/get_started
It describes how far off the result your network produced is from the expected result - it indicates the magnitude of error your model made on its prediciton.
You can then take that error and 'backpropagate' it through your model, adjusting its weights and making it get closer to the truth the next time around.
The loss function is how you're penalizing your output.
The following example is for a supervised setting i.e. when you know the correct result should be. Although loss functions can be applied even in unsupervised settings.
Suppose you have a model that always predicts 1. Just the scalar value 1.
You can have many loss functions applied to this model. L2 is the euclidean distance.
If I pass in some value say 2 and I want my model to learn the x**2 function then the result should be 4 (because 2*2 = 4). If we apply the L2 loss then its computed as ||4 - 1||^2 = 9.
We can also make up our own loss function. We can say the loss function is always 10. So no matter what our model outputs the loss will be constant.
Why do we care about loss functions? Well they determine how poorly the model did and in the context of backpropagation and neural networks. They also determine the gradients from the final layer to be propagated so the model can learn.
As other comments have suggested I think you should start with basic material. Here's a good link to start off with http://neuralnetworksanddeeplearning.com/
Worth to note we can speak of different kind of loss functions:
Regression loss functions and classification loss functions.
Regression loss function describes the difference between the values that a model is predicting and the actual values of the labels.
So the loss function has a meaning on a labeled data when we compare the prediction to the label at a single point of time.
This loss function is often called the error function or the error formula.
Typical error functions we use for regression models are L1 and L2, Huber loss, Quantile loss, log cosh loss.
Note: L1 loss is also know as Mean Absolute Error. L2 Loss is also know as Mean Square Error or Quadratic loss.
Loss functions for classification represent the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).
Name a few: log loss, focal loss, exponential loss, hinge loss, relative entropy loss and other.
Note: While more commonly used in regression, the square loss function can be re-written and utilized for classification.

Resources