I'm working through this notebook -- https://github.com/aamini/introtodeeplearning/blob/master/lab1/solutions/Part2_Music_Generation_Solution.ipynb -- where we are using an embedding layer, LSTM, and final dense layer w/ softmax to generate music.
I'm a little confused, however, on how we're calculating loss; it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer). However, aren't these predictions supposed to be a probability distribution? When are we actually selecting the label that we are predicting against?
A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?
The short answer is that the Keras loss function sparse_categorical_crossentropy() does everything you need.
At each timestep of the LSTM model, the top dense layer and softmax function inside that loss function together generate a probability distribution over the model's vocabulary, which in this case are musical notes. Suppose the vocabulary comprises the notes A, B, C, D. Then one possible probability distribution generated is: [0.01, 0.70, 0.28, 0.01], meaning that the model is putting a lot of probability on note B (index 1), like so:
Label: A B C D
---- ---- ---- ---- ----
Index: 0 1 2 3
---- ---- ---- ---- ----
Prob: 0.01 0.70 0.28 0.01
Suppose the true note should be C, which is represented by the number 2, since it is at index 2 in the distribution array (with indexing starting at 0). To measure the difference between the predicted distribution and the true value distributions, use the sparse_categorical_crossentropy() function to produce a floating-point number representing the loss.
More information can be found on this TensorFlow documentation page. On that page, they have the example:
y_true = [1, 2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
You can see in that example there is a batch of two instances. For the first instance, the true label is 1 and the predicted distribution is [0.05, 0.95, 0], and for the second instance, the true label is 2 while the predicted distribution is [0.1, 0.8, 0.1].
This function is used in your Jupyter Notebook in section 2.5:
To train our model on this classification task, we can use a form of the crossentropy loss (negative log likelihood loss). Specifically, we will use the sparse_categorical_crossentropy loss, as it utilizes integer targets for categorical classification tasks. We will want to compute the loss using the true targets -- the labels -- and the predicted targets -- the logits.
So to answer your questions directly:
it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer).
Yes, your understanding is correct.
However, aren't these predictions supposed to be a probability distribution?
Yes, they are.
When are we actually selecting the label that we are predicting against?
It is done inside the sparse_categorical_crossentropy() function. If your distribution is [0.05, 0.95, 0], then that implicitly means that the function is predicting 0.05 probability for index 0, 0.95 probability for index 1, and 0.0 probability for index 3.
A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?
It's inside that function.
I am new babie to the Deep Learning field, and I am use log-likelihood method to compare the MSE metrics.Could anyone be able to show how to calculate the following 2 predicted output examples with 3 outputs neurons each. Thanks
yt = [ [1,0,0],[0,0,1]]
yp = [ [0.9, 0.2,0.2], [0.2,0.8,0.3] ]
MSE or Mean Squared Error is simply the expected value of the squared difference between the predicted and the ground truth labels, represented as
\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta)^2\right]
where theta is the ground truth labels and theta^hat is the predicted labels
I am not sure what are you referring to exactly, like a theoretical question or a part of code
As a Python implementation
def mean_squared_error(A, B):
return np.square(np.subtract(A,B)).mean()
yt = [[1,0,0],[0,0,1]]
yp = [[0.9, 0.2,0.2], [0.2,0.8,0.3]]
mse = mean_squared_error(yt, yp)
print(mse)
This will give a value of 0.21
If you are using one of the DL frameworks like TensorFlow, then they are already providing the function which calculates the mse loss between tensors
tf.losses.mean_squared_error
where
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
Args:
labels: The ground truth output tensor, same dimensions as 'predictions'.
predictions: The predicted outputs.
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions
must be either 1, or the same as the corresponding losses dimension).
scope: The scope for the operations performed in computing the loss.
loss_collection: collection to which the loss will be added.
reduction: Type of reduction to apply to loss.
Returns:
Weighted loss float Tensor. If reduction is NONE, this has the same
shape as labels; otherwise, it is scalar.
Hello everyone, I'm new in this area, I wondered if anyone could help me understand the results of logistic regression.
I would need to understand if the independent variables can be used to make a good classification.
=== Run information ===
Scheme: weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4
Relation: Train
Instances: 14185
Attributes: 5
ATTR_1
ATTR_2
ATTR_3
ATTR_4
DEPENDENT_VAR
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable 0
====================
ATTR_1 0.0022
ATTR_2 0.0022
ATTR_3 0.0034
ATTR_4 -0.0021
Intercept 0.9156
Odds Ratios...
Class
Variable 0
====================
ATTR_1 1.0022
ATTR_2 1.0022
ATTR_3 1.0034
ATTR_4 0.9979
Time taken to build model: 0.13 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.07 seconds
=== Summary ===
Correctly Classified Instances 51240 72.2453 %
Incorrectly Classified Instances 19685 27.7547 %
Kappa statistic -0.0001
Mean absolute error 0.3992
Root mean squared error 0.4467
Relative absolute error 99.5581 %
Root relative squared error 99.7727 %
Total Number of Instances 70925
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1,000 1,000 0,723 1,000 0,839 -0,005 0,545 0,759 0
0,000 0,000 0,000 0,000 0,000 -0,005 0,545 0,305 1
Weighted Avg. 0,722 0,723 0,522 0,722 0,606 -0,005 0,545 0,633
=== Confusion Matrix ===
a b <-- classified as
51240 5 | a = 0
19680 0 | b = 1
In particular, I am interested in understanding the values of the coefficients and the odds-ratios.
Thanks.
Off the top of my head:
Odds ratios and coefficient values are proportional to another, and can be calculated from each other.
For attribute1 , exp(0.0022) = 1.002
For doing more calculations and fitting/predicting, coefficients are "better". However the coefficients are values that must be plugged into exp(x) functions and are somewhat difficult to "visualize in your head".
For human understanding, odds ratios are sometimes more convenient - easier to interpret/visualize, but you can't do certain calculations directly with them.
Weka does not know what you are more interested in, so it gives you both for convenience.
By the way, weka does regularized logistic regression
(Logistic Regression with ridge parameter of 1.0E-8), so coefficients might differ slightly from those that a different software package might give you.
I'm trying to train a CNN to categorize text by topic. When I use binary cross-entropy I get ~80% accuracy, with categorical cross-entropy I get ~50% accuracy.
I don't understand why this is. It's a multiclass problem, doesn't that mean that I have to use categorical cross-entropy and that the results with binary cross-entropy are meaningless?
model.add(embedding_layer)
model.add(Dropout(0.25))
# convolution layers
model.add(Conv1D(nb_filter=32,
filter_length=4,
border_mode='valid',
activation='relu'))
model.add(MaxPooling1D(pool_length=2))
# dense layers
model.add(Flatten())
model.add(Dense(256))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# output layer
model.add(Dense(len(class_id_index)))
model.add(Activation('softmax'))
Then I compile it either it like this using categorical_crossentropy as the loss function:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
or
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Intuitively it makes sense why I'd want to use categorical cross-entropy, I don't understand why I get good results with binary, and poor results with categorical.
The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:
the accuracy computed with the Keras method evaluate is just plain
wrong when using binary_crossentropy with more than 2 labels
I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.
This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
is valid, your second one:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).
Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.
Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # WRONG way
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=2, # only 2 epochs, for demonstration purposes
verbose=1,
validation_data=(x_test, y_test))
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.9975801164627075
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001
score[1]==acc
# False
To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])
In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.98580000000000001
# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001
score[1]==acc
# True
System setup:
Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4
UPDATE: After my post, I discovered that this issue had already been identified in this answer.
It all depends on the type of classification problem you are dealing with. There are three main categories
binary classification (two target classes),
multi-class classification (more than two exclusive targets),
multi-label classification (more than two non exclusive targets), in which multiple target classes can be on at the same time.
In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.
In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.
In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.
The binary cross-entropy is defined as
and categorical cross-entropy is defined as
where c is the index running over the number of classes C.
I came across an "inverted" issue — I was getting good results with categorical_crossentropy (with 2 classes) and poor with binary_crossentropy. It seems that problem was with wrong activation function. The correct settings were:
for binary_crossentropy: sigmoid activation, scalar target
for categorical_crossentropy: softmax activation, one-hot encoded target
It's really interesting case. Actually in your setup the following statement is true:
binary_crossentropy = len(class_id_index) * categorical_crossentropy
This means that up to a constant multiplication factor your losses are equivalent. The weird behaviour that you are observing during a training phase might be an example of a following phenomenon:
At the beginning the most frequent class is dominating the loss - so network is learning to predict mostly this class for every example.
After it learnt the most frequent pattern it starts discriminating among less frequent classes. But when you are using adam - the learning rate has a much smaller value than it had at the beginning of training (it's because of the nature of this optimizer). It makes training slower and prevents your network from e.g. leaving a poor local minimum less possible.
That's why this constant factor might help in case of binary_crossentropy. After many epochs - the learning rate value is greater than in categorical_crossentropy case. I usually restart training (and learning phase) a few times when I notice such behaviour or/and adjusting a class weights using the following pattern:
class_weight = 1 / class_frequency
This makes loss from a less frequent classes balancing the influence of a dominant class loss at the beginning of a training and in a further part of an optimization process.
EDIT:
Actually - I checked that even though in case of maths:
binary_crossentropy = len(class_id_index) * categorical_crossentropy
should hold - in case of keras it's not true, because keras is automatically normalizing all outputs to sum up to 1. This is the actual reason behind this weird behaviour as in case of multiclassification such normalization harms a training.
After commenting #Marcin answer, I have more carefully checked one of my students code where I found the same weird behavior, even after only 2 epochs ! (So #Marcin's explanation was not very likely in my case).
And I found that the answer is actually very simple: the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels. You can check that by recomputing the accuracy yourself (first call the Keras method "predict" and then compute the number of correct answers returned by predict): you get the true accuracy, which is much lower than the Keras "evaluate" one.
a simple example under a multi-class setting to illustrate
suppose you have 4 classes (onehot encoded) and below is just one prediction
true_label = [0,1,0,0]
predicted_label = [0,0,1,0]
when using categorical_crossentropy, the accuracy is just 0 , it only cares about if you get the concerned class right.
however when using binary_crossentropy, the accuracy is calculated for all classes, it would be 50% for this prediction. and the final result will be the mean of the individual accuracies for both cases.
it is recommended to use categorical_crossentropy for multi-class(classes are mutually exclusive) problem but binary_crossentropy for multi-label problem.
As it is a multi-class problem, you have to use the categorical_crossentropy, the binary cross entropy will produce bogus results, most likely will only evaluate the first two classes only.
50% for a multi-class problem can be quite good, depending on the number of classes. If you have n classes, then 100/n is the minimum performance you can get by outputting a random class.
You are passing a target array of shape (x-dim, y-dim) while using as loss categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
from keras.utils import to_categorical
y_binary = to_categorical(y_int)
Alternatively, you can use the loss function sparse_categorical_crossentropy instead, which does expect integer targets.
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).
Take a look at the equation you can find that binary cross entropy not only punish those label = 1, predicted =0, but also label = 0, predicted = 1.
However categorical cross entropy only punish those label = 1 but predicted = 1.That's why we make assumption that there is only ONE label positive.
The main point is answered satisfactorily with the brilliant piece of sleuthing by desernaut. However there are occasions when BCE (binary cross entropy) could throw different results than CCE (categorical cross entropy) and may be the preferred choice. While the thumb rules shared above (which loss to select) work fine for 99% of the cases, I would like to add a few new dimensions to this discussion.
The OP had a softmax activation and this throws a probability distribution as the predicted value. It is a multi-class problem. The preferred loss is categorical CE. Essentially this boils down to -ln(p) where 'p' is the predicted probability of the lone positive class in the sample. This means that the negative predictions dont have a role to play in calculating CE. This is by intention.
On a rare occasion, it may be needed to make the -ve voices count. This can be done by treating the above sample as a series of binary predictions. So if expected is [1 0 0 0 0] and predicted is [0.1 0.5 0.1 0.1 0.2], this is further broken down into:
expected = [1,0], [0,1], [0,1], [0,1], [0,1]
predicted = [0.1, 0.9], [.5, .5], [.1, .9], [.1, .9], [.2, .8]
Now we proceed to compute 5 different cross entropies - one for each of the above 5 expected/predicted combo and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.8)]
The CE has a different scale but continues to be a measure of the difference between the expected and predicted values. The only difference is that in this scheme, the -ve values are also penalized/rewarded along with the +ve values. In case your problem is such that you are going to use the output probabilities (both +ve and -ves) instead of using the max() to predict just the 1 +ve label, then you may want to consider this version of CE.
How about a multi-label situation where expected = [1 0 0 0 1]? Conventional approach is to use one sigmoid per output neuron instead of an overall softmax. This ensures that the output probabilities are independent of each other. So we get something like:
expected = [1 0 0 0 1]
predicted is = [0.1 0.5 0.1 0.1 0.9]
By definition, CE measures the difference between 2 probability distributions. But the above two lists are not probability distributions. Probability distributions should always add up to 1. So conventional solution is to use same loss approach as before - break the expected and predicted values into 5 individual probability distributions, proceed to calculate 5 cross entropies and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.9)] = 3.3
The challenge happens when the number of classes may be very high - say a 1000 and there may be only couple of them present in each sample. So the expected is something like: [1,0,0,0,0,0,1,0,0,0.....990 zeroes]. The predicted could be something like: [.8, .1, .1, .1, .1, .1, .8, .1, .1, .1.....990 0.1's]
In this case the CE =
- [ ln(.8) + ln(.8) for the 2 +ve classes and 998 * ln(0.9) for the 998 -ve classes]
= 0.44 (for the +ve classes) + 105 (for the negative classes)
You can see how the -ve classes are beginning to create a nuisance value when calculating the loss. The voice of the +ve samples (which may be all that we care about) is getting drowned out. What do we do? We can't use categorical CE (the version where only +ve samples are considered in calculation). This is because, we are forced to break up the probability distributions into multiple binary probability distributions because otherwise it would not be a probability distribution in the first place. Once we break it into multiple binary probability distributions, we have no choice but to use binary CE and this of course gives weightage to -ve classes.
One option is to drown the voice of the -ve classes by a multiplier. So we multiply all -ve losses by a value gamma where gamma < 1. Say in above case, gamma can be .0001. Now the loss comes to:
= 0.44 (for the +ve classes) + 0.105 (for the negative classes)
The nuisance value has come down. 2 years back Facebook did that and much more in a paper they came up with where they also multiplied the -ve losses by p to the power of x. 'p' is the probability of the output being a +ve and x is a constant>1. This penalized -ve losses even further especially the ones where the model is pretty confident (where 1-p is close to 1). This combined effect of punishing negative class losses combined with harsher punishment for the easily classified cases (which accounted for majority of the -ve cases) worked beautifully for Facebook and they called it focal loss.
So in response to OP's question of whether binary CE makes any sense at all in his case, the answer is - it depends. In 99% of the cases the conventional thumb rules work but there could be occasions when these rules could be bent or even broken to suit the problem at hand.
For a more in-depth treatment, you can refer to: https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0
The binary_crossentropy(y_target, y_predict) doesn't need to apply to binary classification problem.
In the source code of binary_crossentropy(), the nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output) of tensorflow was actually used.
And, in the documentation, it says that:
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
I am trying to use Vowpal Wabbit to do a binary classification, i.e. given feature values vw will classify it either 1 or 0. This is how I have the training data formatted.
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
-1 'name2 | feature1:1 feature2:0 feature3:5 feature4:2565 ...
etc
I have about 30,000 1 data points, and about 3,000 0 data points. I have 100 1 and 100 0 data points that I'm using to test on, after I create the model. These test data points are classified by default as 1. Here is how I format the prediction set:
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
From my understanding of the VW documentation, I need to use either the logistic or hinge loss_function for binary classifications. This is how I've been creating the model:
vw -d ../training_set.txt --loss_function logistic/hinge -f model
And this is how I try the predictions:
vw -d ../test_set.txt --loss_function logistic/hinge -i model -t -p /dev/stdout
However, this is where I'm running into problems. If I use the hinge loss function, all the predictions are -1. When I use the logistic loss function, I get arbitrary values between 5 and 11. There is a general trend for data points that should be 0 to be lower values, 5-7, and for data points that should be 1 to be from 6-11. What am I doing wrong? I've looked around the documentation and checked a bunch of articles about VW to see if I can identify what my problem is, but I can't figure it out. Ideally I would get a 0,1 value, or a value between 0 and 1 which corresponds to how strong VW thinks the result is. Any help would be appreciated!
If the output should be just -1 and +1 labels, use the --binary option (when testing).
If the output should be a real number between 0 and 1, use --loss_function=logistic --link=logistic. The loss_function=logistic is needed when training, so the number can be interpreted as probability.
If the output should be a real number between -1 and 1, use --link=glf1.
If your training data is unbalanced, e.g. 10 times more positive examples than negative, but your test data is balanced (and you want to get the best loss on this test data), set the importance weight of the positive examples to 0.1 (because there are 10 times more positive examples).
Independently of your tool and/or specific algorithm you can use "learning curves" ,and train/cross validation/test splitting to diagnose your algorithm and determine whats your problem . After diagnosing your problem you can apply adjustments to your algorithm, for example if you find you have over-fitting you can apply some actions like:
Add regularization
Get more training data
Reduce the complexity of your model
Eliminate redundant features.
You can reference Andrew Ng. "Advice for machine learning" videos on YouTube to more details on this subject.