How to calculate accuracy score of a random classifier? - machine-learning

Say for example, a dataset contains 60% instances for "Yes" class and 30% instances for "NO" class.
In this scenario, Precision, Recall for the random classifier are
Precision =60%
Recall =50%
Then, what will be the accuracy for random classifier in this scenario?

Some caution is required here, since the very definition of a random classifier is somewhat ambiguous; this is best illustrated in cases of imbalanced data.
By definition, the accuracy of a binary classifier is
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
where P stands for probability.
Indeed, if we stick to the intuitive definition of a random binary classifier as giving
P(prediction=0) = P(prediction=1) = 0.5
then the accuracy computed by the above formula is always 0.5, irrespectively of the class distribution (i.e. the values of P(class=0) and P(class=1)).
However, in this definition, there is an implicit assumption, i.e. that our classes are balanced, each one consisting of 50% of our dataset.
This assumption (and the corresponding intuition) breaks down in cases of class imbalance: if we have a dataset where, say, 90% of samples are of class 0 (i.e. P(class=0)=0.9), then it doesn't make much sense to use the above definition of a random binary classifier; instead, we should use the percentages of the class distributions themselves as the probabilities of our random classifier, i.e.:
P(prediction=0) = P(class=0) = 0.9
P(prediction=1) = P(class=1) = 0.1
Now, plugging these values to the formula defining the accuracy, we get:
acc = P(class=0) * P(prediction=0) + P(class=1) * P(prediction=1)
= (0.9 * 0.9) + (0.1 * 0.1)
= 0.82
which is nowhere close to the naive value of 0.5...
As I already said, AFAIK there are no clear-cut definitions of a random classifier in the literature. Sometimes the "naive" random classifier (always flip a fair coin) is referred to as a "random guess" classifier, while what I have described is referred to as a "weighted guess" one, but still this is far from being accepted as a standard...
The bottom line here is the following: since the main reason for using a random classifier is as a baseline, it makes sense to do so only in relatively balanced datasets. In your case of a 60-40 balance, the result turns out to be 0.52, which is admittedly not far from the naive one of 0.5; but for highly imbalanced datasets (e.g. 90-10), the usefulness itself of the random classifier as a baseline ceases to exist, since the correct baseline has become "always predict the majority class", which here would give an accuracy of 90%, in contrast to the random classifier accuracy of just 82% (let alone the 50% accuracy of the naive approach)...

As #desertnaut mentioned, if you're after a naïve benchmark for your model you're always better using "always predict the majority class" as your benchmark, achieving accuracy of %of_samples_in_majority_class (which is always better than either a random guess or a weighted guess).
In Deepchecks (a package I maintain) we have a check that automatically compares the performance of your model to a simple model (either weighted random, majority class or simple decision tree).
from deepchecks.checks import SimpleModelComparison
from deepchecks import Dataset
SimpleModelComparison().run(Dataset(train_df, label='target'), Dataset(test_df, label='target'), model)

Related

Which metric to use for imbalanced classification problem?

I am working on a classification problem with very imbalanced classes. I have 3 classes in my dataset : class 0,1 and 2. Class 0 is 11% of the training set, class 1 is 13% and class 2 is 75%.
I used and random forest classifier and got 76% accuracy. But I discovered 93% of this accuracy comes from class 2 (majority class). Here is the Crosstable I got.
The results I would like to have :
fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1
What I found on the internet to solve the problem and what I've tried :
using class_weight='balanced' or customized class_weight ( 1/11% for class 0, 1/13% for class 1, 1/75% for class 2), but it doesn't change anything (the accuracy and crosstable are still the same). Do you have an interpretation/explenation of this ?
as I know accuracy is not the best metric in this context, I used other metrics : precision_macro, precision_weighted, f1_macro and f1_weighted, and I implemented the area under the curve of precision vs recall for each class and use the average as a metric.
Here's my code (feedback welcome) :
from sklearn.preprocessing import label_binarize
def pr_auc_score(y_true, y_pred):
y=label_binarize(y_true, classes=[0, 1, 2])
return average_precision_score(y[:,:],y_pred[:,:])
pr_auc = make_scorer(pr_auc_score, greater_is_better=True,needs_proba=True)
and here's a plot of the precision vs recall curves.
Alas, for all these metrics, the crosstab remains the same... they seem to have no effect
I also tuned the parameters of Boosting algorithms ( XGBoost and AdaBoost) (with accuracy as metric) and again the results are not improved.. I don't understand because boosting algorithms are supposed to handle imbalanced data
Finally, I used another model (BalancedRandomForestClassifier) and the metric I used is accuracy. The results are good as we can see in this crosstab. I am happy to have such results but I notice that, when I change the metric for this model, there is again no change in the results...
So I'm really interested in knowing why using class_weight, changing the metric or using boosting algorithms, don't lead to better results...
As you have figured out, you have encountered the "accuracy paradox";
Say you have a classifier which has an accuracy of 98%, it would be amazing, right? It might be, but if your data consists of 98% class 0 and 2% class 1, you obtain a 98% accuracy by assigning all values to class 0, which indeed is a bad classifier.
So, what should we do? We need a measure which is invariant to the distribution of the data - entering ROC-curves.
ROC-curves are invariant to the distribution of the data, thus are a great tool to visualize classification-performances for a classifier whether or not it is imbalanced. But, they only work for a two-class problem (you can extend it to multiclass by creating a one-vs-rest or one-vs-one ROC-curve).
F-score might a bit more "tricky" to use than the ROC-AUC since it's a trade off between precision and recall and you need to set the beta-variable (which is often a "1" thus the F1 score).
You write: "fewer false negatives for class 0 and 1 OR/AND fewer false positives for class 0 and 1". Remember, that all algorithms work by either minimizing something or maximizing something - often we minimize a loss function of some sort. For a random forest, lets say we want to minimize the following function L:
L = (w0+w1+w2)/n
where wi is the number of class i being classified as not class i i.e if w0=13 we have missclassified 13 samples from class 0, and n the total number of samples.
It is clear that when class 0 consists of most of the data then an easy way to get a small L is to classify most of the samples as 0. Now, we can overcome this by adding a weight instead to each class e.g
L = (b0*w0+b1*w1+b2*x2)/n
as an example say b0=1, b1=5, b2=10. Now you can see, we cannot just assign most of the data to c0 without being punished by the weights i.e we are way more conservative by assigning samples to class 0, since assigning a class 1 to class 0 gives us 5 times as much loss now as before! This is exactly how the weight in (most) of the classifiers work - they assign a penalty/weight to each class (often proportional to it's ratio i.e if class 0 consists of 80% and class 1 consists of 20% of the data then b0=1 and b1=4) but you can often specify the weight your self; if you find that the classifier still generates to many false negatives of a class then increase the penalty for that class.
Unfortunately "there is no such thing as a free lunch" i.e it's a problem, data and usage specific choice, of what metric to use.
On a side note - "random forest" might actually be bad by design when you don't have much data due to how the splits are calculated (let me know, if you want to know why - it's rather easy to see when using e.g Gini as splitting). Since you have only provided us with the ratio for each class and not the numbers, I cannot tell.

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?

I'm trying to train a CNN to categorize text by topic. When I use binary cross-entropy I get ~80% accuracy, with categorical cross-entropy I get ~50% accuracy.
I don't understand why this is. It's a multiclass problem, doesn't that mean that I have to use categorical cross-entropy and that the results with binary cross-entropy are meaningless?
model.add(embedding_layer)
model.add(Dropout(0.25))
# convolution layers
model.add(Conv1D(nb_filter=32,
filter_length=4,
border_mode='valid',
activation='relu'))
model.add(MaxPooling1D(pool_length=2))
# dense layers
model.add(Flatten())
model.add(Dense(256))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# output layer
model.add(Dense(len(class_id_index)))
model.add(Activation('softmax'))
Then I compile it either it like this using categorical_crossentropy as the loss function:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
or
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Intuitively it makes sense why I'd want to use categorical cross-entropy, I don't understand why I get good results with binary, and poor results with categorical.
The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:
the accuracy computed with the Keras method evaluate is just plain
wrong when using binary_crossentropy with more than 2 labels
I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.
This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
is valid, your second one:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).
Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.
Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # WRONG way
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=2, # only 2 epochs, for demonstration purposes
verbose=1,
validation_data=(x_test, y_test))
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.9975801164627075
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001
score[1]==acc
# False
To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])
In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.98580000000000001
# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001
score[1]==acc
# True
System setup:
Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4
UPDATE: After my post, I discovered that this issue had already been identified in this answer.
It all depends on the type of classification problem you are dealing with. There are three main categories
binary classification (two target classes),
multi-class classification (more than two exclusive targets),
multi-label classification (more than two non exclusive targets), in which multiple target classes can be on at the same time.
In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.
In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.
In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.
The binary cross-entropy is defined as
and categorical cross-entropy is defined as
where c is the index running over the number of classes C.
I came across an "inverted" issue — I was getting good results with categorical_crossentropy (with 2 classes) and poor with binary_crossentropy. It seems that problem was with wrong activation function. The correct settings were:
for binary_crossentropy: sigmoid activation, scalar target
for categorical_crossentropy: softmax activation, one-hot encoded target
It's really interesting case. Actually in your setup the following statement is true:
binary_crossentropy = len(class_id_index) * categorical_crossentropy
This means that up to a constant multiplication factor your losses are equivalent. The weird behaviour that you are observing during a training phase might be an example of a following phenomenon:
At the beginning the most frequent class is dominating the loss - so network is learning to predict mostly this class for every example.
After it learnt the most frequent pattern it starts discriminating among less frequent classes. But when you are using adam - the learning rate has a much smaller value than it had at the beginning of training (it's because of the nature of this optimizer). It makes training slower and prevents your network from e.g. leaving a poor local minimum less possible.
That's why this constant factor might help in case of binary_crossentropy. After many epochs - the learning rate value is greater than in categorical_crossentropy case. I usually restart training (and learning phase) a few times when I notice such behaviour or/and adjusting a class weights using the following pattern:
class_weight = 1 / class_frequency
This makes loss from a less frequent classes balancing the influence of a dominant class loss at the beginning of a training and in a further part of an optimization process.
EDIT:
Actually - I checked that even though in case of maths:
binary_crossentropy = len(class_id_index) * categorical_crossentropy
should hold - in case of keras it's not true, because keras is automatically normalizing all outputs to sum up to 1. This is the actual reason behind this weird behaviour as in case of multiclassification such normalization harms a training.
After commenting #Marcin answer, I have more carefully checked one of my students code where I found the same weird behavior, even after only 2 epochs ! (So #Marcin's explanation was not very likely in my case).
And I found that the answer is actually very simple: the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels. You can check that by recomputing the accuracy yourself (first call the Keras method "predict" and then compute the number of correct answers returned by predict): you get the true accuracy, which is much lower than the Keras "evaluate" one.
a simple example under a multi-class setting to illustrate
suppose you have 4 classes (onehot encoded) and below is just one prediction
true_label = [0,1,0,0]
predicted_label = [0,0,1,0]
when using categorical_crossentropy, the accuracy is just 0 , it only cares about if you get the concerned class right.
however when using binary_crossentropy, the accuracy is calculated for all classes, it would be 50% for this prediction. and the final result will be the mean of the individual accuracies for both cases.
it is recommended to use categorical_crossentropy for multi-class(classes are mutually exclusive) problem but binary_crossentropy for multi-label problem.
As it is a multi-class problem, you have to use the categorical_crossentropy, the binary cross entropy will produce bogus results, most likely will only evaluate the first two classes only.
50% for a multi-class problem can be quite good, depending on the number of classes. If you have n classes, then 100/n is the minimum performance you can get by outputting a random class.
You are passing a target array of shape (x-dim, y-dim) while using as loss categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
from keras.utils import to_categorical
y_binary = to_categorical(y_int)
Alternatively, you can use the loss function sparse_categorical_crossentropy instead, which does expect integer targets.
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample).
Take a look at the equation you can find that binary cross entropy not only punish those label = 1, predicted =0, but also label = 0, predicted = 1.
However categorical cross entropy only punish those label = 1 but predicted = 1.That's why we make assumption that there is only ONE label positive.
The main point is answered satisfactorily with the brilliant piece of sleuthing by desernaut. However there are occasions when BCE (binary cross entropy) could throw different results than CCE (categorical cross entropy) and may be the preferred choice. While the thumb rules shared above (which loss to select) work fine for 99% of the cases, I would like to add a few new dimensions to this discussion.
The OP had a softmax activation and this throws a probability distribution as the predicted value. It is a multi-class problem. The preferred loss is categorical CE. Essentially this boils down to -ln(p) where 'p' is the predicted probability of the lone positive class in the sample. This means that the negative predictions dont have a role to play in calculating CE. This is by intention.
On a rare occasion, it may be needed to make the -ve voices count. This can be done by treating the above sample as a series of binary predictions. So if expected is [1 0 0 0 0] and predicted is [0.1 0.5 0.1 0.1 0.2], this is further broken down into:
expected = [1,0], [0,1], [0,1], [0,1], [0,1]
predicted = [0.1, 0.9], [.5, .5], [.1, .9], [.1, .9], [.2, .8]
Now we proceed to compute 5 different cross entropies - one for each of the above 5 expected/predicted combo and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.8)]
The CE has a different scale but continues to be a measure of the difference between the expected and predicted values. The only difference is that in this scheme, the -ve values are also penalized/rewarded along with the +ve values. In case your problem is such that you are going to use the output probabilities (both +ve and -ves) instead of using the max() to predict just the 1 +ve label, then you may want to consider this version of CE.
How about a multi-label situation where expected = [1 0 0 0 1]? Conventional approach is to use one sigmoid per output neuron instead of an overall softmax. This ensures that the output probabilities are independent of each other. So we get something like:
expected = [1 0 0 0 1]
predicted is = [0.1 0.5 0.1 0.1 0.9]
By definition, CE measures the difference between 2 probability distributions. But the above two lists are not probability distributions. Probability distributions should always add up to 1. So conventional solution is to use same loss approach as before - break the expected and predicted values into 5 individual probability distributions, proceed to calculate 5 cross entropies and sum them up. Then:
CE = -[ ln(.1) + ln(0.5) + ln(0.9) + ln(0.9) + ln(0.9)] = 3.3
The challenge happens when the number of classes may be very high - say a 1000 and there may be only couple of them present in each sample. So the expected is something like: [1,0,0,0,0,0,1,0,0,0.....990 zeroes]. The predicted could be something like: [.8, .1, .1, .1, .1, .1, .8, .1, .1, .1.....990 0.1's]
In this case the CE =
- [ ln(.8) + ln(.8) for the 2 +ve classes and 998 * ln(0.9) for the 998 -ve classes]
= 0.44 (for the +ve classes) + 105 (for the negative classes)
You can see how the -ve classes are beginning to create a nuisance value when calculating the loss. The voice of the +ve samples (which may be all that we care about) is getting drowned out. What do we do? We can't use categorical CE (the version where only +ve samples are considered in calculation). This is because, we are forced to break up the probability distributions into multiple binary probability distributions because otherwise it would not be a probability distribution in the first place. Once we break it into multiple binary probability distributions, we have no choice but to use binary CE and this of course gives weightage to -ve classes.
One option is to drown the voice of the -ve classes by a multiplier. So we multiply all -ve losses by a value gamma where gamma < 1. Say in above case, gamma can be .0001. Now the loss comes to:
= 0.44 (for the +ve classes) + 0.105 (for the negative classes)
The nuisance value has come down. 2 years back Facebook did that and much more in a paper they came up with where they also multiplied the -ve losses by p to the power of x. 'p' is the probability of the output being a +ve and x is a constant>1. This penalized -ve losses even further especially the ones where the model is pretty confident (where 1-p is close to 1). This combined effect of punishing negative class losses combined with harsher punishment for the easily classified cases (which accounted for majority of the -ve cases) worked beautifully for Facebook and they called it focal loss.
So in response to OP's question of whether binary CE makes any sense at all in his case, the answer is - it depends. In 99% of the cases the conventional thumb rules work but there could be occasions when these rules could be bent or even broken to suit the problem at hand.
For a more in-depth treatment, you can refer to: https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0
The binary_crossentropy(y_target, y_predict) doesn't need to apply to binary classification problem.
In the source code of binary_crossentropy(), the nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output) of tensorflow was actually used.
And, in the documentation, it says that:
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.

How to compute probabilities instead of actual classifications on ML problems

Let's assume that we have a few data points that can be used as the training set. Each row is consisted of 4 say columns (features) that take boolean values. The 5th column expresses the class and it also takes boolean values. Here is an example (they are almost random):
1,1,1,0,1
0,1,1,0,1
1,1,0,0,1
0,0,0,0,0
1,0,0,1,0
0,0,0,0,0
Now, what I want to do is to build a model such that for any given input (new line) the system does not return the class itself (like in the case of a regular classification problem) but instead the probability this particular input belongs to class 0 or class 1. How can I do that? What's more, how can I generate a confidence interval or error rate associated with that computation?
Not all classification algorithms return probabilities, because not all of them have an underlying probabilistic model. For example, a classification tree is just a set of rules that you follow to assign each new input to a particular class.
An example of a classification algorithm that does have an underlying probabilistic model is logistic regression. In this algorithm, the probability that a particular input x is in the class is
prob = 1 / (1 + exp( -theta * x ))
where theta is a vector of coefficients with the same number of dimensions as x. Generally to move from probabilities to classifications, you simply threshold, e.g.
if prob < 0.5
return 0;
else
return 1;
end
Other classification algorithms may have probabilistic interpretations, for example random forests are essentially a voting algorithm with multiple classification trees. If 80% of the trees vote for class 1 and 20% vote for class 2, then you could output an 80% probability of being in class 1. But this is a side effect of how the model works, rather than an explicit underlying probability model.

Resources