Binary Classification vs. Multi Class Classification - machine-learning

I have a machine learning classification problem with 3 possible classes (Class A, Class b and Class C). Please let me know which one would be better approach?
- Split the problem into 2 binary classification: First Identify whether it is Class A or Class 'Not A'. Then if it is Class 'Not A', then another binary classification to classify into Class B or Class C

Binary classification may at the end use sigmoid function (goes smooth from 0 to 1). This is how we will know how to classify two values.
from keras.layers import Dense
model.add(Dense(1, input_dim=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
For multi class classification you would typically use softmax at the very last layer, and the number of neurons in the next example will be 10, means 10 choices.
from keras.layers import Dropout
model.add(Dense(10, activation='softmax'))
However, you can also use softmax with 2 neurons in the last layer for the binary classification as well:
model.add(Dense(2, activation='softmax'))
Hope this provides little intuition on classifiers.

What you describe is one method used for Multi Class Classification.
It is called One vs. All / One vs. Rest.
The best way is to chose a good classifier framework with both options and choose the better one using Cross Validation process.


I have used softmax function with loss='binary_crossentropy' for a two class classification problem

for two class classification problem sigmoid + binary_crossentropy is fine or softmax + categorical_crossentropy is fine. But in my case I have used softmax(2 dense layers) + binary_crossentropy and trained a DL model.. Is this correct? Does the accuracy produced is genuine?
Please let me know if softmax(2 dense layers) + binary_crossentropy is correct or not.
The number of layers it's irrelevant at this stage. If you use softmax then it's either categorical_crossentropy or sparse_categorical_crossentropy depending whether you one-hot-encoded the targets or not. But there's no consistency between softmax output layer activation function and loss='binary_crossentropy' , output is likely to be whacky.
model.add(Dense(2, activation='softmax')) #2 because it's a two class problem
optimizer='adagrad', #optimizer can be whatever works best
Whether using softmax or sigmoid depends on your classification problem. Is it something like 'A vs NOT A' or 'A or B' . Plot the model performance, compare and drive conclusions.

Machine learning approach to facial recognition

First of all i'm very new to the field. maybe my question is a bit too naive of even trivial..
I'm currently trying to understand how can i go about recognizing different faces.
Here is what i tried so far and the main issues with each approach:
1) Haar Cascade -> HOG -> SVM:
The main issue is that the algorithm becomes very indecisive when more than 4 people are trained.. the same occurs when we change Haar Cascade for a pre-trained CNN to detect faces..
2) dlib facial landmarks -> distance between points -> SVM or Simple Neural Network Classification:
This is the current approach and it behaves very well when when 4 people are trained.. when more people are trained it becomes very messy, jumping from decision to decision and never resolves to a choice.
I've read online that Triplet loss is the way to go.. but I very confused as to how id go about implementing it.. can i use the current distance vectors found using Dlib or should i scrap everything and train my own CNN?
If i can use the distance vectors how would i pass the data to the algorithm? is Triplet loss a trivial neural network only with it's loss function altered?
I've took the liberty to show exactly how the distance vectors are being calculated:
The green lines represent the distances being calculated
A 33 float list is returned which is then fed to the classifier
Here is the relevant code for the classifier (Keras):
def fit_classifier(self):
x_train, y_train = self._get_data(self.train_data_path)
x_test, y_test = self._get_data(self.test_data_path)
encoding_train_y = np_utils.to_categorical(y_train)
encoding_test_y = np_utils.to_categorical(y_test)
model = Sequential()
model.add(Dense(10, input_dim=33, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(30, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(30, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(max(y_train)+1, activation='softmax'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy']), encoding_train_y, epochs=100, batch_size=10)
I think this is a more theoretical question than anything else.. if someone with good experience in the field could help me out i'd be very happy!

When and where should we use these keras LSTM models

I know how a RNN, LSTM, neural nets,activation function works but from various available LSTM models I dont know what should I use for which data and when. I created these 5 models as a sample of different varites of LSTM models I have seen but I dont know which optimal sequence dataset should use. I have most of my confussion in the second/third lines of these models. Are model1 and model4 are same? Why is model1.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False)) different from model4.add(Embedding(X_train.shape[1], 128, input_length=max_len)) . I would much appreciate If some one can explain these five models in simple english.
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.models import Sequential
from keras.layers.wrappers import TimeDistributed
model1 = Sequential()
model1.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model1.summary()
model2 = Sequential()
model2.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=True))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model2.summary()
model3 = Sequential()
model3.add(TimeDistributed(Dense(X_train.shape[1]), input_shape=(X_train.shape[1],1)))
model3.add(LSTM(10, return_sequences=False))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model3.summary()
model4 = Sequential()
model4.add(Embedding(X_train.shape[1], 128, input_length=max_len))
model4.add(Dense(1, activation='sigmoid'))
model4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model4.summary()
model5 = Sequential()
model5.add(Embedding(X_train.shape[1], 128, input_length=max_len))
model5.add(Dense(1, activation='sigmoid'))
model5.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model5.summary()
First network is the best one for classification. It's simply analysing the whole sequence - and once all input steps are fed to a model - it's able to perform a decision. There are other variants of this architecture (using e.g. GlobalAveragePooling1D or max one) which are pretty similiar from a conceptual point of view.
Second network - from a design point of view is quite similar to a first architecture. What differs them is the fact that in a first approach two consequent fit and predict calls are totally independent, whereas here - the starting state for second call is the same to the last one in a first. This enables a lot of cool scenarios like e.g. varying length sequences analysis or e.g. decision making processes thanks to the fact that you could effecitively stop inference / training process - affect network or input and come back to it with actualized state.
Is the best one when you don't want to use recurrent network at all stages of your computations. Especially - when your network is big - introducing a recurrent layers is quite costly from a parameter number point of view (introducing a recurrent connection usually increases the number of parameter by a factor of at least 2). So you could apply a static network as a preprocessing stage - and then you feed results to a recurrent part. This makes training easier.
Model is a special case of case 3. Here - you have a sequence of tokens which are coded by a one-hot encoding and then transformed using Embedding. This makes the process less memory consuming.
Bidrectional network provides you an advantage of knowing at each step not only a sequence previous history - but also further steps. This is at computational cost and also you are losing the possibilty of a sequential data feed - as you need to have a full sequence when analysis is performed.

How to calculate prediction uncertainty using Keras?

I would like to calculate NN model certainty/confidence (see What my deep model doesn't know) - when NN tells me an image represents "8", I would like to know how certain it is. Is my model 99% certain it is "8" or is it 51% it is "8", but it could also be "6"? Some digits are quite ambiguous and I would like to know for which images the model is just "flipping a coin".
I have found some theoretical writings about this but I have trouble putting this in code. If I understand correctly, I should evaluate a testing image multiple times while "killing off" different neurons (using dropout) and then...?
Working on MNIST dataset, I am running the following model:
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D, Flatten, Dropout
model = Sequential()
model.add(Conv2D(128, kernel_size=(7, 7),
input_shape=(28, 28, 1,)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=10, activation='softmax'))
metrics=['accuracy']), train_labels, batch_size=100, epochs=30, validation_data=(test_data, test_labels,))
How should I predict with this model so that I get its certainty about predictions too? I would appreciate some practical examples (preferably in Keras, but any will do).
To clarify, I am looking for an example of how to get certainty using the method outlined by Yurin Gal (or an explanation of why some other method yields better results).
If you want to implement dropout approach to measure uncertainty you should do the following:
Implement function which applies dropout also during the test time:
import keras.backend as K
f = K.function([model.layers[0].input, K.learning_phase()],
Use this function as uncertainty predictor e.g. in a following manner:
def predict_with_uncertainty(f, x, n_iter=10):
result = numpy.zeros((n_iter,) + x.shape)
for iter in range(n_iter):
result[iter] = f(x, 1)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Of course you may use any different function to compute uncertainty.
Made a few changes to the top voted answer. Now it works for me.
It's a way to estimate model uncertainty. For other source of uncertainty, I found helpful.
f = K.function([model.layers[0].input, K.learning_phase()],
def predict_with_uncertainty(f, x, n_iter=10):
result = []
for i in range(n_iter):
result.append(f([x, 1]))
result = np.array(result)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Your model uses a softmax activation, so the simplest way to obtain some kind of uncertainty measure is to look at the output softmax probabilities:
probs = model.predict(some input data)[0]
The probs array will then be a 10-element vector of numbers in the [0, 1] range that sum to 1.0, so they can be interpreted as probabilities. For example the probability for digit 7 is just probs[7].
Then with this information you can do some post-processing, typically the predicted class is the one with highest probability, but you can also look at the class with second highest probability, etc.
A simpler way is to set training=True on any dropout layers you want to run during inference as well (essentially tells the layer to operate as if it's always in training mode - so it is always present for both training and inference).
import keras
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(3)(inputs)
outputs = keras.layers.Dropout(0.5)(x, training=True)
model = keras.Model(inputs, outputs)
Code above is from this issue.

Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?

I'm learning how to create convolutional neural networks using Keras. I'm trying to get a high accuracy for the MNIST dataset.
Apparently categorical_crossentropy is for more than 2 classes and binary_crossentropy is for 2 classes. Since there are 10 digits, I should be using categorical_crossentropy. However, after training and testing dozens of models, binary_crossentropy consistently outperforms categorical_crossentropy significantly.
On Kaggle, I got 99+% accuracy using binary_crossentropy and 10 epochs. Meanwhile, I can't get above 97% using categorical_crossentropy, even using 30 epochs (which isn't much, but I don't have a GPU, so training takes forever).
Here's what my model looks like now:
model = Sequential()
model.add(Convolution2D(100, 5, 5, border_mode='valid', input_shape=(28, 28, 1), init='glorot_uniform', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(100, 3, 3, init='glorot_uniform', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(100, init='glorot_uniform', activation='relu'))
model.add(Dense(100, init='glorot_uniform', activation='relu'))
model.add(Dense(10, init='glorot_uniform', activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=['accuracy'])
Short answer: it is not.
To see that, simply try to calculate the accuracy "by hand", and you will see that it is different from the one reported by Keras with the model.evaluate method:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
# 0.99794011611938471
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
# 0.98999999999999999
The reason it seems to be so is a rather subtle issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation.
If you check the source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns.
To avoid that, i.e. to use indeed binary cross entropy as your loss function (nothing wrong with this, in principle) while still getting the categorical accuracy required by the problem at hand (i.e. MNIST classification), you should ask explicitly for categorical_accuracy in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=[categorical_accuracy])
And after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000 == score[1]
# True
(HT to this great answer to a similar problem, which helped me understand the issue...)
UPDATE: After my post, I discovered that this issue had already been identified in this answer.
First of all, binary_crossentropy is not when there are two classes.
The "binary" name is because it is adapted for binary output, and each number of the softmax is aimed at being 0 or 1.
Here, it checks for each number of the output.
It doesn't explain your result, since categorical_entropy exploits the fact that it is a classification problem.
Are you sure that when you read your data there is one and only one class per sample? It's the only one explanation I can give.
