Multiple binary classifiers combining - machine-learning

I'm trying to implement a multi layer perceptron classifier, and I have a data set of 1000 sample. There are 6 features and 5 possible different labels
Based on my understanding for OneVsAll, we create a binary classifier per label and train the classifier with the training data.
However, I don't understand how we can combine the results of the 5 binary classifiers. Also, what if the data was noisy and 2 binary classifiers predicted that the test sample was positive? and what we do if all labels binary classifiers predicted that a sample is a negative sample, then how do we label it?

Your output layer, each unit should be returning a value of h where 0 < h < 1. Usually, in a binary classifier, you would choose a threshold value, say 0.5 where you decide whether your output is a positive or negative result. In the case of 1vsAll, you choose the label for the output units with the highest value of h as your predicted label.

Related

SVM model predicts instances with probability scores greater than 0.1(default threshold 0.5) as positives

I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?
That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.

Multiclass classification for n classes with number of output neurons = ceiling of log2 (n)

Suppose I want to use a multilayer perceptron to classify 3 classes. When it comes to number of output neurons, anybody would instantly say - use 3 output neurons with softmax activation. But what if I use 2 output neurons with sigmoid activations to output [0,0] for class 1, [0,1] for class 2 and [1,0] for class 3? Basically getting a binary encoded output with each bit being output by each output neuron. Wouldn't this technique decrease output neurons(and hence number of parameters) by a lot? A 100 class word classification for simple NLP application would require 100 output neurons for softmax where as you can cover it with 7 output neurons with the above technique. One disadvantage is that you won't get the probability scores for all the classes. My question is, is this approach correct? If so, would you consider it to be more efficient than softmaxing for datasets with large number of classes?
You could do this, but then you would have to rethink your loss function. The cross-entropy loss used in training a model for classification is the likelihood of a categorical distribution, which assumes you have a probability associated with every class. The loss function requires 3 output probabilities and you only have 2 output values.
However, there are ways to do it anyway: you could use a binary cross-entropy loss on each element of your output, but this would be a different probabilistic assumption about your model. You'd be assuming that your classes have some shared characteristics [0,0] and [0,1] share a value. The decreased degrees of freedom are probably going to give you marginally worse performance (but other parts of the MLP may pick up the slack).
If you're really worried about the parameter cost of the final layer, then you might be better just not training it at all. This paper shows a fixed Hadamard matrix on the final layer is as good as training it.

Multi-label classification involving range of numbers as labels

I have a classification problem where my labels are ratings, 0 - 100, with increments of 1 (e.g. 1, 2, 3, 4,).
I have a data set where each row has a name, text corpus, and a rating (0 - 100).
From the text corpus I am trying to extract features that I can feed into my classifier, which will output a corresponding rating per row (0 - 100).
For feature selection, I am thinking of starting with basic bag of words. My question lies in the classification algorithm, however. Is there a classification algorithm in sci-kit learn that supports this kind of problem?
I was reading http://scikit-learn.org/stable/modules/multiclass.html, but the algorithms described seem to support labels that are completely discrete, whereas I have a set of continuous labels.
EDIT: What about the case where I bin my ratings? For example, I can have 10 labels, each 1- 10.
You can use multi-variate regression instead of classification. U can cluster the n-gram features from text corpus to form a dictionary and use it to form a feature set. With this feature set, train a regression model where output can be continuous values. U can round the output real number to get a discrete label in 1-100
You can preprocess your data with OneHotEncoder to convert your one 1-to-100 feature into 100 binary features corresponding to each value of interval [1..100]. Then you'll have 100 labels and learn a multiclass classifier.
Though, I suggest to use Regression instead.

how to use weight when training a weak learner for adaboost

The following is adaboost algorithm:
It mentions "using weights wi on the training data" at part 3.1.
I am not very clear about how to use the weights. Should I resample the training data?
I am not very clear about how to use the weights. Should I resample the training data?
It depends on what classifier you are using.
If your classifier can take instance weight (weighted training examples) into account, then you don't need to resample the data. An example classifier could be naive bayes classifier that accumulates weighted counts or a weighted k-nearest-neighbor classifier.
Otherwise, you want to resample the data using the instance weight, i.e., those instance with more weights could be sampled multiple times; while those instance with little weight might not even appear in the training data. Most of the other classifiers fall in this category.
In Practice
Actually in practice, boosting performs better if you only rely on a pool of very naive classifiers, e.g., decision stump, linear discriminant. In this case, the algorithm you listed has a easy-to-implement form (see here for details):
Where alpha is chosen by (epsilon is defined similarly as yours).
An Example
Define a two-class problem in the plane (for example, a circle of points
inside a square) and build a strong classier out of a pool of randomly
generated linear discriminants of the type sign(ax1 + bx2 + c).
The two class labels are represented with red crosses and blue dots. We here are using a bunch of linear discriminants (yellow lines) to construct the pool of naive/weak classifiers. We generate 1000 data points for each class in the graph (inside the circle or not) and 20% of data is reserved for testing.
This is the classification result (in the test dataset) I got, in which I used 50 linear discriminants. The training error is 1.45% and the testing error is 2.3%
The weights are the values applied to each example (sample) in step 2. These weights are then updated at step 3.3 (wi).
So initially all weights are equal (step 2) and they are increased for wrongly classified data and decreased for correctly classified data. So in step 3.1 you have to take take these value in account to determine a new classifier, giving more importance to higher weight values. If you did not change the weight you would produce exactly the same classifier each time you execute step 3.1.
These weights are only used for training purpose, they're not part of the final model.

Query in training

Am using libsvm for image classification in matlab. What does training_label_vector exactly mean in "svmtrain" command. What does testing_label_vector and testing_instance_matrix in "svmpredict". After training how to use the results.
For SVM, each example contains two parts: an input object (typically a vector containing the data) and a desired output value or label to specify which object/class it belongs to. Training Label vector basically represents the class the vector belongs to. For the two-class classification, the values for Training Label will be 1 or -1. So some of the features will be given a label as 1 as some as -1. This applies to Testing Label vector. Testing Instance matrix represents the data which you are trying to test the model with.
After training, the model will be outputted and you have to test with the testing matrix and testing labels to get the accuarcy of the classifier.
To read more on SVM, this is a good link: http://www.tristanfletcher.co.uk/SVM%20Explained.pdf

Resources