Combining classifiers' output based on F1-scores - machine-learning

I have 4 classifiers available (already trained), for a 4-class classification problem. For a given dataset, I have the output of each of the classifiers, as well as their Recall, Precision and F1-scores for each of the classes.
What would be the best algorithm (or available existing algorithms) to combine the predictions of these classifiers to get 1 single final prediction, taking into consideration that some of the classifiers have higher f1-scores than others for specific classes?
EDIT
My main problem is that some classifiers have better F1 for specific classes.
So let's say Classifier1 (C1) predicted class A, and has a f1 of 0.90 for class A. Then Classifier2 (C2) predicted class B, and has a f1 of 0.80 for class B.
My first thought would be to choose C1 prediction based on its higher f1, but what if for example we also know that C2 has a f1 score of 0.999 for class A? If C2 is that good at predicting class A (even better than C1) but did not predict it, that should increase the probability that the real class is not A, I believe.
On the other and, if C2 had a really low f1 for class A that should make it even more likely that the real class was A, not only because C1 predicted A and it's good at it, but also because C2 is not good in predicting that class explaining why it might have failed to detect it
I'm not sure how to deal with these questions in practice though.

One way would be to make waited votes, where the vote from the classifier with highest f1-score have the highest weight. Then you simply chose the class with the highest amount score.

Related

Per class weighted loss for multiclass-multilabel classification in ML.net

I want to classify several classes, lets say A, B,C and D but the dataset is unbalanced (class A can have 60% of cases). For that reason, multiclass classification algorithms in ML.NET tend to predict A.
This unbalanced situation is common in the population of my problem: A is more frequent than the others, B uses to more frequent than C and C uses to be more frequent than D. For now, I'm not interested in sampling up/down the dataset or increase size of the dataset to solve this problem (unless there are not other options).
In the context of my problem predict successfully B is more valuable than predict A, predict C is more valuable than B and predict D is more valuable than C. So I'm interested in give more weight to classes B, C and D in order to tell the algorithm to take more risks and try to predict other classes.
But I cannot find the way to do it in ML.Net. I know that it can be done with loss functions, but there are not much information about it and I could not find any example in ML.net. I tried to implement custom loss function (class CustomLoss : ISupportSdcaClassificationLoss, ISupportSdcaLoss, IScalarLoss, ILossFunction<float, float>, IClassificationLoss) and I tried to inject it to (MulticlassClassification.Trainers.SdcaNonCalibrated) but no success because the ground truth is always 1 (it does not represent the truth class so I cannot know which class I'm calculating the loss)
Any ideas to solve that with ML.net? If not, Are there some good alternatives to ML.net in C# to solve this problem?
try it Tensorflow.NET as a good alternative https://github.com/SciSharp/TensorFlow.NET

How do classifiers classify?

After training any classifier, the classifier tells the probability of data point belonging to a class.
y_pred = clf.predict_proba(test_point)
Does the classifier predicts the class with the max probability or does it considers the probabilities as a distribution draws according to distribution?
In other words, suppose the output probability is -
C1 - 0.1 C2 - 0.2 C3 - 0.7
Will the output be C3 always or only 70% of the times?
When clf predict it won’t calculate the probably of each class . It will use the full connect get a array like [itemsnum ,classisnum] then you can use max output[1] get the items class
by the way when clf training it use softmax to get the probably of each class which is more smooth to optimize you can find some doc about softmax if you are interested about train process
How to go from class probability scores to a class is often called the 'decision function', and is often considered separate from the classifier itself. In scikit-learn, many estimators have a default decision function accessible via predict() for multi-class problems this generally just returns the largest value (argmax function).
However this may be extended in various ways, depending on needs. For instance if the effects of one prediction one of the classes is very costly, then one might weight those probabilities down (class weighting). Or one can have a decision function that only gives a class as output if the confidence is high, else returns an error or a fallback class.
One can also have multi-label classification, there the output is not a single class but a list of classes. [ 0.6, 0.1, 0.7, 0.2 ] -> (class0, class2) These can then use a common threshold, or a per-class threshold. This is common in tagging problems.
But in almost all cases the decision function is a deterministic function, not a probabilistic one.

How to find the best combination of features which will maximise the probability of a particular class?

Suppose we have a classifier having two output classes C1 and C2 and 8 features X1, X2 ... X8. Now how do you find the combination of features (can be a subset as well) such that the likelihood of class C1 is maximized?
you could look at something called "backward elimination" to find the best combination of features which impact the outcome of each of the classes.To what I can interpret from your question, you want to maximize the likelihood of class C1(be biased basically),you could consider going for a weighted approach for the same(higher weights for the features which impact the outcome of class C1)

Are NN Classification Outputs Probabilities?

I have just begun to work with Neural Networks using tensor flow and I am really new to this. I trained my first model to make 2 category classifications and I'm a little curious about the output. Let's say we are making a prediction based on whether or not a house price will go up and we get an output like
House A: .99
House B: .75
House C: .55
House D: .40
Can I assume that these outputs are probabilities? So it's more likely that house B will go up, rather than House C. Or Is it just classifying it as C and B will go up and House D will not. Thanks!
Not exactly. A neural network will output a prediction of what you have trained it for. So if you trained it to predict probabilities, it sure will output (predictions of) probabilities. However, if you trained it on an observation that the price actually did go up, say a single output which is 1.0 if the price went up, and 0.0 if the price didn't, then the output will be a regression value of observation given the input. This is not necessarily the probability but can rather be viewed as the confidence of the model.
Yes each number can be thought of as a probability representing how likely a house will go up in price. Just to further clarify, the probability estimate of one house does not affect the probability estimate of the others as they are treated as separate samples. So B being more likely doesn't make C less likely. It's just that B happens to be more likely to go up.
And the classification depends on your threshold. By default I believe most classifiers use 0.5 as their threshold, so in this case A, B, and C are classified to go up and D is classified to go down.

classification algorithms that return confidences?

Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated automatically by scikit-learn)? What should I change in this approach if I had more that 2 potential classes?
This is what I have done so far:
# load libraries
from sklearn import neighbors
# initialize NearestNeighbor classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
# train model
knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])
# predict ::: get class probabilities
print(knn.predict_proba(1.5))
print(knn.predict_proba(37))
print(knn.predict_proba(3.5))
Example:
Let's assume that we have created a model using the XYZ machine learning algorithm. Let's also assume that we are trying to classify users based on their gender using information such as location, hobbies, and income. Then, we have 10 say new instances that we want to classify. As normal, upon the applying of the model, we get 10 outputs, either M (for male) or F (for female). So far so good. However, I would like to somehow measure the precision of these results and then, by using a hard-coded threshold, leave out those with low precision. My question is on how to measure the precession. Is probability (as given by the predict_proba() function) a good measure? For example, can I say that if probably is between 0.9 and 1 then "keep" (otherwise "omit")? Or I should use a more sophisticated method for doing that? As you can see, I lack theoretical background so any help would be highly appreciated.
While this is more of a stats question I can give answers relative to scikit-learn.
Confidence in machine learning depends on the method used for the model. For exemple with 3-NN (what you used), predict_proba(x) will give you n/3 with x the number of "class 1" among the 3 nearest neighbours from x. You can easily say that if n/3 is smaller than 0.5 that means there are less than 2 "class 1" among the nearest neighbours and that there are more than 2 "class 0". That means your x is more likely to be from "class 0". (I assume you knew that already)
For another method like SVM the confidence can be the distance from the point considered to the hyperplan or for ensemble models it could be the number of aggregated votes towards a certain class. Scikit-learn's predict_proba() uses what is available from the model.
For multiclass problems (imagine Y can be equal to A, B or C) ypu have two main approach that are sometimes directly taken into consideration in scikit learn.
The first approach is OneVsOne. It basically compute every new sample as a AvsB AvsC and BvsC model and takes the most probable (imagine if A wins against B and against C it is very likely that the right class is A, the annoying cases are resolved by taking the class that has the highest confidence in the match ups e.g. if A wins against B, B wins against C and C wins against C, if the confidence of A winning against B is higher than the rest it will most likely be A).
The second approach is OneVsAll, in wich you compute A vs B and C, B vs A and C, C vs A and B and take the class that is the most likely by looking at the confidence scores.
Using scikit-learn's predict() will always give the most likely class based on the confidence scores that predict_proba would give.
I suggest you read this http://scikit-learn.org/stable/modules/multiclass.html very carefully.
EDIT :
Ah I see what you are trying to do. predict_proba() has a big flaw : let's assume you have a big outlier in your new instances (e.g. female with video games and guns as hobbies, software developper as a job etc.) if you use for instance k-NN and your outlier will be in the flock of the other classe's cloud of point predict_proba() could give 1 as a confidence score for Male while the instance is Female. However it will well for undecisive cases (e.g. male or female, with video games and guns as hobbies, and works in a nursery) as predict_proba() will give something around ~0.5.
I don't know if something better can be used tought. If you have enough training samples for doing cross validation I suggest you maybe look toward ROC and PR curves for optimizing your threshold.

Resources