Per class weighted loss for multiclass-multilabel classification in ML.net - machine-learning

I want to classify several classes, lets say A, B,C and D but the dataset is unbalanced (class A can have 60% of cases). For that reason, multiclass classification algorithms in ML.NET tend to predict A.
This unbalanced situation is common in the population of my problem: A is more frequent than the others, B uses to more frequent than C and C uses to be more frequent than D. For now, I'm not interested in sampling up/down the dataset or increase size of the dataset to solve this problem (unless there are not other options).
In the context of my problem predict successfully B is more valuable than predict A, predict C is more valuable than B and predict D is more valuable than C. So I'm interested in give more weight to classes B, C and D in order to tell the algorithm to take more risks and try to predict other classes.
But I cannot find the way to do it in ML.Net. I know that it can be done with loss functions, but there are not much information about it and I could not find any example in ML.net. I tried to implement custom loss function (class CustomLoss : ISupportSdcaClassificationLoss, ISupportSdcaLoss, IScalarLoss, ILossFunction<float, float>, IClassificationLoss) and I tried to inject it to (MulticlassClassification.Trainers.SdcaNonCalibrated) but no success because the ground truth is always 1 (it does not represent the truth class so I cannot know which class I'm calculating the loss)
Any ideas to solve that with ML.net? If not, Are there some good alternatives to ML.net in C# to solve this problem?

try it Tensorflow.NET as a good alternative https://github.com/SciSharp/TensorFlow.NET

Related

Polynomial regression in machine learning coursera

While going through Andrew NG's Coursera course on machine learning . I found this particular thing that prices of a house might goes down after certain value of x in Quadratic regression equation. Can anyone explain why is it so?
Andrew Ng is trying to show that a Quadratic function doesn't really make sense to represent the price of houses.
This what the graph of a quadratic function might look like -->
The values of a, b and c were chosen randomly for this example.
As you can see in the figure, the graph first rises to a maximum and then begins to dip. This isn't representative of the real-world since the price of a house wouldn't normally come down with an increasingly larger house.
He recommends that we use a different polynomial function to represent this problem better, such as the cubic function.
The values of a, b, c and d were chosen randomly for this example.
In reality, we would use a different method altogether for choosing the best polynomial function to fit a problem. We would try different polynomial functions on a cross-validation dataset and have an algorithm choose the best suited one. We could also manually chose a polynomial function for a dataset if we already know the trend that our data would follow (due to prior mathematical or physical knowledge).

Is there any algorithm good at pick special category?

When I see machine learning, specially the classification, I find that some algorithm are designed to classify , for example, the Decision tree, to classify without the consideration as described next:
For a two categories problem, category A and B, people are interested in a special one, for example the category A. For this case, assume that we have 100 for A and 1000 for B. A good classify may have a result that mixed 100A and 100B as a part and let 900B another part. This is good for classify . But is there a algorithm can pick, for example , 50A and 5 B to a part and 50 A and 995 B for another part. This may not so good as a view of classify, but if some one is interested in category A, I think that next algorithm can give a more pure A result so it is better.
In short, it means is there a algorithm can pure a special category, not to classify them with no bias?
If scikit-learn have included this algorithm, it is be better.
Look into a matching algorithm such as the "Stable Marriage Problem."
https://en.wikipedia.org/wiki/Stable_marriage_problem
If I understand you correctly, I think you're asking for a machine learning algorithm that gives a higher weight to certain classes and are therefore proportionally more likely to predict those "special" classes.
If that's what you're asking, you could use any algorithm that outputs a probability of each class during prediction. I think most algorithms take that approach actually, but I know specifically that neural nets do. Then, you can either train the network on proportionally more data on the "special" classes, or manually post-process the prediction output (the array of probabilities of each class) to adapt the probabilities to your specification.

Multiclass classification growing number of classes

I am building an intent recognition system using multiclass classfication with SVM.
Currently I only have a few numbers of classes limited by the training data. However, in the future, I may get data with new classes. I can, of course, put all the data together and re-train the model, which is timing consuming and in-efficient.
My current idea is to do the one-against-one classification at the beginning, and when a new class comes in, I can just train it against all the existing classes, and get n new classifiers. I am wondering if there are some other better methods to do that. Thanks!
The most efficient approach would be to focus on one-class classifiers, then you just need to add one new model to the ensemble. Just to compare:
Let us assume that we have K classes and you get 1 new plus P new points from it, your whole dataset consists of N points (for simpliciy - equaly distributed among classes) and your training algorithm complexity is f(N) and if your classifier supports incremental learning then its complexity if g(P, N)
OVO (one vs one) - in order to get the exact results you need to train K new classifiers, each with about N/K datapoints thus leading to O(K f(P+N/K)), there is no place to use incremental training
OVA (one vs all) - in order to get the exact results you retrain all classifiers, if done in batch fassion you need O(K f(N+P)), worse than the above. However if you can train in incremental fashion you just need O(K g(P, N)) which might be better (depending on the classifier).
One-class ensemble - it might seem a bit weird, but for example Naive Bayes can be seen as such approach, you have generative model which models each class conditional distribution, thus your model for each class is actually independent on the remaining ones. Thus the complexity is O(f(P))
This list is obviously not exhaustive but should give you general idea in what to analyze.

classification algorithms that return confidences?

Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated automatically by scikit-learn)? What should I change in this approach if I had more that 2 potential classes?
This is what I have done so far:
# load libraries
from sklearn import neighbors
# initialize NearestNeighbor classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
# train model
knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])
# predict ::: get class probabilities
print(knn.predict_proba(1.5))
print(knn.predict_proba(37))
print(knn.predict_proba(3.5))
Example:
Let's assume that we have created a model using the XYZ machine learning algorithm. Let's also assume that we are trying to classify users based on their gender using information such as location, hobbies, and income. Then, we have 10 say new instances that we want to classify. As normal, upon the applying of the model, we get 10 outputs, either M (for male) or F (for female). So far so good. However, I would like to somehow measure the precision of these results and then, by using a hard-coded threshold, leave out those with low precision. My question is on how to measure the precession. Is probability (as given by the predict_proba() function) a good measure? For example, can I say that if probably is between 0.9 and 1 then "keep" (otherwise "omit")? Or I should use a more sophisticated method for doing that? As you can see, I lack theoretical background so any help would be highly appreciated.
While this is more of a stats question I can give answers relative to scikit-learn.
Confidence in machine learning depends on the method used for the model. For exemple with 3-NN (what you used), predict_proba(x) will give you n/3 with x the number of "class 1" among the 3 nearest neighbours from x. You can easily say that if n/3 is smaller than 0.5 that means there are less than 2 "class 1" among the nearest neighbours and that there are more than 2 "class 0". That means your x is more likely to be from "class 0". (I assume you knew that already)
For another method like SVM the confidence can be the distance from the point considered to the hyperplan or for ensemble models it could be the number of aggregated votes towards a certain class. Scikit-learn's predict_proba() uses what is available from the model.
For multiclass problems (imagine Y can be equal to A, B or C) ypu have two main approach that are sometimes directly taken into consideration in scikit learn.
The first approach is OneVsOne. It basically compute every new sample as a AvsB AvsC and BvsC model and takes the most probable (imagine if A wins against B and against C it is very likely that the right class is A, the annoying cases are resolved by taking the class that has the highest confidence in the match ups e.g. if A wins against B, B wins against C and C wins against C, if the confidence of A winning against B is higher than the rest it will most likely be A).
The second approach is OneVsAll, in wich you compute A vs B and C, B vs A and C, C vs A and B and take the class that is the most likely by looking at the confidence scores.
Using scikit-learn's predict() will always give the most likely class based on the confidence scores that predict_proba would give.
I suggest you read this http://scikit-learn.org/stable/modules/multiclass.html very carefully.
EDIT :
Ah I see what you are trying to do. predict_proba() has a big flaw : let's assume you have a big outlier in your new instances (e.g. female with video games and guns as hobbies, software developper as a job etc.) if you use for instance k-NN and your outlier will be in the flock of the other classe's cloud of point predict_proba() could give 1 as a confidence score for Male while the instance is Female. However it will well for undecisive cases (e.g. male or female, with video games and guns as hobbies, and works in a nursery) as predict_proba() will give something around ~0.5.
I don't know if something better can be used tought. If you have enough training samples for doing cross validation I suggest you maybe look toward ROC and PR curves for optimizing your threshold.

training time and overfitting with gamma and C in libsvm

I am now using libsvm for support vector machine classifier with Gaussian kernal. In its website, it provides a python script grid.py to select the best C and gamma.
I just wonder how training time and overfitting/underfitting change with gamma and C?
Is it correct that:
suppose C changes from 0 to +infinity, the trained model will go from underfitting to overfitting, and the training time increases?
suppose gamma changes from almost 0 to +infinity, the trained model will go from underfitting to overfitting, and the training time increases?
In grid.py, the default searching order is for C from small to big BUT gamma from big to small. Is it for the purpose of training time from small to big and trained model from underfitting to overfitting? So we can perhaps save time in selecting the values of C and gamma?
Thanks and regards!
Good question for which I don't have a sure answer, because I myself would like to know. But in response to the question:
So we can perhaps save time in selecting the values of C and gamma?
... I find that, with libsvm, there is definitely a "right" value for C and gamma that is highly problem dependent. So regardless of the order in which gamma is searched, many candidate values for gamma must be tested. Ultimately, I don't know any shortcut around this time-consuming (depending upon your problem) but necessary parameter search.

Resources