Loss function for OneVsRestClassifier - machine-learning

I have a OneVsRestClassifier (scikit-learn) which has been trained.
clf = OneVsRestClassifier(LogisticRegression(C=1.2, penalty='l1')).fit(X_train, y_train)
I want to find out the loss for my test data. I used log_loss function but it does not seem to work because I have multiple classes as outputs for each test case. What do I do?

The classification problem that you are referring to is known as a Multi-Label Classification problem. You have made a good decision of using the OneVsRestClassifier for this purpose. By default the score method uses the subset accuracy which is a very harsh metric as it requires you to guess the entire subset of labels correctly.
Some other loss functions, provided by scikit-learn, that you can use are as follows:
Hamming Loss - This measures the hamming distance between your prediction of labels and the true label. This is an intuitive formula to understand the hamming distance.
Jaccard Similarity Coefficient Score - This measures the Jaccard similarity between your predicted labels and the true labels.
Precision, Recall and F-Measures - In the case of multi-label classification, the notion of Precision, Recall and F-Measures can be applied to each class independently. The following guide explains how to combine them across all labels in multi-label classification.
If you need to also rank the labels as it is done in multi-label ranking problems, then there are other more advanced techniques available in scikit-learn which are very well documented with examples here. If you are dealing with this kind of a problem, then let me know in the comments, I will explain each of these metrics in more details.
Hope this helps!

Related

Sklearn models: decision function vs predict_proba for roc curve

In Sklearn, roc curve requires (y_true, y_scores). Generally, for y_scores, I feed in probabilities as outputted by a classifier's predict_proba function. But in the sklearn example, I see both predict_prob and decision_fucnction are used.
I wonder what is the difference in terms of real life model evaluation?
The functional form of logistic regression is -
f(x)=11+e−(β0+β1x1+⋯+βkxk)
This is what is returned by predict_proba.
The term inside the exponential i.e.
d(x)=β0+β1x1+⋯+βkxk
is what is returned by decision_function. The "hyperplane" referred to in the documentation is
β0+β1x1+⋯+βkxk=0
My Understanding after reading few resources:
Decision Function: Gives the distances from the hyperplane. These are therefore unbounded. This can not be equated to probabilities. For getting probabilities, there are 2 solutions - Platt Scaling & Multi-Attribute Spaces to calibrate outputs using Extreme Value Theory.
Predict Proba: Gives the actual probabilities (0 to 1) however attribute 'probability' has to be set to True while fitting the model itself. It uses Platt scaling which is known to have theoretical issues.
Refer to this in documentation.

How to use Genetic Algorithm to find weight of voting classifier in WEKA?

I am working from this article: "A novel method for predicting kidney stone type using ensemble learning". The author used a genetic algorithm for finding the optimal weight vector for voting with WEKA, but i don't know see can they did that. How can i use a genetic algorithm to find weight of voting classifier with WEKA?
This below paragraph has been extracted from the article:
In order to enhance the performance of the voting algorithm,a weighted
majority vote is used. Simple majority vote algorithm is usually an
effective way to combine different classifiers, but not all
classifiers have the same effect on the classification problem. To
optimize the results from weight majority vote classifier, we need to
find the optimal weight vector. Applying Genetic algorithms is our
solution for finding the optimal weight vector in this problem.
Assuming you have some trained classifiers and a test set, you can create a method calculateFitness(double[] weights). In this method for each Instance calculate all predictions and a merged prediction according to the weights. Use the combined predictions and the real values to calculate the total score you want to maximize/minimize.
Using the calculateFitness method you can create a custom GA to find best weights.

Multi-label classification Keras metrics

Which metrics is better for multi-label classification in Keras: accuracy or categorical_accuracy? Obviously the last activation function is sigmoid and as loss function is binary_crossentropy in this case.
I would not use Accuracy for classification tasks with unbalanced classes.
Especially for multi-label tasks, you probably have most of your labels to be False. That is, each data point can only have a small set of labels compared to the cardinality of all of the possibile labels.
For that reason accuracy is not a good metric, if your model predict all False (sigmoid activation output < 0.5) then you would measure a very high accuracy.
I would analyze either the AUC or recall/precision at each epoch.
Alternatively a multi-label task can be seen as a ranking task (like Recommender Systems) and you could evaluate precision#k or recall#k where k are the top predicted labels.
If your Keras back-end is TensorFlow, check out the full list of supported metrics here: https://www.tensorflow.org/api_docs/python/tf/keras/metrics.
Actually, there is no metric named accuracy in Keras. When you set metrics=['accuray'] in Keras, the correct accuracy metric will be inferred automatically based on the loss function used. As a result, since you have used binary_crossentropy as the loss function, the binary_accuracy will be chosen as the metric.
Now, you should definitely choose binary_accuracy over categorical_accuracy in a multi-label classification task since classes are independent from each other and the prediction for each class should be considered independently of the predictions for other classes.

Imbalanced dataset doesn't produce good 'Precision' or 'Recall'

The dataset is extremely imbalanced the positive results were only 10% approximately compared to negative results. Eg: (0 - 11401, 1- 1280).
I have tried
1. RandomForestClassifier with GridSearchCV - hyper parameter tuning.
2. Weighted RandomForest with class_weight="balanced"
3. Penalised SVC
4. UpSampling and DownSampling
Still I don't get good precision or recall in any of the above methods.
Im aware prevalence is related PPV. And my dataset has very low class -1. Also Random Forest may lean to majority class.
But i was hoping sampling should work but it didn't. Am I missing something? Any suggestion would be really appreciated.
a few methodes should help you:
predict probabilities and do a manual thresholding.
change the loss/metric you are using.
for an imbalance dataset (outliers detection) you shoudln't use class_weight=balance but put more weight on the outliers.
try other algorithm to see if some do better (XGBoost,catboost,lightgbm if you want to stick with tree based solutions)
we can also use tpot to find the best algo in sklearn for your particular dataset
tell me if any of those helped you

What's the meaning of logistic regression dataset labels?

I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.

Resources