My target is a range 1 to 5. Is there a way to force to predict only in this range?
Regardless of the model I use, I sometimes get negative values and values greater than 5.
You can use a model that supports multiple classes classification such as Softmax Regression. This algorithm is a generaliztion of Logistic regression that can classify N targets where N > 1.
The hard prediction of your model can be:
1 2 3 4 5
0 0 0 1 0
Which means that the prediction is 4
or it can be a soft prediction:
1 2 3 4 5
0.1 0.1 0.6 0.1 0.1
Which is probability and then you can know how confident is your model.
Scikit-learn implements Softmax regression within Logistic regression algorithm itself by specifying the parameter multi_class="multinomial"
Related
Hello everyone, I'm new in this area, I wondered if anyone could help me understand the results of logistic regression.
I would need to understand if the independent variables can be used to make a good classification.
=== Run information ===
Scheme: weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4
Relation: Train
Instances: 14185
Attributes: 5
ATTR_1
ATTR_2
ATTR_3
ATTR_4
DEPENDENT_VAR
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable 0
====================
ATTR_1 0.0022
ATTR_2 0.0022
ATTR_3 0.0034
ATTR_4 -0.0021
Intercept 0.9156
Odds Ratios...
Class
Variable 0
====================
ATTR_1 1.0022
ATTR_2 1.0022
ATTR_3 1.0034
ATTR_4 0.9979
Time taken to build model: 0.13 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.07 seconds
=== Summary ===
Correctly Classified Instances 51240 72.2453 %
Incorrectly Classified Instances 19685 27.7547 %
Kappa statistic -0.0001
Mean absolute error 0.3992
Root mean squared error 0.4467
Relative absolute error 99.5581 %
Root relative squared error 99.7727 %
Total Number of Instances 70925
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1,000 1,000 0,723 1,000 0,839 -0,005 0,545 0,759 0
0,000 0,000 0,000 0,000 0,000 -0,005 0,545 0,305 1
Weighted Avg. 0,722 0,723 0,522 0,722 0,606 -0,005 0,545 0,633
=== Confusion Matrix ===
a b <-- classified as
51240 5 | a = 0
19680 0 | b = 1
In particular, I am interested in understanding the values of the coefficients and the odds-ratios.
Thanks.
Off the top of my head:
Odds ratios and coefficient values are proportional to another, and can be calculated from each other.
For attribute1 , exp(0.0022) = 1.002
For doing more calculations and fitting/predicting, coefficients are "better". However the coefficients are values that must be plugged into exp(x) functions and are somewhat difficult to "visualize in your head".
For human understanding, odds ratios are sometimes more convenient - easier to interpret/visualize, but you can't do certain calculations directly with them.
Weka does not know what you are more interested in, so it gives you both for convenience.
By the way, weka does regularized logistic regression
(Logistic Regression with ridge parameter of 1.0E-8), so coefficients might differ slightly from those that a different software package might give you.
I'm trying to use Vowpal Wabbit to predict probabilities given existing set of statistics. My txt file looks like that:
0.22 | Features1
0.28 | Features2
Now, given this example, I want to predict the label (probability) for Features3. I'm trying to use logistic regression:
vw -d ds.vw.txt -f model.p --loss_function=logistic --link=logistic -p probs.txt
But get the error :
You are using label 0.00110011 not -1 or 1 as loss function expects!
You are using label 0.00559702 not -1 or 1 as loss function expects!
etc..
How can I use these statistics as labels to predict probabilities?
To predict a continuous label you need to use one of the following loss functions:
--loss_function squared # optimizes for min loss vs mean
--loss_function quantile # optimizes for min loss vs median
--loss_function squared is the vw default, so you may leave it out.
Another trick you may use is to map your probability range into [-1, 1] by mapping the mid-point 0.5 to 0.0 using the function (2*probability - 1). You can then use --loss_function logistic which requires binary labels (-1 and 1), but follow the labels with abs(probability) as a floating point weight:
1 0.22 | features...
-1 0.28 | features...
This may or may not work better for your particular data (you'll have to hold-out some of your data and test your different models for accuracy.)
Background regarding binary outcomes: vw "starting point" (i.e null, or initial model) is 0.0 weights everywhere. This is why when you're doing a logistic regression, the negative, positive labels must be -1, 1 (rather than 0, 1) respectively.
I'd like to classify a set of 3d images (MRI). There are 4 classes (i.e. grade of disease A, B, C, D) where the distinction between the 4 grades is not trivial, therefore the labels I have for the training data is not one class per image. It's a set of 4 probabilities, one per class, e.g.
0.7 0.1 0.05 0.15
0.35 0.2 0.45 0.0
...
... would basically mean that
The first image belongs to class A with a probability of 70%, class B with 10%, C with 5% and D with 15%
etc., I'm sure you get the idea.
I don't understand how to fit a model with these labels, because scikit-learn classifiers expect only 1 label per training data. Using just the class with the highest probability results in miserable results.
Can I train my model with scikit-learn multilabel classification (and how)?
Please note:
Feature extraction is not the problem.
Prediction is not the problem.
Can I handle this somehow with the multilable classification framework?
For predict_proba to return the probability for each class A, B, C, D the classifier needs to be trained with one label per image.
If yes: How?
Use the image class as the label (Y) in your training set. That is your input dataset will look something like this:
F1 F2 F3 F4 Y
1 0 1 0 A
0 1 1 1 B
1 0 0 0 C
0 0 0 1 D
(...)
where F# are the features per each image and Y is the class as classified by doctors.
If no: Any other approaches?
For the case where you have more than one label per image, that is multiple potential classes or their respective probabilities, multilabel models might be a more appropriate choice, as documented in Multiclass and multilabel algorithms.
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
I have a multiclass classification task with 10 classes. As such, I used sklearn's OneHotEncoder to transform the one-column labels to 10-columns labels. I was trying to fit the training data. Although I was able to do this with RandomForestClassifier, I got the below error message when fitting with GaussianNB:
ValueError: bad input shape (1203L, 10L)
I understand the allowed shape of y in these two classifiers is different:
GaussianNB:
y : array-like, shape (n_samples,)
RandomForest:
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The question is, why is this? Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"? Any way to go around it? Thanks!
The question is, why is this?
It is because of a slight missunderstanding, in scikit-learn you do not encode labels, you pass it as one dimensional vector of labels, thus instead of
1 0 0
0 1 0
0 0 1
you literally pass
1 2 3
So why does random forest accepts a different scheme? Because it is not for multiclass setting! It is for multi label where each instance can have many labels, like
1 1 0
1 1 1
0 0 0
Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"?
Contrary - it is the easiest solution - to never ask for one-hot unless it is multi-label,
Any way to go around it?
Yup, just do not encode - pass raw labels :-)