I'd like to classify a set of 3d images (MRI). There are 4 classes (i.e. grade of disease A, B, C, D) where the distinction between the 4 grades is not trivial, therefore the labels I have for the training data is not one class per image. It's a set of 4 probabilities, one per class, e.g.
0.7 0.1 0.05 0.15
0.35 0.2 0.45 0.0
...
... would basically mean that
The first image belongs to class A with a probability of 70%, class B with 10%, C with 5% and D with 15%
etc., I'm sure you get the idea.
I don't understand how to fit a model with these labels, because scikit-learn classifiers expect only 1 label per training data. Using just the class with the highest probability results in miserable results.
Can I train my model with scikit-learn multilabel classification (and how)?
Please note:
Feature extraction is not the problem.
Prediction is not the problem.
Can I handle this somehow with the multilable classification framework?
For predict_proba to return the probability for each class A, B, C, D the classifier needs to be trained with one label per image.
If yes: How?
Use the image class as the label (Y) in your training set. That is your input dataset will look something like this:
F1 F2 F3 F4 Y
1 0 1 0 A
0 1 1 1 B
1 0 0 0 C
0 0 0 1 D
(...)
where F# are the features per each image and Y is the class as classified by doctors.
If no: Any other approaches?
For the case where you have more than one label per image, that is multiple potential classes or their respective probabilities, multilabel models might be a more appropriate choice, as documented in Multiclass and multilabel algorithms.
Related
My dataset is approxiately balanced: 52/48. I evaluate both ACC and F1-score. The result returned by Random forest model is below
Acc: 52%
F1: 68%
Confusion matrix:
|Predicted
Label|0 |1
0 |52|122109
1 |19|134802
I know if I switch labels 0 as 1 and vice versa, the F1 score will be very small.
So, in the case of using F1, should I always switch labels?
The interpretation of F1 score entirely depends on an arbitrary choice of labels (this is buried within its formulation). F1 score is therefore most suitable for cases where class labels actually mean and correspond to negative and positive in real-life (e.g., presence of cancer) and where there is an imbalanced class distribution (particularly, when the negatives significantly outnumber the positives). Since your data is balanced and it seems you can arbitrarily switch the labels too, F1 score may not be a suitable metric to use.
These may also help: 1 2 3
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
Examples : I have a sentence in job description : "Java senior engineer in UK ".
I want to use a deep learning model to predict it as 2 categories : English and IT jobs. If i use traditional classification model, it only can predict 1 label with softmax function at last layer . Thus, i can use 2 model neural networks to predict "Yes"/"No" with both categories, but if we have more categories, it is too expensive . So do we have any deeplearning or machine learning model to predict 2 or more categories at same time ?
"Edit" : With 3 labels by traditional approach , it will be encoded by [1,0,0] but in my case, it will be encoded by [1,1,0] or [1,1,1]
Example : if we have 3 labels, and a sentence may be fit with all of these labels. So if output from softmax function is [0.45 , 0.35 , 0.2 ] we should classify it into 3 labels or 2 labels , or may be one ?
The main problem when we do it is : what is good threshold to classify into 1, or 2 , or 3 labels ?
If you have n different categories which can be true at the same time, have n outputs in your output layer with a sigmoid activation function. This will give each output a value between 0 and 1 independently.
Your loss function should be the mean of the negative log likelihood of the outputs. In tensorflow, this is:
linear_output = ... # the output layer before applying activation function
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits=linear_output, labels=correct_outputs))
output = tf.sigmoid(linear_output) # 0 to 1 for each category
I have to implement a naive bayes classifier for classifying a document to a class. So, in getting the conditional probability for a term belonging to class, along with laplace smoothing, we have:
prob(t | c) = Num(Word occurences in the docs of the class c) + 1 / Num(documents in class c) + |V|
Its a bernoulli model, which will have either 1 or 0 and the vocabulary is really large, like perhaps 20000 words and so on. So, won't the laplace smoothing give really small values due to the large size of the vocabulary or am I doing something wrong.
According to the psuedo code from this link: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html, for the bernoulli model we just add 2 instead of |V|. Why so?
Consider the case of multinomial naive Bayes. The smoothing you defined above is such that you can never get a zero probability.
With the multivariate/Bernoulli case, there is an additional constraint: probabilities of exactly 1 are not allowed either. This is because when some t from the known vocabulary is not present in the document d, a probability of 1 - prob(t | c) is multiplied to the document probability. If prob(t | c) is 1, then once again this is going to produce a posterior probability of 0.
(Likewise, when using logs instead, log(1 - prob(t | c)) is undefined when the probability is 1)
So in the Bernoulli equation (Nct + 1) / (Nc + 2), both cases are protected against. If Nct == Nc, then the probability will be 1/2 rather than 1. This also has the consequence of producing a likelihood of 1/2 regardless of whether t exists (P(t | c) == 1/2) or not (1 - P(t | c) == 1/2)
My samples can either belong to class 0 or class 1 but for some of my samples I only have a probability available for them to belong to class 1. So far I've discretized my target variable by applying a threshold i.e. all y >= t I assigned to class 1 and I've discarded all samples that have non-zero probability to belong to class 1. Then I fitted a linear SVM to the data using scitkit-learn.
Of cause this way I through away quite a bit of the training data. One idea I had was to omit the discretization and use regression instead but usually it's not a good idea to approach classification by regression as for example it doesn't guarantee predicted values to be in the interval [0,1].
By the way the nature of my features x is similar as for some of them I also only have probabilities for the respective feature to be present. For the error it didn't make a big difference if I discretized my features in the same way I discretized the dependent variable.
You might be able to approximate this using sample weighting - assign a sample to the class which has the highest probability, but weight that sample by the probability of it actually belonging. Many of the scikit-learn estimators allow for this.
Example:
X = [1, 2, 3, 4] -> class 0 with probability .7 would become X = [1, 2, 3, 4] y = [0] with sample weight of .7 . You might also normalize so the sample weights are between 0 and 1 (since your probabilities and sample weights will only be from .5 to 1. in this scheme). You could also incorporate non-linear penalties to "strengthen" the influence of high probability samples.