I run SVM-Light classifier but the recall/precision row it outputs seem to be corrupted:
Reading model...OK. (20 support vectors read)
Classifying test examples..100..200..done
Runtime (without IO) in cpu-seconds: 0.00
Accuracy on test set: 95.50% (191 correct, 9 incorrect, 200 total)
Precision/recall on test set: 0.00%/0.00%
What should I configure to get valid precision and recall?
For example, if your classifier is always predicting "-1" -- the negative class; your test dataset, however, contains 191 "-1" and 9 "+1" as golden labels, you will get 191 of them correctly classified and 9 of them incorrect.
True positives : 0 (TP)
True negatives : 191 (TN)
False negatives: 9 (FN)
False positives: 0 (FP)
Thus:
TP 0
Precision = ----------- = --------- = undefined
TP + FP 0 + 0
TP 0
Recall = ----------- = --------- = 0
TP + FN 0 + 9
From the formula above, you know that as long as your TP is zero, your precision/recall is either zero or undefined.
To debug, you should output (for each test example) the golden label and the predicted label so that you know where the issue is.
Thank you greeness. your answer helped me too.
To avoid this issue, make sure that the test and training datasets are chosen/grouped such that they have a fair mix of positive and negative values.
Related
Summary & Questions
I'm using liblinear 2.30 - I noticed a similar issue in prod, so I tried to isolate it through a simple reduced training with 2 classes, 1 train doc per class, 5 features with same weight in my vocabulary and 1 simple test doc containing only one feature which is present only in class 2.
a) what's the feature value being used for?
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
d) Could my changes affect other more complex trainings in a bad way?
What I tried
Below you will find data related to a simple training (please focus on feature 5):
> cat train.txt
1 1:1 2:1 3:1
2 2:1 4:1 5:1
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 train.txt model.bin
iter 1 act 3.353e-01 pre 3.333e-01 delta 6.715e-01 f 1.386e+00 |g| 1.000e+00 CG 1
iter 2 act 4.825e-05 pre 4.824e-05 delta 6.715e-01 f 1.051e+00 |g| 1.182e-02 CG 1
> cat model.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
And this is the output of the model:
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
1 5:10
Below you will find my model's prediction:
> cat test.txt
1 5:1
> predict -b 1 test.txt model.bin test.out
Accuracy = 0% (0/1)
> cat test.out
labels 1 2
2 0.416438 0.583562
And here is where I'm a bit surprised because of the predictions being just [0.42, 0.58] as the feature 5 is only present in class 2. Why?
So I just tried with increasing the feature value for the test doc from 1 to 10:
> cat newtest.txt
1 5:10
> predict -b 1 newtest.txt model.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.0331135 0.966887
And now I get a better prediction [0.03, 0.97]. Thus, I tried re-compiling my training again with all features set to 10:
> cat newtrain.txt
1 1:10 2:10 3:10
2 2:10 4:10 5:10
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 newtrain.txt newmodel.bin
iter 1 act 1.104e+00 pre 9.804e-01 delta 2.508e-01 f 1.386e+00 |g| 1.000e+01 CG 1
iter 2 act 1.381e-01 pre 1.140e-01 delta 2.508e-01 f 2.826e-01 |g| 2.272e+00 CG 1
iter 3 act 2.627e-02 pre 2.269e-02 delta 2.508e-01 f 1.445e-01 |g| 6.847e-01 CG 1
iter 4 act 2.121e-03 pre 1.994e-03 delta 2.508e-01 f 1.183e-01 |g| 1.553e-01 CG 1
> cat newmodel.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.19420510395364846
0
0.19420510395364846
-0.19420510395364846
-0.19420510395364846
0
> predict -b 1 newtest.txt newmodel.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.125423 0.874577
And again predictions were still ok for class 2: 0.87
a) what's the feature value being used for?
Each instance of n features is considered as a point in an n-dimensional space, attached with a given label, say +1 or -1 (in your case 1 or 2). A linear SVM tries to find the best hyperplane to separate those instance into two sets, say SetA and SetB. A hyperplane is considered better than other roughly when SetA contains more instances labeled with +1 and SetB contains more those with -1. i.e., more accurate. The best hyperplane is saved as the model. In your case, the hyperplane has formulation:
f(x)=w^T x
where w is the model, e.g (0.33741,0,0.33741,-0.33741,-0.33741) in your first case.
Probability (for LR) formulation:
prob(x)=1/(1+exp(-y*f(x))
where y=+1 or -1. See Appendix L of LIBLINEAR paper.
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
Not only 1 5:1 gives weak probability such as [0.42,0.58], if you predict 2 2:1 4:1 5:1 you will get [0.337417,0.662583] which seems that the solver is also not very confident about the result, even the input is exactly the same as the training data set.
The fundamental reason is the value of f(x), or can be simply seen as the distance between x and the hyperplane. It can be 100% confident x belongs to a certain class only if the distance is infinite large (see prob(x)).
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
TL;DR
Enlarging both training and test set is like having a larger penalty parameter C (the -c option). Because larger C means a more strict penalty on error, intuitively speaking, the solver has more confidence with the prediction.
Enlarging every feature of the training set is just like having a smaller C.
Specifically, logistic regression solves the following equation for w.
min 0.5 w^T w + C ∑i log(1+exp(−yi w^T xi))
(eq(3) of LIBLINEAR paper)
For most instance, yi w^T xi is positive and larger xi implies smaller ∑i log(1+exp(−yi w^T xi)).
So the effect is somewhat similar to having a smaller C, and a smaller C implies smaller |w|.
On the other hand, enlarging the test set is the same as having a large |w|. Therefore, the effect of enlarging both training and test set is basically
(1). Having smaller |w| when training
(2). Then, having larger |w| when testing
Because the effect is more dramatic in (2) than (1), overall, enlarging both training and test set is like having a larger |w|, or, having a larger C.
We can run on the data set and multiply every features by 10^12. With C=1, we have the model and probability
> cat model.bin.m1e12.c1
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430106024949e-12
0
3.0998430106024949e-12
-3.0998430106024949e-12
-3.0998430106024949e-12
0
> cat test.out.m1e12.c1
labels 1 2
2 0.0431137 0.956886
Next we run on the original data set. With C=10^12, we have the probability
> cat model.bin.m1.c1e12
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430101989314
0
3.0998430101989314
-3.0998430101989314
-3.0998430101989314
0
> cat test.out.m1.c1e12
labels 1 2
2 0.0431137 0.956886
Therefore, because larger C means more strict penalty on error, so intuitively the solver has more confident with prediction.
d) Could my changes affect other more complex trainings in a bad way?
From (c) we know your changes is like having a larger C, and that will result in a better training accuracy. But it almost can be sure that the model is over fitting the training set when C goes too large. As a result, the model cannot endure the noise in training set and will perform badly in test accuracy.
As for finding a good C, a popular way is by cross validation (-v option).
Finally,
it may be off-topic but you may want to see how to pre-process the text data. It is common (e.g., suggested by the author of liblinear here) to instance-wise normalize the data.
For document classification, our experience indicates that if you normalize each document to unit length, then not only the training time is shorter, but also the performance is better.
I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)
I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.
I have a one-vs-all classifier set. This set consists of, let's say, 3 classifiers (LibSVM SVMs) each trained on data for a class and all other class data. The current setup for a sample is that the classifier of the 3 classes that gives the highest score is said to be the matching class.
This setup gives a FAR and FRR result. The issue is that the FAR and FRR results are not enough to construct an ROC curve, which I need. I am wondering what I can do to produce and ROC curve.
This can be done using "multiclass ROC curves" (see e.g. this answer for more details). Usually, you either look at each class individually, or even at each pair of classes individually. I'll provide a short R example of how the first one could look, which is less complicated and still gives a good feeling for how well individual classes could be recognized.
You first need to obtain some class probabilities (for reproducibility, this is what you already have):
# Computing some class probabilities for a 3 class problem using repeated cross validation
library(caret)
model <- train(x = iris[,1:2], y = iris[,5], method = 'svmLinear', trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 10, classProbs = T, savePredictions = T))
# those are the class probabilities for each sample
> model$pred
pred obs setosa versicolor virginica rowIndex C Resample
[...]
11 virginica virginica 1.202911e-02 0.411723759 0.57624713 116 1 Fold01.Rep01
12 versicolor virginica 4.970032e-02 0.692146087 0.25815359 122 1 Fold01.Rep01
13 virginica virginica 5.258769e-03 0.310586094 0.68415514 125 1 Fold01.Rep01
14 virginica virginica 4.321882e-05 0.202372698 0.79758408 131 1 Fold01.Rep01
15 versicolor virginica 1.057353e-03 0.559993337 0.43894931 147 1 Fold01.Rep01
[...]
Now you can look at the ROC curve for each class individually. For each curve, the FRR indicates the rate how often samples of this class were predicted as samples of some other class, while the FAR indicates the rate how often a sample of some other class was predicted as a sample of this class:
plot(roc(predictor = model$pred$setosa, response = model$pred$obs=='setosa'), xlab = 'FAR', ylab = '1-FRR')
plot(roc(predictor = model$pred$versicolor, response = model$pred$obs=='versicolor'), add=T, col=2)
plot(roc(predictor = model$pred$virginica, response = model$pred$obs=='virginica'), add=T, col=3)
legend('bottomright', legend = c('setosa', 'versicolor', 'virginica'), col=1:3, lty=1)
As mentioned before, you could instead also look at the ROC curve for each pair of classes, but IMHO this transports much more information, hence it is more complicated/takes longer to grasp the contained information.
I know that for k-cross validation, I'm supposed to divide the corpus into k equal parts. Of these k parts, I'm to use k-1 parts for training and the remaining 1 part for testing. This process is to be repeated k times, such that each part is used once for testing.
But I don't understand what exactly does training mean and what exactly does testing mean .
What I think is (please correct me if I'm wrong):
1. Training sets (k-1 out of k): These sets are to be used build to the Tag transition probabilities and Emission probabilities tables. And then, apply some algorithm for tagging using these probability tables (Eg. Viterbi Algorithm)
2. Test set (1 set): Use the remaining 1 set to validate the implementation done in step 1. That is, this set from the corpus will have untagged words and I should use the step 1 implementation on this set.
Is my understanding correct? Please explain if not.
Thanks.
I hope this helps:
from nltk.corpus import brown
from nltk import UnigramTagger as ut
# Let's just take the first 100 sentences.
sents = brown.tagged_sents()[:1000]
num_sents = len(sents)
k = 10
foldsize = num_sents/10
fold_accurracies = []
for i in range(10):
# Locate the test set in the fold.
test = sents[i*foldsize:i*foldsize+foldsize]
# Use the rest of the sent not in test for training.
train = sents[:i*foldsize] + sents[i*foldsize+foldsize:]
# Trains a unigram tagger with the train data.
tagger = ut(train)
# Evaluate the accuracy using the test data.
accuracy = tagger.evaluate(test)
print "Fold", i
print 'from sent', i*foldsize, 'to', i*foldsize+foldsize
print 'accuracy =', accuracy
print
fold_accurracies.append(accuracy)
print 'average accuracy =', sum(fold_accurracies)/k
[out]:
Fold 0
from sent 0 to 100
accuracy = 0.785714285714
Fold 1
from sent 100 to 200
accuracy = 0.745431364216
Fold 2
from sent 200 to 300
accuracy = 0.749628896586
Fold 3
from sent 300 to 400
accuracy = 0.743798291989
Fold 4
from sent 400 to 500
accuracy = 0.803448275862
Fold 5
from sent 500 to 600
accuracy = 0.779836277467
Fold 6
from sent 600 to 700
accuracy = 0.772676371781
Fold 7
from sent 700 to 800
accuracy = 0.755679184052
Fold 8
from sent 800 to 900
accuracy = 0.706402915148
Fold 9
from sent 900 to 1000
accuracy = 0.774622079707
average accuracy = 0.761723794252
I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources.
On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats.
I perfectly understand the example, and the recall calculation.
So my first step, let see in Weka how to reproduce with this data.
How do I create such a .ARFF file?
With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class
Recall is not 1, it should be 4/9 (0.4444)
#relation 'dogs and cat detection'
#attribute 'realanimal' {dog,cat}
#attribute 'detected' {dog,cat}
#attribute 'class' {correct,wrong}
#data
dog,dog,correct
dog,dog,correct
dog,dog,correct
dog,dog,correct
cat,dog,wrong
cat,dog,wrong
cat,dog,wrong
dog,?,?
dog,?,?
dog,?,?
dog,?,?
dog,?,?
cat,?,?
cat,?,?
Output Weka (without filters)
=== Run information ===
Scheme:weka.classifiers.rules.ZeroR
Relation: dogs and cat detection
Instances: 14
Attributes: 3
realanimal
detected
class
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: correct
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 4 57.1429 %
Incorrectly Classified Instances 3 42.8571 %
Kappa statistic 0
Mean absolute error 0.5
Root mean squared error 0.5044
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 7
Ignored Class Unknown Instances 7
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.571 1 0.727 0.65 correct
0 0 0 0 0 0.136 wrong
Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43
=== Confusion Matrix ===
a b <-- classified as
4 0 | a = correct
3 0 | b = wrong
There must be something wrong with the False Negative dogs,
or is my ARFF approach totally wrong and do I need another kind of attributes?
Thanks
Lets start with the basic definition of Precision and Recall.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Where TP is True Positive, FP is False Positive, and FN is False Negative.
In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples).
Lets calculate the precision for correct and wrong class.
First for the correct class:
Prec = 4/(4+3) = 0.571428571
Recall = 4/(4+0) = 1.
For wrong class:
Prec = 0/(0+0)= 0
recall =0/(0+3) = 0