I am using libsvm for document classification. I use svm.cc and svm.h in my project. I then call svm_train. I save the model in a file using svm_save_model.
I have there categories. The svm model file is:
svm_type c_svc
kernel_type rbf
gamma 0.001002
nr_class 3
total_sv 9
rho -0.000766337 0.00314423 0.00387654
label 0 1 2
nr_sv 3 3 3
SV
1 1 1:0.001 2:0.001 3:0.012521912 5:0.001 15:0.012521912 17:0.012521912 23:0.001
1 1 1:0.001 2:0.014176543 4:0.093235799 6:0.001 7:0.0058630699 9:0.040529628 10:0.001
1 1 11:0.38863495 33:0.08295242 46:0.041749886 58:0.08295242 89:0.08295242 127:0.15338862 -1 1 5:0.001 8:0.0565 10:0.001 13:0.001 18:0.0565 21:0.021483399 34:0.12453384 36:0.001
-1 1 13:0.034959612 34:0.090130132 36:0.034959612 45:0.034959612 47:0.12019824
-1 1 5:0.001 8:0.048037273 13:0.001 18:0.048037273 29:0.14715472 30:0.018360058 36:0.001
-1 -1 9:0.0049328688 12:0.090902344 18:0.1156038 27:0.0049328688 31:0.015144206
What are 1 and -1 before the vector values in the form of index:value ?
From the libsvm FAQ:
Q: Can you explain more about the model file?
In the model file,
after parameters and other informations such as labels , each line
represents a support vector. Support vectors are listed in the order
of "labels" shown earlier. (i.e., those from the first class in the
"labels" list are grouped first, and so on.) If k is the total number
of classes, in front of a support vector in class j, there are k-1
coefficients y*alpha where alpha are dual solution of the following
two class problems: 1 vs j, 2 vs j, ..., j-1 vs j, j vs j+1, j vs
j+2, ..., j vs k and y=1 in first j-1 coefficients, y=-1 in the
remaining k-j coefficients. For example, if there are 4 classes, the
file looks like:
+-+-+-+--------------------+
|1|1|1| |
|v|v|v| SVs from class 1 |
|2|3|4| |
+-+-+-+--------------------+
|1|2|2| |
|v|v|v| SVs from class 2 |
|2|3|4| |
+-+-+-+--------------------+
|1|2|3| |
|v|v|v| SVs from class 3 |
|3|3|4| |
+-+-+-+--------------------+
|1|2|3| |
|v|v|v| SVs from class 4 |
|4|4|4| |
+-+-+-+--------------------+
http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f402
Related
Summary & Questions
I'm using liblinear 2.30 - I noticed a similar issue in prod, so I tried to isolate it through a simple reduced training with 2 classes, 1 train doc per class, 5 features with same weight in my vocabulary and 1 simple test doc containing only one feature which is present only in class 2.
a) what's the feature value being used for?
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
d) Could my changes affect other more complex trainings in a bad way?
What I tried
Below you will find data related to a simple training (please focus on feature 5):
> cat train.txt
1 1:1 2:1 3:1
2 2:1 4:1 5:1
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 train.txt model.bin
iter 1 act 3.353e-01 pre 3.333e-01 delta 6.715e-01 f 1.386e+00 |g| 1.000e+00 CG 1
iter 2 act 4.825e-05 pre 4.824e-05 delta 6.715e-01 f 1.051e+00 |g| 1.182e-02 CG 1
> cat model.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
And this is the output of the model:
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
1 5:10
Below you will find my model's prediction:
> cat test.txt
1 5:1
> predict -b 1 test.txt model.bin test.out
Accuracy = 0% (0/1)
> cat test.out
labels 1 2
2 0.416438 0.583562
And here is where I'm a bit surprised because of the predictions being just [0.42, 0.58] as the feature 5 is only present in class 2. Why?
So I just tried with increasing the feature value for the test doc from 1 to 10:
> cat newtest.txt
1 5:10
> predict -b 1 newtest.txt model.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.0331135 0.966887
And now I get a better prediction [0.03, 0.97]. Thus, I tried re-compiling my training again with all features set to 10:
> cat newtrain.txt
1 1:10 2:10 3:10
2 2:10 4:10 5:10
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 newtrain.txt newmodel.bin
iter 1 act 1.104e+00 pre 9.804e-01 delta 2.508e-01 f 1.386e+00 |g| 1.000e+01 CG 1
iter 2 act 1.381e-01 pre 1.140e-01 delta 2.508e-01 f 2.826e-01 |g| 2.272e+00 CG 1
iter 3 act 2.627e-02 pre 2.269e-02 delta 2.508e-01 f 1.445e-01 |g| 6.847e-01 CG 1
iter 4 act 2.121e-03 pre 1.994e-03 delta 2.508e-01 f 1.183e-01 |g| 1.553e-01 CG 1
> cat newmodel.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.19420510395364846
0
0.19420510395364846
-0.19420510395364846
-0.19420510395364846
0
> predict -b 1 newtest.txt newmodel.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.125423 0.874577
And again predictions were still ok for class 2: 0.87
a) what's the feature value being used for?
Each instance of n features is considered as a point in an n-dimensional space, attached with a given label, say +1 or -1 (in your case 1 or 2). A linear SVM tries to find the best hyperplane to separate those instance into two sets, say SetA and SetB. A hyperplane is considered better than other roughly when SetA contains more instances labeled with +1 and SetB contains more those with -1. i.e., more accurate. The best hyperplane is saved as the model. In your case, the hyperplane has formulation:
f(x)=w^T x
where w is the model, e.g (0.33741,0,0.33741,-0.33741,-0.33741) in your first case.
Probability (for LR) formulation:
prob(x)=1/(1+exp(-y*f(x))
where y=+1 or -1. See Appendix L of LIBLINEAR paper.
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
Not only 1 5:1 gives weak probability such as [0.42,0.58], if you predict 2 2:1 4:1 5:1 you will get [0.337417,0.662583] which seems that the solver is also not very confident about the result, even the input is exactly the same as the training data set.
The fundamental reason is the value of f(x), or can be simply seen as the distance between x and the hyperplane. It can be 100% confident x belongs to a certain class only if the distance is infinite large (see prob(x)).
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
TL;DR
Enlarging both training and test set is like having a larger penalty parameter C (the -c option). Because larger C means a more strict penalty on error, intuitively speaking, the solver has more confidence with the prediction.
Enlarging every feature of the training set is just like having a smaller C.
Specifically, logistic regression solves the following equation for w.
min 0.5 w^T w + C ∑i log(1+exp(−yi w^T xi))
(eq(3) of LIBLINEAR paper)
For most instance, yi w^T xi is positive and larger xi implies smaller ∑i log(1+exp(−yi w^T xi)).
So the effect is somewhat similar to having a smaller C, and a smaller C implies smaller |w|.
On the other hand, enlarging the test set is the same as having a large |w|. Therefore, the effect of enlarging both training and test set is basically
(1). Having smaller |w| when training
(2). Then, having larger |w| when testing
Because the effect is more dramatic in (2) than (1), overall, enlarging both training and test set is like having a larger |w|, or, having a larger C.
We can run on the data set and multiply every features by 10^12. With C=1, we have the model and probability
> cat model.bin.m1e12.c1
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430106024949e-12
0
3.0998430106024949e-12
-3.0998430106024949e-12
-3.0998430106024949e-12
0
> cat test.out.m1e12.c1
labels 1 2
2 0.0431137 0.956886
Next we run on the original data set. With C=10^12, we have the probability
> cat model.bin.m1.c1e12
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430101989314
0
3.0998430101989314
-3.0998430101989314
-3.0998430101989314
0
> cat test.out.m1.c1e12
labels 1 2
2 0.0431137 0.956886
Therefore, because larger C means more strict penalty on error, so intuitively the solver has more confident with prediction.
d) Could my changes affect other more complex trainings in a bad way?
From (c) we know your changes is like having a larger C, and that will result in a better training accuracy. But it almost can be sure that the model is over fitting the training set when C goes too large. As a result, the model cannot endure the noise in training set and will perform badly in test accuracy.
As for finding a good C, a popular way is by cross validation (-v option).
Finally,
it may be off-topic but you may want to see how to pre-process the text data. It is common (e.g., suggested by the author of liblinear here) to instance-wise normalize the data.
For document classification, our experience indicates that if you normalize each document to unit length, then not only the training time is shorter, but also the performance is better.
I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)
I have built a RNN LSTM model as three class classifier. It is giving around 94% train and test accuracy. But on testing it on real data it is giving pretty inaccurate results.Its application is related to NLP.
My raw dataset format is like
DATA | Class
1. String1 | 0
2. String2 | 1
3. String3 | 2
Accuracy, F1 Score and Confusion Matrix:-
Accuracy = 91.07142857142857
F1 Score = 0.9104914905591774
[[39 3 0]
[ 2 36 4]
[ 0 1 27]]
Graph of training and validation acc
So what I want to achieve is that on feeding some string, if that string contain some common words of class 0 then it should predict that string is from class 0. Any help?
I'm working with classification with imbalanced data set using Sklearn. Sklearn has calculated the false_positive_rate and true_positive_rate wrong; when I want to calculate the AUC score, the result is different from what I have gotten from the confusion matrix.
From Sklearn I got the following confusion matrix:
confusion = confusion_matrix(y_test, y_pred)
array([[ 9100, 4320],
[109007, 320068]], dtype=int64)
of course, I understand the output as:
+-----------------------------------+------------------------+
| | Predicted | Predicted |
+-----------------------------------+------------------------+
| Actual | True positive = 9100 | False-negative = 4320 |
| Actual | False-positive = 109007 | True negative = 320068|
+--------+--------------------------+------------------------+
However, for FPR and TPR, I got the following result:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
(false_positive_rate, true_positive_rate)
(array([0. , 0.3219076, 1. ]),
array([0. , 0.7459488, 1. ]))
The result is different from the confusion_matrix. According to my table, the FPR is actually FNR, and the TPR is actually TNR. Then I checked the confusion matrix document, I found out that:
Thus, in binary classification, the count of true negatives is C0,0, false negatives is C1,0, true positives is C1,1 and false positives is C0,1.
This means that the confusion_matrix, according to Sklearn, looks like this:
+-----------------------------------+---------------------------+
| | Predicted | Predicted |
+-----------------------------------+---------------------------+
| Actual | True-Positive = 320068 | False-Negative = 109007 |
| Actual | False-Positive = 4320 | True-Negative = 9100 |
+--------+--------------------------+---------------------------+
According to the theory, for binary classification, the rare class is denoted as the positive class.
Why does Sklearn treat the majority class as positive?
After some experiments, I found out that when IsolationForest from sklearn is used for imbalanced data, if you check confusion_matrix, It can be seen that IsolationForest treats the majority (Normal) class as a positive class whereas minor class should be the positive class in Fraud/Outlier/Anomaly detection tasks.
To overcome this challenge there are 2 solutions:
Interpret the confusion matrix results in vice versa. FP instead of FN and TP instead of TN.
If you want to pass the results correctly due to this bad treatment of IF for imbalanced data you can use the following trick:
Typically IF returns -1 for outliers and 1 for inliers so if you replace 1 with -1 and then -1 with 1 in the output from the IsolationForest, Then you could use the standard metric calculations correctly in this case.
IF_model = IsolationForest(max_samples="auto",
random_state=11,
contamination = 0.1,
n_estimators=100,
n_jobs=-1)
IF_model.fit(X_train_sf, y_train_sf)
y_pred_test = IF_model.predict(X_test_sf)
counts = np.unique(y_pred_test, return_counts=True)
#(array([-1, 1]), array([44914, 4154]))
#replace 1 with -1 and then -1 with 1
if (counts[1][0] < counts[1][1] and counts[0][0] == -1) or (counts[1][0] > counts[1][1] and counts[0][0] == 1): y_pred_test = -y_pred_test
Considering confusion matrix documentation, and problem definition here, above trick should work and right form of confusion matrix for Fraud/Outlier/Anomaly detection or Binary classifiers based on litertures Ref.1, Ref.2, Ref.3 is as follows:
+----------------------------+---------------+--------------+
| | Predicted | Predicted |
+----------------------------+---------------+--------------+
| Actual (Positive class)[1] | TP | FN |
| Actual (Negative class)[-1]| FP | TN |
+----------------------------+---------------+--------------+
tn, fp, fn, tp = confusion_matrix(y_test_sf, y_pred_test).ravel()
print("TN: ",tn,"\nFP: ", fp,"\nFN: " ,fn,"\nTP: ", tp)
print("Number of positive class instances: ",tp+fn,"\nNumber of negative class instances: ", tn+fp)
check the evaluation:
print(classification_report(y_test_sf, y_pred_test, target_names=["Anomaly", "Normal"]))
The subject of this question is joining tensors for neural networks with torch/nn and torch/nngraph libraries for Lua. I started coding in Lua a few weeks ago so my experience is very minimal. In the text below, I refer to lua tables as arrays.
Context
I am working with a recurrent neural network for speech recognition.
At some point in the network there are N number of arrays of m Tensors.
a = {a1, a2, ..., aM},
b = {b1, b2, ..., bM},
... N times
Where ai and bi are tensors and {} represents an array.
What needs to be done is join all those arrays element-wise so that output is an array of M Tensors where output[i] is the result of joining every ith Tensors from the N arrays over the second dimension.
output = {z1, z2, ..., zM}
Example
|| used to represent Tensors
x = {|1 1|, |2 2|}
|1 1| |2 2|
Tensors of size 2x2
y = {|3 3 3|, |4 4 4|}
|3 3 3| |4 4 4|
Tensors of size 2x3
|
| Join{x,y}
\/
z = {|1 1 3 3 3|, |2 2 4 4 4|}
|1 1 3 3 3| |2 2 4 4 4|
Tensors of size 2x5
So the first Tensor of x of size 2x2 was joined with the first Tensor of y of size 2x3 over the second dimension and same thing for second Tensor of each array resulting in z an array of Tensors 2x5.
Problem
Now this is a basic concatenation, but I can't seem to find a module in the torch/nn library that would allow me to do that. I could write my own module of course, but if an already existing module does it then I would rather go with that.
The only existing module I know that joins table is (obviously) JoinTable. It takes an array of Tensors and joins them together. I want to join arrayS of tensors element-wise.
Also, as we are feeding input to our network, the number of Tensors in the N arrays varies, so m from the context above is not constant.
Idea
What I thought I could do in order to use the module JoinTable is convert my arrays into Tensors instead and then JoinTable on the converted N Tensors. But then again I would need a module that does such a conversion and and another one to convert back to an array in order to feed it to the next layers of the network.
Last resort
Write a new module that iterates over all given arrays and concatenates element-wise. Of course it's do-able, but the whole purpose of this post is to find a way to avoid writing smelly modules. It seems weird to me that such a module doesn't already exist.
Conclusion
I finally decided to do as I wrote in Last resort. I wrote a new module that iterates over all given arrays and concatenates element-wise.
Though, the answer given by #fmguler does the same without having to write a new module.
You can do it with nn.SelectTable and nn.JoinTable like this;
require 'nn'
x = {torch.Tensor{{1,1},{1,1}}, torch.Tensor{{2,2},{2,2}}}
y = {torch.Tensor{{3,3,3},{3,3,3}}, torch.Tensor{{4,4,4},{4,4,4}}}
res = {}
res[1] = nn.JoinTable(2):forward({nn.SelectTable(1):forward(x),nn.SelectTable(1):forward(y)})
res[2] = nn.JoinTable(2):forward({nn.SelectTable(2):forward(x),nn.SelectTable(2):forward(y)})
print(res[1])
print(res[2])
If you want this to be done in a module, wrap it in nnGraph;
require 'nngraph'
x = {torch.Tensor{{1,1},{1,1}}, torch.Tensor{{2,2},{2,2}}}
y = {torch.Tensor{{3,3,3},{3,3,3}}, torch.Tensor{{4,4,4},{4,4,4}}}
xi = nn.Identity()()
yi = nn.Identity()()
res = {}
--you can loop over columns here>>
res[1] = nn.JoinTable(2)({nn.SelectTable(1)(xi),nn.SelectTable(1)(yi)})
res[2] = nn.JoinTable(2)({nn.SelectTable(2)(xi),nn.SelectTable(2)(yi)})
module = nn.gModule({xi,yi},res)
--test like this
result = module:forward({x,y})
print(result)
print(result[1])
print(result[2])
--gives the result
th> print(result)
{
1 : DoubleTensor - size: 2x5
2 : DoubleTensor - size: 2x5
}
th> print(result[1])
1 1 3 3 3
1 1 3 3 3
[torch.DoubleTensor of size 2x5]
th> print(result[2])
2 2 4 4 4
2 2 4 4 4
[torch.DoubleTensor of size 2x5]