Why doesn't Mahout logistic regression give a good AUC when the model is tested on training data? - mahout

I'm using the logistic regression of Mahout (version 0.9) but when I check the created model on the same data set it was trained for, I do not see a high value for AUC. I would expect it to be very high since it is the same data set.
My data set is a CSV file with about 7 million lines and has 18 attributes, some numerical and some categorical.
This is how I create the model for logistic regression (I ignore some of the attributes):
$ mahout trainlogistic --input train.csv \
--output ./model \
--categories 2 \
--predictors attribute1 ... attribute15 \
--types w w w n n w w w w w w w n n n \
--target is_delayed \
--rate 100 \
--passes 2 \
--features 500000
And then when I check the AUC value using the model on the same data set:
$ mahout runlogistic --input train.csv --model ./model --auc --confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar
AUC = 0.48
confusion: [[1703477.0, 761921.0], [3034369.0, 1137161.0]]
entropy: [[NaN, NaN], [-16.5, -17.4]]
15/01/18 06:50:50 INFO driver.MahoutDriver: Program took 98213 ms (Minutes: 1.6368833333333332)
I'm really confused why I only get AUC = 0.48, instead of a value of 1.00 or something very close since it is the same data set.
Do I miss something?
I tried with only a few attributes but still very low AUC, around 0.47, that means the model is almost guessing randomly.


Weka RF doesn't give any confusion matrix or expected results

I am using WEKA to classify a small dataset with only 27 instances into a binary classification. I have tried with bigger datasets and weka show the confusion matrix and the other metrics, but with my main and small 27 instances dataset only shows this:
Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1
Relation: t_PROMIS_mtbi-weka.filters.unsupervised.attribute.Remove-R1
Instances: 27
Attributes: 7
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities
Time taken to build model: 0.01 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.0348
Mean absolute error 0.4544
Root mean squared error 0.529
Relative absolute error 91.7269 %
Root relative squared error 102.952 %
Total Number of Instances 27
i don't undersantd why this is happening. Is it a size thing?
I have already solved it, The problem was that i was using numbers 1/0 on my class viariable, I changed it for a "Yes"/"No" variable and works.

What modifications can I make in svmtrain of LIBSVM to improve the accuracy of my spam classifier?

I am using Octave version 5.2.0 and LIBSVM 3.24 to build a spam classifier.
Without using LIBSVM I got an accuracy of >99% on both test and train data.
But while using LIBSVM, I only get an accuracy of 68-69% .What modifications should I do on my LIBSVM options?
This is the code I used
model = svmtrain(X, y,'-c 0.1 -t 2 -s 0 -g 1000');
p = svmpredict(y,X,model);
Are you aware of the settings of LibSVM?
% libSVM options:
% -s svm_type: set type of SVM (default 0)
% 0 -- C-SVC
% 1 -- nu-SVC
% 2 -- one-class SVM
% 3 -- epsilon-SVR
% 4 -- nu-SVR
% -t kernel_type: set type of kernel function (default 2)
% 0 -- linear: u'*v
% 1 -- polynomial: (gamma*u'*v + coef0)^degree
% 2 -- radial basis function: exp(-gamma*|u-v|^2)
% 3 -- sigmoid: tanh(gamma*u'*v + coef0)
% -d degree: set degree in kernel function (default 3)
% -g gamma: set gamma in kernel function (default 1/num_features)
% -r coef0: set coef0 in kernel function (default 0)
% -c cost: set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
% -n nu: set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
% -p epsilon: set the epsilon in loss function of epsilon-SVR (default 0.1)
% -m cachesize: set cache memory size in MB (default 100)
% -e epsilon: set tolerance of termination criterion (default 0.001)
% -h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
% -b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
% -wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
So your -s 0 -t 2 -g 1000 -c 0.1 settings translate to a C-SVM (-s 0) with a Gaussian kernel (-t 2) with a large scaling (-g 1000) and a smaller than default cost for violations (-c 0.1).
I suggest to try it first with the default values (-s 0 -t 2), and than increase the cost -c. Your gamma looks ridiculously large but without knowing your data, none can judge this. Have a look on hyperparameter optimization, which exactly sets those values. There exists plenty of work on this but I am only familiar with regression analysis. If in doubt, do a global optimization on those parameters through gridsearch or ga.

Is there a reason why a feature only present in a given class is not being predicted strongly into that class?

Summary & Questions
I'm using liblinear 2.30 - I noticed a similar issue in prod, so I tried to isolate it through a simple reduced training with 2 classes, 1 train doc per class, 5 features with same weight in my vocabulary and 1 simple test doc containing only one feature which is present only in class 2.
a) what's the feature value being used for?
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
d) Could my changes affect other more complex trainings in a bad way?
What I tried
Below you will find data related to a simple training (please focus on feature 5):
> cat train.txt
1 1:1 2:1 3:1
2 2:1 4:1 5:1
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 train.txt model.bin
iter 1 act 3.353e-01 pre 3.333e-01 delta 6.715e-01 f 1.386e+00 |g| 1.000e+00 CG 1
iter 2 act 4.825e-05 pre 4.824e-05 delta 6.715e-01 f 1.051e+00 |g| 1.182e-02 CG 1
> cat model.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
And this is the output of the model:
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
1 5:10
Below you will find my model's prediction:
> cat test.txt
1 5:1
> predict -b 1 test.txt model.bin test.out
Accuracy = 0% (0/1)
> cat test.out
labels 1 2
2 0.416438 0.583562
And here is where I'm a bit surprised because of the predictions being just [0.42, 0.58] as the feature 5 is only present in class 2. Why?
So I just tried with increasing the feature value for the test doc from 1 to 10:
> cat newtest.txt
1 5:10
> predict -b 1 newtest.txt model.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.0331135 0.966887
And now I get a better prediction [0.03, 0.97]. Thus, I tried re-compiling my training again with all features set to 10:
> cat newtrain.txt
1 1:10 2:10 3:10
2 2:10 4:10 5:10
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 newtrain.txt newmodel.bin
iter 1 act 1.104e+00 pre 9.804e-01 delta 2.508e-01 f 1.386e+00 |g| 1.000e+01 CG 1
iter 2 act 1.381e-01 pre 1.140e-01 delta 2.508e-01 f 2.826e-01 |g| 2.272e+00 CG 1
iter 3 act 2.627e-02 pre 2.269e-02 delta 2.508e-01 f 1.445e-01 |g| 6.847e-01 CG 1
iter 4 act 2.121e-03 pre 1.994e-03 delta 2.508e-01 f 1.183e-01 |g| 1.553e-01 CG 1
> cat newmodel.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
> predict -b 1 newtest.txt newmodel.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.125423 0.874577
And again predictions were still ok for class 2: 0.87
a) what's the feature value being used for?
Each instance of n features is considered as a point in an n-dimensional space, attached with a given label, say +1 or -1 (in your case 1 or 2). A linear SVM tries to find the best hyperplane to separate those instance into two sets, say SetA and SetB. A hyperplane is considered better than other roughly when SetA contains more instances labeled with +1 and SetB contains more those with -1. i.e., more accurate. The best hyperplane is saved as the model. In your case, the hyperplane has formulation:
f(x)=w^T x
where w is the model, e.g (0.33741,0,0.33741,-0.33741,-0.33741) in your first case.
Probability (for LR) formulation:
where y=+1 or -1. See Appendix L of LIBLINEAR paper.
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
Not only 1 5:1 gives weak probability such as [0.42,0.58], if you predict 2 2:1 4:1 5:1 you will get [0.337417,0.662583] which seems that the solver is also not very confident about the result, even the input is exactly the same as the training data set.
The fundamental reason is the value of f(x), or can be simply seen as the distance between x and the hyperplane. It can be 100% confident x belongs to a certain class only if the distance is infinite large (see prob(x)).
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
Enlarging both training and test set is like having a larger penalty parameter C (the -c option). Because larger C means a more strict penalty on error, intuitively speaking, the solver has more confidence with the prediction.
Enlarging every feature of the training set is just like having a smaller C.
Specifically, logistic regression solves the following equation for w.
min 0.5 w^T w + C ∑i log(1+exp(−yi w^T xi))
(eq(3) of LIBLINEAR paper)
For most instance, yi w^T xi is positive and larger xi implies smaller ∑i log(1+exp(−yi w^T xi)).
So the effect is somewhat similar to having a smaller C, and a smaller C implies smaller |w|.
On the other hand, enlarging the test set is the same as having a large |w|. Therefore, the effect of enlarging both training and test set is basically
(1). Having smaller |w| when training
(2). Then, having larger |w| when testing
Because the effect is more dramatic in (2) than (1), overall, enlarging both training and test set is like having a larger |w|, or, having a larger C.
We can run on the data set and multiply every features by 10^12. With C=1, we have the model and probability
> cat model.bin.m1e12.c1
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
> cat test.out.m1e12.c1
labels 1 2
2 0.0431137 0.956886
Next we run on the original data set. With C=10^12, we have the probability
> cat model.bin.m1.c1e12
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
> cat test.out.m1.c1e12
labels 1 2
2 0.0431137 0.956886
Therefore, because larger C means more strict penalty on error, so intuitively the solver has more confident with prediction.
d) Could my changes affect other more complex trainings in a bad way?
From (c) we know your changes is like having a larger C, and that will result in a better training accuracy. But it almost can be sure that the model is over fitting the training set when C goes too large. As a result, the model cannot endure the noise in training set and will perform badly in test accuracy.
As for finding a good C, a popular way is by cross validation (-v option).
it may be off-topic but you may want to see how to pre-process the text data. It is common (e.g., suggested by the author of liblinear here) to instance-wise normalize the data.
For document classification, our experience indicates that if you normalize each document to unit length, then not only the training time is shorter, but also the performance is better.

Vowpal Wabbit same results always

I am using VW to try to predict multi classes. The strangest part is that it doesn't matter which parameters I use, the result is always the same.
Should that happen, maybe because of my data?
Around 90k lines of data. A line of the data:
1 2334225|SUBDEPT "D1SUB1" "D2SUB1" |DEPT "DEPT1" "DEPT2" |SCANCODE "11223442" "65434533543" |WDAY Friday |AMTBOUGHT 2
Its a multiclass problem,so the command line is:
vw --ect 38 ../Processed/train.vw.txt --loss_function logistic --link=logistic
The single parameter that changes something is from --ect to --oaa. I have tried adding the following, but none changes the final validation values:
-c -k --passes 20 (goes until 8)
--l1 or --l2
--ignore D or --ignore d (or s or su...)
the results are always
average loss = 0.911153 h
Is there something that I am missing here?

Step by step guide to train a multilayer perceptron for the XOR case in Weka?

I'm just getting started with Weka and having trouble with the first steps.
We've got our training set:
#relation PerceptronXOR
#attribute X1 numeric
#attribute X2 numeric
#attribute Output numeric
First step I want to do is just train, and then classify a set using the Weka gui.
What I've been doing so far:
Using Weka 3.7.0.
Start GUI.
Open file -> choose my arff file.
Classify tab.
Use training set radio button.
Choose-> functions>multilayer_perceptron
Click the 'multilayer perceptron' text at the top to open settings.
Set Hidden layers to '2'. (if gui is selected true,t his show that this is the correct network we want). Click ok.
click start.
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 2 -R
Relation: PerceptronXOR
Instances: 4
Attributes: 3
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Node 0
Inputs Weights
Threshold 0.21069691964232443
Node 1 1.8781169869419072
Node 2 -1.8403146612166397
Sigmoid Node 1
Inputs Weights
Threshold -3.7331156814378685
Attrib X1 3.6380519730323164
Attrib X2 -1.0420815868133226
Sigmoid Node 2
Inputs Weights
Threshold -3.64785119182632
Attrib X1 3.603244645539393
Attrib X2 0.9535137571446323
Node 0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.7047
Mean absolute error 0.6073
Root mean squared error 0.7468
Relative absolute error 60.7288 %
Root relative squared error 74.6842 %
Total Number of Instances 4
It seems odd that 500 iterations at 0.3 doesn't get it the error, but 5000 # 0.1 does, so lets go with that.
Now use the test data set:
#relation PerceptronXOR
#attribute X1 numeric
#attribute X2 numeric
#attribute Output numeric
Radio button to 'Supplied test set'
Select my test set arff.
Click start.
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.2 -N 5000 -V 0 -S 0 -E 20 -H 2 -R
Relation: PerceptronXOR
Instances: 4
Attributes: 3
Test mode: user supplied test set: size unknown (reading incrementally)
=== Classifier model (full training set) ===
Linear Node 0
Inputs Weights
Threshold -1.2208619057226187
Node 1 3.1172079341507497
Node 2 -3.212484459911485
Sigmoid Node 1
Inputs Weights
Threshold 1.091378074639599
Attrib X1 1.8621040828953983
Attrib X2 1.800744048145267
Sigmoid Node 2
Inputs Weights
Threshold -3.372580743113282
Attrib X1 2.9207154176666386
Attrib X2 2.576791630598144
Node 0
Time taken to build model: 0.04 seconds
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient 0.8296
Mean absolute error 0.3006
Root mean squared error 0.6344
Relative absolute error 30.0592 %
Root relative squared error 63.4377 %
Total Number of Instances 8
Why is unable to classify these correctly?
Is it just because it's reached a local minimum quickly on the training data, and doesn't 'know' that that doesn't fit all the cases?
Why does 500 # 0.3 not work? Seems odd for such a simple problem.
Why does it fail on the test set.
How do I pass in a set to classify?
Using learning rate with 0.5 does the job with 500 iterations for the both examples.
The learning rate is how much weight it gives for new examples.
Apparently the problem is difficult and it is easy to get in local minima with the 2 hidden layers. If you use a low learning rate with a high iteration number the learning process will be more conservative and more likely to high a good minimum.
