WEKA Changing number of decimal places in predictions - machine-learning

I'm trying to get precise predictions from WEKA, and I need to increase the number of decimal places that it outputs for its prediction data.
My .arff training set looks like this:
#relation TrainSet
#attribute TimeDiff1 numeric
#attribute TimeDiff2 numeric
#attribute TimeDiff3 numeric
#attribute TimeDiff4 numeric
#attribute TimeDiff5 numeric
#attribute TimeDiff6 numeric
#attribute TimeDiff7 numeric
#attribute TimeDiff8 numeric
#attribute TimeDiff9 numeric
#attribute TimeDiff10 numeric
#attribute LBN/Distance numeric
#attribute LBNDiff1 numeric
#attribute LBNDiff2 numeric
#attribute LBNDiff3 numeric
#attribute Size numeric
#attribute RW {R,W}
#attribute 'Response Time' numeric
#data
0,0,0,0,0,0,0,0,0,0,203468398592,0,0,0,32768,R,0.006475
0.004254,0,0,0,0,0,0,0,0,0,4564742206976,4361273808384,0,0,65536,R,0.011025
0.002128,0.006382,0,0,0,0,0,0,0,0,4585966117376,21223910400,4382497718784,0,4096,R,0.01389
0.001616,0.003744,0,0,0,0,0,0,0,0,4590576115200,4609997824,25833908224,4387107716608,4096,R,0.005276
0.002515,0.004131,0.010513,0,0,0,0,0,0,0,233456156672,-4357119958528,-4352509960704,-4331286050304,32768,R,0.01009
0.004332,0.006847,0.010591,0,0,0,0,0,0,0,312887472128,79431315456,-4277688643072,-4273078645248,4096,R,0.005081
0.000342,0.004674,0.008805,0,0,0,0,0,0,0,3773914294272,3461026822144,3540458137600,-816661820928,8704,R,0.004252
0.000021,0.000363,0.00721,0,0,0,0,0,0,0,3772221901312,-1692392960,3459334429184,3538765744640,4096,W,0.00017
0.000042,0.000063,0.004737,0.01525,0,0,0,0,0,0,3832104423424,59882522112,58190129152,3519216951296,16384,W,0.000167
0.005648,0.00569,0.006053,0.016644,0,0,0,0,0,0,312887476224,-3519216947200,-3459334425088,-3461026818048,19456,R,0.009504
I'm trying to get predictions for the Response Time, which is the right-most column. As you can see, my data goes to the 6th decimal place.
However, WEKA's predictions only go to the 3rd. Here are the results of the file named "predictions":
inst# actual predicted error
1 0.006 0.005 -0.002
2 0.011 0.017 0.006
3 0.014 0.002 -0.012
4 0.005 0.022 0.016
5 0.01 0.012 0.002
6 0.005 0.012 0.007
7 0.004 0.018 0.014
8 0 0.001 0
9 0 0.001 0
10 0.01 0.012 0.003
As you can see, this greatly limits the accuracy of my predictions. For very small numbers less than 0.0005 (like row 8 and 9), they will show up as 0 instead of a more accurate smaller decimal number.
I'm using WEKA on the "Simple Command Line" instead of the GUI. My command to build the model looks like this:
java weka.classifiers.trees.REPTree -M 2 -V 0.00001 -N 3 -S 1 -L -1 -I 0.0 -num-decimal-places 6 \
-t [removed path]/TrainSet.arff \
-T [removed path]/TestSet.arff \
-d [removed path]/model1.model > \
[removed path]/model1output
([removed path]: I just removed the full pathname for privacy)
As you can see, I found this "-num-decimal-places" switch for creating the model.
Then I use the following command to make the predictions:
java weka.classifiers.trees.REPTree \
-T [removed path]/LUN0train.arff \
-l [removed path]/model1.model -p 0 > \
[removed path]/predictions
I can't use the "-num-decimal places" switch here because WEKA doesn't allow it in this case for some reason. "predictions" is my wanted predictions file.
So I do these two commands, and it doesn't change the number of decimal places in the prediction! It's still only 3.
I've already looked at this answers, Weka decimal precision, and this answer on the pentaho forum, but no one gave enough information to answer my question. These answers hinted that changing the number of decimal places might not be possible? but I just want to be sure.
Does any one know of an option to fix this? Ideally a solution would be on the command line, but if you only know how to do it in the GUI, that's ok.

I just figured a work around, which is to simply scale/multiply the data by 1000, and then get your predictions, and then multiply it back to 1/1000 when done to get the original scale. Kinda outside the box, but it works.
EDIT: An alternative way to do it: Answer from Peter Reutemann from http://weka.8497.n7.nabble.com/Changing-decimal-point-precision-td43393.html:
This has been around for a long time. ;-) "-p" is the really
old-fashioned way of outputting the predictions. Using the
"-classifications" option, you can specify what format the output is
to be in (eg CSV). The class that you specify with that option has to
be derived from
"weka.classifiers.evaluation.output.prediction.AbstractOutput":
http://weka.sourceforge.net/doc.dev/weka/classifiers/evaluation/output/prediction/AbstractOutput.html
Here is an example of using 12 decimals for the prediction output
using Java:
https://svn.cms.waikato.ac.nz/svn/weka/trunk/wekaexamples/src/main/java/wekaexamples/classifiers/PredictionDecimals.java

Related

Extract first and second number after a match is found into a variable

I have a file that looks like this:
DATA REGRESSION SECTION
TEMPERATURE UNITS : C
AVERAGE ABSOLUTE DEVIATION = 0.1353242
AVG. ABS. REL. DEVIATION = 0.3980671E-01
DATA REGRESSION SECTION
PRESSURE UNITS : BAR
AVERAGE ABSOLUTE DEVIATION = 0.8465562E-12
AVG. ABS. REL. DEVIATION = 0.8381744E-12
DATA REGRESSION SECTION
COMPOSITION LIQUID1 METHANOL UNITS : MOLEFRAC
AVERAGE ABSOLUTE DEVIATION = 0.8718076E-02
AVG. ABS. REL. DEVIATION = 0.3224882E-01
I would like to extract the first number after the occurrence of the string "TEMPERATURE" into a variable. Then, I would like to extract the second number after the occurrence of the string "TEMPERATURE" into another variable. Thus, I would have:
var1 = 0.1353242
var2 = 0.3980671E-01
I have tried the following, which works for the most part but will not keep the decimal point or 'E' character.
var1=$(grep -A 1 TEMPERATURE input.txt)| echo $var1 | sed 's/[^0-9]*//g'
If anyone is curious, I found the following solution that seems to work:
var1=$(grep -A 1 TEMPERATURE output.txt)
var1=$(sed 's/.*= //g' <<< $var1)
var2=$(grep -A 2 TEMPERATURE input.txt)
var2=$(sed 's/.*= //g' <<< $var2)

Preprocessing categorical data already converted into numbers

I'm fairly new to machine learning, so I don't know the correct terminology, but I converted two categorical columns into numbers the following way. These columns are part of my features inputs, akin to the sex column in the titanic database.
(They are not the target data y which I have already created)
changed p_changed
Date
2010-02-17 0.477182 0 0
2010-02-18 0.395813 0 0
2010-02-19 0.252179 1 1
2010-02-22 0.401321 0 1
2010-02-23 0.519375 1 1
Now the rest of my data Xlooks something like this
Open High Low Close Volume Adj Close log_return \
Date
2010-02-17 2.07 2.07 1.99 2.03 219700.0 2.03 -0.019513
2010-02-18 2.03 2.03 1.99 2.03 181700.0 2.03 0.000000
2010-02-19 2.03 2.03 2.00 2.02 116400.0 2.02 -0.004938
2010-02-22 2.05 2.05 2.02 2.04 188300.0 2.04 0.009852
2010-02-23 2.05 2.07 2.01 2.05 255400.0 2.05 0.004890
close_open Daily_Change 30_Avg_Vol 20_Avg_Vol 15_Avg_Vol \
Date
2010-02-17 0.00 -0.04 0.909517 0.779299 0.668242
2010-02-18 0.00 0.00 0.747470 0.635404 0.543015
2010-02-19 0.00 -0.01 0.508860 0.417706 0.348761
2010-02-22 0.03 -0.01 0.817274 0.666903 0.562414
2010-02-23 0.01 0.00 1.078411 0.879007 0.742730
As you can see the rest of my data is continuous (containing many variables) as opposed to the two categorical columns which only have two values (0 and 1).
I was planning to preprocess all this data in one shot via this simple preprocess method
X_scaled = preprocessing.scale(X)
I was wondering if this is mistake? Is there something else I need to do to the categorical values before using this simple preprocessing?
EDIT: I tried two ways; First I tried scaling the full data, including the categorical data converted to 1's and 0's.
Full_X = OPK_df.iloc[:-5, 0:-5]
Full_X_scaled = preprocessing.scale( Full_X) # First way, which scales everything in one shot.
Then I tried dropping the last two columns, scaling, then adding the dropped columns via this code.
X =OPK_df.iloc[:-5, 0:-7] # Here I'm dropping both -7 while originally the offset was only till -5, which means two extra columns were dropped.
I created another dataframe which has those two columns I dropped
x2 =OPK_df.iloc[:-5, -7:-5]
x2 = np.array(x2) # convert it to an array
# preprocessing the data without last two columns
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
# Then concact the X_scaled with x2(originally dropped columns)
X =np.concatenate((X_scaled, x2), axis =1)
#Creating a classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn2 = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_scaled, y)
knn2.fit(X,y)
knn.score(Full_X_scaled, y)
0.71396522714526078
knn2.score(X, y)
0.71789119461581608
So there is a higher score when I do indeed drop the two columns during standarization.
You're doing pretty well so far. Do not scale your classification data. Since those appear to be binary classifications, think of this as "Yes" and "No". What does it mean to scale these?
Even worse, consider that you might have classifications such as flower types: you've coded Zinnia=0, Rose=1, Orchid=2, etc. What does it meant to scale those? It doesn't make any sense to re-code these as Zinnia=-0.257, Rose=+0.448, etc.
Scaling your input data is the necessary part: it keeps the values within comparable ranges (mathematical influence), allowing you to readily use a single treatment for your loss function. Otherwise, the feature with the largest spread of values would have the greatest influence on training, until your model's weights learned how to properly discount the large values.
For your beginning explorations, don't do any other preprocessing: just scale the input data and start your fitting exercises.

Vowpal Wabbit same results always

I am using VW to try to predict multi classes. The strangest part is that it doesn't matter which parameters I use, the result is always the same.
Should that happen, maybe because of my data?
Details:
Around 90k lines of data. A line of the data:
1 2334225|SUBDEPT "D1SUB1" "D2SUB1" |DEPT "DEPT1" "DEPT2" |SCANCODE "11223442" "65434533543" |WDAY Friday |AMTBOUGHT 2
Its a multiclass problem,so the command line is:
vw --ect 38 ../Processed/train.vw.txt --loss_function logistic --link=logistic
The single parameter that changes something is from --ect to --oaa. I have tried adding the following, but none changes the final validation values:
-c -k --passes 20 (goes until 8)
--l1 or --l2
--power_t
--ignore D or --ignore d (or s or su...)
the results are always
average loss = 0.911153 h
Is there something that I am missing here?

Learning Weka - Precision and Recall - Wiki example to .Arff file

I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources.
On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats.
I perfectly understand the example, and the recall calculation.
So my first step, let see in Weka how to reproduce with this data.
How do I create such a .ARFF file?
With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class
Recall is not 1, it should be 4/9 (0.4444)
#relation 'dogs and cat detection'
#attribute 'realanimal' {dog,cat}
#attribute 'detected' {dog,cat}
#attribute 'class' {correct,wrong}
#data
dog,dog,correct
dog,dog,correct
dog,dog,correct
dog,dog,correct
cat,dog,wrong
cat,dog,wrong
cat,dog,wrong
dog,?,?
dog,?,?
dog,?,?
dog,?,?
dog,?,?
cat,?,?
cat,?,?
Output Weka (without filters)
=== Run information ===
Scheme:weka.classifiers.rules.ZeroR
Relation: dogs and cat detection
Instances: 14
Attributes: 3
realanimal
detected
class
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: correct
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 4 57.1429 %
Incorrectly Classified Instances 3 42.8571 %
Kappa statistic 0
Mean absolute error 0.5
Root mean squared error 0.5044
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 7
Ignored Class Unknown Instances 7
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.571 1 0.727 0.65 correct
0 0 0 0 0 0.136 wrong
Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43
=== Confusion Matrix ===
a b <-- classified as
4 0 | a = correct
3 0 | b = wrong
There must be something wrong with the False Negative dogs,
or is my ARFF approach totally wrong and do I need another kind of attributes?
Thanks
Lets start with the basic definition of Precision and Recall.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Where TP is True Positive, FP is False Positive, and FN is False Negative.
In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples).
Lets calculate the precision for correct and wrong class.
First for the correct class:
Prec = 4/(4+3) = 0.571428571
Recall = 4/(4+0) = 1.
For wrong class:
Prec = 0/(0+0)= 0
recall =0/(0+3) = 0

Step by step guide to train a multilayer perceptron for the XOR case in Weka?

I'm just getting started with Weka and having trouble with the first steps.
We've got our training set:
#relation PerceptronXOR
#attribute X1 numeric
#attribute X2 numeric
#attribute Output numeric
#data
1,1,-1
-1,1,1
1,-1,1
-1,-1,-1
First step I want to do is just train, and then classify a set using the Weka gui.
What I've been doing so far:
Using Weka 3.7.0.
Start GUI.
Explorer.
Open file -> choose my arff file.
Classify tab.
Use training set radio button.
Choose-> functions>multilayer_perceptron
Click the 'multilayer perceptron' text at the top to open settings.
Set Hidden layers to '2'. (if gui is selected true,t his show that this is the correct network we want). Click ok.
click start.
outputs:
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 2 -R
Relation: PerceptronXOR
Instances: 4
Attributes: 3
X1
X2
Output
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Node 0
Inputs Weights
Threshold 0.21069691964232443
Node 1 1.8781169869419072
Node 2 -1.8403146612166397
Sigmoid Node 1
Inputs Weights
Threshold -3.7331156814378685
Attrib X1 3.6380519730323164
Attrib X2 -1.0420815868133226
Sigmoid Node 2
Inputs Weights
Threshold -3.64785119182632
Attrib X1 3.603244645539393
Attrib X2 0.9535137571446323
Class
Input
Node 0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.7047
Mean absolute error 0.6073
Root mean squared error 0.7468
Relative absolute error 60.7288 %
Root relative squared error 74.6842 %
Total Number of Instances 4
It seems odd that 500 iterations at 0.3 doesn't get it the error, but 5000 # 0.1 does, so lets go with that.
Now use the test data set:
#relation PerceptronXOR
#attribute X1 numeric
#attribute X2 numeric
#attribute Output numeric
#data
1,1,-1
-1,1,1
1,-1,1
-1,-1,-1
0.5,0.5,-1
-0.5,0.5,1
0.5,-0.5,1
-0.5,-0.5,-1
Radio button to 'Supplied test set'
Select my test set arff.
Click start.
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.2 -N 5000 -V 0 -S 0 -E 20 -H 2 -R
Relation: PerceptronXOR
Instances: 4
Attributes: 3
X1
X2
Output
Test mode: user supplied test set: size unknown (reading incrementally)
=== Classifier model (full training set) ===
Linear Node 0
Inputs Weights
Threshold -1.2208619057226187
Node 1 3.1172079341507497
Node 2 -3.212484459911485
Sigmoid Node 1
Inputs Weights
Threshold 1.091378074639599
Attrib X1 1.8621040828953983
Attrib X2 1.800744048145267
Sigmoid Node 2
Inputs Weights
Threshold -3.372580743113282
Attrib X1 2.9207154176666386
Attrib X2 2.576791630598144
Class
Input
Node 0
Time taken to build model: 0.04 seconds
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient 0.8296
Mean absolute error 0.3006
Root mean squared error 0.6344
Relative absolute error 30.0592 %
Root relative squared error 63.4377 %
Total Number of Instances 8
Why is unable to classify these correctly?
Is it just because it's reached a local minimum quickly on the training data, and doesn't 'know' that that doesn't fit all the cases?
Questions.
Why does 500 # 0.3 not work? Seems odd for such a simple problem.
Why does it fail on the test set.
How do I pass in a set to classify?
Using learning rate with 0.5 does the job with 500 iterations for the both examples.
The learning rate is how much weight it gives for new examples.
Apparently the problem is difficult and it is easy to get in local minima with the 2 hidden layers. If you use a low learning rate with a high iteration number the learning process will be more conservative and more likely to high a good minimum.

Resources