WEKA Changing number of decimal places in predictions - machine-learning
I'm trying to get precise predictions from WEKA, and I need to increase the number of decimal places that it outputs for its prediction data.
My .arff training set looks like this:
#relation TrainSet
#attribute TimeDiff1 numeric
#attribute TimeDiff2 numeric
#attribute TimeDiff3 numeric
#attribute TimeDiff4 numeric
#attribute TimeDiff5 numeric
#attribute TimeDiff6 numeric
#attribute TimeDiff7 numeric
#attribute TimeDiff8 numeric
#attribute TimeDiff9 numeric
#attribute TimeDiff10 numeric
#attribute LBN/Distance numeric
#attribute LBNDiff1 numeric
#attribute LBNDiff2 numeric
#attribute LBNDiff3 numeric
#attribute Size numeric
#attribute RW {R,W}
#attribute 'Response Time' numeric
#data
0,0,0,0,0,0,0,0,0,0,203468398592,0,0,0,32768,R,0.006475
0.004254,0,0,0,0,0,0,0,0,0,4564742206976,4361273808384,0,0,65536,R,0.011025
0.002128,0.006382,0,0,0,0,0,0,0,0,4585966117376,21223910400,4382497718784,0,4096,R,0.01389
0.001616,0.003744,0,0,0,0,0,0,0,0,4590576115200,4609997824,25833908224,4387107716608,4096,R,0.005276
0.002515,0.004131,0.010513,0,0,0,0,0,0,0,233456156672,-4357119958528,-4352509960704,-4331286050304,32768,R,0.01009
0.004332,0.006847,0.010591,0,0,0,0,0,0,0,312887472128,79431315456,-4277688643072,-4273078645248,4096,R,0.005081
0.000342,0.004674,0.008805,0,0,0,0,0,0,0,3773914294272,3461026822144,3540458137600,-816661820928,8704,R,0.004252
0.000021,0.000363,0.00721,0,0,0,0,0,0,0,3772221901312,-1692392960,3459334429184,3538765744640,4096,W,0.00017
0.000042,0.000063,0.004737,0.01525,0,0,0,0,0,0,3832104423424,59882522112,58190129152,3519216951296,16384,W,0.000167
0.005648,0.00569,0.006053,0.016644,0,0,0,0,0,0,312887476224,-3519216947200,-3459334425088,-3461026818048,19456,R,0.009504
I'm trying to get predictions for the Response Time, which is the right-most column. As you can see, my data goes to the 6th decimal place.
However, WEKA's predictions only go to the 3rd. Here are the results of the file named "predictions":
inst# actual predicted error
1 0.006 0.005 -0.002
2 0.011 0.017 0.006
3 0.014 0.002 -0.012
4 0.005 0.022 0.016
5 0.01 0.012 0.002
6 0.005 0.012 0.007
7 0.004 0.018 0.014
8 0 0.001 0
9 0 0.001 0
10 0.01 0.012 0.003
As you can see, this greatly limits the accuracy of my predictions. For very small numbers less than 0.0005 (like row 8 and 9), they will show up as 0 instead of a more accurate smaller decimal number.
I'm using WEKA on the "Simple Command Line" instead of the GUI. My command to build the model looks like this:
java weka.classifiers.trees.REPTree -M 2 -V 0.00001 -N 3 -S 1 -L -1 -I 0.0 -num-decimal-places 6 \
-t [removed path]/TrainSet.arff \
-T [removed path]/TestSet.arff \
-d [removed path]/model1.model > \
[removed path]/model1output
([removed path]: I just removed the full pathname for privacy)
As you can see, I found this "-num-decimal-places" switch for creating the model.
Then I use the following command to make the predictions:
java weka.classifiers.trees.REPTree \
-T [removed path]/LUN0train.arff \
-l [removed path]/model1.model -p 0 > \
[removed path]/predictions
I can't use the "-num-decimal places" switch here because WEKA doesn't allow it in this case for some reason. "predictions" is my wanted predictions file.
So I do these two commands, and it doesn't change the number of decimal places in the prediction! It's still only 3.
I've already looked at this answers, Weka decimal precision, and this answer on the pentaho forum, but no one gave enough information to answer my question. These answers hinted that changing the number of decimal places might not be possible? but I just want to be sure.
Does any one know of an option to fix this? Ideally a solution would be on the command line, but if you only know how to do it in the GUI, that's ok.
I just figured a work around, which is to simply scale/multiply the data by 1000, and then get your predictions, and then multiply it back to 1/1000 when done to get the original scale. Kinda outside the box, but it works.
EDIT: An alternative way to do it: Answer from Peter Reutemann from http://weka.8497.n7.nabble.com/Changing-decimal-point-precision-td43393.html:
This has been around for a long time. ;-) "-p" is the really
old-fashioned way of outputting the predictions. Using the
"-classifications" option, you can specify what format the output is
to be in (eg CSV). The class that you specify with that option has to
be derived from
"weka.classifiers.evaluation.output.prediction.AbstractOutput":
http://weka.sourceforge.net/doc.dev/weka/classifiers/evaluation/output/prediction/AbstractOutput.html
Here is an example of using 12 decimals for the prediction output
using Java:
https://svn.cms.waikato.ac.nz/svn/weka/trunk/wekaexamples/src/main/java/wekaexamples/classifiers/PredictionDecimals.java
Related
Extract first and second number after a match is found into a variable
I have a file that looks like this: DATA REGRESSION SECTION TEMPERATURE UNITS : C AVERAGE ABSOLUTE DEVIATION = 0.1353242 AVG. ABS. REL. DEVIATION = 0.3980671E-01 DATA REGRESSION SECTION PRESSURE UNITS : BAR AVERAGE ABSOLUTE DEVIATION = 0.8465562E-12 AVG. ABS. REL. DEVIATION = 0.8381744E-12 DATA REGRESSION SECTION COMPOSITION LIQUID1 METHANOL UNITS : MOLEFRAC AVERAGE ABSOLUTE DEVIATION = 0.8718076E-02 AVG. ABS. REL. DEVIATION = 0.3224882E-01 I would like to extract the first number after the occurrence of the string "TEMPERATURE" into a variable. Then, I would like to extract the second number after the occurrence of the string "TEMPERATURE" into another variable. Thus, I would have: var1 = 0.1353242 var2 = 0.3980671E-01 I have tried the following, which works for the most part but will not keep the decimal point or 'E' character. var1=$(grep -A 1 TEMPERATURE input.txt)| echo $var1 | sed 's/[^0-9]*//g'
If anyone is curious, I found the following solution that seems to work: var1=$(grep -A 1 TEMPERATURE output.txt) var1=$(sed 's/.*= //g' <<< $var1) var2=$(grep -A 2 TEMPERATURE input.txt) var2=$(sed 's/.*= //g' <<< $var2)
Preprocessing categorical data already converted into numbers
I'm fairly new to machine learning, so I don't know the correct terminology, but I converted two categorical columns into numbers the following way. These columns are part of my features inputs, akin to the sex column in the titanic database. (They are not the target data y which I have already created) changed p_changed Date 2010-02-17 0.477182 0 0 2010-02-18 0.395813 0 0 2010-02-19 0.252179 1 1 2010-02-22 0.401321 0 1 2010-02-23 0.519375 1 1 Now the rest of my data Xlooks something like this Open High Low Close Volume Adj Close log_return \ Date 2010-02-17 2.07 2.07 1.99 2.03 219700.0 2.03 -0.019513 2010-02-18 2.03 2.03 1.99 2.03 181700.0 2.03 0.000000 2010-02-19 2.03 2.03 2.00 2.02 116400.0 2.02 -0.004938 2010-02-22 2.05 2.05 2.02 2.04 188300.0 2.04 0.009852 2010-02-23 2.05 2.07 2.01 2.05 255400.0 2.05 0.004890 close_open Daily_Change 30_Avg_Vol 20_Avg_Vol 15_Avg_Vol \ Date 2010-02-17 0.00 -0.04 0.909517 0.779299 0.668242 2010-02-18 0.00 0.00 0.747470 0.635404 0.543015 2010-02-19 0.00 -0.01 0.508860 0.417706 0.348761 2010-02-22 0.03 -0.01 0.817274 0.666903 0.562414 2010-02-23 0.01 0.00 1.078411 0.879007 0.742730 As you can see the rest of my data is continuous (containing many variables) as opposed to the two categorical columns which only have two values (0 and 1). I was planning to preprocess all this data in one shot via this simple preprocess method X_scaled = preprocessing.scale(X) I was wondering if this is mistake? Is there something else I need to do to the categorical values before using this simple preprocessing? EDIT: I tried two ways; First I tried scaling the full data, including the categorical data converted to 1's and 0's. Full_X = OPK_df.iloc[:-5, 0:-5] Full_X_scaled = preprocessing.scale( Full_X) # First way, which scales everything in one shot. Then I tried dropping the last two columns, scaling, then adding the dropped columns via this code. X =OPK_df.iloc[:-5, 0:-7] # Here I'm dropping both -7 while originally the offset was only till -5, which means two extra columns were dropped. I created another dataframe which has those two columns I dropped x2 =OPK_df.iloc[:-5, -7:-5] x2 = np.array(x2) # convert it to an array # preprocessing the data without last two columns from sklearn import preprocessing X_scaled = preprocessing.scale(X) # Then concact the X_scaled with x2(originally dropped columns) X =np.concatenate((X_scaled, x2), axis =1) #Creating a classifier from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5) knn2 = KNeighborsClassifier(n_neighbors=5) knn.fit(X_scaled, y) knn2.fit(X,y) knn.score(Full_X_scaled, y) 0.71396522714526078 knn2.score(X, y) 0.71789119461581608 So there is a higher score when I do indeed drop the two columns during standarization.
You're doing pretty well so far. Do not scale your classification data. Since those appear to be binary classifications, think of this as "Yes" and "No". What does it mean to scale these? Even worse, consider that you might have classifications such as flower types: you've coded Zinnia=0, Rose=1, Orchid=2, etc. What does it meant to scale those? It doesn't make any sense to re-code these as Zinnia=-0.257, Rose=+0.448, etc. Scaling your input data is the necessary part: it keeps the values within comparable ranges (mathematical influence), allowing you to readily use a single treatment for your loss function. Otherwise, the feature with the largest spread of values would have the greatest influence on training, until your model's weights learned how to properly discount the large values. For your beginning explorations, don't do any other preprocessing: just scale the input data and start your fitting exercises.
Vowpal Wabbit same results always
I am using VW to try to predict multi classes. The strangest part is that it doesn't matter which parameters I use, the result is always the same. Should that happen, maybe because of my data? Details: Around 90k lines of data. A line of the data: 1 2334225|SUBDEPT "D1SUB1" "D2SUB1" |DEPT "DEPT1" "DEPT2" |SCANCODE "11223442" "65434533543" |WDAY Friday |AMTBOUGHT 2 Its a multiclass problem,so the command line is: vw --ect 38 ../Processed/train.vw.txt --loss_function logistic --link=logistic The single parameter that changes something is from --ect to --oaa. I have tried adding the following, but none changes the final validation values: -c -k --passes 20 (goes until 8) --l1 or --l2 --power_t --ignore D or --ignore d (or s or su...) the results are always average loss = 0.911153 h Is there something that I am missing here?
Learning Weka - Precision and Recall - Wiki example to .Arff file
I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources. On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats. I perfectly understand the example, and the recall calculation. So my first step, let see in Weka how to reproduce with this data. How do I create such a .ARFF file? With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class Recall is not 1, it should be 4/9 (0.4444) #relation 'dogs and cat detection' #attribute 'realanimal' {dog,cat} #attribute 'detected' {dog,cat} #attribute 'class' {correct,wrong} #data dog,dog,correct dog,dog,correct dog,dog,correct dog,dog,correct cat,dog,wrong cat,dog,wrong cat,dog,wrong dog,?,? dog,?,? dog,?,? dog,?,? dog,?,? cat,?,? cat,?,? Output Weka (without filters) === Run information === Scheme:weka.classifiers.rules.ZeroR Relation: dogs and cat detection Instances: 14 Attributes: 3 realanimal detected class Test mode:10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: correct Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 4 57.1429 % Incorrectly Classified Instances 3 42.8571 % Kappa statistic 0 Mean absolute error 0.5 Root mean squared error 0.5044 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 7 Ignored Class Unknown Instances 7 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.571 1 0.727 0.65 correct 0 0 0 0 0 0.136 wrong Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43 === Confusion Matrix === a b <-- classified as 4 0 | a = correct 3 0 | b = wrong There must be something wrong with the False Negative dogs, or is my ARFF approach totally wrong and do I need another kind of attributes? Thanks
Lets start with the basic definition of Precision and Recall. Precision = TP/(TP+FP) Recall = TP/(TP+FN) Where TP is True Positive, FP is False Positive, and FN is False Negative. In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples). Lets calculate the precision for correct and wrong class. First for the correct class: Prec = 4/(4+3) = 0.571428571 Recall = 4/(4+0) = 1. For wrong class: Prec = 0/(0+0)= 0 recall =0/(0+3) = 0
Step by step guide to train a multilayer perceptron for the XOR case in Weka?
I'm just getting started with Weka and having trouble with the first steps. We've got our training set: #relation PerceptronXOR #attribute X1 numeric #attribute X2 numeric #attribute Output numeric #data 1,1,-1 -1,1,1 1,-1,1 -1,-1,-1 First step I want to do is just train, and then classify a set using the Weka gui. What I've been doing so far: Using Weka 3.7.0. Start GUI. Explorer. Open file -> choose my arff file. Classify tab. Use training set radio button. Choose-> functions>multilayer_perceptron Click the 'multilayer perceptron' text at the top to open settings. Set Hidden layers to '2'. (if gui is selected true,t his show that this is the correct network we want). Click ok. click start. outputs: === Run information === Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 2 -R Relation: PerceptronXOR Instances: 4 Attributes: 3 X1 X2 Output Test mode: evaluate on training data === Classifier model (full training set) === Linear Node 0 Inputs Weights Threshold 0.21069691964232443 Node 1 1.8781169869419072 Node 2 -1.8403146612166397 Sigmoid Node 1 Inputs Weights Threshold -3.7331156814378685 Attrib X1 3.6380519730323164 Attrib X2 -1.0420815868133226 Sigmoid Node 2 Inputs Weights Threshold -3.64785119182632 Attrib X1 3.603244645539393 Attrib X2 0.9535137571446323 Class Input Node 0 Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correlation coefficient 0.7047 Mean absolute error 0.6073 Root mean squared error 0.7468 Relative absolute error 60.7288 % Root relative squared error 74.6842 % Total Number of Instances 4 It seems odd that 500 iterations at 0.3 doesn't get it the error, but 5000 # 0.1 does, so lets go with that. Now use the test data set: #relation PerceptronXOR #attribute X1 numeric #attribute X2 numeric #attribute Output numeric #data 1,1,-1 -1,1,1 1,-1,1 -1,-1,-1 0.5,0.5,-1 -0.5,0.5,1 0.5,-0.5,1 -0.5,-0.5,-1 Radio button to 'Supplied test set' Select my test set arff. Click start. === Run information === Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.2 -N 5000 -V 0 -S 0 -E 20 -H 2 -R Relation: PerceptronXOR Instances: 4 Attributes: 3 X1 X2 Output Test mode: user supplied test set: size unknown (reading incrementally) === Classifier model (full training set) === Linear Node 0 Inputs Weights Threshold -1.2208619057226187 Node 1 3.1172079341507497 Node 2 -3.212484459911485 Sigmoid Node 1 Inputs Weights Threshold 1.091378074639599 Attrib X1 1.8621040828953983 Attrib X2 1.800744048145267 Sigmoid Node 2 Inputs Weights Threshold -3.372580743113282 Attrib X1 2.9207154176666386 Attrib X2 2.576791630598144 Class Input Node 0 Time taken to build model: 0.04 seconds === Evaluation on test set === === Summary === Correlation coefficient 0.8296 Mean absolute error 0.3006 Root mean squared error 0.6344 Relative absolute error 30.0592 % Root relative squared error 63.4377 % Total Number of Instances 8 Why is unable to classify these correctly? Is it just because it's reached a local minimum quickly on the training data, and doesn't 'know' that that doesn't fit all the cases? Questions. Why does 500 # 0.3 not work? Seems odd for such a simple problem. Why does it fail on the test set. How do I pass in a set to classify?
Using learning rate with 0.5 does the job with 500 iterations for the both examples. The learning rate is how much weight it gives for new examples. Apparently the problem is difficult and it is easy to get in local minima with the 2 hidden layers. If you use a low learning rate with a high iteration number the learning process will be more conservative and more likely to high a good minimum.