I am using the UCI ML breast cancer dataset to build a classifier using SVMs. I am using LIBSVM, and its fselect.py script for calculating f-scores for feature selection. My dataset has 8 features, and the scores for them are following:
5: 1.765716
2: 1.413180
1: 1.320096
6: 1.103449
8: 0.790712
3: 0.734230
7: 0.698571
4: 0.580819
This implies that the 5th feature is the most discriminative, and 4th is the least. My next piece of code looks something like this:
x1=x(:,5);
x2=x(:,[5,2]);
x3=x(:,[5,2,6]);
x4=x(:,[5,2,6,8]);
x5=x(:,[5,2,6,8,3]);
x6=x(:,[5,2,6,8,3,7]);
x7=x(:,[5,2,6,8,3,7,4]);
errors2=zeros(7,1);
errors2(1)=svmtrain(y,x1,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(2)=svmtrain(y,x2,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(3)=svmtrain(y,x3,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(4)=svmtrain(y,x4,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(5)=svmtrain(y,x5,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(6)=svmtrain(y,x6,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(7)=svmtrain(y,x7,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
Note: gamma and C were computed using grid search and x is the complete matrix with 8 columns (corresponding to 8 features)
When I print the errors2 matrix, I get the following output:
errors2 =
88.416
92.229
93.109
94.135
94.282
94.575
94.575
This means that I get the most accuracy when I use all the features and get the least accuracy when I use the most discriminating feature. As far as I know, I should get the most accuracy when I use a subset of features containing the most discriminating one. Why is the program behaving this way then? Can someone point out any errors that I might have made?
(My intuition says that I've calculated the C wrong, since it is so small).
The error rates you are getting are as would be expected. Adding an extra feature should reduce the error rate, because you have more information.
As an example, consider trying to work out what model a car is. The most discriminative feature is probably the manufacturer, but adding features such as engine size, height, width, length, weight etc will narrow it down further.
If you are considering lots of features, some of which may have very low discriminative power, you might run into problems with overfitting to your training data. Here you have just 8 features, but it already looks like adding the 8th feature has no effect. (In the car example, this might be features such as how dirty the car is, amount of tread left on the tyres, the channel the radio is tuned to, etc).
Related
I am following this link and trying to implement the scenarios there.
So I need to generate a data for MANET nodes representing their location in this format:
Current time - latest x – latest y – latest update time – previous x –previous y – previous update time
with the use of setdest tool with these options:
1500 by 300 grid, ran for 300 seconds and used pause times of 20s and maximum velocities of 2.5 m/s.
so I come up with this command
./setdest -v 2 -n 10 -s 2.5 -m 10 -M 50 -t 300 -p 20 -x 1500 -y 300 > test1.tcl
which worked and generated a tcl file, but I don't know how can I obtain the data in the required format.
setdest -v 2 -n 10 -s 2.5 -m 10 -M 50 -t 300 -p 20 -x 1500 -y 300 > test1.tcl
Not a tcl file : Is a "scen" / scenario file with 1,700 "ns" commands. Your file was renamed to "test1.scen", and is now used in the manet examples, in the simulation example aodv-manet-20.tcl :
set val(cp) "test1.scen" ;#Connection Pattern
Please be aware that time settings are maximum time. "Long time settings" were useful ~20 years ago when computers were slow. (Though there are complex simulations lasting half an hour to one hour.)
Link, manet-examples-1.tar.gz https://drive.google.com/file/d/0B7S255p3kFXNR05CclpEdVdvQm8/view?usp=sharing
Edit: New example added → manet0-16-nam.tcl → → https://drive.google.com/file/d/0B7S255p3kFXNR0ZuQ1l6YnlWRGc/view?usp=sharing
I am using weka 3.6.13 and trying to use a model to classify data:
java -cp weka-stable-3.6.13.jar weka.classifiers.Evaluation weka.classifiers.trees.RandomForest -l Parking.model -t Data_features_class_ques-2.arff
java.lang.Exception: training and test set are not compatible
though the model works when we use the GUI, through Explorer->Claasify ->Supplied test set and load the arff file->right click on result list and load model-> again right click -> re-evaluate model on current data set...
Any pointers please help.
If your data contains "String" features then first use StringToWordVector in batch mode i.e. for both data set in single command (command 1) then use command 2 and command 3.
Command 1.
java weka.filters.unsupervised.attribute.StringToWordVector -b -R first-last -i training.arff -o training_s2w.arff -r test.arff -s test_s2w.arff
Command 2.
java weka.classifiers.trees.RandomForest -t training_s2w.arff -d model.model
Command 3.
java weka.classifiers.trees.RandomForest -T test_s2w.arff -l model.model -p 0 > result.txt
PS: add path for weka.jar accordingly.
My test.csv file:
==================
1,54,1341775056478
2,1568,1341775056478
1,1622,1341775056498
2,3136,1341775056498
1,3190,1341775056671
2,4704,1341775056671
1,4758,1341775056693
2,6272,1341775056693
1,6326,1341775056714
2,7840,1341775056714
1,7894,1341775056735
2,9408,1341775056735
1,9462,1341775056951
2,10976,1341775056951
1,11030,1341775056972
2,12544,1341775056972
1,12598,1341775056994
2,14112,1341775056994
1,14166,1341775057014
2,15680,1341775057014
1,15734,1341775057065
2,17248,1341775057065
1,17302,1341775057087
2,18816,1341775057087
1,18870,1341775057119
2,20384,1341775057119
....
....
I am trying to cluster this data using mahout k-means algorithm.
I had followed these steps:
1)Create a sequence file from the test.csv file
mahout seqdirectory -c UTF-8 -i /user/mahout/input/test.csv -o /user/sample/out_seq -chunk 64
2)Create a sparse vector from the sequence file
mahout seq2sparse -i /user/mahout/out_seq/ -o /user/mahout/sparse_dir --maxDFPercent 85 --namedVector
3)perfom K-Means clustering
mahout kmeans -i /user/mahout/sparse_dir/tfidf-vectors/ -c /user/mahout/cluster -o /user/mahout/kmeans_out
-dm org.apache.mahout.common.distance.CosineDistanceMeasure --maxIter 10 --numClusters 20 --ow --clustering
At step 3,I'm facing this error:
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /user/mahout/text/cluster/part-randomSeed. Check your -c argument.
at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
....
....
How to overcome this error.Actually I did the clustering example successfuuly using reuters dataset.But with my dataset,it is showing this issue.Is there any problem with the dataset ? or due to some other issue,am i facing this error?
Can anyone suggest me regarding this issue...
Thanks, in advance
So I'm using weka 3.7.11 in a Windows machine (and runnings bash scripts with cygwin), and I found an inconsistency regarding the AODE classifier (which in this version of weka, comes from an add-on package).
Using Averaged N-Dependencies Estimators from the GUI, I get the following configuration (from an example that worked alright in the Weka Explorer):
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" -W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
So I modified this to get the following command in my bash script:
java -Xmx60G -cp "C:\work\weka-3.7.jar;C:\Users\Oracle\wekafiles\packages\AnDE\AnDE.jar" weka.classifiers.meta.FilteredClassifier \
-t train_2.arff -T train_1.arff \
-classifications "weka.classifiers.evaluation.output.prediction.CSV -distribution -p 1 -file predictions_final_multi.csv -suppress" \
-threshold-file umbral_multi.csv \
-F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" \
-W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
But this gives me the error:
Weka exception: No value given for -S option.
Which is weird, since this was not a problem with the GUI. In the GUI, the Information box says that -S it's just a flag ("Subsumption Resolution can be achieved by using -S option"), so it shouldn't expect any number at all, which is consistent with what I got using the Explorer.
So then, what's the deal with the -S option when using the command line? Looking at the error text given by weka, I found this:
Options specific to classifier weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE:
-D
Output debugging information
-F <int>
Impose a frequency limit for superParents (default is 1)
-M <double>
Specify a weight to use with m-estimate (default is 1)
-S <int>
Specify a critical value for specialization-generalilzation SR (default is 100)
-W
Specify if to use weighted AODE
So it seems that this class works in two different ways, depending on which method I use (GUI vs. Command Line).
The solution I found, at least for the meantime, was to write -S 100 on my script. Is this really the same as just putting -S in the GUI?
Thanks in advance.
JM
I've had a play with this Classifier, and can confirm that what you are experiencing on your end is consistent with what I have here. From the GUI, the -S Option (subsumption Resolution) requires no parameters while the Command Prompt does (specialization-generalization SR).
They don't sound like the same parameter, so you may need to raise this issue with the developer of the third party package if you would like to know more information on these parameters. You can find this information from the Tools -> Package Manager -> AnDE, which will point you to the contacts for the library.
I tried to run the kmeans example in Mahout 0.5, but failed! I found in kmeans.props that it required a strange parameter, -c, which means path_to_initial_clusters.
What's this stuff? How could I prepare for it?
kmeans.props:
The following parameters must be specified
i|input = /path/to/input
c|clusters = /path/to/initial/clusters
So mahout cannot needs input in specific format to carry out its clustering algorithm.
So have a look at
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
As an example say for Reuters 21587 data set.
Following are the steps :
1.mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
2.mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
3.mahout kmeans -i reuters-vectors/tfidf-vectors/ \
-c reuters-initial-clusters \
-o reuters-kmeans-clusters \
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure \
-cd 1.0 -k 20 -x 20 -cl
Hope it helps
K-means requires initial clusters in order to iteratively update the centroid (which is the center of a cluster) until it converge.
-c, path_to_initial_clusters ask you just to give a directory for mahout to store its initial clusters.
You can specify any path for mahout to store the initial clusters and mahout will compute the initial clusters and store in the directory. Or you can compute the initial cluster by canopy clustering or other method, and tell mahout the directory of the initial cluster you have computed to initialize K-means clustering.