exception in thread main java.lang.illegalargumentexception. Training dataset /tmp/iris.csv can not be found in mahout - mahout

When I try to work MLP in mahout:
/home/batu/Documents/mahout/trunk/bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron -i /tmp/iris.csv -labels Iris-setosa Iris-versicolor Iris-virginica -mo /tmp/mlp.model -ls 4 8 3 -l 0.2 -m 0.35 -r 0.0001
this exception occurs.
Exception in thread "main" java.lang.illegalargumentexception: Training dataset /tmp/iris2.csv cannot be found!
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
at org.apache.mahout.classifier.mlp.TrainMultiLayerPerceptron.main(TrainMultiLayerPerceptron.java:124)

Related

Segmentation fault while training a ml model using yolov5

I'm doing a project on object detection using yolov5. After I run train.py file for yolov5, it stops causing segmentation error. How do I solve this?
zsh: segmentation fault python3 train.py --img 640 --cfg yolov5m.yaml --hyp --batch 5 --epochs 1

Feature selection in SVM classification-Weird behaviour

I am using the UCI ML breast cancer dataset to build a classifier using SVMs. I am using LIBSVM, and its fselect.py script for calculating f-scores for feature selection. My dataset has 8 features, and the scores for them are following:
5: 1.765716
2: 1.413180
1: 1.320096
6: 1.103449
8: 0.790712
3: 0.734230
7: 0.698571
4: 0.580819
This implies that the 5th feature is the most discriminative, and 4th is the least. My next piece of code looks something like this:
x1=x(:,5);
x2=x(:,[5,2]);
x3=x(:,[5,2,6]);
x4=x(:,[5,2,6,8]);
x5=x(:,[5,2,6,8,3]);
x6=x(:,[5,2,6,8,3,7]);
x7=x(:,[5,2,6,8,3,7,4]);
errors2=zeros(7,1);
errors2(1)=svmtrain(y,x1,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(2)=svmtrain(y,x2,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(3)=svmtrain(y,x3,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(4)=svmtrain(y,x4,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(5)=svmtrain(y,x5,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(6)=svmtrain(y,x6,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
errors2(7)=svmtrain(y,x7,'-s 0 -t 2 -c 0.062500 -g 0.0039062 -v 10');
Note: gamma and C were computed using grid search and x is the complete matrix with 8 columns (corresponding to 8 features)
When I print the errors2 matrix, I get the following output:
errors2 =
88.416
92.229
93.109
94.135
94.282
94.575
94.575
This means that I get the most accuracy when I use all the features and get the least accuracy when I use the most discriminating feature. As far as I know, I should get the most accuracy when I use a subset of features containing the most discriminating one. Why is the program behaving this way then? Can someone point out any errors that I might have made?
(My intuition says that I've calculated the C wrong, since it is so small).
The error rates you are getting are as would be expected. Adding an extra feature should reduce the error rate, because you have more information.
As an example, consider trying to work out what model a car is. The most discriminative feature is probably the manufacturer, but adding features such as engine size, height, width, length, weight etc will narrow it down further.
If you are considering lots of features, some of which may have very low discriminative power, you might run into problems with overfitting to your training data. Here you have just 8 features, but it already looks like adding the 8th feature has no effect. (In the car example, this might be features such as how dirty the car is, amount of tread left on the tyres, the channel the radio is tuned to, etc).

No input clusters found in /user/mahout/cluster/part-randomSeed. Check your -c argument

My test.csv file:
==================
1,54,1341775056478
2,1568,1341775056478
1,1622,1341775056498
2,3136,1341775056498
1,3190,1341775056671
2,4704,1341775056671
1,4758,1341775056693
2,6272,1341775056693
1,6326,1341775056714
2,7840,1341775056714
1,7894,1341775056735
2,9408,1341775056735
1,9462,1341775056951
2,10976,1341775056951
1,11030,1341775056972
2,12544,1341775056972
1,12598,1341775056994
2,14112,1341775056994
1,14166,1341775057014
2,15680,1341775057014
1,15734,1341775057065
2,17248,1341775057065
1,17302,1341775057087
2,18816,1341775057087
1,18870,1341775057119
2,20384,1341775057119
....
....
I am trying to cluster this data using mahout k-means algorithm.
I had followed these steps:
1)Create a sequence file from the test.csv file
mahout seqdirectory -c UTF-8 -i /user/mahout/input/test.csv -o /user/sample/out_seq -chunk 64
2)Create a sparse vector from the sequence file
mahout seq2sparse -i /user/mahout/out_seq/ -o /user/mahout/sparse_dir --maxDFPercent 85 --namedVector
3)perfom K-Means clustering
mahout kmeans -i /user/mahout/sparse_dir/tfidf-vectors/ -c /user/mahout/cluster -o /user/mahout/kmeans_out
-dm org.apache.mahout.common.distance.CosineDistanceMeasure --maxIter 10 --numClusters 20 --ow --clustering
At step 3,I'm facing this error:
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /user/mahout/text/cluster/part-randomSeed. Check your -c argument.
at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
....
....
How to overcome this error.Actually I did the clustering example successfuuly using reuters dataset.But with my dataset,it is showing this issue.Is there any problem with the dataset ? or due to some other issue,am i facing this error?
Can anyone suggest me regarding this issue...
Thanks, in advance

How to run the DistributedLanczosSolver on mahout

I am trying to run the Lanczos Example of mahout.
I am having trouble finding the input file. and what should be the format of input file.
I have used the commands to convert the .txt file into sequence File format by running:
bin/mahout seqdirectory -i input.txt -o outseq -c UTF-8
bin/mahout seq2sparse -i outseq -o ttseq
bin/hadoop jar mahout-examples-0.9-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input /user/hduser/outputseq --output /out1 --numCols 2 --numRows 4 --cleansvd "true" --rank 5
14/03/20 13:36:12 INFO lanczos.LanczosSolver: Finding 5 singular vectors of matrix with 4 rows, via Lanczos
14/03/20 13:36:13 INFO mapred.FileInputFormat: Total input paths to process : 7
Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:245)
at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:152)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:111)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:237)
... 13 more
Any idea please?
In your case you are doing input.txt -> outseq -> ttseq.
You are using outputseq (but not outseq) as input to generate out1.
And you are getting error with ttseq. That is strange? Perhaps you are missing some step in your post.
For me:
This PASSES: text-files -> output-seqdir -> output-seq2sparse-normalized
This FAILS: text-files -> output-seqdir -> output-seq2sparse -> output-seq2sparse-normalized
More details.
I am seeing this error in a different situation:
Create sequence files
$ mahout seqdirectory -i /data/lda/text-files/ -o /data/lda/output-seqdir -c UTF-8
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 20:47:25 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/data/lda/ohsumed_full_txt/ohsumed_full_txt/], --keyPrefix=[], --output=[/data/lda/output], --startPhase=[0], --tempDir=[temp]}
14/03/24 20:57:20 INFO driver.MahoutDriver: Program took 594764 ms (Minutes: 9.912733333333334)
Convert sequence files to sparse vectors. Use TFIDF by default.
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse/ -ow
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:00:08 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/03/24 21:00:10 INFO input.FileInputFormat: Total input paths to process : 1
14/03/24 21:00:11 INFO mapred.JobClient: Running job: job_201403241418_0001
.....
14/03/24 21:02:51 INFO driver.MahoutDriver: Program took 162906 ms (Minutes: 2.7151)
Following command fails ( using /data/lda/output-seq2sparse as input )
$ mahout seq2sparse -i /data/lda/output-seq2sparse -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/data/lda/output-seq2sparse/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
....SKIPPED....
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
However this works just fine ( using /data/lda/output-seqdir as input )
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Running on hadoop, using .../hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ..../mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:35:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 2
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 50.0
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 5
14/03/24 21:35:57 INFO input.FileInputFormat: Total input paths to process : 1
...SKIPPED...
14/03/24 21:45:11 INFO common.HadoopUtil: Deleting /data/lda/output-seq2sparse-normalized/partial-vectors-0
14/03/24 21:45:11 INFO driver.MahoutDriver: Program took 556420 ms (Minutes: 9.273666666666667)

Error while creating mahout model

I am training mahout classifier for my data,
Following commands i issued to create mahout model
./bin/mahout seqdirectory -i /tmp/mahout-work-root/MyData-all -o /tmp/mahout-work-root/MyData-seq
./bin/mahout seq2sparse -i /tmp/mahout-work-root/MyData-seq -o /tmp/mahout-work-root/MyData-vectors -lnorm -nv -wt tfidf
./bin/mahout split -i /tmp/mahout-work-root/MyData-vectors/tfidf-vectors --trainingOutput /tmp/mahout-work-root/MyData-train-vectors --testOutput /tmp/mahout-work-root/MyData-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
./bin/mahout trainnb -i /tmp/mahout-work-root/Mydata-train-vectors -el -o /tmp/mahout-work-root/model -li /tmp/mahout-work-root/labelindex -ow
When i try to create the model using trainnb command i am getting the following Exception :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:119) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:152)
What could be the problem here?
Note: Original Example mentioned here works fine.
I think it might be the problem of how you put your training files.
The files should be organized as following:
MyData-All
\classA
-file1
-file2
-...
\classB
-filex
....

Resources