How to run the DistributedLanczosSolver on mahout - mahout

I am trying to run the Lanczos Example of mahout.
I am having trouble finding the input file. and what should be the format of input file.
I have used the commands to convert the .txt file into sequence File format by running:
bin/mahout seqdirectory -i input.txt -o outseq -c UTF-8
bin/mahout seq2sparse -i outseq -o ttseq
bin/hadoop jar mahout-examples-0.9-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input /user/hduser/outputseq --output /out1 --numCols 2 --numRows 4 --cleansvd "true" --rank 5
14/03/20 13:36:12 INFO lanczos.LanczosSolver: Finding 5 singular vectors of matrix with 4 rows, via Lanczos
14/03/20 13:36:13 INFO mapred.FileInputFormat: Total input paths to process : 7
Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:245)
at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:152)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:111)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:237)
... 13 more
Any idea please?

In your case you are doing input.txt -> outseq -> ttseq.
You are using outputseq (but not outseq) as input to generate out1.
And you are getting error with ttseq. That is strange? Perhaps you are missing some step in your post.
For me:
This PASSES: text-files -> output-seqdir -> output-seq2sparse-normalized
This FAILS: text-files -> output-seqdir -> output-seq2sparse -> output-seq2sparse-normalized
More details.
I am seeing this error in a different situation:
Create sequence files
$ mahout seqdirectory -i /data/lda/text-files/ -o /data/lda/output-seqdir -c UTF-8
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 20:47:25 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/data/lda/ohsumed_full_txt/ohsumed_full_txt/], --keyPrefix=[], --output=[/data/lda/output], --startPhase=[0], --tempDir=[temp]}
14/03/24 20:57:20 INFO driver.MahoutDriver: Program took 594764 ms (Minutes: 9.912733333333334)
Convert sequence files to sparse vectors. Use TFIDF by default.
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse/ -ow
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:00:08 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/03/24 21:00:10 INFO input.FileInputFormat: Total input paths to process : 1
14/03/24 21:00:11 INFO mapred.JobClient: Running job: job_201403241418_0001
.....
14/03/24 21:02:51 INFO driver.MahoutDriver: Program took 162906 ms (Minutes: 2.7151)
Following command fails ( using /data/lda/output-seq2sparse as input )
$ mahout seq2sparse -i /data/lda/output-seq2sparse -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/data/lda/output-seq2sparse/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
....SKIPPED....
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
However this works just fine ( using /data/lda/output-seqdir as input )
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Running on hadoop, using .../hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ..../mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:35:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 2
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 50.0
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 5
14/03/24 21:35:57 INFO input.FileInputFormat: Total input paths to process : 1
...SKIPPED...
14/03/24 21:45:11 INFO common.HadoopUtil: Deleting /data/lda/output-seq2sparse-normalized/partial-vectors-0
14/03/24 21:45:11 INFO driver.MahoutDriver: Program took 556420 ms (Minutes: 9.273666666666667)

Related

Nextflow Docker problem -- permission problem with Docker? Not able to run pomoxis container

I'm new to nextflow and docker containers.
I am trying to denovo assemble some reads and map reads to a reference genome but keep getting the following error:
shaun#shaun-HP-Z6-G4-Workstation:~/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble$ nextflow pomoxis_map_assemble_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `pomoxis_map_assemble_nextflow.nf` [goofy_hoover] DSL1 - revision: 6e4be1e0bd
executor > local (1)
executor > local (1)
[cf/bc2b69] process > pomoxis (1) [100%] 1 of 1, failed: 1 ✘
WARN: There's no process matching config selector: fastqc
WARN: There's no process matching config selector: porechop
WARN: There's no process matching config selector: bioawk
WARN: There's no process matching config selector: fastqconvert
WARN: There's no process matching config selector: blast_raw
Error executing process > 'pomoxis (1)'
Caused by:
Process `pomoxis (1)` terminated with an error exit status (127)
Command executed:
mini_align -i output.fastq -r /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/ref/*.fasta -o results -p > output_test_final.fa
Command exit status:
127
Command output:
(empty)
Command error:
.command.sh: line 2: mini_align: command not found
Work dir:
/home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/work/cf/bc2b696aea0863d76c2c9221315c39
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
The WARN is because I use a master config file for all of my .nf scripts.
Here is the logfile from the above error:
Oct.-12 10:33:04.759 [main] DEBUG nextflow.cli.Launcher - $> nextflow pomoxis_map_assemble_nextflow.nf
Oct.-12 10:33:04.822 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 22.04.5
Oct.-12 10:33:04.837 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/nextflow.config
Oct.-12 10:33:04.838 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/nextflow.config
Oct.-12 10:33:04.856 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Oct.-12 10:33:05.414 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=1 by probing script field
Oct.-12 10:33:05.428 [main] INFO nextflow.cli.CmdRun - Launching `pomoxis_map_assemble_nextflow.nf` [goofy_hoover] DSL1 - revision: 6e4be1e0bd
Oct.-12 10:33:05.439 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; plugins-dir=/home/shaun/.nextflow/plugins; core-plugins: nf-amazon#1.7.2,nf-azure#0.13.2,nf-console#1.0.3,nf-ga4gh#1.0.3,nf-google#1.1.4,nf-sqldb#0.4.0,nf-tower#1.4.0
Oct.-12 10:33:05.440 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
Oct.-12 10:33:05.449 [main] INFO org.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Oct.-12 10:33:05.449 [main] INFO org.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Oct.-12 10:33:05.452 [main] INFO org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Oct.-12 10:33:05.478 [main] INFO org.pf4j.AbstractPluginManager - No plugins
Oct.-12 10:33:05.527 [main] DEBUG nextflow.Session - Session uuid: ddab5124-cc54-4c39-840d-6cc1bf664773
Oct.-12 10:33:05.527 [main] DEBUG nextflow.Session - Run name: goofy_hoover
Oct.-12 10:33:05.527 [main] DEBUG nextflow.Session - Executor pool size: 16
Oct.-12 10:33:05.547 [main] DEBUG nextflow.cli.CmdRun -
Version: 22.04.5 build 5708
Created: 15-07-2022 16:09 UTC (16-07-2022 01:39 ACDT)
System: Linux 5.13.0-52-generic
Runtime: Groovy 3.0.10 on OpenJDK 64-Bit Server VM 11.0.15+10-Ubuntu-0ubuntu0.21.10.1
Encoding: UTF-8 (UTF-8)
Process: 665521#shaun-HP-Z6-G4-Workstation [127.0.1.1]
CPUs: 16 - Mem: 62.5 GB (32.1 GB) - Swap: 2 GB (2 GB)
Oct.-12 10:33:05.564 [main] DEBUG nextflow.Session - Work-dir: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/work [ext2/ext3]
Oct.-12 10:33:05.564 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/bin
Oct.-12 10:33:05.573 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[]
Oct.-12 10:33:05.583 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Oct.-12 10:33:05.605 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
Oct.-12 10:33:05.613 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 17; maxThreads: 1000
Oct.-12 10:33:05.700 [main] DEBUG nextflow.Session - Session start invoked
Oct.-12 10:33:05.706 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/pipeline_trace.txt
Oct.-12 10:33:05.921 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Oct.-12 10:33:05.969 [PathVisitor-1] DEBUG nextflow.file.PathVisitor - files for syntax: glob; folder: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/; pattern: *.fastq; options: [:]
Oct.-12 10:33:05.998 [PathVisitor-1] DEBUG nextflow.file.PathVisitor - files for syntax: glob; folder: /home/shaun/nextflow_pipelines/pipelines/pomoxis_map_assemble/ref/; pattern: *.fasta; options: [:]
Oct.-12 10:33:06.048 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: null
Oct.-12 10:33:06.048 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'local'
Oct.-12 10:33:06.052 [main] DEBUG nextflow.executor.Executor - [warm up] executor > local
Oct.-12 10:33:06.056 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=16; memory=62.5 GB; capacity=16; pollInterval=100ms; dumpInterval=5m
Oct.-12 10:33:06.121 [main] DEBUG nextflow.Session - Workflow process names [dsl1]: pomoxis
Oct.-12 10:33:06.133 [main] WARN nextflow.Session - There's no process matching config selector: fastqc
Oct.-12 10:33:06.134 [main] WARN nextflow.Session - There's no process matching config selector: porechop
Oct.-12 10:33:06.135 [main] WARN nextflow.Session - There's no process matching config selector: bioawk
Oct.-12 10:33:06.135 [main] WARN nextflow.Session - There's no process matching config selector: fastqconvert
Oct.-12 10:33:06.135 [main] WARN nextflow.Session - There's no process matching config selector: blast_raw
Oct.-12 10:33:06.135 [main] DEBUG nextflow.script.ScriptRunner - > Await termination
Oct.-12 10:33:06.135 [main] DEBUG nextflow.Session - Session await
Oct.-12 10:33:06.344 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Oct.-12 10:33:06.349 [Task submitter] INFO nextflow.Session - [cf/bc2b69] Submitted process > pomoxis (1)
Oct.-12 10:33:06.457 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: pomoxis (1); status: COMPLETED; exit: 127; error: -; workDir: /home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/work/cf/bc2b696aea0863d76c2c9221315c39]
Oct.-12 10:33:06.491 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'pomoxis (1)'
Caused by:
Process `pomoxis (1)` terminated with an error exit status (127)
Command executed:
mini_align -i output.fastq -r /home/shaun/nextflow_pipelines/pipelines/pomoxis_map_assemble/ref/*.fasta -o results -p > output_test_final.fa
Command exit status:
127
Command output:
(empty)
Command error:
.command.sh: line 2: mini_align: command not found
Work dir:
/home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/work/cf/bc2b696aea0863d76c2c9221315c39
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Oct.-12 10:33:06.499 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `pomoxis (1)` terminated with an error exit status (127)
Oct.-12 10:33:06.503 [main] DEBUG nextflow.Session - Session await > all process finished
Oct.-12 10:33:06.520 [main] DEBUG nextflow.Session - Session await > all barriers passed
Oct.-12 10:33:06.540 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=1.2s; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=16; peakMemory=60 GB; ]
Oct.-12 10:33:06.540 [main] DEBUG nextflow.trace.TraceFileObserver - Flow completing -- flushing trace file
Oct.-12 10:33:06.715 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Oct.-12 10:33:06.728 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye
I have the same error when I try to use pomoxis to devno assemble the the reads:
shaun#shaun-HP-Z6-G4-Workstation:~/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble$ nextflow pomoxis_denovo_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `pomoxis_denovo_nextflow.nf` [intergalactic_sammet] DSL1 - revision: 1927bae9ab
executor > local (1)
[fb/3868bc] process > pomoxis (1) [100%] 1 of 1, failed: 1 ✘
WARN: There's no process matching config selector: fastqc
WARN: There's no process matching config selector: porechop
WARN: There's no process matching config selector: bioawk
WARN: There's no process matching config selector: fastqconvert
WARN: There's no process matching config selector: blast_raw
Error executing process > 'pomoxis (1)'
Caused by:
Process `pomoxis (1)` terminated with an error exit status (127)
Command executed:
mini_assemble -i output.fastq -o results -p > output_test_final.fa
Command exit status:
127
Command output:
(empty)
Command error:
.command.sh: line 2: mini_assemble: command not found
Work dir:
/home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_map_assemble/work/fb/3868bc97cabfa0d2934f4344c13c89
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
This this is the output for the .command.sh
#!/bin/bash -ue
mini_assemble -i output.fastq -o results -p > output_test_final.fa
I have used the the flags for pomoxis from here --> https://nanoporetech.github.io/pomoxis/programs.html
Below are the two .nf files:
#!/usr/bin/env nextflow
//data_location
params.outdir = './results'
params.in = "$PWD/*.fastq"
datasetA = Channel.fromPath(params.in)
params.ref = "$PWD/ref/*.fasta"
datasetB = Channel.fromPath(params.ref)
//map and assemble
process pomoxis {
publishDir "${params.outdir}", mode:'copy'
input:
path (c) from datasetA
output:
path "${c.baseName}_test_final.fa" into mapped_ch
script:
"""
mini_align -i $c -r $params.ref -o results -p > ${c.baseName}_test_final.fa
"""
}
#!/usr/bin/env nextflow
//data_location
params.outdir = './results'
params.in = "$PWD/*.fastq"
dataset = Channel.fromPath(params.in)
//devno only
process pomoxis {
publishDir "${params.outdir}", mode:'copy'
input:
path (c) from dataset
output:
path "${c.baseName}_test_final.fa" into mapped_ch
script:
"""
mini_assemble -i $c -o results -p > ${c.baseName}_test_final.fa
"""
}
Here is the config file, I have tried multiple docker containers from docker hub but I get the same error message.
//resume = true
process {
cpus = 16
accelerator = 'Quadro-RTX-5000'
memory = 60. GB
}
trace {
enabled = true
file = 'pipeline_trace.txt'
fields = 'task_id,hash,process,name,status,exit,module,container,cpus,time,disk,memory,attempt,submit,start,complete,duration,realtime,queue,%cpu,%mem'
}
docker {
enabled = true
temp = 'auto'
runOption = '--user root'
}
params {
nt_db_20221011 = '/home/shaun/blast/nt_db_20221011/nt'
}
process {
withName:fastqc {container = 'staphb/fastqc:latest' }
withName:porechop {container = 'quay.io/biocontainers/porechop:0.2.3_seqan2.1.1--py36h2d50403_3' }
withName:bioawk {container = 'wslhbio/bioawk:1.0-wslh-signed' }
withName:fastqconvert{container = 'staphb/seqtk:1.3' }
withName:blast_raw {container = 'staphb/blast:2.13.0' }
withname:pomoxis {container = 'dpirdmk/pomoxis:0.1.11' }
}
Update:
After correcting the syntax error, I changed the container as suggested and added the -h function to the script section.
The following is the output from this:
shaun#shaun-HP-Z6-G4-Workstation:~/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_align$ nextflow pomoxis_align_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `pomoxis_align_nextflow.nf` [voluminous_solvay] DSL1 - revision: 4dca1d1d46
executor > local (1)
[55/4dad0b] process > pomoxis (1) [ 0%] 0 of 1
WARN: There's no process matching config selector: fastqc
WARN: There's no process matching config selector: porechop
WARN: There's no process matching config selector: bioawk
WARN: There's no process matching config selector: fastqconvert
WARN: There's no process matching config selector: blast_raw
Error executing process > 'pomoxis (1)'
Caused by:
Missing output file(s) `output` expected by process `pomoxis (1)`
Command executed:
mini_align -h
Command exit status:
0
Command output:
(empty)
executor > local (1)
[55/4dad0b] process > pomoxis (1) [100%] 1 of 1, failed: 1 ✘
WARN: There's no process matching config selector: fastqc
WARN: There's no process matching config selector: porechop
WARN: There's no process matching config selector: bioawk
WARN: There's no process matching config selector: fastqconvert
WARN: There's no process matching config selector: blast_raw
Error executing process > 'pomoxis (1)'
Caused by:
Missing output file(s) `output` expected by process `pomoxis (1)`
Command executed:
mini_align -h
Command exit status:
0
Command output:
(empty)
Command error:
mini_align [-h] -r <reference> -i <fastq>
Align fastq/a formatted reads to a genome using minimap2.
-h show this help text.
-r reference, should be a fasta file. If correspondng minimap indices
do not exist they will be created. (required).
-i fastq/a input reads (required).
-I split index every ~NUM input bases (default: 16G, this is larger
than the usual minimap2 default).
-d set the minimap2 preset, e.g. map-ont, asm5, asm10, asm20 [default: map-ont]
-f force recreation of index file.
-a aggressively extend gaps (sets -A1 -B2 -O2 -E1 for minimap2).
-P filter to only primary alignments (i.e. run samtools view -F 2308).
Deprecated: this filter is now default and can be disabled with -A.
-y filter to primary and supplementary alignments (i.e. run samtools view -F 260)
-A do not filter alignments, output all.
-n sort bam by read name.
-c chunk size. Input reads/contigs will be broken into chunks
prior to alignment.
-t alignment threads (default: 1).
-p output file prefix (default: reads).
-m fill MD tag.
-s fill cs(=long) tag.
-X only create reference index files.
-x log all commands before running.
-M match score
-S mismatch score
-O open gap penalty
-E extend gap penalty.
Work dir:
/home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_align/work/55/4dad0b1df7105e31362f58b0c67f1f
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Athough there is an error, I'm assuming this worked as I can see the help menu?
I then added flags to the script section of the .nf as per the Pomoxis website.
#!/usr/bin/env nextflow
//data_location
params.outdir = './results'
params.in = "$PWD/*.fastq"
datasetA = Channel.fromPath(params.in)
params.ref = "$PWD/SLCMV.fasta"
datasetB = Channel.fromPath(params.ref)
//map and assemble; input fastq only
//output; pomoxis is very particular about names, minimap can out put
//.sam or .paf
process pomoxis {
publishDir "${params.outdir}", mode:'copy'
input:
path (c) from datasetA
path (d) from datasetB
output:
path "${c.simpleName}" into mapped_ch
script:
"""
mini_align -r $d -i $c -p ${c.simpleName}
"""
}
The results are below:
shaun#shaun-HP-Z6-G4-Workstation:~/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_align$ nextflow pomoxis_align_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `pomoxis_align_nextflow.nf` [determined_bassi] DSL1 - revision: 1ad3e75c1b
executor > local (1)
[50/860f9f] process > pomoxis (1) [ 0%] 0 of 1
WARN: There's no process matching config selector: fastqc
WARN: There's no process matching config selector: porechop
WARN: There's no process matching config selector: bioawk
WARN: There's no process matching config selector: fastqconvert
WARN: There's no process matching config selector: blast_raw
Error executing process > 'pomoxis (1)'
Caused by:
Missing output file(s) `output` expected by process `pomoxis (1)`
executor > local (1)
[50/860f9f] process > pomoxis (1) [100%] 1 of 1, failed: 1 ✘
WARN: There's no process matching config selector: fastqc
WARN: There's no process matching config selector: porechop
WARN: There's no process matching config selector: bioawk
WARN: There's no process matching config selector: fastqconvert
WARN: There's no process matching config selector: blast_raw
Error executing process > 'pomoxis (1)'
Caused by:
Missing output file(s) `output` expected by process `pomoxis (1)`
Command executed:
mini_align -r SLCMV.fasta -i output.fastq -p output
Command exit status:
0
Command output:
(empty)
Command error:
Creating fai index file SLCMV.fasta.fai
Creating mmi index file SLCMV.fasta.map-ont.mmi
[M::mm_idx_gen::0.003*2.10] collected minimizers
[M::mm_idx_gen::0.005*2.42] sorted minimizers
[M::main::0.009*1.77] loaded/built the index for 1 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.009*1.74] distinct minimizers: 522 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.282; total length: 2757
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -I 16G -x map-ont -d SLCMV.fasta.map-ont.mmi SLCMV.fasta
[M::main] Real time: 0.010 sec; CPU: 0.017 sec; Peak RSS: 0.003 GB
[M::main::0.005*1.39] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.005*1.37] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.005*1.35] distinct minimizers: 522 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.282; total length: 2757
[M::worker_pipeline::9.159*1.00] mapped 87369 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -x map-ont --secondary=no -L -t 1 -a SLCMV.fasta.map-ont.mmi output.fastq
[M::main] Real time: 9.182 sec; CPU: 9.145 sec; Peak RSS: 0.299 GB
Work dir:
/home/shaun/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_align/work/50/860f9ffe37f7a17a4d5e216f67b512
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
I'm not sure how to fix this, I've tried changing the output name but this hasnt help.
interestingly, when I run mini_assemble .nf:
#!/usr/bin/env nextflow
//data_location
params.outdir = './results'
params.in = "$PWD/*.fastq"
dataset = Channel.fromPath(params.in)
//devno only; input is fastq
//output; pomoxis is very particular about names, the name that it out puts is
//prefix_test_final.fa where prefic can be anything and it will add _test_final.fa
//to the end of the file
process pomoxis {
tag "$c"
publishDir "${params.outdir}", mode:'copy'
input:
path (c) from dataset
output:
path ("${c.simpleName}_test_final.fa") into mapped_ch
script:
"""
mini_assemble -i $c -p ${c.simpleName}
"""
}
I get the following error:
shaun#shaun-HP-Z6-G4-Workstation:~/nextflow_pipelines/pipelines/WA_MomicaKehoe/Asad/nf_pipeline/pomoxis_assemble$ nextflow pomoxis_assemble_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `pomoxis_assemble_nextflow.nf` [intergalactic_fermi] DSL1 - revision: ce7a466307
executor > local (1)
[04/28f5e6] process > pomoxis (output.fastq) [ 0%] 0 of 1
Error executing process > 'pomoxis (output.fastq)'
Caused by:
Missing output file(s) `output_test_final.fa` expected by process `pomoxis (output.fastq)`
Command executed:
mini_assemble -i output.fastq -p output
Command exit status:
0
Command output:
Copying FASTX input to workspace: output.fastq > assm/output.fa.gz
Skipped adapter trimming.
Skipped pre-assembly correction.
Overlapping reads...
Assembling graph...
Running racon read shuffle 1...
Running round 1 consensus...
Running round 2 consensus...
Running round 3 consensus...
Running round 4 consensus...
Waiting for cleanup.
Final assembly written to assm/output_final.fa. Have a nice day.
Command error:
[M::mm_idx_stat::0.065*1.04] distinct minimizers: 69519 (92.83% are singletons); average occurrences: 1.101; average spacing: 5.337; total length: 408518
[M::worker_pipeline::9.872*0.97] mapped 87369 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -L -K 500M -t 1 racon_1_3.fa.gz output.fa.gz
[M::main] Real time: 9.880 sec; CPU: 9.620 sec; Peak RSS: 0.165 GB
[racon::Polisher::initialize] loaded target sequences 0.011974 s
[racon::Polisher::initialize] loaded sequences 2.047803 s
[racon::Polisher::initialize] loaded overlaps 0.106485 s
[racon::Polisher::initialize] aligning overlaps [=> ] 1.252726 s
[racon::Polisher::initialize] aligning overlaps [==> ] 1.827141 s
[racon::Polisher::initialize] aligning overlaps [===> ] 2.413558 s
[racon::Polisher::initialize] aligning overlaps [====> ] 2.974846 s
[racon::Polisher::initialize] aligning overlaps [=====> ] 3.609222 s
[racon::Polisher::initialize] aligning overlaps [======> ] 4.263570 s
[racon::Polisher::initialize] aligning overlaps [=======> ] 4.934294 s
[racon::Polisher::initialize] aligning overlaps [========> ] 5.554714 s
[racon::Polisher::initialize] aligning overlaps [=========> ] 6.164404 s
[racon::Polisher::initialize] aligning overlaps [==========> ] 6.746845 s
executor > local (1)
[04/28f5e6] process > pomoxis (output.fastq) [100%] 1 of 1, failed: 1 ✘
Error executing process > 'pomoxis (output.fastq)'
Caused by:
Missing output file(s) `output_test_final.fa` expected by process `pomoxis (output.fastq)`
Command executed:
mini_assemble -i output.fastq -p output
Command exit status:
0
Command output:
Copying FASTX input to workspace: output.fastq > assm/output.fa.gz
Skipped adapter trimming.
Skipped pre-assembly correction.
Overlapping reads...
Assembling graph...
Running racon read shuffle 1...
Running round 1 consensus...
Running round 2 consensus...
Running round 3 consensus...
Running round 4 consensus...
Waiting for cleanup.
Final assembly written to assm/output_final.fa. Have a nice day.
Command error:
[M::mm_idx_stat::0.065*1.04] distinct minimizers: 69519 (92.83% are singletons); average occurrences: 1.101; average spacing: 5.337; total length: 408518
[M::worker_pipeline::9.872*0.97] mapped 87369 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -L -K 500M -t 1 racon_1_3.fa.gz output.fa.gz
[M::main] Real time: 9.880 sec; CPU: 9.620 sec; Peak RSS: 0.165 GB
[racon::Polisher::initialize] loaded target sequences 0.011974 s
[racon::Polisher::initialize] loaded sequences 2.047803 s
[racon::Polisher::initialize] loaded overlaps 0.106485 s
[racon::Polisher::initialize] aligning overlaps [=> ] 1.252726 s
[racon::Polisher::initialize] aligning overlaps [==> ] 1.827141 s
[racon::Polisher::initialize] aligning overlaps [===> ] 2.413558 s
[racon::Polisher::initialize] aligning overlaps [====> ] 2.974846 s
[racon::Polisher::initialize] aligning overlaps [=====> ] 3.609222 s
[racon::Polisher::initialize] aligning overlaps [======> ] 4.263570 s
[racon::Polisher::initialize] aligning overlaps [=======> ] 4.934294 s
[racon::Polisher::initialize] aligning overlaps [========> ] 5.554714 s
[racon::Polisher::initialize] aligning overlaps [=========> ] 6.164404 s
[racon::Polisher::initialize] aligning overlaps [==========> ] 6.746845 s
[racon::Polisher::initialize] aligning overlaps [===========> ] 7.282895 s
[racon::Polisher::initialize] aligning overlaps [============> ] 7.878856 s
[racon::Polisher::initialize] aligning overlaps [=============> ] 8.459672 s
[racon::Polisher::initialize] aligning overlaps [==============> ] 8.984196 s
[racon::Polisher::initialize] aligning overlaps [===============> ] 9.558089 s
[racon::Polisher::initialize] aligning overlaps [================> ] 10.072586 s
[racon::Polisher::initialize] aligning overlaps [=================> ] 10.648603 s
[racon::Polisher::initialize] aligning overlaps [==================> ] 11.220553 s
[racon::Polisher::initialize] aligning overlaps [===================>] 11.753881 s
[racon::Polisher::initialize] aligning overlaps [====================] 12.364460 s
[racon::Polisher::initialize] transformed data into windows 0.018851 s
[racon::Polisher::polish] generating consensus [=> ] 0.515508 s
[racon::Polisher::polish] generating consensus [==> ] 1.077083 s
[racon::Polisher::polish] generating consensus [===> ] 1.561330 s
[racon::Polisher::polish] generating consensus [====> ] 1.967063 s
[racon::Polisher::polish] generating consensus [=====> ] 2.574379 s
[racon::Polisher::polish] generating consensus [======> ] 3.037187 s
[racon::Polisher::polish] generating consensus [=======> ] 3.740370 s
[racon::Polisher::polish] generating consensus [========> ] 20.112573 s
[racon::Polisher::polish] generating consensus [=========> ] 46.718507 s
[racon::Polisher::polish] generating consensus [==========> ] 48.913841 s
[racon::Polisher::polish] generating consensus [===========> ] 52.859126 s
[racon::Polisher::polish] generating consensus [============> ] 57.495382 s
[racon::Polisher::polish] generating consensus [=============> ] 74.111321 s
[racon::Polisher::polish] generating consensus [==============> ] 79.382051 s
[racon::Polisher::polish] generating consensus [===============> ] 113.129291 s
[racon::Polisher::polish] generating consensus [================> ] 115.971094 s
[racon::Polisher::polish] generating consensus [=================> ] 123.322513 s
[racon::Polisher::polish] generating consensus [==================> ] 128.155934 s
[racon::Polisher::polish] generating consensus [===================>] 130.509134 s
[racon::Polisher::polish] generating consensus [====================] 131.095442 s
[racon::Polisher::] total = 145.794273 s
Work dir:
/home/shaun/nextflow_pipelines/pipelines/WA_MomicaKehoe/Asad/nf_pipeline/pomoxis_assemble/work/04/28f5e6eed13d3fc986739e287594d9
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
However, the file that I need is in:
shaun#shaun-HP-Z6-G4-Workstation:~/nextflow_pipelines/pipelines/nf_pipeline/pomoxis_assemble/work/04/28f5e6eed13d3fc986739e287594d9/assm$ cat output_final.fa
When I cat this file I can see that there are contig in the file as expected.
I can see that pomoxis is expecting output_test_final.fa, which is indeed in the work dir, but pomoxis cant see it or use it?
Do you have any suggestions on why these errors are occuring?
I appreciate your time.
Command error:
.command.sh: line 2: mini_assemble: command not found
You will get this error when you try to run a command outside of a container and the executable does not exist in your $PATH. Usually we see this when we forget to add docker.enabled = true (or singularity.enabled = true if using Singualarity) to our nextflow.config. However, in this case it's just caused by a typo in your config file. The withName config selector must be camel case. You will also get a 'command not found' error using the chosen container. If we instead use the pomoxis biocontainer, we get the expected result. For example:
Contents of nextflow.config:
docker {
enabled = true
}
process {
withName: pomoxis {
container = 'quay.io/biocontainers/pomoxis:0.3.10--pyhdfd78af_0'
}
}
Contents of map_and_assembly.nf:
process pomoxis {
"""
mini_align -h
"""
}
Results:
$ nextflow run map_and_assembly.nf -dsl1
N E X T F L O W ~ version 22.04.4
Launching `map_and_assembly.nf` [loving_thompson] DSL1 - revision: 6efbc53bf8
executor > local (1)
[22/23c938] process > pomoxis [100%] 1 of 1 ✔
Completed at: 13-Oct-2022 01:05:27
Duration : 2m 46s
CPU hours : (a few seconds)
Succeeded : 1
$ cat work/22/23c938ca9312037ee10087ba5d48a2/.command.err
Unable to find image 'quay.io/biocontainers/pomoxis:0.3.10--pyhdfd78af_0' locally
0.3.10--pyhdfd78af_0: Pulling from biocontainers/pomoxis
c1a16a04cedd: Already exists
4ca545ee6d5d: Already exists
af0d0c971daf: Pulling fs layer
af0d0c971daf: Verifying Checksum
af0d0c971daf: Download complete
af0d0c971daf: Pull complete
Digest: sha256:b42d95b742be3dc8333f57892c4aa2cc5cd739e796b33c7f310696856dcdea4d
Status: Downloaded newer image for quay.io/biocontainers/pomoxis:0.3.10--pyhdfd78af_0
mini_align [-h] -r <reference> -i <fastq>
Align fastq/a formatted reads to a genome using minimap2.
-h show this help text.
-r reference, should be a fasta file. If correspondng minimap indices
do not exist they will be created. (required).
-i fastq/a input reads (required).
-I split index every ~NUM input bases (default: 16G, this is larger
than the usual minimap2 default).
-d set the minimap2 preset, e.g. map-ont, asm5, asm10, asm20 [default: map-ont]
-f force recreation of index file.
-a aggressively extend gaps (sets -A1 -B2 -O2 -E1 for minimap2).
-P filter to only primary alignments (i.e. run samtools view -F 2308).
Deprecated: this filter is now default and can be disabled with -A.
-y filter to primary and supplementary alignments (i.e. run samtools view -F 260)
-A do not filter alignments, output all.
-n sort bam by read name.
-c chunk size. Input reads/contigs will be broken into chunks
prior to alignment.
-t alignment threads (default: 1).
-p output file prefix (default: reads).
-m fill MD tag.
-s fill cs(=long) tag.
-X only create reference index files.
-x log all commands before running.
-M match score
-S mismatch score
-O open gap penalty
-E extend gap penalty.
The problem with using the dpirdmk/pomoxis:0.1.11 container is that it specifies an alternate entrypoint which makes it difficult to use out of the box without either modifying the command to be run or by supplying some additional Docker configuration:
$ docker run --rm dpirdmk/pomoxis:0.1.11 bash -c 'cat /init.sh'
#!/bin/bash
. /apps/pomoxis/venv/bin/activate
exec "$#"
Reply to follow up questions:
You receive the Missing output file(s) `output` expected by process error when the command completes successfully (i.e. exit status zero) but Nextflow (not pomoxis) couldn't find one or more output file(s) in the working directory as specified in the output declaration. I think what you want is the following. I've taken the liberty of renaming some of the variables to make things a little more clear. The code below is of course untested:
params.outdir = './results'
params.ref_fasta = "SLCMV.fasta"
params.reads = "*.fastq"
reads = Channel.fromPath( params.reads )
ref_fasta = file( params.ref_fasta )
process mini_align {
tag { fastq.name }
publishDir "${params.outdir}/mini_align", mode:'copy'
input:
path fastq from reads
path ref_fasta
output:
path "${fastq.simpleName}.bam{,.bai}" into mapped_ch
script:
"""
mini_align \\
-r "${ref_fasta}" \\
-i "${fastq}" \\
-p "${fastq.simpleName}"
"""
}
params.outdir = './results'
params.reads = "*.fastq"
reads = Channel.fromPath( params.reads )
process mini_assemble {
tag { fastq.name }
publishDir "${params.outdir}/mini_assemble", mode:'copy'
input:
path fastq from reads
output:
path "assm/${fastq.simpleName}_final.fa" into mapped_ch
script:
"""
mini_assemble \\
-i "${fastq}" \\
-p "${fastq.simpleName}"
"""
}

gcov generating correct output but gcovr does not

Running through the setup example from gcovr here: https://gcovr.com/en/stable/guide.html#getting-started I can build the file and am seeing the following output from running gcovr -r .:
% gcovr -r .
------------------------------------------------------------------------------
GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File Lines Exec Cover Missing
------------------------------------------------------------------------------
example.cpp 0 0 --%
------------------------------------------------------------------------------
TOTAL 0 0 --%
------------------------------------------------------------------------------
If I run gcov example.cpp directly I can see that the generated .gcov data is correct:
% gcov example.cpp
File 'example.cpp'
Lines executed:87.50% of 8
Creating 'example.cpp.gcov'
I am unsure where the disconnect between this gcov output and the gcovr interpretation of it is.
I have tried downgrading to an older gcovr version, running the command on other projects, and switching python versions, but have not seen any different behavior.
My gcov and gcc are from the Xcode command line tools. gcovr was pip installed (within pyenv with python 3.8.5)
Edit: adding verbose output:
gcovr -r . -v
Filters for --root: (1)
- re.compile('^/Test/')
Filters for --filter: (1)
- DirectoryPrefixFilter(/Test/)
Filters for --exclude: (0)
Filters for --gcov-filter: (1)
- AlwaysMatchFilter()
Filters for --gcov-exclude: (0)
Filters for --exclude-directories: (0)
Scanning directory . for gcda/gcno files...
Found 2 files (and will process 1)
Pool started with 1 threads
Processing file: /Test/example.gcda
Running gcov: 'gcov /Test/example.gcda --branch-counts --branch-probabilities --preserve-paths --object-directory /Test' in '/var/folders/bc/20q4mkss6457skh36yzgm2bw0000gp/T/tmpo4mr2wh4'
Finding source file corresponding to a gcov data file
currdir /Test
gcov_fname /var/folders/bc/20q4mkss6457skh36yzgm2bw0000gp/T/tmpo4mr2wh4/example.cpp.gcov
[' -', ' 0', 'Source', 'example.cpp\n']
source_fname /Test/example.gcda
root /Test
fname /Test/example.cpp
Parsing coverage data for file /Test/example.cpp
Gathered coveraged data for 1 files
------------------------------------------------------------------------------
GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File Lines Exec Cover Missing
------------------------------------------------------------------------------
example.cpp 0 0 --%
------------------------------------------------------------------------------
TOTAL 0 0 --%
------------------------------------------------------------------------------

OpenCV Error: Assertion failed (_img.cols == winSize.width)

So I've been trying to run the command
opencv_traincascade -data HandsData -vec hands.vec -bg HandsNeg.txt -numPos 3641 -numNeg 2578 -numStages 20 -w 27 -h 48 -mode ALL -minHitRate 0.999 -maxFalseAlarmRate 0.5 -precalcValBufSize 1024 -precalcIdxBufSize 1024
and I get the error
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 3641 : 3641
OpenCV Error: Assertion failed (_img.cols == winSize.width) in get, file /builddir/build/BUILD/OpenCV-2.0.0/apps/traincascade/imagestorage.cpp, line 86
terminate called after throwing an instance of 'cv::Exception'
Aborted
I've seen suggestions around to change the positive image number, where I use 1043 instead and still get the same error. Then I also see suggestions of editing the source code. The problem is that I built OpenCV with yum and would like to not have to rebuild from source.
sudo find / -name imagestorage.cpp
turns up nothing.
I'm at a complete loss of what to do.
Additional info: Steps I took to get to this point
I created everything from some videos using ffmpeg. These were from a phone and VLC lists the info as
Resolution: 1920x1090
Display resolution: 1920x1080
The ffmpeg command was (replacing input/output with respective videos and locations)
steven ~/computer_vision $ ffmpeg -i videos/Not\ hands\ stuff.mp4 -y -r 40 -s 27x48 -f image2 NotHandsorFists/Negs-%4d.png
Files in Hands.txt are of the form
steven ~/computer_vision $ cat Hands.txt | head -n 1
Hands/LeftHand-0001.png 1 0 0 27 48
I compiled the vec file with
steven ~/computer_vision $ opencv_createsamples -info Hands.txt -num 3641 -w 27 -h 48 -vec hands.vec
The negative file is in the form
steven ~/computer_vision $ cat HandsNeg.txt | head -n 1
Fists/LeftFist-0001.png
and the working directory is
steven ~/computer_vision $ ls
Fists fists.txt Hands HandsData HandsNeg.txt Hands.txt hands.vec NotHandsorFists NotHandsorFists.txt videos
Edit:
I've tried changing png to jpg and bmp to get rid of the channels. No help.

No input clusters found in /user/mahout/cluster/part-randomSeed. Check your -c argument

My test.csv file:
==================
1,54,1341775056478
2,1568,1341775056478
1,1622,1341775056498
2,3136,1341775056498
1,3190,1341775056671
2,4704,1341775056671
1,4758,1341775056693
2,6272,1341775056693
1,6326,1341775056714
2,7840,1341775056714
1,7894,1341775056735
2,9408,1341775056735
1,9462,1341775056951
2,10976,1341775056951
1,11030,1341775056972
2,12544,1341775056972
1,12598,1341775056994
2,14112,1341775056994
1,14166,1341775057014
2,15680,1341775057014
1,15734,1341775057065
2,17248,1341775057065
1,17302,1341775057087
2,18816,1341775057087
1,18870,1341775057119
2,20384,1341775057119
....
....
I am trying to cluster this data using mahout k-means algorithm.
I had followed these steps:
1)Create a sequence file from the test.csv file
mahout seqdirectory -c UTF-8 -i /user/mahout/input/test.csv -o /user/sample/out_seq -chunk 64
2)Create a sparse vector from the sequence file
mahout seq2sparse -i /user/mahout/out_seq/ -o /user/mahout/sparse_dir --maxDFPercent 85 --namedVector
3)perfom K-Means clustering
mahout kmeans -i /user/mahout/sparse_dir/tfidf-vectors/ -c /user/mahout/cluster -o /user/mahout/kmeans_out
-dm org.apache.mahout.common.distance.CosineDistanceMeasure --maxIter 10 --numClusters 20 --ow --clustering
At step 3,I'm facing this error:
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /user/mahout/text/cluster/part-randomSeed. Check your -c argument.
at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
....
....
How to overcome this error.Actually I did the clustering example successfuuly using reuters dataset.But with my dataset,it is showing this issue.Is there any problem with the dataset ? or due to some other issue,am i facing this error?
Can anyone suggest me regarding this issue...
Thanks, in advance

Error while creating mahout model

I am training mahout classifier for my data,
Following commands i issued to create mahout model
./bin/mahout seqdirectory -i /tmp/mahout-work-root/MyData-all -o /tmp/mahout-work-root/MyData-seq
./bin/mahout seq2sparse -i /tmp/mahout-work-root/MyData-seq -o /tmp/mahout-work-root/MyData-vectors -lnorm -nv -wt tfidf
./bin/mahout split -i /tmp/mahout-work-root/MyData-vectors/tfidf-vectors --trainingOutput /tmp/mahout-work-root/MyData-train-vectors --testOutput /tmp/mahout-work-root/MyData-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
./bin/mahout trainnb -i /tmp/mahout-work-root/Mydata-train-vectors -el -o /tmp/mahout-work-root/model -li /tmp/mahout-work-root/labelindex -ow
When i try to create the model using trainnb command i am getting the following Exception :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:119) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:152)
What could be the problem here?
Note: Original Example mentioned here works fine.
I think it might be the problem of how you put your training files.
The files should be organized as following:
MyData-All
\classA
-file1
-file2
-...
\classB
-filex
....

Resources