snakemake: MissingOutputException within docker - docker

I am trying to run a pipeline within a docker using snakemake. I am having problem using the sortmerna tool to produce {sample}_merged_sorted_mRNA and {sample}_merged_sorted output from control_merged.fq and treated_merged.fq input files.
Here my Snakefile:
SAMPLES = ["control","treated"]
for smp in SAMPLES:
print("Sample " + smp + " will be processed")
rule final:
input:
expand('/output/{sample}_merged.fq', sample=SAMPLES),
expand('/output/{sample}_merged_sorted', sample=SAMPLES),
expand('/output/{sample}_merged_sorted_mRNA', sample=SAMPLES),
rule sortmerna:
input: '/output/{sample}_merged.fq',
output: merged_file='/output/{sample}_merged_sorted_mRNA', merged_sorted='/output/{sample}_merged_sorted',
message: """---SORTING---"""
shell:
'''
sortmerna --ref /usr/share/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/ usr/share/sortmerna/rRNA_databases/index/silva-bac-23s-id98: --reads {input} --paired_in -a 16 --log --fastx --aligned {output.merged_file} --other {output.merged_sorted} -v
'''
When runnig this I get:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 57 of /input/Snakefile:
Missing files after 5 seconds:
/output/control_merged_sorted_mRNA
/output/control_merged_sorted
This might be due to filesystem latency. If that is the case, consider to increase the wait $ime with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /input/.snakemake/log/2018-11-05T091643.911334.snakemake.log
I tried to increase the latency with --latency-wait but I get the same result. Funny thing is that two output files control_merged_sorted_mRNA.fq and control_merged_sorted.fq are produced but the program fails and exits. The version of snakemake is 5.3.0. Any help?

snakemake fails because the outputs described by the rule sortmerna are not produced. This is not a latency problem, it is a problem with your outputs.
Your rule sortmerna expects as output:
/output/control_merged_sorted_mRNA
and
/output/control_merged_sorted
but the program you are using (I know nothing about sortmerna) is apparently producing
/output/control_merged_sorted_mRNA.fq
and
/output/control_merged_sorted.fq
Make sure that when you specify the options --aligned and --other on the command line of your program, it should be the real names of the files produced or if it is only the basename and the program will add a suffix .fq. If you are in the latter case, I suggest you use:
rule final:
input:
expand('/output/{sample}_merged.fq', sample=SAMPLES),
expand('/output/{sample}_merged_sorted', sample=SAMPLES),
expand('/output/{sample}_merged_sorted_mRNA', sample=SAMPLES),
rule sortmerna:
input:
'/output/{sample}_merged.fq',
output:
merged_file='/output/{sample}_merged_sorted_mRNA.fq',
merged_sorted='/output/{sample}_merged_sorted.fq'
params:
merged_file_basename='/output/{sample}_merged_sorted_mRNA',
merged_sorted_basename='/output/{sample}_merged_sorted'
message: """---SORTING---"""
shell:
"""
sortmerna --ref /usr/share/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/usr/share/sortmerna/rRNA_databases/index/silva-bac-23s-id98: --reads {input} --paired_in -a 16 --log --fastx --aligned {params.merged_file_basename} --other {params.merged_sorted_basename} -v
"""

Related

parallel: Error: Command line too long (68914 >= 65524) at input 0

Given a file with long lines, parallel fails to pass these lines as an argument to any command:
$> cat johny_long_lines.txt | parallel echo {}
parallel: Error: Command line too long (68906 >= 65524) at input 0: 2236439425|\x308286873082856fa003020102020c221ff03...
This gets more confusing when I see that the line is 68900 characters long:
$> cat johny_long_lines.txt | head -n 1 | wc -m
68900
while the max line length allowed by parallel is way more longer that my input:
$> parallel --max-line-length-allowed
131049
Also: if you think that it's a problem of execve, this might interest you:
$> getconf ARG_MAX
2097152
Any idea what I'm doing here wrong?
UPDATE
I figured out that the problem persists for versions 20161222 and 20220522 but not for 20210822 (delivered with Ubuntu 22.04 LTS). Further inspection reveals that this line causes the problem:
# Usable len = maxlen - 3000 for wrapping, div 2 for hexing
int(($Global::minimal_command_line_length - 3000)/2);
Which I can confirm using --show-limits:
$> parallel --show-limits
[...]
Maximal size of command: 131063
Maximal usable size of command: 64031
This annoying feature does not exist in version 20210822 and I my file goes through as expected.
Can this be disabled?
I got the message
parallel --show-limits
Maximal size of command: 131049
Maximal used size of command: 83906
and had some trouble finding the source for the 83906
but then found it in
~/.parallel/tmp/sshlogin/$(hostname)/linelen
and had no clue how it was set to this small value.
But the Version of parallel was quite old:
GNU parallel 20180922

Moses tuning fails to run extractor

When I attempt to tune my Moses tuner (according to the moses baseline), it reaches the end of my tuning dataset (75k lines) and then exits on the code:
Executing: /home/alexm/Desktop/dissertation/models/cy-en/transsystem/mert-work/extractor.sh > extract.out 2> extract.err
Exit code: 127
ERROR: Failed to run '/home/alexm/Desktop/dissertation/models/cy-en/transsystem/mert-work/extractor.sh'. at ../../../mosesdecoder/scripts/training/mert-moses.pl line 1775.
It exits without giving the current tuning weights too, causing the loss of hours of progress.
It seems like you're missing some file or access to it
127 typically means command not found
Are you missing the file /home/alexm/Desktop/dissertation/models/cy-en/transsystem/mert-work/extractor.sh ?

Rule in snakemake using singularity: unterminated quoted string

I'm running a snakemake pipeline that for a specific rule loads a container:
rule counts:
params:
transcriptome=os.environ["INDEX"],
outdir= (os.environ["OUTDIR"] + "/counts/"),
indir= (os.environ["INDIR"] + "{sample}"),
name = lambda wildcards: SAMPLES[wildcards.sample]
output:
(os.environ["OUTDIR"] + "counts/" + "{sample}" + "/outs/web_summary.html")
container:
"docker://marcusczi/cellranger_clean"
shell:
"""
cellranger count --id={wildcards.sample} --transcriptome={params.transcriptome} --fastqs={params.indir} --sample={params.name}
mkdir -p {params.outdir}
mv ./{wildcards.sample}/ {params.outdir}
"""
Dry run looks fine, the rule itself I'm sure it works (tried it without the container). However, when I run it with docker I get this error:
Activating singularity image /some/path/.snakemake/singularity/c288fbc3fef5771f055a688c6678c24d.simg
/bin/sh: syntax error: unterminated quoted string
[ 1.228141] reboot: Power down
And then it waits for the missing files, and fails.
I think the answer to this situation might be related to this previous question, but I have tried everything i can think of in terms of escaping characters (except for the wildcards and variables within curly brackets because I'm guessing it should be fine, and if not why am i even using snakemake :-( ). The paths for the directories I'm using are valid and exist, the name and wildcard "sample" are in the shape "sample_123", nothing fancy.
It's also worth saying that there are no single or double quotes in any of these variables.
Thank you!!
Software and OS:
I am in macos catalina 10.15.5, running snakemake 5.20.1, and I have been using the beta version of singularity for macos (3.3.0-rc.1.658.g7427b73f1.dirty).
Running singularity outside Snakemake:
I tried running the singularity outside snakemake, the software that I'm trying to run starts, but then complains that there is no disk left on space (which is not true). I'm running the singularity as sudo singularity run -B "$(pwd):$(pwd)" docker://marcusczi/cellranger_clean
I think this latest error might be either 1) I'm not running singularity as I should..? Or 2) A false statement of what is happening since cellranger (the software I'm trying to run) often has misleading error messages.
Minimal reproducible example:
If you install snakemake, you should be able to reproduce my error when running snakemake -j1 --use-singularity in the same directory of the Snakefile.
Snakefile:
rule all:
input:
"output.txt"
rule counts:
output:
"output.txt"
container:
"docker://marcusczi/cellranger_clean"
shell:
"""
cellranger count --help
echo "hurray!" > {output}
"""

Replacement string not working in GNU parallel

I have the script run_md.py which produces the file test.dcd from the input file named test.pdb.
I want to execute the same command on multiple input files (test*.pdb) on a remote server using GNU parallel and transfer the result back to the local computer. Therefore, I'm using the following command:
parallel --trc {.}.dcd -j 2 -S $SERVER1 './run_md.py {} 1000' ::: test*.pdb
The command is running as expected on the server using 2 slots. However, the files are not transferred back and I get the following error:
rsync: link_stat "/home/bougui/{.}.dcd" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
It looks like the replacement string is not working. How can I make it works?
Below is the output of parallel --version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
What you are doing is 100% correct. So something on your system is breaking this. Please try this on another system and if possible follow REPORTING BUGS from man parallel.
The bug reported in that thread has been fixed and this feature works well with the latest version of GNU parallel (20160622). The GNU parallel version 20130922 packaged with Debian 8.5 is buggy for the usage of {.} string replacement, as described below:
With more test I found that the output file must be specified with a replacement string in the command run in parallel.
For testing purpose, you can find below a complete example that others can run:
echo This is input_file > input_file && parallel --trc {}.out -S $SERVER1 cat {} ">"{}.out ::: input_file
The example above works well. When I use the substitution string {.} as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"{.}.out ::: input_file
It works, as well. However, if I didn't specify {.}.out in the command run in parallel as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"input_file.out ::: input_file
... I reproduce the error:
rsync: link_stat "/home/bouvier/{.}.out" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
rsync: [Receiver] write error: Broken pipe (32)
Therefore the output file must be specified in the command run in parallel.

bash gnu parallel argfile syntax

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?
Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.
After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Resources