parallel: Error: Command line too long (68914 >= 65524) at input 0 - gnu-parallel

Given a file with long lines, parallel fails to pass these lines as an argument to any command:
$> cat johny_long_lines.txt | parallel echo {}
parallel: Error: Command line too long (68906 >= 65524) at input 0: 2236439425|\x308286873082856fa003020102020c221ff03...
This gets more confusing when I see that the line is 68900 characters long:
$> cat johny_long_lines.txt | head -n 1 | wc -m
68900
while the max line length allowed by parallel is way more longer that my input:
$> parallel --max-line-length-allowed
131049
Also: if you think that it's a problem of execve, this might interest you:
$> getconf ARG_MAX
2097152
Any idea what I'm doing here wrong?
UPDATE
I figured out that the problem persists for versions 20161222 and 20220522 but not for 20210822 (delivered with Ubuntu 22.04 LTS). Further inspection reveals that this line causes the problem:
# Usable len = maxlen - 3000 for wrapping, div 2 for hexing
int(($Global::minimal_command_line_length - 3000)/2);
Which I can confirm using --show-limits:
$> parallel --show-limits
[...]
Maximal size of command: 131063
Maximal usable size of command: 64031
This annoying feature does not exist in version 20210822 and I my file goes through as expected.
Can this be disabled?

I got the message
parallel --show-limits
Maximal size of command: 131049
Maximal used size of command: 83906
and had some trouble finding the source for the 83906
but then found it in
~/.parallel/tmp/sshlogin/$(hostname)/linelen
and had no clue how it was set to this small value.
But the Version of parallel was quite old:
GNU parallel 20180922

Related

Vertica's vsql.exe returns errorlevel 0 when facing ERROR 3326: Execution time exceeded run time cap

I am using vsql.exe on an external Vertica database for which I don't have any administrative access. I use some views with simple SELECT+FROM+WHERE queries.
These queries 90% of the time work just fine, but some times, randomly, I get this error:
ERROR 3326:  Execution time exceeded run time cap of 00:00:45
The strange thing is that this error can happen way after those 45 seconds, even after 3 minutes. I've been told this is related to having different resource pools, but anyway I don't want to dig into that.
The problem is that when this occurs, vsql.exe returns errorlevel 0 and there is (apparently almost) no way to know this failed.
The output of the query is stored in a csv file. When it succeeds, it ends with (#### rows). But when it fails with this error, it just stops at any point of the csv, and its resulting size is around half of what's expected. This is of course not what you would expect when an error occurs, like no output or an empty one.
If there is a connection error or if the query has syntax errors, the errorlevel is not 0, so in those cases it behaves as expected.
I've tried many things, like increasing the timeout or adding -v ON_ERROR_STOP=ON to the vsql.exe parameters, but none of that helped.
I've googled a lot and found many people having this error, but the solutions are mostly related to increasing the timeouts, not related to the errorlevel returned.
Any help will be greatly appreciated.
TL;DR: how can I detect an error 3326 in a batch file like this?
#echo off
vsql.exe -h <hostname> -U <user> -w <pwd> -o output.cs -Ac "SELECT ....;"
echo %errorlevel% is always 0
if errorlevel 1 echo Error!! But this is never displayed.
Now that's really unexpected to me. I don't have Windows available just now, but trying on my Mac - at first just triggering a deliberate error:
$ vsql -h zbook -d sbx -U dbadmin -w $VSQL_PASSWORD -v ON_ERROR_STOP=ON -Ac "select * from foobarfoo"
ERROR 4566: Relation "foobarfoo" does not exist
$ echo $?
1
With ON_ERROR_STOP set to ON, this should be the behaviour everywhere.
Could you try what I did above through Windows, just with echo %ERRORLEVEL% instead of echo $?, just from the Windows command prompt and not in a batch file?
Next test: I run on resource pool general in my little test database, so I temporarily modify it to a runtime cap of 30 sec, run a silly query that will take over 30 seconds with ON_ERROR_STOP set to ON, collect the value returned by vsql and set the runtime cap of general back to NONE. I also have the %VSQL_* % env variables set so I don't have to repeat them all the time:
rem Windows way to set environment variables for vsql:
set VSQL_HOST=zbook
set VSQL_DATABASE=sbx
set VSQL_USER=dbadmin
set VSQL_PASSWORD=***masked***
Now for the test (backslashes, in Linux/MacOs escape a new line, which enables you to "word wrap" a shell command. Use the caret (^) in Windows for that):
marco ~/1/Vertica/supp $ # set a runtime cap
marco ~/1/Vertica/supp $ vsql -i -c \
"alter resource pool general runtimecap '00:00:30'"
ALTER RESOURCE POOL
Time: First fetch (0 rows): 116.326 ms. All rows formatted: 116.730 ms
marco ~/1/Vertica/supp $ vsql -v ON_ERROR_STOP=ON -iAc \
"select count(*) from one_million_rows a cross join one_million_rows b"
ERROR 3326: Execution time exceeded run time cap of 00:00:30
marco ~/1/Vertica/supp $ # test the return code
marco ~/1/Vertica/supp $ echo $?
1
marco ~/1/Vertica/supp $ # clear the runtime cap
marco ~/1/Vertica/supp $ vsql -i -c \
"alter resource pool general runtimecap NONE "
ALTER RESOURCE POOL
Time: First fetch (0 rows): 11.148 ms. All rows formatted: 11.383 ms
So it works in my case. Your line:
if errorlevel 1 echo Error!! But this is never displayed.
... never echoes anything because the previous line, with echo will return 0 to the shell, overriding the previous errorlevel.
Try it command by command on your Windows command prompt, and see what happens. Just echo %errorlevel%, without evaluating it.
And I notice that you are trying to export to CSV format. Then, try this:
Format the output unaligned (-A)
set the field separator to comma (-F ',')
remove the footer '(n rows)' (-P footer)
limit the output to 5 rows in the query for test
(I show the output before redirecting to file):
marco ~/1/Vertica/supp $ vsql -A -F ',' -P footer -c "select * from one_million_rows limit 5"
id,id_desc,dob,category,busid,revenue
0,0,1950-01-01,1,====== boss ========,0.000
1,-1,1950-01-02,2,kbv-000001kbv-000001,0.010
2,-2,1950-01-03,3,kbv-000002kbv-000002,0.020
3,-3,1950-01-04,4,kbv-000003kbv-000003,0.030
4,-4,1950-01-05,5,kbv-000004kbv-000004,0.040
Not aligning is much faster than aligning.
Then, as you spend most time in the fetching of the rows (that's because you get a timeout in the middle of an output file write process), try fetching more rows at a time than the default 1000. You will need to play with the value, depending on the network settings at your site until you get your best value:
-v ROWS_AT_A_TIME=10000
Once you're happy with the tested output, try this command (change the SELECT for your needs, of course ....):
marco ~/1/Vertica/supp $ vsql -A -F ',' -P footer \
-v ON_ERROR_STOP=ON -v ROWS_AT_A_TIME=10000 -o one_million_rows.csv \
-c "select * from one_million_rows"
marco ~/1/Vertica/supp $ wc -l one_million_rows.csv
1000001 one_million_rows.csv
The table actually contains one million rows. Note the line count in the file: 1,000,001. That's the title line included, but the footer (1000000 rows) removed.

snakemake: MissingOutputException within docker

I am trying to run a pipeline within a docker using snakemake. I am having problem using the sortmerna tool to produce {sample}_merged_sorted_mRNA and {sample}_merged_sorted output from control_merged.fq and treated_merged.fq input files.
Here my Snakefile:
SAMPLES = ["control","treated"]
for smp in SAMPLES:
print("Sample " + smp + " will be processed")
rule final:
input:
expand('/output/{sample}_merged.fq', sample=SAMPLES),
expand('/output/{sample}_merged_sorted', sample=SAMPLES),
expand('/output/{sample}_merged_sorted_mRNA', sample=SAMPLES),
rule sortmerna:
input: '/output/{sample}_merged.fq',
output: merged_file='/output/{sample}_merged_sorted_mRNA', merged_sorted='/output/{sample}_merged_sorted',
message: """---SORTING---"""
shell:
'''
sortmerna --ref /usr/share/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/ usr/share/sortmerna/rRNA_databases/index/silva-bac-23s-id98: --reads {input} --paired_in -a 16 --log --fastx --aligned {output.merged_file} --other {output.merged_sorted} -v
'''
When runnig this I get:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 57 of /input/Snakefile:
Missing files after 5 seconds:
/output/control_merged_sorted_mRNA
/output/control_merged_sorted
This might be due to filesystem latency. If that is the case, consider to increase the wait $ime with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /input/.snakemake/log/2018-11-05T091643.911334.snakemake.log
I tried to increase the latency with --latency-wait but I get the same result. Funny thing is that two output files control_merged_sorted_mRNA.fq and control_merged_sorted.fq are produced but the program fails and exits. The version of snakemake is 5.3.0. Any help?
snakemake fails because the outputs described by the rule sortmerna are not produced. This is not a latency problem, it is a problem with your outputs.
Your rule sortmerna expects as output:
/output/control_merged_sorted_mRNA
and
/output/control_merged_sorted
but the program you are using (I know nothing about sortmerna) is apparently producing
/output/control_merged_sorted_mRNA.fq
and
/output/control_merged_sorted.fq
Make sure that when you specify the options --aligned and --other on the command line of your program, it should be the real names of the files produced or if it is only the basename and the program will add a suffix .fq. If you are in the latter case, I suggest you use:
rule final:
input:
expand('/output/{sample}_merged.fq', sample=SAMPLES),
expand('/output/{sample}_merged_sorted', sample=SAMPLES),
expand('/output/{sample}_merged_sorted_mRNA', sample=SAMPLES),
rule sortmerna:
input:
'/output/{sample}_merged.fq',
output:
merged_file='/output/{sample}_merged_sorted_mRNA.fq',
merged_sorted='/output/{sample}_merged_sorted.fq'
params:
merged_file_basename='/output/{sample}_merged_sorted_mRNA',
merged_sorted_basename='/output/{sample}_merged_sorted'
message: """---SORTING---"""
shell:
"""
sortmerna --ref /usr/share/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/usr/share/sortmerna/rRNA_databases/index/silva-bac-23s-id98: --reads {input} --paired_in -a 16 --log --fastx --aligned {params.merged_file_basename} --other {params.merged_sorted_basename} -v
"""

Replacement string not working in GNU parallel

I have the script run_md.py which produces the file test.dcd from the input file named test.pdb.
I want to execute the same command on multiple input files (test*.pdb) on a remote server using GNU parallel and transfer the result back to the local computer. Therefore, I'm using the following command:
parallel --trc {.}.dcd -j 2 -S $SERVER1 './run_md.py {} 1000' ::: test*.pdb
The command is running as expected on the server using 2 slots. However, the files are not transferred back and I get the following error:
rsync: link_stat "/home/bougui/{.}.dcd" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
It looks like the replacement string is not working. How can I make it works?
Below is the output of parallel --version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
What you are doing is 100% correct. So something on your system is breaking this. Please try this on another system and if possible follow REPORTING BUGS from man parallel.
The bug reported in that thread has been fixed and this feature works well with the latest version of GNU parallel (20160622). The GNU parallel version 20130922 packaged with Debian 8.5 is buggy for the usage of {.} string replacement, as described below:
With more test I found that the output file must be specified with a replacement string in the command run in parallel.
For testing purpose, you can find below a complete example that others can run:
echo This is input_file > input_file && parallel --trc {}.out -S $SERVER1 cat {} ">"{}.out ::: input_file
The example above works well. When I use the substitution string {.} as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"{.}.out ::: input_file
It works, as well. However, if I didn't specify {.}.out in the command run in parallel as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"input_file.out ::: input_file
... I reproduce the error:
rsync: link_stat "/home/bouvier/{.}.out" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
rsync: [Receiver] write error: Broken pipe (32)
Therefore the output file must be specified in the command run in parallel.

Grep regular expression stop after first match

I am trying to figure out what the grep syntax would be to get one\unique result of a search.
For example :
grep "^SEVERE" server.out
SEVERE: Cannot connect repository, Error occurred when calling service PING_SERVER.
SEVERE: Cannot connect repository, Error occurred when calling service PING_SERVER.
SEVERE: Cannot connect repository, Error occurred when calling service PING_SERVER.
I would like the output to show only the first find of the that occurrence.
Any Help would be great!
tcwbot
GNU grep 2.16 which comes with cygwin has this option (from 'man grep'):
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines. If the input is
standard input from a regular file, and NUM matching lines are
output, grep ensures that the standard input is positioned to
just after the last matching line before exiting, regardless of
the presence of trailing context lines. This enables a calling
process to resume a search. When grep stops after NUM matching
lines, it outputs any trailing context lines. When the -c or
--count option is also used, grep does not output a count
greater than NUM. When the -v or --invert-match option is also
used, grep stops after outputting NUM non-matching lines.
So try:
$ grep -m 1 "^SEVERE" server.out

bash gnu parallel argfile syntax

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?
Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.
After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Resources