GNU Parallel multiple sets of commands - gnu-parallel

I'd like to use GNU Parallel to run the same command with two different parameters and two different globs. For example, I want the following jobs to run:
mycmd A apples1
mycmd A apples2
mycmd A apples3
mycmd B bananas1
mycmd B bananas2
I can do it with two separate calls, but this defeats the purpose of having my jobs managed by one call to parallel. Is there a way?
parallel mycmd A ::: apples*
parallel mycmd B ::: bananas*

I assume you don't want Bs with your apples. Otherwise it is as simple as:
parallel mycmd ::: [A-Z] ::: [a-z]*
If A can be computed as the first char of the second arg, you can from version 20140722 do this:
parallel mycmd '{= $_=uc(substr($_,0,1)) =}' {} ::: [a-z]*
If you have a list of apples and the corresponding As like this:
A,apples1
A,apples2
B,bananas1
B,bananas2
B,bananas3
then you can split on ,:
cat file | parallel --colsep , mycmd {1} {2}
If this is also not how you have your input, then you need to explain a bit more about how you have your As and apples.

Related

Error in GNU parallel dynamic string replacement

I have more than 50 file pairs with names in the following format: AA-7R-76L1.clean.R1.fastq.gz, AA-7R-76L1.clean.R2.fastq.gz
I tried to use parallel in the following way:
parallel --plus echo {%R..fastq.gz} ::: *.fastq.gz |parallel 'repair.sh in1={}.R1.fastq.gz in2={}.R2.fastq.gz out1={}.repd.R1.fastq.gz out2={}.repd.R2.fastq.gz outs={}.singletons.fastq.gz repair'
--plus echo should dynamically replace R1.fastq.gz, R2.fastq.gz to capture the sample name i.e.HB-7R-25L0.clean. It should then feed it to repair.sh
The error I get is, the first section extracts the entire filename and does not capture the sample name. Thus in1 and in2 becomes AA-7R-76L1.clean.R1.fastq.gz.R1.fastq.gz and AA-7R-76L1.clean.R2.fastq.gz.R2.fastq.gz
What is the error here?
Something like:
$ parallel --plus --dry-run 'repair.sh in1={} in2={/R1/R2} out1={/R1/fixed.R1} out2={/R1/fixed.R2} outs={%.R1.fastq.gz}_singletons.fastq repair' ::: *R1.fastq.gz
(Assuming R1 and R2 is not part of the *-part of the name).

How to fix 'Unable to open [{2}]' error in Gnu Parallel

I want to parallelize an image processing step which uses two programs at the same time. My code works fine for a single image but when I try to parallelize it, it fails.
The two programs I am using are fx and getkey from USGS Integrated Software for Imagers and Spectrometers. I use fx to perform an arithmetic operation on my input image (which is 'f1' in the code below) and writes it to a new file (which is the 'to' parameter). getkey outputs the value of a requested keyword, which is a number in this case.
In the following code, I am subtracting the output of getkey from my input image, f1, and writing the result to a new file, which is defined by the 'to' parameter. This code works as I expect it to:
fx f1=W1660432760_1_overclocks_average_lwps5.cub to=testing_fx2.cub equation=f1-$(getkey from=W1660432760_1_overclocks_average_lwps5_stats.txt grpname=results keyword=average)
The problem comes when I try to parallelize it. The following code gives an error, saying 'Unable to open [{2}].'
parallel fx f1={1} to={1.}_minus_avg.cub equation=f1-$(getkey from={2} grpname=results keyword=average) ::: $(find *lwps5.cub) ::: $(find *stats.txt)
The result I am expecting is an output image with pixel values that are smaller by the getkey value compared to the input image.
If the two inputs should be combined in all ways:
parallel fx f1={1} to={1.}_minus_avg.cub 'equation=f1-$(getkey from={2} grpname=results keyword=average)' ::: *lwps5.cub ::: *stats.txt
If the two inputs should be linked:
parallel fx f1={1} to={1.}_minus_avg.cub 'equation=f1-$(getkey from={2} grpname=results keyword=average)' ::: *lwps5.cub :::+ *stats.txt
If neither of these solve you issue, then make a shell function that takes 2 arguments:
doit() {
arg1="$1"
arg2="$2"
# Do all your stuff with getkey and fx
}
export -f doit
# all combinations
parallel doit ::: *lwps5.cub ::: *stats.txt
# or linked
parallel doit ::: *lwps5.cub :::+ *stats.txt

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

How to quote each argument from gnu parallel?

Given some tab-delimited content:
Test|One|Two|Three
Again|||Another
And a bash function:
function print_last() {
echo "$4"
}
export -f print_last
And the parallel command: parallel -C "\|" print_last :::: data.tsv
My expected output is:
Three
Another
However, Another never prints because the function only receives two arguments for that row of data. This is caused by the empty cells in the tabular data. My data will have blank cells and an varying number of columns.
So, without changing my command to include numbered arguments (print_last "{1}" "{2}" "{3}" "{4}"), how can I ensure that blank values are sent to the function?
Since your function is called print_last maybe it will be enough to simply get the last element:
parallel -C "\|" echo {-1} :::: data.tsv
Otherwise abuse that -X will repeat context:
parallel -C "\|" -X print_last \"\"{} :::: data.tsv

grep to find words with unique letters

how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.
Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)
grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.
Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;
The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3

Resources