How to take substring from input file as an argument to a program to be executed in GNU-parallel? - gnu-parallel

I am trying to execute a program (say, biotool) using GNU-parallel which takes 3 arguments, i, o and a :
the input files (i)
output file name to be written in (o)
an argument which takes a sub string from the input file name (a)
for example, say i have 10 text files like this
1_a_test.txt
2_b_test.txt
3_c_test.txt
...
10_j_test.txt
I want to run my tool (say biotool) on all the 10 text files. I tried this
parallel biotool -i {} -o {.}.out -a {} ::: *.txt
I want to pass the charachter/letter/whatever before the first underscore from the input text file name as an argument to -a option like this (dry run):
parallel biotool -i 1_a_test.txt -o 1_a_test.out -a 1 ::: *.txt`
parallel biotool -i 2_b_test.txt -o 2_b_test.out -a 2 ::: *.txt`
parallel biotool -i 3_c_test.txt -o 3_c_test.out -a 3 ::: *.txt`
...
{} supplies the complete file name to -a but I only want the sub string before the first underscore to be supplied to -a

The easiest, but harder to read is this:
parallel --dry-run biotool -i {} -o {.}.out -a '{= s/_.*// =}' ::: *test.txt
Alternatively, you can make a bash function that uses bash Parameter Substitution to extract the part before the underscore. Then export that to make it known to GNU Parallel
#!/bin/bash
doit(){
i=$1
o=$2
# Use internal bash parameter substitution to extract whatever precedes "_"
# See https://www.tldp.org/LDP/abs/html/parameter-substitution.html
a=${i/_*/}
echo biotool -i "$i" -o "$o" -a "$a"
}
export -f doit
parallel doit {} {.}.out ::: *test.txt
Sample Output
biotool -i 10_j_test.txt -o 10_j_test.out -a 10
biotool -i 1_a_test.txt -o 1_a_test.out -a 1
biotool -i 2_b_test.txt -o 2_b_test.out -a 2

Related

How to grep lines non-repeatedly for same command?

I have a space-separated file that looks like this:
$ cat in_file
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004927566.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004919950.1 FAD_binding_3
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 FAD_binding_3
I am using the following shell script utilizing grep to search for strings:
$ cat search_script.sh
grep "GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1" Pfam_anntn_temp.txt
grep "GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1" Pfam_anntn_temp.txt
The problem is that I want each grep command to return only the first instance of the string it finds exclusive of the previous identical grep command's output.
I need an output which would look like this:
$ cat out_file
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 FAD_binding_3
in which line 1 is exclusively the output of the first grep command and line 2 is exclusively the output of the second grep command. How do I do it?
P.S. I am running this on a big file (>125,000 lines). So, search_script.sh is mostly composed of unique grep commands. It is the identical commands' execution that is messing up my downstream analysis.
I'm assuming you are generating search_script.sh automatically from the contents of in_file. If you can count how many times you'll repeat the same grep command you can just use grep once and use head, for example if you know you'll be using it 2 times:
grep "foo" bar.txt | head -2
Will output the first 2 occurrences of "foo" in bar.txt.
If you have to do the grep commands separately, for example if you have other code in between the grep commands, you can mix head and tail:
grep "foo" bar.txt | head -1 | tail -1
Some other commands...
grep "foo" bar.txt | head -2 | tail -1
head -n displays the first n lines of the input
tail -n displays the last n lines of the input
If you really MUST always use the same command, but ensure that the outputs always differ, the only way I can think of to achieve this is using temporary files and a complex sequence of commands:
cat foo.bar.txt.tmp 2>&1 | xargs -I xx echo "| grep -v \\'xx\\' " | tr '\n' ' ' | xargs -I xx sh -c "grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp"
So to explain this command, given foo as a search string and bar.txt as the filename, then foo.bar.txt.tmp is a unique name for a temporary file. The temporary file will hold the strings that have already been output:
cat foo.bar.txt.tmp 2>&1 : outputs the contents of the temporary file. If none is present, will output an error message to stdout, (important because if the output was empty the rest of the command wouldn't work.)
xargs -I xx echo "| grep -v \\'xx\\' " adds | grep -v to the start of each line in the temporary file, grep -v something excludes lines that include something.
tr '\n' ' ' replaces newlines with spaces, to have on a single string a sequence of grep -vs.
xargs -I xx sh -c "grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp" runs a new command, grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp, replacing xx with the previous output. xx should be the sequence of grep -vs that exclude previous outputs.
head -1 makes sure only one line is output at a time
tee -a foo.bar.txt.tmp appends the new output to the temporary file.
Just be sure to clear the temporary files, rm *.tmp, at the end of your script.
If I am getting question right and you want to remove duplicates based on last field of each line then try following(this should be easy task for awk).
awk '!a[$NF]++' Input_file

Why is xargs' exit code different based on the presence of "-I" option?

After reading the xargs man page, I am unable to understand the difference in exit codes from the following xargs invocations.
(The original purpose was to combine find and grep to check if an expressions exists in ALL the given files when I came across this behaviour)
To reproduce:
(use >>! if using zsh to force creation of file)
# Create the input files.
echo "a" >> 1.txt
echo "ab" >> 2.txt
# The end goal is to check for a pattern (in this case simply 'b') inside
# ALL the files returned by a find search.
find . -name "1.txt" -o -name "2.txt" | xargs -I {} grep -q "b" {}
echo $?
123 # Works as expected since 'b' is not present in 1.txt
find . -name "1.txt" -o -name "2.txt" | xargs grep -q "b"
echo $?
0 # Am more puzzled by why the behaviour is inconsistent
The EXIT_STATUS section on the man page says:
xargs exits with the following status:
0 if it succeeds
123 if any invocation of the command exited with status 1-125
124 if the command exited with status 255
125 if the command is killed by a signal
126 if the command cannot be run
127 if the command is not found
1 if some other error occurred.
I would have thought, that 123 if any invocation of the command exited with status 1-125 should apply irrespective of whether or not -I is used ?
Could you share any insights to explain this conundrum please?
Here is evidence of the effect of -I option with xargs with the help of a wrapper script which shows the number of invocations:
cat ./grep.sh
#/bin/bash
echo "I am being invoked at $(date +%Y%m%d_%H-%M-%S)"
grep $#
(the actual command being invoked, in this case grep doesn't really matter)
Now execute the same commands as in the question using the wrapper script instead:
❯ find . -name "1.txt" -o -name "2.txt" | xargs -I {} ./grep.sh -q "b" {}
I am being invoked at 20190410_09-46-29
I am being invoked at 20190410_09-46-30
❯ find . -name "1.txt" -o -name "2.txt" | xargs ./grep.sh -q "b"
I am being invoked at 20190410_09-46-53
I have just discovered a comment on the answer of a similar question that answers this question (complete credit to https://superuser.com/users/49184/daniel-andersson for his wisdom):
https://superuser.com/questions/557203/xargs-i-behaviour#comment678705_557230
Also, unquoted blanks do not terminate input items; instead the separator is the newline character. — this is central to understanding the behavior. Without -I, xargs only sees the input as a single field, since newline is not a field separator. With -I, suddenly newline is a field separator, and thus xargs sees three fields (that it iterates over). That is a real subtle point, but is explained in the man page quoted.
-I replace-str
Replace occurrences of replace-str in the initial-arguments
with names read from standard input. Also, unquoted blanks do
not terminate input items; instead the separator is the
newline character. Implies -x and -L 1.
Based on that,
find . -name "1.txt" -o -name "2.txt"
#returns
# ./1.txt
# ./2.txt
xargs -I {} grep -q "b" {}
# interprets the above as two separate lines since,
# with -I option the newline is now a *field separator*.
# So this results in TWO invocations of grep and since one of them fails,
# the overall output is 123 as documented in the EXIT_STATUS section
xargs grep -q "b"
# interprets the above as a single input field,
# so a single grep invocation which returns a successful exit code of 0 since the pattern was found in one of the files.

grep combine -f and -E

I want to combine the content of a patter file with a regular expressions, i.e. grep -E -f.
The input file has the format
2 List_of_anthropologists<!!>Q1279970
3 List_of_Governors_of_Alabama<!!>Q558677
2027476 12th_Dalai_Lama<!!>Q25240
etc..
and the pattern file has the format:
13th_Dalai_Lama
5th_Dalai_Lama
etc...
I can make it work by manually putting in the pattern "13th_Dali_Lama"
grep -E "^(\d*)(?:\t)13th_Dalai_Lama" input_file
But how to I combine the -f option so that 13th_Dalai_Lama is replace by the lines in the pattern file?
With GNU grep, GNU sed and bash:
grep -f <(sed 's/.*/\\b&\\b/' pattern_file) input_file

Gnu Parallel with multiple commands and multiple configurations

I start several gnu parallel jobs from a bash file like this:
parallel -a jobs_A.sh --workdir workDir_A_Path --results logDir_A_Path --joblog logDir_A_Path
parallel -a jobs_B.sh --workdir workDir_B_Path --results logDir_B_Path --joblog logDir_B_Path
I can append jobs_A.sh and jobs_B.sh.
Now I want one single parallel call to submit the jobs to the workers.
However, how can I tell parallel which workdir, results and joblog folder to use, respectively ?
You cannot do that because neither --results nor --joblog are computed per job.
You can get the workdir, though:
parallel --xapply --workdir {1} --results logDir_Path --joblog logDir_common_Path {2} \
:::: <(perl -ne 'print "workDir_A_Path\n"' jobs_A.sh; perl -ne 'print "workDir_B_Path\n"' jobs_B.sh;) \
:::: <(cat jobs_A.sh jobs_B.sh)

How do you exclude symlinks in a grep?

I want to grep -R a directory but exclude symlinks how dow I do it?
Maybe something like grep -R --no-symlinks or something?
Thank you.
Gnu grep v2.11-8 and on if invoked with -r excludes symlinks not specified on the command line and includes them when invoked with -R.
If you already know the name(s) of the symlinks you want to exclude:
grep -r --exclude-dir=LINK1 --exclude-dir=LINK2 PATTERN .
If the name(s) of the symlinks vary, maybe exclude symlinks with a find command first, and then grep the files that this outputs:
find . -type f -a -exec grep -H PATTERN '{}' \;
The '-H' to grep adds the filename to the output (which is the default if grep is searching recursively, but is not here, where grep is being handed individual file names.)
I commonly want to modify grep to exclude source control directories. That is most efficiently done by the initial find command:
find . -name .git -prune -o -type f -a -exec grep -H PATTERN '{}' \;
For now.. here is how I would exclude symbolic links when using grep
If you want just file names matching your search:
for f in $(grep -Rl 'search' *); do if [ ! -h "$f" ]; then echo "$f"; fi; done;
Explaination:
grep -R # recursive
grep -l # file names only
if [ ! -h "file" ] # bash if not a symbolic link
If you want the matched content output, how about a double grep:
srch="whatever"; for f in $(grep -Rl "$srch" *); do if [ ! -h "$f" ]; then
echo -e "\n## $f";
grep -n "$srch" "$f";
fi; done;
Explaination:
echo -e # enable interpretation of backslash escapes
grep -n # adds line numbers to output
.. It's not perfect of course. But it could get the job done!
If you're using an older grep that does not have the -r behavior described in Aryeh Leib Taurog's answer, you can use a combination of find, xargs and grep:
find . -type f | xargs grep "text-to-search-for"
If you are using BSD grep (Mac) the following works similar to '-r' option of Gnu grep.
grep -OR <PATTERN> <PATH> 2> /dev/null
From man page
-O If -R is specified, follow symbolic links only if they were explicitly listed on the command line.

Resources