Combining replacement strings and regular expressions in GNU Parallel - gnu-parallel

I have a list of file paths of the format:
/data/nicotine_sensi/bam/9-2_box_1_S23_starAligned.sortedByCoord.out.bam
/data/nicotine_sensi/bam/9-2_box_3_S101_starAligned.sortedByCoord.out.bam
/data/nicotine_sensi/bam/9-3_box_1_S24_starAligned.sortedByCoord.out.bam
/data/nicotine_sensi/bam/9-3_box_3_S102_starAligned.sortedByCoord.out.bam
I want to input into a gnu parallel command so that both the predefined replacement strings and a perl or --plus replacement string operate at the same time, but I couldn't find a solution in the tutorials. Ideally, {/...} and {%_starAligned} would both work together to produce:
9-2_box_1_S23
9-2_box_3_S101
9-3_box_1_S24
9-3_box_3_S102
however, the closest I get is:
parallel --rpl '{..} s:/data/nicotine_sensi/bam/::;s:_starAligned.sortedByCoord.out.bam::' \
echo {..} ::: $(ls $bam_dir/*.bam)
which is messy and not very portable for other directories.

The definition of {/...} is:
s:.*/::; s:\.[^/.]+$::; s:\.[^/.]+$::; s:\.[^/.]+$::;
The definition of {%(.*)} is:
s/$$1$//;
So combined you could do:
echo /data/nicotine_sensi/bam/9-3_box_1_S24_starAligned.sortedByCoord.out.bam |
parallel --rpl '{¤([^}]+?)} s:.*/::; s:\.[^/.]+$::; s:\.[^/.]+$::; s:\.[^/.]+$::; s/$$1$//;' echo {¤_starAligned}
If you know you will always remove _something then:
echo /data/nicotine_sensi/bam/9-3_box_1_S24_starAligned.sortedByCoord.out.bam |
parallel --rpl '{¤} s:.*/::; s:\.[^/.]+$::; s:\.[^/.]+$::; s:\.[^/.]+$::; s/_[^_]+$//;' echo {¤}
If you will be using this a lot then putting it in a profile will probably be a good idea.

Related

Delete lines of many files using grep and GNU parallel

I have a directory with many files that all end in "_all.txt". I want to delete all lines in each of these files containing either a "*" or a "-" and send them to files ending in "_all_cleaned.txt".
Right now I am using a for loop as follows:
for file in *_all.txt;
do
filename=$(echo $file | cut -d '_' -f 1)
grep -vwE "(*|-)" ${file}> "${filename}_all_cleaned.txt"
done
I would like to be able to do this in parallel using GNU parallel so that the command will be executed on each file on a different compute node instead of waiting for one node to do all in a row.
How can I incorporate
If the files are in the login dir on the servers (i.e. the dir you get by ssh server1 pwd):
parallel -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt
If it is the same dir relative to $HOME (e.g. /home/me/my/dir):
parallel --wd . -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt
If it is /different/dir:
parallel --wd /different/dir -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.
Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

grep in pipeline: why it does not work

I want to extract certain information from the output of a program. But my method does not work. I write a rather simple script.
#!/usr/bin/env python
print "first hello world."
print "second"
After making the script executable, I type ./test | grep "first|second". I expect it to show the two sentences. But it does not show anything. Why?
Escape the expression.
$ ./test | grep "first\|second"
first hello world.
second
Also bear in mind that the shebang is #!/usr/bin/env python, not just #/usr/bin/env python.
use \| instead of |
./test | grep "first\|second"

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

How do I extract partial path from pwd in tcsh?

I want to basically implement an alias (using cd) which takes me to the 5th directory in my pwd. i.e.
If my pwd is /hm/foo/bar/dir1/dir2/dir3/dir4/dir5, I want my alias, say cdf to take me to /hm/foo/bar/dir1/dir2 .
So basically I am trying to figure how I strip a given path to a given number of levels of directories in tcsh.
Any pointers?
Edit:
Okay, I came this far to print out the dir I want to cd into using awk:
alias cdf 'echo `pwd` | awk -F '\''/'\'' '\''BEGIN{OFS="/";} {print $1,$2,$3,$4,$5,$6,$7;}'\'''
I am finding it difficult to do a cd over this as it already turned into a mess of escaped characters.
This should do the trick:
alias cdf source ~/.tcsh/cdf.tcsh
And in ~/.tcsh/cdf.tcsh:
cd "`pwd | cut -d/ -f1-6`"
We use the pwd tool to get the current path, and pipe that to cut, where we split by the delimiter / (-d/) and show the first 5 fields (-f1-6).
You can see cut as a very light awk; in many cases it's enough, and hugely simplifies things.
The problem with your alias is tcsh's quircky quoting rules. I'm not even going to try and fix that. We use source to evade all of that;
tcsh lacks functions, but you can sort of emulate them with this. Never said it was pretty.
#carpetsmoker's solution using cut is nice and simple. But since his solution awkwardly uses another file and source, here's a demonstration of how to avoid that. Using single quotes prevents the premature evaluation.
% alias cdf 'cd "`pwd | cut -d/ -f1-6`"'
% alias cdf
cd "`pwd | cut -d/ -f1-6`"
Here's a simple demonstration of how single quotes can work with backticks:
% alias pwd2 'echo `pwd`'
% alias pwd2
echo `pwd`
% pwd2
/home/shx2

Resources