run command taking two arguments with GNU parallel - gnu-parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?

This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

Related

Extract specific number from command outout

I have the following issue.
In a script, I have to execute the hdparm command on /dev/xvda1 path.
From the command output, I have to extract the MB/sec values calculated.
So, for example, if executing the command I have this output:
/dev/xvda1:
Timing cached reads: 15900 MB in 1.99 seconds = 7986.93 MB/sec
Timing buffered disk reads: 478 MB in 3.00 seconds = 159.09 MB/sec
I have to extract 7986.93 and 159.09.
I tried:
grep -o -E '[0-9]+', but it returns to me all the six number in the output
grep -o -E '[0-9]', but it return to me only the first character of the six values.
grep -o -E '[0-9]+$', but the output is empty, I suppose because the number is not the last character set of outoput.
How can I achieve my purpose?
To get the last number, you can add a .* in front, that will match as much as possible, eating away all the other numbers. However, to exclude that part from the output, you need GNU grep or pcregrep or sed.
grep -Po '.* \K[0-9.]+'
Or
sed -En 's/.* ([0-9.]+).*/\1/p'
Consider using awk to just print the fields you want rather than matching on numbers. This will work using any awk in any shell on every Unix box:
$ hdparm whatever | awk 'NF>1{print $(NF-1)}'
7986.93
159.09

Output file much larger than input files after cat + grep

I have 18 csv files, all between 1mb and 14mb. The sum of all files is 64mb. I want to create a new csv file that contains a subset of those files-- only the lines featuring the pattern "Hello" (or "HELLO", or "hello" ...). Here's what I'm doing
cat *.csv | head -n 1 > new.csv # I want to create a header first
cat *.csv | grep -i "hello" >> new.csv
I'm running Debian on WSL. The output file is much, much larger than the original 64mb (I stopped the process after 1+ hour, and the file was 300+ GB).
How can a subset of a text file be larger than the original files? Does it have anything to do with WSL?
This is not an OS issue. When you redirect your output to new.csv, shell creates that file first, before the glob expression *.csv is evaluated. That means the expansion of *.csv would include new.csv as well. That seems like the root cause of the recursive grep issue you are facing.
You are reading all the files twice, which is not necessary. You can make your operation a lot simpler and efficient with a single awk command:
awk 'NR==1 {print} tolower($0) ~ /hello/ {print}' *.csv > csv.new
mv csv.new new.csv
since the output file is named csv.new it won't interfere with the glob *.csv
NR==1 picks up the first line (header) from the very first file
The awk command can be written more succinctly as:
awk 'NR==1 || tolower($0) ~ /hello/' *.csv > csv.new
You are using *.csv and redirecting the output to new.csv which falls under *.csv which is causing recursion in grep result. perhaps you can try,
grep -i hello *.csv --exclude="new.csv" >> new.csv

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.
Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

How to find a pattern and surrounding content in a very large SINGLE line file?

I have a very large file 100Mb+ where all the content is on one line.
I wish to find a pattern in that file and a number of characters around that pattern.
For example I would like to call a command like the one below but where -A and -B are number of bytes not lines:
cat very_large_file | grep -A 100 -B 100 somepattern
So for a file containing content like this:
1234567890abcdefghijklmnopqrstuvwxyz
With a pattern of
890abc
and a before size of -B 3
and an after size of -A 3
I want it to return:
567890abcdef
Any tips would be great.
Many thanks.
You could try the -o option:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
and use a regular expression to match your pattern and the 3 preceding/following characters i.e.
grep -o -P ".{3}pattern.{3}" very_large_file
In the example you gave, it would be
echo "1234567890abcdefghijklmnopqrstuvwxyz" > tmp.txt
grep -o -P ".{3}890abc.{3}" tmp.txt
Another one with sed (you may need it on systems where GNU grep is not available):
sed -n '
s/.*\(...890abc...\).*/\1/p
' infile
Best way I can think of doing this is with a tiny Perl script.
#!/usr/bin/perl
$pattern = $ARGV[0];
$before = $ARGV[1];
$after = $ARGV[2];
while(<>) {
print $& if( /.{$before}$pattern.{$after}/ );
}
You would then execute it thusly:
cat very_large_file | ./myPerlScript.pl 890abc 3 3
EDIT: Dang, Paolo's solution is much easier. Oh well, viva la Perl!

Extracting n rows of text from a large csv file

I have a CSV file (foo.csv) with 200,000 rows. I need to break it into four files (foo1.csv, foo2.csv... etc.) with 50,000 rows each.
I already tried simple ctrl-v/-c using gui text editors, but the my computer slows to a halt.
What unix command(s) could I use to accomplish this task?
I don't have a terminal handy to try it out, but it should be just split -d -l 50000 foo.csv.
Hopefully the naming isn't terribly important because with the -d option, the output files will be named foo.csv00 .. foo.csv03. You can add the -a 1 option so that the suffixes are 0-3, but there's no simple way to get the suffix to be injected into the middle of the filename.
you should use head and tail.
head -n 50000 myfile > part1.csv
head -n 100000 myfile | tail -n 50000 > part2.csv
head -n 150000 myfile | tail -n 50000 > part3.csv
etc ...
Else, but with no control on file names, you can use unix command split.
sed -n 2000,4000p somefile.txt
will print from lines 2000 to 4000 to stdout.
split -l50000 foo.csv
You can use sed
I wrote this little shell script for this topic very similar at yours.
This shell script + awk works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)

Resources