Delete lines of many files using grep and GNU parallel

Delete lines of many files using grep and GNU parallel - grep

I have a directory with many files that all end in "_all.txt". I want to delete all lines in each of these files containing either a "*" or a "-" and send them to files ending in "_all_cleaned.txt".
Right now I am using a for loop as follows:
for file in *_all.txt;
do
filename=$(echo $file | cut -d '_' -f 1)
grep -vwE "(*|-)" ${file}> "${filename}_all_cleaned.txt"
done
I would like to be able to do this in parallel using GNU parallel so that the command will be executed on each file on a different compute node instead of waiting for one node to do all in a row.
How can I incorporate

If the files are in the login dir on the servers (i.e. the dir you get by ssh server1 pwd):
parallel -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt
If it is the same dir relative to $HOME (e.g. /home/me/my/dir):
parallel --wd . -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt
If it is /different/dir:
parallel --wd /different/dir -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt

Related

Output file much larger than input files after cat + grep

I have 18 csv files, all between 1mb and 14mb. The sum of all files is 64mb. I want to create a new csv file that contains a subset of those files-- only the lines featuring the pattern "Hello" (or "HELLO", or "hello" ...). Here's what I'm doing
cat *.csv | head -n 1 > new.csv # I want to create a header first
cat *.csv | grep -i "hello" >> new.csv
I'm running Debian on WSL. The output file is much, much larger than the original 64mb (I stopped the process after 1+ hour, and the file was 300+ GB).
How can a subset of a text file be larger than the original files? Does it have anything to do with WSL?

This is not an OS issue. When you redirect your output to new.csv, shell creates that file first, before the glob expression *.csv is evaluated. That means the expansion of *.csv would include new.csv as well. That seems like the root cause of the recursive grep issue you are facing.
You are reading all the files twice, which is not necessary. You can make your operation a lot simpler and efficient with a single awk command:
awk 'NR==1 {print} tolower($0) ~ /hello/ {print}' *.csv > csv.new
mv csv.new new.csv
since the output file is named csv.new it won't interfere with the glob *.csv
NR==1 picks up the first line (header) from the very first file
The awk command can be written more succinctly as:
awk 'NR==1 || tolower($0) ~ /hello/' *.csv > csv.new

You are using *.csv and redirecting the output to new.csv which falls under *.csv which is causing recursion in grep result. perhaps you can try,
grep -i hello *.csv --exclude="new.csv" >> new.csv

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.

Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?

This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

Unused Images in XCode Project

I have created small bash script which finds the unused images(*.jpg and *.png) in iOS project. And here is the code for the same.
//For removing space and replace '\n' in file and folder names
`IFS=$'\n'`
//Find all files with extension *.png and *.jpg
`for i in `find . -name "*.png" -o -name "*.jpg"``; do`
//Get the base name of found file
file=``basename -s .jpg "$i" | xargs basename -s .png``
//Replace "\n" with original space so that ack can search exact file name in all files
filenameWithSpace="${file//$'\n'/ }"
//Merge both basename and extension for ack command
//Add this "</dev/null" to ack command for it to work on jenkins setup
`extension="${i##*.}"`
`filename="$filenameWithSpace.$extension"`
result=`ack -i --ignore-file=match:/.\.pbxproj/ "$filename" </dev/null`
//result=ack -i "$filename" </dev/null
//If result is NULL then ack did not find any occurrence of the filename in any file
`if [ -z "$result" ];`
`then `
`echo "$i" `
`fi `
`done`
Should ".pbxproj" be included in my search? I ask this question since including and excluding ".pbxproj" provides me different results.
Thanks in advance.

Don't include the .pbxproj. It will have the image references.
FYI : http://jeffhodnett.github.io/Unused/ to find the unused images.

Ignoring directories from a file

I am in the process of creating a script that lists all files opened via lsof output. I would like to checksum specific files and ignore directories from that output but am at a loss to do so EFFECTIVELY. For example: (I'm using FreeBSD btw)
lsof | awk '/\//{print $9}' | sort -u | head -n 5
prints:
/
/bin/sleep
/dev/bpf
What I'd like to do is: FROM that output, ignore any directories and perform an md5 on FILES (not directories).
Any pointers?

Give a try to following perl command:
lsof | perl -MDigest::MD5=md5_hex -ane '
$f = $F[ $#F ];
-f $f and printf qq|%s %s\n|, $f, md5_hex( $f )
'
It filters lsof output to plain files (-f). Take a look into perlfunc to change it to add different kind of files.
It outputs each file and its md5 separated by a space character. An example in my system is like:
/usr/lib/libm-2.17.so a2d3b2de9a1f59fb99427714fefb49ca
/usr/lib/libdl-2.17.so d74d8ac16c2d13128964353d4be7061a
/usr/lib/libnsl-2.17.so 34b6909ec60c337c21b044642b9baa3d
/usr/lib/ld-2.17.so 3d0e7b5b5c4e59c5c4b6a858cc79fcf1
/usr/sbin/lsof b9b8fbc8f296e47969713f6369d97c0d
/usr/lib/locale/locale-archive 3ea56273193198a718b9a5de33d553db
/usr/lib/libc-2.17.so ba51eeb4025b7f5d7f400f1968f4b5f9
/usr/lib/ld-2.17.so 3d0e7b5b5c4e59c5c4b6a858cc79fcf1
...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Delete lines of many files using grep and GNU parallel - grep

Related

Output file much larger than input files after cat + grep

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

run command taking two arguments with GNU parallel

Unused Images in XCode Project

Ignoring directories from a file

Categories

Resources