Intersect large file: memory requirement issue on cluster - grep

I would like to intersect a series of small files with a relatively large file. Following the many topics on stackoverflow and after some tests I choose to use this function which was the fastest on my data:
for file1 in ./myfiles*
do
# Do other things to create file1 and file2
# Then
grep -f $file1 file2.txt | awk -F '\t' '{print $1}' > myResults_$file1.txt
done
where file1 is a single-column file of 50 to 100000 lines and file2 is two-columns tab-delimited ~1 million lines file.
Ex:
file1
A
B
C
file2
A 1
B 2
C 3
I run the command on a cluster with 1 thread and 48Gb RAM. However I soon as it reaches a file1 bigger than 10000 lines it crashes with the following error:
slurmstepd: Job 3312063 exceeded memory limit (50359784 > 50331648), being killed
Can someone explain me why this command is storing such much in memory and how can I solve this issue?

Related

Output file much larger than input files after cat + grep

I have 18 csv files, all between 1mb and 14mb. The sum of all files is 64mb. I want to create a new csv file that contains a subset of those files-- only the lines featuring the pattern "Hello" (or "HELLO", or "hello" ...). Here's what I'm doing
cat *.csv | head -n 1 > new.csv # I want to create a header first
cat *.csv | grep -i "hello" >> new.csv
I'm running Debian on WSL. The output file is much, much larger than the original 64mb (I stopped the process after 1+ hour, and the file was 300+ GB).
How can a subset of a text file be larger than the original files? Does it have anything to do with WSL?
This is not an OS issue. When you redirect your output to new.csv, shell creates that file first, before the glob expression *.csv is evaluated. That means the expansion of *.csv would include new.csv as well. That seems like the root cause of the recursive grep issue you are facing.
You are reading all the files twice, which is not necessary. You can make your operation a lot simpler and efficient with a single awk command:
awk 'NR==1 {print} tolower($0) ~ /hello/ {print}' *.csv > csv.new
mv csv.new new.csv
since the output file is named csv.new it won't interfere with the glob *.csv
NR==1 picks up the first line (header) from the very first file
The awk command can be written more succinctly as:
awk 'NR==1 || tolower($0) ~ /hello/' *.csv > csv.new
You are using *.csv and redirecting the output to new.csv which falls under *.csv which is causing recursion in grep result. perhaps you can try,
grep -i hello *.csv --exclude="new.csv" >> new.csv

grep every fourth line in .fastq

I am working on a linux machine using bash.
My question is, how can I skip lines in the query file using grep?
I am working with a large ~16Gb .fastq file named example.fastq which has the following format.
example.fastq
#SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
#SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I need to extract lines containing a strings of interest #SRR6750041.2 #SRR6750041.5 stored in a bash array called IDarray as well as the 3 lines following each match. The following grep command allows me to do this
for ID in "${IDarray[#]}";
do
grep -F -A 3 "$ID " example.fastq
done
This correctly output the following.
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I am looking for ways to speed this process up... one way would be to reduce the number of lines searched by grep by restricting the search to lines beginning with # or skipping lines that can not possibly contain the match #SRR6750041.1 such as lines 2,3,4 and 6,7,8 etc. Is there a way to do this using grep? Alternative methods are also welcome!
Here are some thoughts with examples. For test purposes I created test case as mini version of Yours example_mini.fastq is 145 MB big and IDarray has 999 elements (interests).
Your version has this performance (more than 2 mins in user space):
$ time for i in "${arr[#]}"; do grep -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 3m16.310s
user 2m9.645s
sys 0m53.092s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
First upgrade of grep to end grep after first match -m 1, I am assuming that interest ID is unique. This narrow down by 50% of complexity and takes approx 1 min in user space:
$ time for i in "${arr[#]}"; do grep -m 1 -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 1m19.325s
user 0m55.844s
sys 0m21.260s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
These solutions are linearly dependent on number of elements. Call n times grep on huge file.
Now let's implement in AWK only for one run, I am exporting IDarray into input file so I can process in one run. I am loading big file into associative array per ID and then looping 1x through You array of IDs to search. This is generic scenario where You can define regexp and number of lines after to print. This has complexity with only one run through file + N comparisons. This is 2000% speed up:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (linesafter-- > 0) { arr[interest]=arr[interest] RS $0; next; }
(FNR==NR) && /^#/ { interest=$1; arr[interest]=$0; linesafter=3; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m7.044s
user 0m6.628s
sys 0m0.307s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
As in Your title If You really can confirm that every fourth line is id of interest and three lines after are about to be printed. You can simplify into this and speed up by another 20%:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (FNR%4==1) { interest=$1; arr[interest]=$0; next; }
(FNR==NR) { arr[interest]=arr[interest] RS $0; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m5.944s
user 0m5.593s
sys 0m0.242s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
On 1.5 GB file with 999 elements to search time is:
real 1m4.333s
user 0m59.491s
sys 0m3.460s
So per my predictions on my machine Your 15 GB example with 10k elements would take approx 16 minutes in user space to process.

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

tar pre-run to evaluate expected size or amount of files

The problem:
I have a back-end process that at some point he collect and build a big tar file.
This tar receive few directories and an exclude files.
the process can take up to few minutes and i want to report in my front-end process (GUI) about the progress of the taring process (This is a big issue for a user that press download button and it seems like nothing is happening...).
i know i can use -v -R in the tar command and count files and size progress but i am looking for some kind of tar pre-run mode / dry run to help me evaluate either the expected number of files or the expected tar size.
the command I am using: tar -jcf 'FILE.tgz' 'exclude_files' 'include_dirs_and_files'
10x for everyone who is willing to assist.
You can pipe the output to the wc tool instead of actually making a file.
With file listing (verbose):
[git#server]$ tar czvf - ./test-dir | wc -c
./test-dir/
./test-dir/test.pdf
./test-dir/test2.pdf
2734080
Without:
[git#server]$ tar czf - ./test-dir | wc -c
2734080
Why don't you run a
DIRS=("./test-dir" "./other-dir-to-test")
find ${DIRS[#]} -type f | wc -l
beforehand. This gets all the files (-type f) one per line and counts the number of files. DIRS is an array in bash, so you can store the folders in a variable
If you want to know the size of all the stored files, you can use du
DIRS=("./test-dir" "./other-dir-to-test")
du -c -d 0 ${DIRS[#]} | tail -1 | awk -F ' ' '{print $1}'
This prints the disk usage with du, calculates a grand total (-c flag), gets the last line (example 4378921 total), and uses just the first column with awk

Extracting n rows of text from a large csv file

I have a CSV file (foo.csv) with 200,000 rows. I need to break it into four files (foo1.csv, foo2.csv... etc.) with 50,000 rows each.
I already tried simple ctrl-v/-c using gui text editors, but the my computer slows to a halt.
What unix command(s) could I use to accomplish this task?
I don't have a terminal handy to try it out, but it should be just split -d -l 50000 foo.csv.
Hopefully the naming isn't terribly important because with the -d option, the output files will be named foo.csv00 .. foo.csv03. You can add the -a 1 option so that the suffixes are 0-3, but there's no simple way to get the suffix to be injected into the middle of the filename.
you should use head and tail.
head -n 50000 myfile > part1.csv
head -n 100000 myfile | tail -n 50000 > part2.csv
head -n 150000 myfile | tail -n 50000 > part3.csv
etc ...
Else, but with no control on file names, you can use unix command split.
sed -n 2000,4000p somefile.txt
will print from lines 2000 to 4000 to stdout.
split -l50000 foo.csv
You can use sed
I wrote this little shell script for this topic very similar at yours.
This shell script + awk works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)

Resources