Matching pattern across multiple files: perl or grep? - grep

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501
1n1e+NDE+A+400 2qzl+IXS+A+449
1llf+F23+A+800 1y0g+8PP+A+320
1ewf+PC1+A+577 2a94+AP0+A+336
2ydx+TXP+E+1339 3g8i+RO7+A+1
1gvh+HEM+A+1398 1v9y+HEM+A+1140
2i0z+FAD+A+501 3m2r+F43+A+1
1h6d+NDP+A+500 3rt4+LP5+C+501
1w07+FAD+A+1660 2pgn+FAD+A+612
2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt
I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings;
open AS, "combi_output_2_fixed.txt";
open AQ, "NAMES.txt";
#arr=<AS>;
#arr1=<AQ>;
foreach $line(#arr)
{
#split=split(' ',$line);
foreach $line1(#arr1)
{
#split1=split(' ',$line1);
if($split[0] eq $split1[0] && $split[1] eq $split1[1])
{ print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
}
}
close AQ;
close AS;
Doing this uses up the entire memory: and shows Out of memory error message..
I am aware that this can be done using grep. but do not know hw to do it.
Can anyone please let me know how I can do this using grep -F AND WITHOUT USING UP THE ENTIRE MEMORY?
Thanks.

Does pattern.txt fit in memory?
If it does, you could use a command like grep -F -f pattern.txt data.txt to match lines in data.txt against the patterns. You would get the full line though, and extra processing would be required to get only the second column of numbers.
Or you could fix the Perl script. The reason you run out of memory is because you read the 8gb file entirely to memory, when you could be processing it line-by-line like grep. For the 8GB file you should use code like this:
open FH, "<", "data.txt";
while ($line = <FH>) {
# check $line against list of patterns ...
}

Try This
grep "`more pattern.txt`" data.txt | awk -F' ' '{ print $1 " " $2 " " $4}'

Related

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

How can I find files that match a two-line pattern using grep?

I created a test file with the following:
<cert>
</cert>
I'm now trying to find this with grep and the following command, but it take forever to run.
How can I search quickly for files that contain adjacent lines like these?
tr -d '\n' | grep '<cert></cert>' test.test
So, from the comments, you're trying to get the filenames that contain an empty <cert>..</cert> element. You're using several tools wrong. As #iiSeymour pointed out, tr only reads from standard input-- so if you want to use it to select from lots of filenames, you'll need to use a loop. grep prints out matching lines, not filenames; though you could use grep -l to see the filenames instead.
But you're only joining lines because grep works one line at a time; so let's use a better tool. Here's how to search with awk:
awk '/<cert>/ { started=1; }
/<\/cert>/ { if (started) { print FILENAME; nextfile;} }
!/<cert>/ { started = 0; }' file1 file2 *.txt
It checks each line and keeps track of whether the previous line matched <cert>. (!/pattern/ sets the flag back to zero on lines not matching /pattern/.) Call it with all your files (or with a wildcard like *.txt).
And a friendly suggestion: Next time, try each command separately (you've been stuck on this for hours and you still don't know what grep does?). And have a quick look at the manual for the tools you want to use. Unix tools are usually too complex for simple trial and error.

How to extract certain part of line that's between quotes

For example if I have file.txt with the following
object = {
'name' : 'namestring',
'type' : 'type',
'real' : 'yes',
'version' : '2.0',
}
and I want to extract just the version so the output is 2.0 how would I go about doing this?
I would suggest that grep is probably the wrong tool for this. Nevertheless, it is possible, using grep twice.
grep 'version' input.txt | grep -Eo '[0-9.]+'
The first grep isolates the line you're interested in, and the second one prints only the characters of the line that match the regex, in this case numbers and periods. For your input data, this should work.
However, this solution is weak in a few areas. It doesn't handle cases where multiple version lines exist, it's hugely dependent on the structure of the file (i.e. I suspect your file would be syntactically valid if all the lines were joined into a single long line). It also uses a pipe, and in general, if there's a way to achieve something with a pipe, and a way without a pipe, you choose the latter.
One compromise might be to use awk, assuming you're always going to have things split by line:
awk '/version/ { gsub(/[^0-9.]/,"",$NF); print $NF; }' input.txt
This is pretty much identical in functionality to the dual grep solution above.
If you wanted to process multiple variables within that section of file, you might do something like the following with awk:
BEGIN {
FS=":";
}
/{/ {
inside=1;
next;
}
/}/ {
inside=0;
print a["version"];
# do things with other variables too
#for(i in a) { printf("i=%s / a=%s\n", i, a[i]); } # for example
delete a;
}
inside {
sub(/^ *'/,"",$1); sub(/' *$/,"",$1); # strip whitespace and quotes
sub(/^ *'/,"",$2); sub(/',$/,"",$2); # strip whitespace and quotes
a[$1]=$2;
}
A better solution would be to use a tool that actually understands the file format you're using.
A simple and clean solution using grep and cut
grep version file.txt | cut -d \' -f4

grep from beginning of found word to end of word

I am trying to grep the output of a command that outputs unknown text and a directory per line. Below is an example of what I mean:
.MHuj.5.. /var/log/messages
The text and directory may be different from time to time or system to system. All I want to do though is be able to grep the directory out and send it to a variable.
I have looked around but cannot figure out how to grep to the end of a word. I know I can start the search phrase looking for a "/", but I don't know how to tell grep to stop at the end of the word, or if it will consider the next "/" a new word or not. The directories listed could change, so I can't assume the same amount of directories will be listed each time. In some cases, there will be multiple lines listed and each will have a directory list in it's output. Thanks for any help you can provide!
If your directory paths does not have spaces then you can do:
$ echo '.MHuj.5.. /var/log/messages' | awk '{print $NF}'
/var/log/messages
It's not clear from a single example whether we can generalize that e.g. the first occurrence of a slash marks the beginning of the data you want to extract. If that holds, try
grep -o '/.*' file
To fetch everything after the last space, try
grep -o '[^ ]*$' file
For more advanced pattern matching and extraction, maybe look at sed, or Awk or Perl or Python.
Your line can be described as:
^\S+\s+(\S+)$
That's assuming whitespace is your delimiter between the random text and the directory. It simply separates the whitespace from the non-whitespace and captures the second part.
Or you might want to look into the word boundary character class: \b.
I know you said to use grep, but I can't help to mention that this is trivially done using awk:
awk '{ print $NF }' input.txt
This is assuming that a whitespace is the delimiter and that the path does not contain any whitespaces.

Recursively grep results and pipe back

I need to find some matching conditions from a file and recursively find the next conditions in previously matched files , i have something like this
input.txt
123
22
33
The files where you need to find above terms in following files, the challenge is if 123 is found in say 10 files , the 22 should be searched in these 10 files only and so on...
Example of files are like f1,f2,f3,f4.....f1200
so it is like i need to grep -w "123" f* | grep -w "123" | .....
its not possible to list them manually so any easier way?
You can solve this using awk script, i ve encountered a similar problem and this will work fine
awk '{ if(!NR){printf("grep -w %d f*|",$1)} else {printf("grep -w %d f*",$1)} }' input.txt | sh
What it Does?
it reads input.txt line by line
until it is at last record , it prints grep -w %d | (note there is a
pipe here)
which is then sent to shell for execution and results are piped back
to back
and when you reach the end the pipe is avoided
Perhaps taking a meta-programming viewpoint would help. Have grep output a series of grep commands. Or write a little PERL program. Maybe Ruby, if the mood suits.
You can use grep -lw to write the list of file names that matched (note that it will stop after finding the first match).
You capture the list of file names and use that for the next iteration in a loop.

Resources