grep or awk? : Find similar substring and output in two fields - grep

I Have two files, I want to find the strings of file1 as substring in file2,
but want the output of the results (when matching) containing both strings in two fields divided by ':'
So that my output of matches has "file1string:file2string"
Example
file 1
464697uifs4h44yy
48oo895i6iu8gg11
j4h5y7yu4g655h44
jyyuthvcxx22zerc
File 2
j4h5y7yu4g655h447ijj651cvpijgtkk
strxzdokui464697uifs4rdffgjfudjh
kjhbdfgfx1154m87gjgbgcsqubyu6u3k
gfhgysj4h5y7yu4g655h44jkhgfhhfhu
Desired Output
j4h5y7yu4g655h44:j4h5y7yu4g655h447ijj651cvpijgtkk
j4h5y7yu4g655h44:gfhgysj4h5y7yu4g655h44jkhgfhhfhu
j4h5y7yu4g655h44:j4h5y7yu4g655h447ijj651cvpijgtkk
j4h5y7yu4g655h44:gfhgysj4h5y7yu4g655h44jkhgfhhfhu
I used :
fgrep -f file1 file2 >output
but this gives only results from file 2

Related

Find matching words

I have a corpus file and the rules file. I am trying to find matching words where the word from rule appear in corpus.
# cat corpus.txt
this is a paragraph number one
second line
third line
# cat rule.txt
a
b
c
This returns 2 lines
# grep -F0 -f rule.txt corpus.txt
this is a paragraph number one
second line
But I am expecting 4 words like this...
a
paragraph
number
second
Trying to achive these results using grep or awk.
Assuming words are seperated by white spaces
awk '{print "\\S*" $1 "\\S*"}' rule.txt | grep -m 4 -o -f - corpus.txt

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

use grep to return list of matching words in a line per file

I have a list of files, and I want to look for some specific keywords in those files. The output should be a line for each file with matches, showing the words that we found just once. For example, if I have the following file test.txt
one,two,three
four,five,six,
seven,eight,nine
and i do a grep of the words five and eight, it should return something like this:
test.txt:five,eight
I'm not interested in the lines, or the number of matches. I just want to know which words matched in each file. How can I do that?
GNU grep + awk solution:
Let's say we have file test1.txt with contents:
one,two,three
four,five,six,
seven,eight,nine
and test2.txt with contents:
one
two
three, four, five
Finding matches for words five and eight:
grep -Hwo '\(five\|eight\)' test*
| awk -F':' '{ a[$1]=(a[$1])? a[$1]","$2:$2 }END{ for(i in a) print i FS a[i] }'
The output:
test1.txt:five,eight
test2.txt:five
grep details:
-H - Print the file name for each match
-w - Select only those lines containing matches that form whole words
-o - Print only the matched (non-empty) parts of matching lines
awk details:
-F':' - field separator
a[$1]=(a[$1])? a[$1]","$2:$2 - using filename $1 as array key for accumulating all matched words

grep Top n Matches Across Files

I'm using grep to extract lines across a set of files:
grep somestring *.log
Is it possible to limit the maximum number of matches per file? Ideally I'd just to print out n lines from each of the *.log files.
To limit 11 lines per file:
grep -m11 somestring *.log
Here is an alternate way of simulating it with awk:
awk 'f==10{f=0; nextfile; exit} /regex/{++f; print FILENAME":"$0}' *.log
Explanation:
f==10 : f is a flag we set and check if the value of it is equal to 10. You can configure it depending on the number of lines
you wish to match.
nextfile : Moves processing to the next file.
exit : Breaks out of awk.
/regex/ : You're search regex or pattern.
{++f;print FILENAME":"$0} : We increment the flag and print the filename and line.

Resources