How to join 2 files using a pattern - join

is it possible to join these files based on first column pattern by using awk ?
Thanks
file1
qwex-123d-947774-sm-shebha
qwex-123d-947774-sm-shebhb
qwex-123d-947774-sm-shebhd
qwex-23d-947774-sm-shebha
qwex-23d-947774-sm-shebhb
qwex-235d-947774-sm-shebhd
file2
qwex-235d none1
qwex-23d none2
output
qwex-23d none2 qwex-23d-947774-sm-shebha
qwex-23d none2 qwex-23d-947774-sm-shebhb
qwex-235d none1 qwex-235d-947774-sm-shebhd

this awk one-liner should do:
awk 'NR==FNR{a[$0];next}{for(x in a)if($0~"^"x){print x, $0;break}}' file2 file1
Note that, the line has risk if the lines in your file2 containing special characters, which have special meaning in regex. like qwex$-23d
If that is the case, ~ should not be used, instead, we should compare the string literally.

Related

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

is there a sophisticated way to grep this file

I have one file. Written in BNF it could be
<line>:== ((<ISBN10>|<ISBN13>)([a-Z/0-9]*)) {1,4})
For example
123456789X/abscd/1234567890123/djfkldsfjj
How can I grep the ISBN10 or ISBN13 ONLY one per line even when in the line are more ISBNs. If there are more ISBNs in the line it should take only the first in line.
When I grep that way
grep -Po "[0-9]{9,13}X{0,1}" file
then I get more lines than the file originally has. (As there could be max 4 ISBNs in line)
I would also need the linecount of file should be the linecount of the grepresult.
Any advices?
Well, assuming the other answer offered isn't correct in assuming that the 'first' ISBN isn't at the start of line, you could always try in perl.
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my ( $first_isbn, #rest ) = m/(\d{9,13}X{0,1})/g;
print $., ":", $first_isbn, "\n" if $first_isbn;
}
$. is the line number in perl, and so we print that and the match if there's a match. <> says read and iterate either filenames or STDIN much like grep does. So you could invoke this in a similar way to grep:
perl myscript.pl <filename>
Or:
cat <filename> | ./myscript.pl
This would one-liner-ify as:
perl -lne 'my ( $first_isbn ) = m/(\d{9,13}X{0,1})/g; print $., ":", $first_isbn, "\n" if $first_isbn;'
One trivial solution is to include the beginning of the line in your regex:
grep -Po "^[0-9]{9,13}X{0,1}" file
This ensures that matches after the first do not satisfy the regex. It does seem from your BNF that the ISBNs, if present, are guaranteed to be the first characters of the line.
Another way is to use sed:
sed -n "s/\([0-9]\{9,13\}X\).*/\1/p" file
This matches your pattern along with the rest of the line, but only prints your pattern. You could then use another utility to add line numbers. E.g. pipe your output to nl -nrz -w9.

Filter a specific letter within a word using Grep

I've been trying to find a way to filter a specific letter within a word using a regular expression. For exemple, filtering the letter "a" in the word "latin". Filtering only a letter would be simple using something like :
grep "\ba\b"
but I can't find a way to get the "a" only in a certain word.
Thanks for your help!
You can pipe to another grep, like this:
grep "\ba\b" /path/to/input/file | grep -o "a"
The latter part of the pipe uses the o flag which only outputs the matched part. Alternatively grep -o "a" should return all a's.

How can I know how many times a word is in a file using grep?

I have a file and I want to know how many times does a word is inside that file.(NOTE: A row can have the same word)
You can use this command. Hope this wil help you.
grep -o yourWord file | wc -l
Use the grep -c option to count the number of occurences of a search pattern.
grep -c searchString file
awk solution:
awk '{s+=gsub(/word/,"&")}END{print s}' file
test:
kent$ cat f
word word word
word
word word word
kent$ awk '{s+=gsub(/word/,"&")}END{print s}' f
7
you may want to add word boundary if you want to match an exact word.
Yes, i know you want a grep solution, but my favorite perl with the rolex operator can't missing here... ;)
perl -0777 -nlE 'say $n=()=m/\bYourWord\b/g' filename
# ^^^^^^^^
if yoy want match the YourWord surrounded with another letters like abcYourWordXYZ, use
perl -0777 -nlE 'say $n=()=m/YourWord/g' filename

Resources