Can grep print its matches to multiple lines, even if found on the same line? - grep

For example, with the following string:
[:variable_one] == options[:variable_two]
and the following grep argument:
grep -Eo "\[\:.*?\]"
It will show the output of:
[:variable_one] == options[:variable_two]
but instead, I'm looking to get an output of:
[:variable_one]
[:variable_two]
Is there a way to "split" each match into a separate line, even if it finds multiple matches on a single line? Basically looking for the opposite answer of this: Print multiple regex matches using grep on the same line

The : and ] (that is not part of a bracket expression) chars are not special inside a regex pattern. *? is treated as * in the POSIX ERE pattern, so it is too greedy and matches until the rightmost occurrence of ].
A POSIX BRE compliant regex for use with grep can look like
#!/bin/bash
s='[:variable_one] == options[:variable_two]'
grep -o "\[:[^][]*]" <<< "$s"
See the online demo. Output:
[:variable_one]
[:variable_two]

Related

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

Getting only grep exact matches

I am trying to grep a file for the exact occurrence of a match, but I get also longer spurious matches:
grep CAT1717O99 myfile.txt -F -w
Output:
CAT1717O99
CAT1717O99.5
I would like to output only the first exactly matching line. Is there any way to get rid of the second line?
Thanks in advance.
Arturo
This is the file 'myfile.txt':
CAT1717O99
CAT1717O99.5
This will do the work for you.
grep -Fx "CAT1717O99" textfile
-F means Fixed
-x mean exact
Use the power of Perl-compatible regular expression (PCRE) and search the matches to the given pattern:
grep -Po "\bCAT1717O99(\s|$)" myfile.txt
(\s|$) - alternative group, ensures matching substring CAT1717O99 if it's followed by whitespace or placed at the end of the line
-P option, allows regular expressions
-o option, prints only matched parts of matching lines
You'll need explicitly request spaces in order to ignore special chars.
grep -E '(^| )CAT1717O99( |$)' myFile.txt
from grep manual :
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

is there a sophisticated way to grep this file

I have one file. Written in BNF it could be
<line>:== ((<ISBN10>|<ISBN13>)([a-Z/0-9]*)) {1,4})
For example
123456789X/abscd/1234567890123/djfkldsfjj
How can I grep the ISBN10 or ISBN13 ONLY one per line even when in the line are more ISBNs. If there are more ISBNs in the line it should take only the first in line.
When I grep that way
grep -Po "[0-9]{9,13}X{0,1}" file
then I get more lines than the file originally has. (As there could be max 4 ISBNs in line)
I would also need the linecount of file should be the linecount of the grepresult.
Any advices?
Well, assuming the other answer offered isn't correct in assuming that the 'first' ISBN isn't at the start of line, you could always try in perl.
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my ( $first_isbn, #rest ) = m/(\d{9,13}X{0,1})/g;
print $., ":", $first_isbn, "\n" if $first_isbn;
}
$. is the line number in perl, and so we print that and the match if there's a match. <> says read and iterate either filenames or STDIN much like grep does. So you could invoke this in a similar way to grep:
perl myscript.pl <filename>
Or:
cat <filename> | ./myscript.pl
This would one-liner-ify as:
perl -lne 'my ( $first_isbn ) = m/(\d{9,13}X{0,1})/g; print $., ":", $first_isbn, "\n" if $first_isbn;'
One trivial solution is to include the beginning of the line in your regex:
grep -Po "^[0-9]{9,13}X{0,1}" file
This ensures that matches after the first do not satisfy the regex. It does seem from your BNF that the ISBNs, if present, are guaranteed to be the first characters of the line.
Another way is to use sed:
sed -n "s/\([0-9]\{9,13\}X\).*/\1/p" file
This matches your pattern along with the rest of the line, but only prints your pattern. You could then use another utility to add line numbers. E.g. pipe your output to nl -nrz -w9.

Grep: First word in line that begins with ? and ends with?

I'm trying to do a grep command that finds all lines in a file whos first word begins "as" and whos first word also ends with "ng"
How would I go about doing this using grep?
This should just about do it:
$ grep '^as\w*ng\b' file
Regexplanation:
^ # Matches start of the line
as # Matches literal string as
\w # Matches characters in word class
* # Quantifies \w to match either zero or more
ng # Matches literal string ng
\b # Matches word boundary
May have missed the odd corner case.
If you only want to print the words that match and not the whole lines then use the -o option:
$ grep -o '^as\w*ng\b' file
Read man grep for all information on the available options.
I am pretty sure this should work:
grep "^as[a-zA-Z]*ng\b" <filename>
hard to say without seeing samples from the actual input file.
sudo has already covered it well, but I wanted to throw out one more simple one:
grep -i '^as[^ ]*ng\b' <file>
-i to make grep case-insensitive
[^ ]* matches zero or more of any character, except a space
^ finds the 'first character in a line', so you can search for that with:
grep '^as' [file]
\w matches a word character, so \w* would match any number of word characters:
grep '^as\w*' [file]
\b means 'a boundary between a word and whitespace' which you can use to ensure that you're matching the 'ng' letters at the end of the word, instead of just somewhere in the middle:
grep '^as\w*ng\b' [file]
If you choose to omit the [file], simply pipe your files into it:
cat [file] | grep '^as\w*ng\b'
or
echo [some text here] | grep '^as\w*ng\b'
Is that what you're looking for?

Pattern matching using grep

Assuming we have one input string like
Nice
And we have the pattern
D*A*C*N*a*g*.h*ca*e
then "Nice" will match the pattern. (* means 0 or more occurrence, . means one char)
I think using grep is better than java in this case(maybe). How can I do it in grep?
Use the same regular expression:
grep 'D*A*C*N*a*g*.h*ca*e' <<EOF
Nice
EOF
If the input is "Nicely" it still prints it! How does it work?
The current regex looks for the pattern anywhere on the line. If it must match exactly (the whole line), then add anchors to start (^) and end ($) of line:
grep '^D*A*C*N*a*g*.h*ca*e$' <<EOF
Nice
Nicely
Darce
Darcy
Darcey
EOF

Resources