grep - Print line with matching pattern, starting from the matched pattern - grep

For e.g. If I have a file containing:
spam eggs ham
and I do grep <some-flag> "eggs" *
I should get the output as:
eggs ham
and not
spam eggs ham

$ echo "spam eggs ham" | grep -o 'eggs.*'
eggs ham
grep -o
This is used to print only the matched portion of text.
eggs.*
This means , eggs followed anything ( dot notify any character and star means zero or more match)

Related

Find matching words

I have a corpus file and the rules file. I am trying to find matching words where the word from rule appear in corpus.
# cat corpus.txt
this is a paragraph number one
second line
third line
# cat rule.txt
a
b
c
This returns 2 lines
# grep -F0 -f rule.txt corpus.txt
this is a paragraph number one
second line
But I am expecting 4 words like this...
a
paragraph
number
second
Trying to achive these results using grep or awk.
Assuming words are seperated by white spaces
awk '{print "\\S*" $1 "\\S*"}' rule.txt | grep -m 4 -o -f - corpus.txt

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

Is there a way to find an occurrence of a word delimited by space using a grep command?

I need to find specific phrases within a text file using the grep commands. The first word must have a maximum of three letters, and there must be the word "of" delimited before and after by a space.
This is the example of the file text:
sch of aock
sch of rock
sc of uok
Tre schoof ai rock
Bamam school of aiao
Bam school of ciao
The correct result should be
sch of aock
sch of rock
sc of uok
Bam school of ciao
My code works only partially
grep -E '^.{0,3} of *' es1.txt
sch of aock
sch of rock
sc of uok
grep -E '^.{0,3} .* of ' es1.txt
Bam school of ciao
$ grep -E '^\S{1,3} (.* )?of ' es1.txt
The first part makes sure you're looking at non-blanks for the first word, and that it has at least one character. Then, it's followed by a space.
Then you have optionally more words, however many, ending in a space. But that group is using a ?, because it's not present when the word "of" is following the first word directly.
Finally, you're matching the word "of" and one more space at the end.
Why not use this simple regex solution (grep or awk):
grep -E '^.{1,3} .*of ' file
awk '/^.{1,3} .*of /' file
sch of aock
sch of rock
Tre schoof ai rock
Bam school of ciao
^ start with
.{1,3} 1 to 3 character and space
.* som more character
of word of

How to grep only if pattern1 and pattern2 matches in consecutive lines

I have a file like below:
city-italy
good food
bad climate
-
city-india
bad food
normal climate
-
city-brussel
normal dressing
stylish cookings
good food
-
Question - I want to grep city and food, for which "food" is "bad".
For example -
for the above question, i need a grep command to get a answer like below
city-india
bad food
Please help me like, how i will get pattern 1 and pattern 2 grepped only if both succeeds parallely.
i mean both pattern should match and it should grep in the following line.
You can do it with pipes - grep -A1 city <filename> | grep -B1 "bad food" or cat filename | grep -A1 city | grep -B1 "bad food" (or any other stream source for the pipe)
If the city name is guaranteed to come before the food quality (any other info in between is allowed):
sed -n -e '/^city/h' -e '/bad food/{x;G;p}' input
Which keeps the name of each city in the hold buffer and prints the last city name when matches bad food.
I know this is an old question, but here's a "robust" alternative (cuz I'm into that):
grep -x -e'city-.*' -e'good food' -e'bad food' -e'-' | tr \\n \| | sed -e's/|-|/\n/g' | grep -xe'[^|]\+|[^|]\+' | grep -e'|bad food$' | tr \| \\n
Explanation
grep -x -e'city-.*' -e'good food' -e'bad food' -e'-': only keep the lines that contain a "city line", a "food line" (either good or bad), or a "separator line" (the food line expression could be better, I know), the -x argument to grep will make it return a line only if the whole line matches the given expression (incidentally, this first stage makes the whole pipe not choke on differently-sized "registers"),
tr \\n \|: turn newlines into pipes (you can use any character that does not appear in the original file, pipe works, so does a colon, you get the idea),
sed -e's/|-|/\n/g': replace the |-| string by a newline (this are the places we know a "register" ends, since we only kept the datums we're interested in and the separators, we know that now we have each of our "registers" in a single line, with their fields separated by pipes),
grep -xe'[^|]\+|[^|]\+': only keep lines containing exactly two fields (ie. the city and food fields),
grep -e'|bad food$': keep only lines ending in |bad food,
tr \| \\n: turn pipes back into newlines (nb. this is just here so that the output conforms to the question's specification, it's not really needed, nor preferred in my opinion).
Partial outputs
After grep -x -e'city-.*' -e'good food' -e'bad food' -e'-':
city-italy
good food
-
city-india
bad food
-
city-brussel
good food
-
After tr \\n \|:
city-italy|good food|-|city-india|bad food|-|city-brussel|good food|-|
After sed -e's/|-|/\n/g':
city-italy|good food
city-india|bad food
city-brussel|good food
After grep -xe'[^|]\+|[^|]\+': idem, since we don't have a "city line" without a "food line" in the example given, nor a register containing two "city lines" and a "food line", nor a register containing a "city line" and two "food lines", nor... you get the picture,
After grep -e'|bad food$':
city-india|bad food
After tr \| \\n:
city-india
bad food
Why is this more "robust"?
The input file basically consists of different "registers", each containing a variable number of "fields", but instead of having them in an "horizontal" format, we find them in a "vertical" one, ie. one field per line with a lone - separating whole registers.
The pipe above supports any amount of fields in each register, it only assumes that:
Registers are separated by a lone -,
The "city fields" are all of the form city-*,
The "food fields" are either good food or bad food,
If at all existent, "city" fields appear before "food" fields.
(this last one I find particularly hard to relax, at least in a "normal"-ish pipe like the one given).
I does not assume that:
Each register has a "city" and a "food" field,
Each register has only "city" and "food" fields.
Disclaimer
I'm not claiming this is in any way better than any of the other answers, it's just that I can't do sed or awk to save my own life, and often find pipes like this are helpful in understanding how the file gets filtered and transformed.
All in all, it's just a matter of taste.
If the order is ensured, you can use directly the command grep with OR:
grep -e "city" -e "food" FILE_INPUT
Then hopefully the city will follow by its food feature at following.
The result looks like:
city-italy
good food
city-india
bad food
city-brussel
good food
You can change your pattern to get a more filtered result.
To get city with bad food using gnu awk (due to RS)
awk '/bad food/ {print RS $1}' RS="city" file
city-india
another awk line:
kent$ awk 'BEGIN{FS=OFS="\n";RS="-"FS}/bad food/{print $1,$2}' file
city-india
bad food

grep to find words with unique letters

how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.
Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)
grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.
Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;
The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3

Resources