How to type AND in regex word matching - grep

I'm trying to do a word search with regex and wonder how to type AND for multiple criteria.
For example, how to type the following:
(Start with a) AND (Contains p) AND (Ends with e), such as the word apple?
Input
apple
pineapple
avocado
Code
grep -E "regex expression here" input.txt
Desired output
apple
What should the regex expression be?

In general you can't implement and in a regexp (but you can implement then with .*) but you can in a multi-regexp condition using a tool that supports it.
To address the case of ands, you should have made your example starts with a and includes p and includes l and ends with e with input including alpine so it wasn't trivial to express in a regexp by just putting .*s in between characters but is trivial in a multi-regexp condition:
$ cat file
apple
pineapple
avocado
alpine
Using &&s will find both words regardless of the order of p and l as desired:
$ awk '/^a/ && /p/ && /l/ && /e$/' file
apple
alpine
but, as you can see, you can't just use .*s to implement and:
$ grep '^a.*p.*l.*e$' file
apple
If you had to use a single regexp then you'd have to do something like:
$ grep -E '^a.*(p.*l|l.*p).*e$' file
apple
alpine

two ways you can do it
all that "&&" is same as negating the totality of a bunch of OR's "||", so you can write the reverse of what you want.
at a single bit-level, AND is same as multiplication of the bits, which means, instead of doing all the && if u think it's overly verbose, you can directly "multiply" the patterns together :
awk '/^a/ * /p/ * /e$/'
so by multiplying them, you're doing the same as performing multiple logical ANDs all at once
(but only use the short hand if inputs aren't too gigantic, or when savings from early exit are known to be negligible.
don't think of them as merely regex patterns - it's easier for one to think of anything not inside an action block, what's typically referred to as pattern, as
any combination and collection of items that could be evaluated for a boolean outcome of TRUE or FALSE in the end
e.g. POSIX-compliant expressions that work in the space include
sprintf()
field assignments, etc
(even decrementing NR - if there's such a need)
but not
statements like next, print, printf(),
delete array etc, or any of the loop structures
surprisingly though, getline is directly doable
in the pattern space area (with some wrapper workaround)

Related

Grep {n} The preceding item is matched exactly n times, is not clear to me in case of "Hair", "Haair" and "Haaair"

Suppose there is three strings "Hair", "Haair" and "Haaair" , When i use grep -E '^Ha{1}' , it returns all the former three words, instead i was expecting only "Hair", as i have asked return a line which starts with H and is followed by letter 'a' exactly once.
grep does not check that its input matches the given search expression. Grep finds substrings of the input that match the search.
See:
grep test <<< This is a test.
The input does not exactly match test. Only part of the input matches,
This is a test.
but that is enough for grep to output the whole line.
Similarly, when you say
grep -E '^Ha{1}' <<< Haaair
The input does not exactly match the search, but a part of it does,
Haaair
and that is enough. Note that {n,m} syntax is purely a convenience: Ha{1} is exactly equivalent to Ha, Ha{3,} is Haaa+, Ha{2,5} is Haa(a?){3} is Haaa?a?a?, etc. In other words, {1} does not mean "exactly once", it just means "once".
What you want to do is match a Ha that is not followed by another a. You have two options:
If your grep supports PCRE, you can use a negative lookahead:
grep -P '^Ha(?!a)'
(?!a) is a zero-length assertion, like ^. It doesn't match any characters; it simply causes the match to fail if there is an a after the first one.
Or, you can keep it simple and use a negative []:
grep -E '^Ha([^a]|$)'
Where [^a] matches any single character that is not a, and the alternation with $ handles the case of no character at all.

Match same character appear multiple times

I want to use regex to match my requirement that, for a same character, it appeared 3 times with exactly one other character inserted into them (to simplify the answer assume all chars are in [a-zA-Z]).
For eg popape, ccccAjAkA meet my requirement, but KKKccc, FFFsF (not an 'other' char between two 'F's) are not qualified. how can I write this grep command?
Using (experimental in grep) Perl compatible regular expression (PCRE):
grep -P '([a-zA-Z])(?!\1)(.)\1(?!\1)(.)\1'

Only output values within a certain range

I run a command that produce lots of lines in my terminal - the lines are floats.
I only want certain numbers to be output as a line in my terminal.
I know that I can pipe the results to egrep:
| egrep "(369|433|375|368)"
if I want only certain values to appear. But is it possible to only have lines that have a value within ± 50 of 350 (for example) to appear?
grep matches against string tokens, so you have to either:
figure out the right string match for the number range you want (e.g., for 300-400, you might do something like grep -E [34].., with appropriate additional context added to the expression and a number of additional .s equal to your floating-point precision)
convert the number strings to actual numbers in whatever programming language you prefer to use and filter them that way
I'd strongly encourage you to take the second option.
I would go with awk here:
./yourProgram | awk '$1>250 && $1<350'
e.g.
echo -e "12.3\n342.678\n287.99999" | awk '$1>250 && $1<350'
342.678
287.99999

Search for combinations of a phrase

What is the way to use 'grep' to search for combinations of a pattern in a text file?
Say, for instance I am looking for "by the way" and possible other combinations like "way by the" and "the way by"
Thanks.
Awk is the tool for this, not grep. On one line:
awk '/by/ && /the/ && /way/' file
Across the whole file:
gawk -v RS='\0' '/by/ && /the/ && /way/' file
Note that this is searching for the 3 words, not searching for combinations of those 3 words with spaces between them. Is that what you want?
Provide more details including sample input and expected output if you want more help.
The simplest approach is probably by using regexps. But this is also slightly wrong:
egrep '([ ]*(by|the|way)\>){3}'
What this does is to match on the group of your three words, taking spaces in front of the words
with it (if any) and forcing it to be a complete word (hence the \> at the end) and matching the string if any of the words in the group occurs three times.
Example of running it:
$ echo -e "the the the\nby the\nby the way\nby the may\nthe way by\nby the thermo\nbypass the thermo" | egrep '([ ]*(by|the|way)\>){3}'
the the the
by the way
the way by
As already said, this procudes a 'false' positive for the the the but if you can live with that, I'd recommend doing it this way.

Pattern matching using grep

Assuming we have one input string like
Nice
And we have the pattern
D*A*C*N*a*g*.h*ca*e
then "Nice" will match the pattern. (* means 0 or more occurrence, . means one char)
I think using grep is better than java in this case(maybe). How can I do it in grep?
Use the same regular expression:
grep 'D*A*C*N*a*g*.h*ca*e' <<EOF
Nice
EOF
If the input is "Nicely" it still prints it! How does it work?
The current regex looks for the pattern anywhere on the line. If it must match exactly (the whole line), then add anchors to start (^) and end ($) of line:
grep '^D*A*C*N*a*g*.h*ca*e$' <<EOF
Nice
Nicely
Darce
Darcy
Darcey
EOF

Resources