GNU grep, backreferences and wildcards - grep

Using grep (GNU grep 3.3) to search for all words with three consecutive double-letters (resulting in "bookkeeper"):
grep -E "((.)\2){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters, each of them followed by the letter "i" (resulting in "Mississippi"):
grep -E "((.)\2i){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters, each of them followed by any single letter (with a couple of results):
grep -E "((.)\2.){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters separated by an optional, single letter (even more results):
grep -E "((.)\2.?){3}" /usr/share/dict/american-english
Now, finally, my original task: Search for all words containing three double-letters:
grep -E "((.)\2.*){3}" /usr/share/dict/american-english
But this results in an empty set. Why? How can .? match something .* does not?

The POSIX regex engine does not handle patterns with back-references well, matching back references is an NP complete problem might provide some hints on why it is that difficult.
Since you are using a GNU grep, the problem is easily solved with the PCRE engine,
grep -P '((.)\2.*){3}' file
since the PCRE engine can handle back-references in a more efficient way than the POSIX regex engine.
See the online demo.

Related

GREP: Find words containing multiple specific characters

To clarify a bit, while I am aware how to grab words with a single specific character, I'm unsure how to approach looking for multiple of them. For example, what grep command would be used to retrieve only the words containing both "b" and "p" (in any order), not just one or the other?
Using the above example, if you're given words like "bear," "pear," "biography," and "printable," it would only return the last two words. These are some of my previous attempts.
grep -E "\b[bp]\b" input
grep -E "\b(b|p)\b" input
grep -E "\bb.*p\b" input
you can do it with a regular expression. For instance, here the code snippet for your problem.
grep '\w*[b]\w*[p]\w*\|\w*[p]\w*[b]\w*' test.txt
Helpful links to read further:
https://www.cyberciti.biz/faq/grep-regular-expressions/
https://regexr.com/

How to grep for files using 'and' operator, words might not be on the same line

I have a directory /dir
which has several text files in it, These files may or may not contain the words 'rock' and 'stone', so basically some files might just contain the word 'rock', some may just contain the word 'stone', some may contain both, and some may contain neither.
How can I list all files in this directory that contain both 'rock' and 'stone'? These words might not be on the same line so I don't think piping through grep twice would work.
Appreciate any help, I was not able to find a stackoverflow post with this problem so I figured I'd ask.
To search files that match the given two (or more) words at any line anywhere in the file, you may want to try ugrep:
ugrep -F --files -e 'rock' --and -e 'stone' dir
This only matches files that have both rock and stone in them. Lines are output that have rock or stone, or you can use option -l to just list files. The -F option searches strings (like grep -F and fgrep), --files applies the --and file-wide, which you want instead of applying the --and per line. Note that we have more than one pattern in this case, so option -e should be used (like grep also requires this).
A shorter form with --bool:
ugrep -F --files --bool 'rock stone' dir
where --bool formulates a Boolean query with space as AND (or use AND).
If you want to search directory dir recursively in subdirectories, use option -r.

Ag / Grep Exact Match Only Search

I am having an issue with using Ag (The Silver Searcher)...
In the docs it says to use -Q for exact match, but I don't understand why it does not work for my purposes. If I type something like ag -Q actions or ag -Q 'actions' into my terminal, it returns all instances of actions, including things like transactions and any other strings that actions is part of.
I have tried a couple other combinations of flags from the docs, including -s and -S, among others, but still I cannot get strictly strings matching just actions to return for me.
I can't get this to work with grep either. Does anyone know how I can get what I need with ag? (or even with grep)...?
Thank you in advance!
Because ag (and grep), find files that contain something. ag -Q means to interpret the search as an exact literal string, not a fuzzy string or a regex. Okay. But a file that has the word "transactions" in it contains exactly, literally the character sequence actions. Sure, it contains more than that too, but that's not surprising.
Probably you're looking for a word-boundary search, grep '\bactions\b' or ag -w -Q actions (maybe ag -w -Q -s actions). But that is not at all the same thing as "just actions", it's a specific requirement on the things surrounding "actions" (namely that they be the beginning or end of a line, or non-letter characters). You have to tell the computer what you actually mean.

Grep Search And Replace in Mass Files

Even though there are lots of grep questions and answers, these don't answer and I need help in this. I need to make
Title-BEX-override-8>"
expressions to become
Title-BEX>"
Any letters or words among Title-BEX and >" should be terminated. I need an exact grep expression for this.
And some optional answers can be about this: I want to do is thin multiple files. And prefer doing this in Mac.
grep cannot do text replacement.
try sed
sed 's/Title-BEX-override-8/Title-BEX/g' file
-i option can let you do it "in place". but I don't know the corresponding option is for your sed on mac.. :(

What flavour of regular expression is grep?

I'm guessing it's not a Perl Compatible Regular Expression, since there's a special kind of grep which is specifically PCRE. What's grep most similar to?
Are there any special quirks of grep that I need to know about? (I'm used to Perl and the preg functions in PHP)
Default GNU grep behavior is to use a slightly flavorful variant on POSIX basic regular expressions, with a similarly tweaked species of POSIX extended regular expressions for egrep (usually an alias for grep -E). POSIX ERE is what PHP ereg() uses.
GNU grep also claims to support grep -P for PCRE, by the way. So no terribly special kind of grep required.
POSIX BRE (Basic Regular Expressions)
You can compare the various flavors here.
There's a good write-up here. To quote the page, "grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep fits some extensions in where POSIX leaves the behaviour unspecified)."
In other words, it's a long story. ;)
Grep is an implementation of POSIX regular expressions. There are two types of posix regular expressions -- basic regular expressions and extended regular expressions. In grep, generally you use the -E option to allow extended regular expressions.
The grep man pages do a pretty thorough job of explaining the flavor of regexp available in grep. man grep is pretty useful.
There is no regular grep function in PHP. If you are referring to the ereg family of PHP functions then those are POSIX regular expressions. If you are referring to the Linux grep commandline utility, those are POSIX regular expressions as well. It supports both basic as well as extended POSIX regular expressions.

Resources