I'm guessing it's not a Perl Compatible Regular Expression, since there's a special kind of grep which is specifically PCRE. What's grep most similar to?
Are there any special quirks of grep that I need to know about? (I'm used to Perl and the preg functions in PHP)
Default GNU grep behavior is to use a slightly flavorful variant on POSIX basic regular expressions, with a similarly tweaked species of POSIX extended regular expressions for egrep (usually an alias for grep -E). POSIX ERE is what PHP ereg() uses.
GNU grep also claims to support grep -P for PCRE, by the way. So no terribly special kind of grep required.
POSIX BRE (Basic Regular Expressions)
You can compare the various flavors here.
There's a good write-up here. To quote the page, "grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep fits some extensions in where POSIX leaves the behaviour unspecified)."
In other words, it's a long story. ;)
Grep is an implementation of POSIX regular expressions. There are two types of posix regular expressions -- basic regular expressions and extended regular expressions. In grep, generally you use the -E option to allow extended regular expressions.
The grep man pages do a pretty thorough job of explaining the flavor of regexp available in grep. man grep is pretty useful.
There is no regular grep function in PHP. If you are referring to the ereg family of PHP functions then those are POSIX regular expressions. If you are referring to the Linux grep commandline utility, those are POSIX regular expressions as well. It supports both basic as well as extended POSIX regular expressions.
Related
Using grep (GNU grep 3.3) to search for all words with three consecutive double-letters (resulting in "bookkeeper"):
grep -E "((.)\2){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters, each of them followed by the letter "i" (resulting in "Mississippi"):
grep -E "((.)\2i){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters, each of them followed by any single letter (with a couple of results):
grep -E "((.)\2.){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters separated by an optional, single letter (even more results):
grep -E "((.)\2.?){3}" /usr/share/dict/american-english
Now, finally, my original task: Search for all words containing three double-letters:
grep -E "((.)\2.*){3}" /usr/share/dict/american-english
But this results in an empty set. Why? How can .? match something .* does not?
The POSIX regex engine does not handle patterns with back-references well, matching back references is an NP complete problem might provide some hints on why it is that difficult.
Since you are using a GNU grep, the problem is easily solved with the PCRE engine,
grep -P '((.)\2.*){3}' file
since the PCRE engine can handle back-references in a more efficient way than the POSIX regex engine.
See the online demo.
I am trying to understand and read the man page. Yet everyday I find more inconsistent syntax and I would like some clarification to whether I am misunderstanding something.
Within the man page, it specifies the syntax for grep is grep [OPTIONS] [-e PATTERN]... [-f FILE]... [FILE...]
I got a working example that recursively searches all files within a directory for a keyword.
grep -rnw . -e 'memes
Now this example works, but I find it very inconsistent with the man page. The directory (Which the man page has written as [FILE...] but specifies the use case for if file == directory in the man page) is located last. Yet in this example it is located after [OPTIONS] and before [-e PATTERN].... Why is this allowed, it does not follow the specified regex fule of using this command?
Why is this allowed, it does not follow the specified regex fule of using this command?
The lines in the SYNOPSIS section of a manpage are not to be understood as strict regular expressions, but as a brief description of the syntax of a utility's arguments.
Depending on the particular application, the parser might be more or less flexible on how it accepts its options. After all, each program can implement whatever grammar they like for their arguments. Therefore, some might allow options at the beginning, at the end, or even in-between files (typically with ways to handle ambiguity that may arisa, e.g. reading from the standard input with -, filenames starting with -...).
Now, of course, there are some ways to do it that are common. For instance, POSIX.1-2017 12.1 Utility Argument Syntax says:
This section describes the argument syntax of the standard utilities and introduces terminology used throughout POSIX.1-2017 for describing the arguments processed by the utilities.
In your particular case, your implementation of grep (probably GNU's grep) allows to pass options in-between the file list, as you have discovered.
For more information, see:
https://unix.stackexchange.com/questions/17833/understand-synopsis-in-manpage
Are there standards for Linux command line switches and arguments?
https://www.gnu.org/software/libc/manual/html_node/Getopt-Long-Options.html
You can also leverage .
grep ‘string’ * -lR
Currently when I have to search for complex patterns in code, I typically use a combination of find and grep in the form:
find / \( -type f -regextype posix-extended -regex '.*python3.*py' \) -exec grep -EliI '\b__[[:alnum:]]*_\b' {} \; -exec cat {} \; > ~/python.py
While this looks a long term to type, its actually quite short if you use zsh. I just type f (the first character), and go directly to this command from my command history. Further the regex in find/grep is standardized and tested, so there are no surprises, or missing searches.
ripgrep/ag etc etc are new software, which mightnot be supported a few years down the line when the original maintaner loses interest.
is there any plan to include .gitignore rules or optimizations in ag/ack/rg in grep/other version of grep? Is there any reason why these optimizations were/are not going to be included in grep?
For those of you who switched over: Did you guys find it worthwhile to switch over to rg/ag/ack especially because there is going to be a learning curve for these tools as well?
Use ag.
The key part of your example: ag -G '.*python3.*py' '\b__[[:alnum:]]*_\b'
Ag is here to stay and uses Perl regex (PCRE) which is far more flexible than POSIX basic or extended Regular Expressions. Grep -P uses the Perl regex engine, so this just akin to using ag, without some of the later's more modern features. Likewise, ack is like ag but is slower (though admittedly has a few more bells and whistles). Ag's file regexes filtering (the -G flag as exemplified above) and built-in file types filters are very handy (e.g. --python). The recently renamed .ignore file also provides finer tuning.
Since most modern scripting languages have PCRE or handle regexes with similar features in PCRE (perl, python, ruby), as do many full languages (java, C++) have near equivalent feature sets (e.g java.util.regex, Boost.Regex), I consider this the main reason to switch. Moreover, it is satisfying to unify your programming with you commandline skillset.
From my point of view, ripgrep is ag's main contender because it is faster and has an easy way to add file types. That said, it doesn't have as flexible a regex engine: no backreferences nor look-arounds. With this is mind, I recommend Ag.
Some command options are with one dash e.g. ruby -c (check syntax) and ruby --copyright (print copyright). Is there any pattern to this?
These are known as short and long options. Which name/format a developer uses for options of his program is totally up to him.
However, there are some widespread conventions. Like -v/--version for printing version number, -h/--help for printing usage instructions, etc.
Sadly, most commandline tools on OSX seem not to conform to -v/-h.
Good CLI (command-line interface) design dictates that options of a program that are most useful should have two formats, short and long. You use short format in your everyday life (because it's faster to type).
ps aux | grep ruby
Long ones are for scripts that you write and rarely touch (they're easier to read and understand).
mongod --logpath /path/to/logs --dbpath /path/to/db --fork --smallfiles
Many less used options may have only the long version (because, you know, there are only 26 letters in latin alphabet).
On many rails commands there is a pattern. One dash is an abbreviation for a two dash option, e.g. rspec -o FILE is a synonym for rspec --out FILE.
How can I find all lines in Delphi source code using GExperts grep search which contain a string literal instead of a resource string, except those lines which are marked as 'do not translate'?
Example:
this line should match
ShowMessage('Fatal error! Save all data and restart the application');
this line should not match
FieldByName('End Date').Clear; // do not translate
(Asking specifically about GExpert as it has a limited grep implementation afaik)
Regular Expressions cannot be negated in general.
Since you want to negate a portion of the search, this comes as close as I could get it within the RegEx boundaries that GExpers Grep Search understands:
\'.*\'.*[^n][^o][^t][^ ][^t][^r][^a][^n][^s][^l][^a][^t][^e]$
Edit: Forgot the end-of-line $ marker, as GExperts Grep Search cannot do without.
blokhead explains why you cannot negate in general.
This Visual Studio Quick Search uses the tilde for negation, but the GExperts Grep Search cannot.
The grep command-line search has the -v (reverse) option to negate a complete search (but not a partial search).
A perfect manual negation gets complicated very rapidly.
--jeroen