grep workaround with pattern files on MacOS 10.13

grep workaround with pattern files on MacOS 10.13 - grep

I am having a perplexing problem with grep that I can't debug. This is reproducible on Mac OS High Sierra, but the problem does not occur on a current Ubuntu (where it works as expected).
I have three files:
cat haystack
apple
aardvark
cow
cat pattern1
a
aardvark
animal
cat pattern2
c
b
apple
You can create these 3 files with:
perl -e 'print "a\naardvark\nanimal"' > pattern1;
perl -e 'print "c\nb\napple"' > pattern2;
perl -e 'print "apple\naardvark\ncow"' > haystack;
Here's the problem: This yields the expected response:
grep -iowFf pattern2 haystack
apple
To explain, the grep...
-i = case insensitive
-o = display the match
-w = word match <== this is the option which is breaking it
The expression is searched for as a word (as if surrounded by `[[:<:]]' and `[[:>:]]'
-F = fast grep (fixed strings)
-f = read pattern from file
This returns nothing:
grep -iowFf pattern1 haystack
But I would expect "pattern1" to return "aardvark".
I was experimenting with this small testbed, but my real project is much larger. And I found that when I change the sequence of the lines in the patternN files, the results change.
sort -r pattern1 > pattern1.reverse
grep -iowFf pattern1.reverse haystack
That returns "aardvark"
What am I missing? I've been banging my head on this. Is it a bug in MacOS 10.13? Is there a workaround? (yes, one workaround is to replace the -w parameter with \b....\b in my patterns and turn off -F, but I am working on very large files, and I want the performance.)

On MacOSX:
$ grep -V
grep (BSD grep) 2.5.1-FreeBSD
On Centos7 e.g.
$ grep -V
grep (GNU grep) 2.20
Now, both versions work differently (as you noticed). To workaround this you can install the GNU version of grep on MacOSX with brew install grep which installs GNU grep with the prefix g. Now you can do:
$ ggrep -iowFf pattern1 haystack
aardvark

Related

How to exclude from grep double colons?

I'm trying to find lines with words not preceded by double colons (::).
Example
void myClass::doMything() // I don't want this line
myObj->doMyThing() // I want this line
My goal is to get the lines where some methods are used, but not where the methods are defined.
I try with this command :
grep --color=always -rwna "methodName" --include=*.cpp | grep -v "::methodName"
but it doesn't work : it keeps extracting also lines containing
::methodName
I've also tried by writing
grep --color=always -rwna "methodName" --include=*.cpp | grep -v "\:\:methodName"
egrep --color=always -rwna "methodName" --include=*.cpp | egrep -v "\:\:methodName"
but neither works.
What should I do ?

Although grep is probably most common used tool among all linux CLI tools and is used by every1 and everywhere... still doesnt mean its perfect. The thing you are trying to achieve is not achievable with basic grep's regex - you need python/perl regex here.
As a workaround (I assume you are trying to only find line where method is invoked) you can try:
grep -Eno "(::)?methodName" your_input_files | grep -v "::methodName"
-n to prints line number and I believe it will give convenience to you
-o to prints only matched part, but I use it here to split output - to have each match in separate line (if you have 5x methodName in line of code you will have 5 lines in grep's output)
(::)? to find distinguish if its declaration or invokation of methodName, we will need it when 2nd grep comes to play...
grep -v ...and here it comes, to get rid of what you dont want
I guess you want to use maaaaany times so you can even try to make a function into your .bashrc
find_invocations () {
# below example goes through current dir, but you can improve it :)
grep --color=yes -Eno "(::)?$1" * 2>/dev/null | grep -v "::$1"
}
in above function you might go risky and use $1.* instead of $1 but an unpleasant case is if you have both methodname and ::methodName in same line AFAIR my C++ lessons (ages ago - anno 2010) methodName::methodName is a constructor...
...sorry for bad english

I've finally managed to make it work.
I've tried linux_beginner's suggestion:
grep -Eno '(::)?myMethodName' path/to/one/of/the/files.cpp | grep -v '::myMethodName'
with a single file and this works. (I found I prefer not using the o option, because I also want to se how it's used).
In this search I need anyway to use multiple files. So I've also tried to include more files :
grep -Eno '(::)?myMethodName' --include=*.cpp | grep -v '::myMethodName'
but in this case it remains like stuck in the search (maybe it triggers some slow scripting ? perl or python ?).
I've checked RavinderSingh13's command. Taken in a single instance, it can capture the lines with double colon(and only them, correctly), both on single file or in multiple files :
grep -rna '::myMethodName' path/to/one/of/the/file.cpp
grep -rna '::myMethodName' --include=*.cpp
but there must not be the -w switch, so the following:
grep -rna '::myMethodName' path/to/one/of/the/file.cpp
grep -rna '::myMethodName' --include=*.cpp
don't get any result.
RavinderSingh13's suggestion put inside the pipelining doesn't manage to filter out the double colon lines (my original goal), either with single or multiple files :
grep -rwna 'myMethodName' path/to/one/of/the/files.cpp | grep -v '::[[:alpha:]]+'
-> extracts both myMethodName and ::myMethodName from the chosen file
grep -rwna 'myMethodName' --include=*.cpp | grep -v '::[[:alpha:]]+'
-> extracts both myMethodName and ::myMethodName from all the cpp files
Now, how I could solve:
usually, when I concatenate grep commands I also add to the first of them the switch --color=always, which preserves results coloring also across the piping of multiple commands.
But that... was the culprit !
i.e., doing
grep --color=always -rwna 'myMethodName' --include=*.cpp | grep -v '::myMethodName'
preserves the color in results, but sadly fails to exclude lines containing ::myMethodName, while
grep -rwna 'myMethodName' --include=*.cpp | grep -v '::myMethodName'
gives colorless but correct results (manages to filter out double column lines).
The distribution on which I've experimented these codes and behaviours is Ubuntu 20.04.1 LTS.
Grep version : grep (GNU grep) 3.4
Thanks everybody for the interest.

Why does grep -o add blank lines? [duplicate]

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.
Based on the definition, I was wondering why the plus sign returns more matches than the asterisk sign.
echo "ABC ddd kkk DDD" | grep -Eo "[A-Z]+"
returns
ABC DDD
echo "ABC ddd kkk DDD" | grep -Eo "[A-Z]*"
returns
ABC

As far as I can tell, it doesn't. With GNU grep versions 2.5.3, 2.6.3, 2.10, and 2.12, I get:
$ echo "ABC ddd kkk DDD" | grep -Eo "[A-Z]+"
ABC
DDD
$ echo "ABC ddd kkk DDD" | grep -Eo "[A-Z]*"
ABC
DDD
Please double-check your second example. If you can confirm that you get only one line of output, it might be a bug in your grep. If you're using GNU grep, what's the output of grep --version? If not, what OS are you using, and (if you know) what grep implementation?
UPDATE :
I just built and installed GNU grep 2.5.1 (the version you're using) from source, and I confirm your output. It appears to be a bug in that version of grep, apparently corrected between 2.5.1a and 2.5.3. GNU grep 2.5.1 is about 12 years old; can you install a newer version? Looking through the ChangeLog for 2.5.3, I suspect this may have been the fix:
2005-08-24 Charles Levert <charles_levert#gna.org>
* src/grep.c (print_line_middle): In case of an empty match,
make minimal progress and continue instead of aborting process
of the remainder of the line, in case there's still an upcoming
non-empty match.
* tests/foad1.sh: Add two tests for this.
* doc/grep.texi, doc/grep.1: Document this behavior, since
--only-matching and --color are GNU extensions which are
otherwise unspecified by POSIX or other standards.
Even if you don't have full access on the machine you're using, you should still be able to download the source tarball from ftp://ftp.gnu.org/gnu/grep/ and install it under your home directory (assuming your system has a working compiler and associated tools).

grep: repetition-operator operand invalid

I have this regular express (?<=heads\/)(.*?)(?=\n) and you can see it working here
http://regexr.com?347dm
I need this regex to work in the grep command but I'm getting this error.
$ grep -Eio '(?<=heads\/)(.*?)(?=\n)' text.txt
grep: repetition-operator operand invalid
It works great in ack but I dont have ack on the machine I need to run this on.
ack text.txt -o --match '(?<=heads\/)(.*?)(?=\n)'
text.txt
74f3649af36984e1b784e46502fe318e91d29570 HEAD
06d4463ab47a6246e6bd94dc3b9267d59fc16c2e refs/heads/ARC
0597e13c22b6397a1b260951f9d064f668b26f08 refs/heads/LocationAge
e7e1ed942d15efb387c878b9d0335b37560c8807 refs/heads/feature/311-312-breaking-banner-updates
d0b2632b465702d840a358d0b192198ae505011c refs/heads/gulf-news
509173eafc6792739787787de0d23b0c804d4593 refs/heads/jbb-new-applicationdidfinishlaunching
1e7b03ce75b1a7ba47ff4fb5128bc0bf43a7393b refs/heads/locationdebug
74f3649af36984e1b784e46502fe318e91d29570 refs/heads/master
5d2ede384325877c24db7ba1ba0338dc7b7f84fb refs/heads/mixed-media
3f3b6a81dd3baea8744aec6b95c2fe4aaeb20ea3 refs/heads/post-onezero
4198a43aab2dfe72d7ae9e9e53fbb401fc9dac1f refs/heads/whitelabel
76741013b3b2200de29f53800d51dfd6dc7bac5e refs/tags/r10
fc53b1a05dad3072614fb397a228819a67615b82 refs/tags/r10^{}
afdcfd970c9387f6fda0390ef781c2776aa666c3 refs/tags/r11

grep does not support the (?<=...) or *? or (?=...) operators. See this table.

$ grep -Pio '(?<=heads\/)(.*?)(?=\n)' text.txt # P option instead of E
If you use GNU grep, you can use -P or --perl-regexp options.
In case you are using OS X, you need to install GNU grep.
$ brew install grep
Due to recent changes, to use GNU grep on macOS you either have to prepend the command with a 'g'
$ ggrep -Pio '(?<=heads\/)(.*?)(?=\n)' text.txt # P option instead of E
Or change the path name

Try this
grep -Eoh 'heads/.*' text.txt | grep -Eoh '/.*' | grep -Eoh '[a-zA-Z].*'

Simple Grep Issue

I am trying to parse items out of a file I have. I cant figure out how to do this with grep
here is the syntax
<FQDN>Compname.dom.domain.com</FQDN>
<FQDN>Compname1.dom.domain.com</FQDN>
<FQDN>Compname2.dom.domain.com</FQDN>
I want to spit out just the bits between the > and the <
can anyone assist?
Thanks

grep can do some text extraction. however not sure if this is what you want:
grep -Po "(?<=>)[^<]*"
test
kent$ echo "<FQDN>Compname.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname1.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname2.dom.domain.com</FQDN>"|grep -Po "(?<=>)[^<]*"
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com

Grep isn't what you are looking for.
Try sed with a regular expression : http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

You can do it like you want with grep :
grep -oP '<FQDN>\K[^<]+' FILE
Output:
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com

As others have said, grep is not the ideal tool for this. However:
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | egrep -io '[a-z]+\.[^<]+'
Compname.dom.domain.com
Remember that grep's purpose is to MATCH things. The -o option shows you what it matched. In order to make regex conditions that are not part of the expression that is returned, you'd need to use lookahead or lookbehind, which most command-line grep does not support because it's part of PCRE rather than ERE.
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | grep -Po '(?<=>)[^<]+'
Compname.dom.domain.com
The -P option will work in most Linux environments, but not in *BSD or OSX or Solaris, etc.

Can grep show only words that match search pattern?

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?

Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer

It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.

You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th

Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly

grep command for only matching and perl
grep -o -P 'th.*? ' filename

I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010

cat *-text-file | grep -Eio "th[a-z]+"

You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple

I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename

To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq

You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'

grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal

$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.

ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart