grep using wildcard .* but ignore whitespaces - grep

I need to find all instances of &var[0] in a C++ code. Using grep -r "&.*\[0\]" . found those but also int& aaa = bbb[0];. How to make grep wildcard .* to avoid whitespaces at all?

What about "&\w*\[0\]"? \w identifies "word characters", ie [A-Za-z_], so it should do what you're looking for.
EDIT: Elevating this excellent point from the comments:
\w is nonstandard, so it'd be safer to replace it with something like [[:alnum:]_]

Related

correct locale setting for devnagari unicode text

The following output is wrong. There should be only 1 word returned insted of 2
$ echo 'उद्योजकता' | grep -o -E '\w+'
उद
योजकता
I have been told this is due to locale setting. I have checked it on 2 different servers with 2 different O/S and the results are the same.
Ubuntu
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
AWS EC2
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I am not sure which locale setting should be selected to get the Devnagari unicode text to break only at space.
Edited to add:
You can use something like this instead if the grep on your machine supports Perl-compatible regular expressions (PCRE). This should be the case e.g., on Amazon Linux.
echo 'उद्योजकता' |grep -o -P '[\w\pL\pM]+'
which will match "word" characters OR "Letter" code points OR "Mark" code points,
OR
echo 'उद्योजकता' |grep -o -P '(*UCP)[\w\pM]+'
which will enable unicode character property (UCP) matching so that \w matches all letters and numbers no matter the script, but you must still include \pM in the pattern because \w simply does not match "Mark" points in grep -- they are not alphanumeric code points.
Be careful with the above! I don't know Devanagari script so I don't know if it's appropriate or not to consider all such "Mark" characters as being part of a word for your purposes. It might be that a narrower set such as \Mn (for non-spacing marks) is more appropriate for your needs, or perhaps there are only a few specific points to include in which case you'd need to select them individually in your pattern.
In the old days, \w just meant [A-Za-z0-9_]. It was oriented around ASCII, and C code. Today, in typical interpretations, it still means "alphanumeric and underscore", which can vary depending on locale.
You say that the output is "wrong", but I'm afraid "wrong" is dependent on which regular expression engine you are using. So even though you are using "grep", the question is which grep, on which OS, etc., etc.
Your input contains 0x094d, as far as I can tell, which is not a "Letter", according to the unicode character definition (at least, not per the link above). It is a "Mark".
There is a Unicode "document" (recommendation) which includes how an engine might define "\w" to be Unicode-smart, and indeed it suggests to include Mark codepoints in the match. So your expectation is natural in that sense. However, you can see from the same link that there is no way to do this and also to be strictly POSIX-compliant at the same time, which lots of regex engines want to do.
Wikipedia indicates that there are some engines which support the Unicode property definitions, but in general, grep isn't going to do it. I'm not familiar enough with those engines (ruby, etc) to say exactly how you should attempt the same thing on command line as you are trying to do with grep.
The macOS (11.2.3) man page for grep has this note at the bottom:
BUGS
The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa.
If you are okay with a solution for Devnagari text alone, these would help. As per wikipedia, the Unicode range is U+0900 to U+097f. So, if your shell supports $'...' form, you can use:
$ echo 'उद्योजकता' | grep -oE $'[\u0900-\u097f]+'
उद्योजकता
If PCRE is available:
$ echo 'उद्योजकता' | grep -oP '[\x{900}-\x{97f}]+'
उद्योजकता
Use ripgrep for better Unicode support.
$ echo 'उद्योजकता' | rg -o '\w+'
उद्योजकता

Find a string between two characters with grep

I have found on this answer the regex to find a string between two characters. In my case I want to find every pattern between ‘ and ’. Here's the regex :
(?<=‘)(.*?)(?=’)
Indeed, it works when I try it on https://regex101.com/.
The thing is I want to use it with grep but it doesn't work :
grep -E '(?<=‘)(.*?)(?=’)' file
Is there anything missing ?
Those are positive look-ahead and look behind assertions. You need to enable it using PCRE(Perl Compatible Regex) and perhaps its better to get only matching part using -o option in GNU grep:
grep -oP '(?<=‘)(.*?)(?=’)' file

How to grep to find all instances of a Java method call using a reference?

I am trying the following query, but without success
grep -nr "[[:alnum:]]+\.[[:alnum:]]+\(\)" .
So, according to my logic, a method call would be one or more alphanumeric characters
[[:alnum:]]+
followed by a dot
\.
followed by one or more alphanumeric characters
[[:alnum:]]+
followed by paranthesis (for void return type only)
\(\)
But this query isn't working. How to write such a query?
grep provides several types of regex syntax.
Your pattern is written is the extended syntax and works with -E
extended-regexp has an easier/better syntax, and perl-regexp is, well, quite powerful.
-E, --extended-regexp
-F, --fixed-strings
-G, --basic-regexp (the default)
-P, --perl-regexp
grep -nrE "[[:alnum:]]+\.[[:alnum:]]+\(\)" .
You need to use "\+" instead of "+" otherwise it'll directly match the character "+".

Confusion in Linux grep command

I have a very basic confusion about grep. Suppose I have a following file to grep in:
test.txt:
This is an article
from some newspaper
Article is good
newspaper is not.
Now if I grep with following expression
grep -P "is\s*g" test.txt
I get the line:
Article is good
However if I do this:
grep -P "is*g" test.txt
I don't get anything. My question is since asterix (*) is a wildcard which represents 0 or more repetitions of the previous character, shouldn't the output of grep be the same. Why the zero or more repetitions of 's' is not giving any output?
What am I missing here. Thanks for the help!
Because there's nothing in your input that matches i, then 0 or more repetitions of s, then g. "Article is good" can't match because it has a space after the s, not a g. The pattern is\s*g matches because \s is a special pattern that matches any sort of whitespace — so the overall pattern is is, then any amount of space, then g, which naturally matches "is g".
I see no ig, isg, issg, issssg in your input...
Since I don't know what you wanted to match, here is my best guess:
grep -P "is.*g" test.txt
You should see regular expression first before you use grep, also you will find it usefull with other commands... http://www.regular-expressions.info/
It's 0 or more repetition of the previous regex atom, and that atom is \s. So \s* can match tab-space-tab-space-space.

How to make grep [A-Z] independent of locale?

I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:
$ echo T | grep [A-Z]
No match.
How come T is not within A-Z range?
I changed the regex a tiny bit:
$ echo T | grep [A-Y]
A match!
Whoa! How is T within A-Y but not within A-Z?
Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY
$ echo $LANG
et_EE.UTF-8
This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?
After trying several things I arrived at the following solution:
$ echo T | LANG=C grep [A-Z]
Is this the recommended way to make grep locale-independent?
Further more... would it be safe to define an alias like that:
$ alias grep="LANG=C grep"
PS. I'm also wondering of why are the character ranges like [A-Z] locale dependent in the first place while \w seems to be unaffected by locale (although the manual says \w is equivalent of [[:alnum:]] - but I found out the latter depends on locale while \w does not).
POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.
grep '[[:upper:]]'
As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.
With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.
[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.

Resources