How to make grep [A-Z] independent of locale? - grep

I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:
$ echo T | grep [A-Z]
No match.
How come T is not within A-Z range?
I changed the regex a tiny bit:
$ echo T | grep [A-Y]
A match!
Whoa! How is T within A-Y but not within A-Z?
Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY
$ echo $LANG
et_EE.UTF-8
This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?
After trying several things I arrived at the following solution:
$ echo T | LANG=C grep [A-Z]
Is this the recommended way to make grep locale-independent?
Further more... would it be safe to define an alias like that:
$ alias grep="LANG=C grep"
PS. I'm also wondering of why are the character ranges like [A-Z] locale dependent in the first place while \w seems to be unaffected by locale (although the manual says \w is equivalent of [[:alnum:]] - but I found out the latter depends on locale while \w does not).

POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.
grep '[[:upper:]]'
As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.
With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.
[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.

Related

How to type AND in regex word matching

I'm trying to do a word search with regex and wonder how to type AND for multiple criteria.
For example, how to type the following:
(Start with a) AND (Contains p) AND (Ends with e), such as the word apple?
Input
apple
pineapple
avocado
Code
grep -E "regex expression here" input.txt
Desired output
apple
What should the regex expression be?
In general you can't implement and in a regexp (but you can implement then with .*) but you can in a multi-regexp condition using a tool that supports it.
To address the case of ands, you should have made your example starts with a and includes p and includes l and ends with e with input including alpine so it wasn't trivial to express in a regexp by just putting .*s in between characters but is trivial in a multi-regexp condition:
$ cat file
apple
pineapple
avocado
alpine
Using &&s will find both words regardless of the order of p and l as desired:
$ awk '/^a/ && /p/ && /l/ && /e$/' file
apple
alpine
but, as you can see, you can't just use .*s to implement and:
$ grep '^a.*p.*l.*e$' file
apple
If you had to use a single regexp then you'd have to do something like:
$ grep -E '^a.*(p.*l|l.*p).*e$' file
apple
alpine
two ways you can do it
all that "&&" is same as negating the totality of a bunch of OR's "||", so you can write the reverse of what you want.
at a single bit-level, AND is same as multiplication of the bits, which means, instead of doing all the && if u think it's overly verbose, you can directly "multiply" the patterns together :
awk '/^a/ * /p/ * /e$/'
so by multiplying them, you're doing the same as performing multiple logical ANDs all at once
(but only use the short hand if inputs aren't too gigantic, or when savings from early exit are known to be negligible.
don't think of them as merely regex patterns - it's easier for one to think of anything not inside an action block, what's typically referred to as pattern, as
any combination and collection of items that could be evaluated for a boolean outcome of TRUE or FALSE in the end
e.g. POSIX-compliant expressions that work in the space include
sprintf()
field assignments, etc
(even decrementing NR - if there's such a need)
but not
statements like next, print, printf(),
delete array etc, or any of the loop structures
surprisingly though, getline is directly doable
in the pattern space area (with some wrapper workaround)

correct locale setting for devnagari unicode text

The following output is wrong. There should be only 1 word returned insted of 2
$ echo 'उद्योजकता' | grep -o -E '\w+'
उद
योजकता
I have been told this is due to locale setting. I have checked it on 2 different servers with 2 different O/S and the results are the same.
Ubuntu
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
AWS EC2
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I am not sure which locale setting should be selected to get the Devnagari unicode text to break only at space.
Edited to add:
You can use something like this instead if the grep on your machine supports Perl-compatible regular expressions (PCRE). This should be the case e.g., on Amazon Linux.
echo 'उद्योजकता' |grep -o -P '[\w\pL\pM]+'
which will match "word" characters OR "Letter" code points OR "Mark" code points,
OR
echo 'उद्योजकता' |grep -o -P '(*UCP)[\w\pM]+'
which will enable unicode character property (UCP) matching so that \w matches all letters and numbers no matter the script, but you must still include \pM in the pattern because \w simply does not match "Mark" points in grep -- they are not alphanumeric code points.
Be careful with the above! I don't know Devanagari script so I don't know if it's appropriate or not to consider all such "Mark" characters as being part of a word for your purposes. It might be that a narrower set such as \Mn (for non-spacing marks) is more appropriate for your needs, or perhaps there are only a few specific points to include in which case you'd need to select them individually in your pattern.
In the old days, \w just meant [A-Za-z0-9_]. It was oriented around ASCII, and C code. Today, in typical interpretations, it still means "alphanumeric and underscore", which can vary depending on locale.
You say that the output is "wrong", but I'm afraid "wrong" is dependent on which regular expression engine you are using. So even though you are using "grep", the question is which grep, on which OS, etc., etc.
Your input contains 0x094d, as far as I can tell, which is not a "Letter", according to the unicode character definition (at least, not per the link above). It is a "Mark".
There is a Unicode "document" (recommendation) which includes how an engine might define "\w" to be Unicode-smart, and indeed it suggests to include Mark codepoints in the match. So your expectation is natural in that sense. However, you can see from the same link that there is no way to do this and also to be strictly POSIX-compliant at the same time, which lots of regex engines want to do.
Wikipedia indicates that there are some engines which support the Unicode property definitions, but in general, grep isn't going to do it. I'm not familiar enough with those engines (ruby, etc) to say exactly how you should attempt the same thing on command line as you are trying to do with grep.
The macOS (11.2.3) man page for grep has this note at the bottom:
BUGS
The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa.
If you are okay with a solution for Devnagari text alone, these would help. As per wikipedia, the Unicode range is U+0900 to U+097f. So, if your shell supports $'...' form, you can use:
$ echo 'उद्योजकता' | grep -oE $'[\u0900-\u097f]+'
उद्योजकता
If PCRE is available:
$ echo 'उद्योजकता' | grep -oP '[\x{900}-\x{97f}]+'
उद्योजकता
Use ripgrep for better Unicode support.
$ echo 'उद्योजकता' | rg -o '\w+'
उद्योजकता

Grep Filenames from ls for specific part of them

I want to extract a specific part out of the filenames to work with them.
Example:
ls -1
REZ-Name1,Surname1-02-04-2012.png
REZ-Name2,Surname2-07-08-2013.png
....
So I want to get only the part with the name.
How can this be achieved ?
There are several ways to do this. Here's a loop:
for file in REZ-*-??-??-????.png
do
name=${file#*-}
name=${name%-??-??-????.png}
echo "($name)"
done
Given a variety of filenames with all sorts of edge cases from spacing, additional hyphens and line feeds:
REZ-Anna-Maria,de-la-Cruz-12-32-2015.png
REZ-Bjørn,Dæhlie-01-01-2015.png
REZ-First,Last-12-32-2015.png
REZ-John Quincy,Adams-11-12-2014.png
REZ-Ridiculous example # this is one filename
is ridiculous,but fun-22-11-2000.png # spanning two lines
it outputs:
(Anna-Maria,de-la-Cruz)
(Bjørn,Dæhlie)
(First,Last)
(John Quincy,Adams)
(Ridiculous example
is ridiculous,but fun)
If you're less concerned with correctness, you can simplify it further:
$ ls | grep -o '[^-]*,[^-]*'
Maria,de
Bjørn,Dæhlie
First,Last
John Quincy,Adams
is ridiculous,but fun
In this case, cut makes more sense than grep:
ls -l | cut -f2 -d-
cut the second field from the input, using '-' as the field delimiter. That other guy's answer will correctly handle some cases mine will not, but for one off uses, I generally find the semantics of cut to be much easier to remember.

GREP How do I search for words that contain specific letters (one or more times)?

I'm using the operating systems dictionary file to scan. I'm creating a java program to allow a user to enter any concoction of letters to find words that contain those letters. How would I do this using grep commands?
To find words that contain only the given letters:
grep -v '[^aeiou]' wordlist
The above filters out the lines in wordlist that don't contain any characters except for those listed. It's sort of using a double negative to get what you want. Another way to do this would be:
grep '^[aeiou]+$' wordlist
which searches the whole line for a sequence of one or more of the selected letters.
To find words that contain all of the given letters is a bit more lengthy, because there may be other letters in between the ones we want:
cat wordlist | grep a | grep e | grep i | grep o | grep u
(Yes, there is a useless use of cat above, but the symmetry is better this way.)
You can use a single grep to solve the last problem in Greg's answer, provided your grep supports PCRE. (Based on this excellent answer, boiled down a bit)
grep -P "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)" wordlist
The positive lookahead means it will match anything with an "a" anywhere, and an "e" anywhere, and.... etc etc.

How to grep to find all instances of a Java method call using a reference?

I am trying the following query, but without success
grep -nr "[[:alnum:]]+\.[[:alnum:]]+\(\)" .
So, according to my logic, a method call would be one or more alphanumeric characters
[[:alnum:]]+
followed by a dot
\.
followed by one or more alphanumeric characters
[[:alnum:]]+
followed by paranthesis (for void return type only)
\(\)
But this query isn't working. How to write such a query?
grep provides several types of regex syntax.
Your pattern is written is the extended syntax and works with -E
extended-regexp has an easier/better syntax, and perl-regexp is, well, quite powerful.
-E, --extended-regexp
-F, --fixed-strings
-G, --basic-regexp (the default)
-P, --perl-regexp
grep -nrE "[[:alnum:]]+\.[[:alnum:]]+\(\)" .
You need to use "\+" instead of "+" otherwise it'll directly match the character "+".

Resources