correct locale setting for devnagari unicode text - grep

The following output is wrong. There should be only 1 word returned insted of 2
$ echo 'उद्योजकता' | grep -o -E '\w+'
उद
योजकता
I have been told this is due to locale setting. I have checked it on 2 different servers with 2 different O/S and the results are the same.
Ubuntu
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
AWS EC2
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I am not sure which locale setting should be selected to get the Devnagari unicode text to break only at space.

Edited to add:
You can use something like this instead if the grep on your machine supports Perl-compatible regular expressions (PCRE). This should be the case e.g., on Amazon Linux.
echo 'उद्योजकता' |grep -o -P '[\w\pL\pM]+'
which will match "word" characters OR "Letter" code points OR "Mark" code points,
OR
echo 'उद्योजकता' |grep -o -P '(*UCP)[\w\pM]+'
which will enable unicode character property (UCP) matching so that \w matches all letters and numbers no matter the script, but you must still include \pM in the pattern because \w simply does not match "Mark" points in grep -- they are not alphanumeric code points.
Be careful with the above! I don't know Devanagari script so I don't know if it's appropriate or not to consider all such "Mark" characters as being part of a word for your purposes. It might be that a narrower set such as \Mn (for non-spacing marks) is more appropriate for your needs, or perhaps there are only a few specific points to include in which case you'd need to select them individually in your pattern.
In the old days, \w just meant [A-Za-z0-9_]. It was oriented around ASCII, and C code. Today, in typical interpretations, it still means "alphanumeric and underscore", which can vary depending on locale.
You say that the output is "wrong", but I'm afraid "wrong" is dependent on which regular expression engine you are using. So even though you are using "grep", the question is which grep, on which OS, etc., etc.
Your input contains 0x094d, as far as I can tell, which is not a "Letter", according to the unicode character definition (at least, not per the link above). It is a "Mark".
There is a Unicode "document" (recommendation) which includes how an engine might define "\w" to be Unicode-smart, and indeed it suggests to include Mark codepoints in the match. So your expectation is natural in that sense. However, you can see from the same link that there is no way to do this and also to be strictly POSIX-compliant at the same time, which lots of regex engines want to do.
Wikipedia indicates that there are some engines which support the Unicode property definitions, but in general, grep isn't going to do it. I'm not familiar enough with those engines (ruby, etc) to say exactly how you should attempt the same thing on command line as you are trying to do with grep.

The macOS (11.2.3) man page for grep has this note at the bottom:
BUGS
The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa.

If you are okay with a solution for Devnagari text alone, these would help. As per wikipedia, the Unicode range is U+0900 to U+097f. So, if your shell supports $'...' form, you can use:
$ echo 'उद्योजकता' | grep -oE $'[\u0900-\u097f]+'
उद्योजकता
If PCRE is available:
$ echo 'उद्योजकता' | grep -oP '[\x{900}-\x{97f}]+'
उद्योजकता
Use ripgrep for better Unicode support.
$ echo 'उद्योजकता' | rg -o '\w+'
उद्योजकता

Related

grep file with a large array

Hi i have a few archive of FW log and occasionally im required to compare them with a series of IP addresses (thousand of them) to get the date and time if the ip addresses matches. my current script is as follow:
#input the list of ip into array
mapfile -t -O 1 var < ip.txt while true
do
#check array is not null
if [[-n "${var[i]}"]] then
zcat /.../abc.log.gz | grep "${var[i]}"
((i++))
It does work but its way too slow and i would think that grep-ping a line with multiple strings would be faster than zcat on every ip line. So my question is is there a way to generate a 'long grep search string' from the ip.txt? or is there a better way to do this
Sure. One thing is that using cat is usually slightly inefficient. I'd recommend using zgrep here instead. You could generate a regex as follows
IP=`paste -s -d ' ' ip.txt`
zgrep -E "(${IP// /|})" /.../abc.log.gz
The first line loads the IP addresses into IP as a single line. The second line builds up a regex that looks something like (127.0.0.1|8.8.8.8) by replacing spaces with |'s. It then uses zgrep to search through abc.log.gz once, with that -Extended regex.
However, I recommend that you do not do this. Firstly, you should escape strings put into a regex. Even if you know that ip.txt really contains IP addresses (e.g. not controlled by a malicious user), you should still escape the periods. But rather than building up a search string and then escape it, just use the -Fixed strings and -file features of grep. Then you get the simple and fast one-liner:
zgrep -F -f ip.txt /.../abc.log.gz

Grep's word boundaries include spaces?

I tried to use grep to search for lines containing the word "bead" using "\b" but it doesn't find the lines containing the word "bead" separated by space. I tried this script:
cat in.txt | grep -i "\bbead\b" > out.txt
I get results like
BEAD-air.JPG
Bead, 3 sided MET DP110317.jpg
Bead. -2819 (FindID 10143).jpg
Bead(Gem), Artefacts of Phu Hoa site(Dong Nai province).jpg
Romano-British pendant amulet (bead) (FindID 241983).jpg
But I don't get the results like
Bead fun.jpg
Instead of getting some 2,000 lines, I'm only getting 92 lines
My OS is Windows 10 - 64 bit but I'm using grep 2.5.4 from the GnuWin32 package.
I've also tried the MSYS2, which includes grep 3.0 but it does the same thing.
And then, how can I search for words separated by space?
LATER EDIT:
It looks like grep has problems with big files. My input file is 2.4 GB in size. With smaller files, it works - I reported the bug here: https://sourceforge.net/p/getgnuwin32/discussion/554300/thread/03a84e6b/
Try this,
cat in.txt | grep -wi "bead"
-w provides you a whole word search
What you are doing normally should work but there are ways of setting what is and is not considered a word boundary. Rather than worry about it please try this instead:
cat in.txt | grep -iP "\bbead(\b|\s)" > out.txt
The P option adds in Perl regular expression power and the \s matches any sort of space character. The Or Bar | separates options within the parens ( )
While you are waiting for grep to be fixed you could use another tool if it is available to you. E.g.
perl -lane 'print if (m/\bbead\b/i);' in.txt > out.txt

Find a string between two characters with grep

I have found on this answer the regex to find a string between two characters. In my case I want to find every pattern between ‘ and ’. Here's the regex :
(?<=‘)(.*?)(?=’)
Indeed, it works when I try it on https://regex101.com/.
The thing is I want to use it with grep but it doesn't work :
grep -E '(?<=‘)(.*?)(?=’)' file
Is there anything missing ?
Those are positive look-ahead and look behind assertions. You need to enable it using PCRE(Perl Compatible Regex) and perhaps its better to get only matching part using -o option in GNU grep:
grep -oP '(?<=‘)(.*?)(?=’)' file

why to use singlequotes and \ in the patterens in grep command?

In some book I have seen a grep command example as
$grep '^no(fork\|group)' /etc/group
I need explanation for "why to use single quotes for the patteren and \ before the characters ( | )".
The advantage of using single quotes with grep, is that you do not need to escape double quotes when you need to grep for them. For example, if you wanted to search for "findthis" (including searching for the quotes) with grep, using single quotes, it would look like this:
grep '"findthis"' yourfile.txt
If you were using double quotes you would need to escape the quotes with a \, so it would look like this:
grep "\"findthis\"" yourfile.txt
The reason a backslash is needed to search for certain characters is that grep assumes that those characters have special meanings. For example grep uses " to find out the beginning and end of what you are searching for (among other things). But that means that you cannot ever search for " unless there is some way around this. The solution is to place a \ before the " like so: \". If you do that, then grep knows that you actually want to search for " rather than end the string.
quoting arguments for a command is always recommended. single quote won't expand variable. in your example, it makes no different to use single/double quotes.
take an example:
kent$ cat f
foo
bar
ooo
without quote:
kent$ grep foo|bar f
zsh: correct 'bar' to 'bzr' [nyae]? n
zsh: command not found: bar
you see, my zsh thought you want to pipe output to a command "bar"
now why escape |:
Assume your grep is not an alias. grep use BRE by default, in BRE you need to escape some char to give them special meaning, | is one of them.
You can however let grep work in ERE or PCRE mode, with -E, -P option. then you don't need escape those char any longer:
kent$ grep -E 'foo|bar' f
foo
bar
in ERE or PCRE, you escape some char, to take the special meaning away.

How to make grep [A-Z] independent of locale?

I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:
$ echo T | grep [A-Z]
No match.
How come T is not within A-Z range?
I changed the regex a tiny bit:
$ echo T | grep [A-Y]
A match!
Whoa! How is T within A-Y but not within A-Z?
Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY
$ echo $LANG
et_EE.UTF-8
This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?
After trying several things I arrived at the following solution:
$ echo T | LANG=C grep [A-Z]
Is this the recommended way to make grep locale-independent?
Further more... would it be safe to define an alias like that:
$ alias grep="LANG=C grep"
PS. I'm also wondering of why are the character ranges like [A-Z] locale dependent in the first place while \w seems to be unaffected by locale (although the manual says \w is equivalent of [[:alnum:]] - but I found out the latter depends on locale while \w does not).
POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.
grep '[[:upper:]]'
As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.
With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.
[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.

Resources