An encoding-savvy grep replacement? - character-encoding

I am frustrated that grep fails to find a word like "hello" in my UTF-16 documents.
Can anyone recommend a version of grep that attempts to guess the file encoding and then properly handle it?

ack as perl-based grep replacement?
You'll definitely want to check out ack.
It supports Unicode encodings, and is basically grep, but better.
try a matching Unicode locale with grep
If you are under Linux, Unix, etc. you may want to change your LANG envariable to an encoding to match your documents.
Check your locale first. Here is what mine is set to by default on my MacBook Pro:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
say, under bash:
$ LANG="foo" grep 'gotta be found now' file.name
something a little more permanent (be careful with this):
$ export LANG="foo"
$ grep 'bar' mitz.vah

Perl has a way better regex syntax than grep (much more powerful), it has UTF8 and UTF16 support, but I'm not sure how good it is at guessing the encoding... if you tell it which encoding to use, though, it can read these files without any issues and run regexes over them. You'll have to write yourself a tiny Perl program for that (your own micro-grep implementation in Perl so to say), but that isn't too hard. Perl exists for all major operating systems.

I am frustrated that grep fails to find a word like "hello" in my
UTF-16 documents.
Can anyone recommend a version of grep that attempts to guess the file
encoding and then properly handle it?
ugrep which is free BSD-3 open source, supports all UTF encodings and claims to be a true drop-in replacement for grep by supporting the GNU/BSD grep command line options. Likewise, ripgrep, ack, and silver searcher (ag) also support UTF encodings but are not drop-in replacements for grep since their behavior and options differ from grep.
You could use the iconv filter utility in combination with grep to convert UTF-16 files to UTF-8, but you will have to explicitly specify the input and output encodings, something like:
iconv -f utf-16 -t utf8` < file.txt | grep PATTERN

Related

Strange behavior grep -rnw

I am using grep (BSD grep) 2.5.1-FreeBSD in MacOS and I have found the following behavior.
I have two *.tex files. Each one of these contains the following lines
$k$-th bit of
$(i-m)$-th bit of
respectively. When I ran
grep --color -rnw . -e '\$-th bit of' --include="*.tex"
I got only the second file, i.e., $(i-m)$-th bit of, while I expect the two lines. Could you help me please to understand this behavior?
Never use -r or --include or any other grep option to find files. The GNU guys really screwed up by adding those options to grep when there's a perfectly good tool named find for finding files and now they've turned grep into a convoluted mush of finding files and Globally matching a Regular Expression within a file and Printing the result (G/RE/P).
Keep it simple - find the files with find then g/re/p within then using grep:
find . -name '*.tex' -exec grep --color -n '\$-th bit of' {} +
As others pointed out your g/re/p problem was the -w arg so I've removed that above.
I have the same version of grep.
It is caused by your use of the -w option:
-w, --word-regexp
The expression is searched for as a word (as if surrounded by `[[:<:]]' and `[[:>:]]'; see re_format(7)).
The matched part of the string $k$-th bit of is bounded on the left-hand side by a word character (i.e. k) so the match is treated as being inside a "word" and it can't therefore satisfy the "searched for as a whole word" requirement.
Try without -w and it will work fine.

Using globs in GNU grep's path argument

BSD (Mac) grep allows for this command:
grep -n "FIXME" **/*.rb
But GNU grep forces me to specify at least a folder to start from:
grep -n "FIXME" {lib,spec}/**/*.rb
Is there a way to get this to behave like it does in BSD grep?
Switch to ack. It uses the recursive strategy by default, and comes with loads of tricky regexes for types of language files available as flags.
For instance, writing:
ack FIXME --ruby
Will search the current directory recursively for anything that may be a Ruby file. This will work the same on Mac and Linux.

Using iconv to convert Traditional Chinese to Simplified Chinese

How do I use the iconv in Ruby to convert a string from Simplified Chinese to Traditional Chinese (and vice-versa)?
I've tried
Iconv.conv("gb2312//IGNORE", "big5//IGNORE", '大家一起學中文')
I get an entirely different string. I've tried with the GBK and BIG5 encodings, I get an IllegalSequence Error.
Thanks.
https://rubygems.org/gems/tradsim
I just wrote a gem
To install the gem
gem install tradsim
To use the gem
# encoding: UTF-8
require 'tradsim'
puts Tradsim::to_sim("大家一起學中文")
it will yield
大家一起学中文
and you can use Tradsim::to_trad to do the reverse.
Are you trying to convert, say, 學 to 学? I could be wrong, but I don't think Iconv will perform that type of conversion.
OpenCC
https://github.com/BYVoid/OpenCC
As of 2021, this sees to be the most popular choice:
sudo apt install opencc
opencc -i input.txt -o output.txt -c t2s.json
With:
input.txt
大家一起學中文
we get:
output.txt
大家一起学中文
It also has APIs for several languages like Python and Node.js.
Tested on Ubuntu 21.04, opencc 1.1.1.

grep match with string1 OR string2

I want to grep 2 patterns in a file on Solaris UNIX.
That is grep 'pattern1 OR pattern2' filename.
The following command does NOT work:
grep 'pattern1\|pattern2' filename
What is wrong with this command?
NOTE: I am on Solaris
What operating system are you on?
It will work with on systems with GNU grep, but on BSD, Solaris, etc., \| is not supported.
Try egrep or grep -E, e.g.
egrep 'pattern1|pattern2'
If you want POSIX functionality (i.e. Linux-like behavior) you can put the POSIX 2-compatible binaries at the beginning of your path:
$ echo $PATH
/usr/xpg4/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:[...]
There is also /usr/xpg6 which is POSIX 1 compatible.
/usr/bin: SVID/XPG3
/usr/xpg4/bin: POSIX.2/POSIX.2a/SUS/SUSv2/XPG4
/usr/xpg6/bin: POSIX.1-2001/SUSv3
That command works fine for me. Please add additional information such as your platform and the exact regular expression and file contents you're using (minimized to the smallest example that still reproduces the issue). (I would add a comment to your post but don't have enough reputation.)
That should be correct. Make sure that you do or don't add the appropriate spaces i.e. "pattern1\|pattern2" vs "pattern1\| pattern2".
Are you sure you aren't just having problems with cases or something? try the -i switch.
That depends entirely on what pattern1 and pattern2 are. If they're just words, it should work, otherwise you'll need:
grep '\(pattern1\)\|\(pattern2\)'
An arcane method using fgrep (ie: fixed strings) that works on Solaris 10...
Provide a pattern-list, with each pattern separated by a NEWLINE, yet quoted so as to be interpreted by the shell as one word:-
fgrep 'pattern1
pattern2' filename
This method also works for grep, fgrep and egrep in /usr/xpg4/bin, although the pipe-delimited ERE in any egrep is sometimes the least fussy.
You can insert arbitrary newlines in a string if your shell allows history-editing, eg: in bash issue C-v C-j in either emacs mode or vi-command mode.
egrep -e "string1|string2" works for me in SunOS 5.9 (Solaris)

How to make grep stop at first match on a line?

Well, I have a file test.txt
#test.txt
odsdsdoddf112 test1_for_grep
dad23392eeedJ test2 for grep
Hello World test
garbage
I want to extract strings which have got a space after them. I used following expression and it worked
grep -o [[:alnum:]]*.[[:blank:]] test.txt
Its output is
odsdsdoddf112
dad23392eeedJ
test2
for
Hello
World
But problem is grep prints all the strings that have got space after them, where as I want it to stop after first match on a line and then proceed to second line.
Which expression should I use here, in order to make it stop after first match and move to next line?
This problem may be solved with gawk or some other tool, but I will appreciate a solution which uses grep only.
Edit
I using GNU grep 2.5.1 on a Linux system, if that is relevant.
Edit
With the help of the answers given below, I tried my luck with
grep -o ^[[:alnum:]]* test.txt
grep -Eo ^[[:alnum:]]+ test.txt
and both gave me correct answers.
Now what surprises me is that I tried using
grep -Eo "^[[:alnum:]]+[[:blank:]]" test.txt
as suggested here but didn't get the correct answer.
Here is the output on my terminal
odsdsdoddf112
dad23392eeedJ
test2
for
Hello
World
But comments from RichieHindle and Adrian Pronk, shows that they got correct output on their systems. Anyone with some idea that why I too am not getting the same result on my system. Any idea? Any help will be appreciated.
Edit
Well, it seems that grep 2.5.1 has some bug because of which my output wasn't correct. I installed grep 2.5.4, now it is working correctly. Please see this link for details.
If you're sure you have no leading whitespace, add a ^ to match only at the start of a line, and change the * to a + to match only when you have one or more alphanumeric characters. (That means adding -E to use extended regular expressions).
grep -Eo "^[[:alnum:]]+[[:blank:]]" test.txt
(I also removed the . from the middle; I'm not sure what that was doing there?)
As the questioner discovered, this is a bug in versions of GNU grep prior to 2.5.3. The bug allows a caret to match after the end of a previous match, not just at beginning of line.
This bug is still present in other versions of grep, for instance in Mac OS X 10.9.4.
There isn't a universal workaround, but in the some examples, like non-spaces followed by a space, you can often get the desired behavior by leaving off the delimiter. That is, search for '[^ ]*' rather than '[^ ]* '.
grep -oe "^[^ ]* " test.txt
If we want to extract all meaningful input before garbage and actually stop on first match then -B NUM, --before-context=NUM option may be useful to "print NUM lines of leading context before matching lines".
Example:
grep --before-context=999999 "Hello World test"

Resources