fgrep not counting line with accented letter in it - grep

The chances that I found a bug in fgrep are rather small, so my bet is that I missunderstand something. I was counting the number of addresses in a VCF file with
fgrep FN: Contacts.vcf| wc -l
to quickly find the number of NV (full name) fields.
I noticed that I lacked one compared to the count in my nextcloud adress book.
I tracked it down to the line of a friend called Jurriën.
If I keep his name fgrep doesn't count the line
FN:Jurriën Somelastname
If I remove the ë fgrep counts the line.
FN:Jurrin Somelastname
This is a simple DOS style encoded textfile, straight out of the Nextcloud server.
However fgrep sees it as a binary. so fgrep -a works. Is this the expected behaviour?

Related

correct locale setting for devnagari unicode text

The following output is wrong. There should be only 1 word returned insted of 2
$ echo 'उद्योजकता' | grep -o -E '\w+'
उद
योजकता
I have been told this is due to locale setting. I have checked it on 2 different servers with 2 different O/S and the results are the same.
Ubuntu
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
AWS EC2
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I am not sure which locale setting should be selected to get the Devnagari unicode text to break only at space.
Edited to add:
You can use something like this instead if the grep on your machine supports Perl-compatible regular expressions (PCRE). This should be the case e.g., on Amazon Linux.
echo 'उद्योजकता' |grep -o -P '[\w\pL\pM]+'
which will match "word" characters OR "Letter" code points OR "Mark" code points,
OR
echo 'उद्योजकता' |grep -o -P '(*UCP)[\w\pM]+'
which will enable unicode character property (UCP) matching so that \w matches all letters and numbers no matter the script, but you must still include \pM in the pattern because \w simply does not match "Mark" points in grep -- they are not alphanumeric code points.
Be careful with the above! I don't know Devanagari script so I don't know if it's appropriate or not to consider all such "Mark" characters as being part of a word for your purposes. It might be that a narrower set such as \Mn (for non-spacing marks) is more appropriate for your needs, or perhaps there are only a few specific points to include in which case you'd need to select them individually in your pattern.
In the old days, \w just meant [A-Za-z0-9_]. It was oriented around ASCII, and C code. Today, in typical interpretations, it still means "alphanumeric and underscore", which can vary depending on locale.
You say that the output is "wrong", but I'm afraid "wrong" is dependent on which regular expression engine you are using. So even though you are using "grep", the question is which grep, on which OS, etc., etc.
Your input contains 0x094d, as far as I can tell, which is not a "Letter", according to the unicode character definition (at least, not per the link above). It is a "Mark".
There is a Unicode "document" (recommendation) which includes how an engine might define "\w" to be Unicode-smart, and indeed it suggests to include Mark codepoints in the match. So your expectation is natural in that sense. However, you can see from the same link that there is no way to do this and also to be strictly POSIX-compliant at the same time, which lots of regex engines want to do.
Wikipedia indicates that there are some engines which support the Unicode property definitions, but in general, grep isn't going to do it. I'm not familiar enough with those engines (ruby, etc) to say exactly how you should attempt the same thing on command line as you are trying to do with grep.
The macOS (11.2.3) man page for grep has this note at the bottom:
BUGS
The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa.
If you are okay with a solution for Devnagari text alone, these would help. As per wikipedia, the Unicode range is U+0900 to U+097f. So, if your shell supports $'...' form, you can use:
$ echo 'उद्योजकता' | grep -oE $'[\u0900-\u097f]+'
उद्योजकता
If PCRE is available:
$ echo 'उद्योजकता' | grep -oP '[\x{900}-\x{97f}]+'
उद्योजकता
Use ripgrep for better Unicode support.
$ echo 'उद्योजकता' | rg -o '\w+'
उद्योजकता

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

grep file with a large array

Hi i have a few archive of FW log and occasionally im required to compare them with a series of IP addresses (thousand of them) to get the date and time if the ip addresses matches. my current script is as follow:
#input the list of ip into array
mapfile -t -O 1 var < ip.txt while true
do
#check array is not null
if [[-n "${var[i]}"]] then
zcat /.../abc.log.gz | grep "${var[i]}"
((i++))
It does work but its way too slow and i would think that grep-ping a line with multiple strings would be faster than zcat on every ip line. So my question is is there a way to generate a 'long grep search string' from the ip.txt? or is there a better way to do this
Sure. One thing is that using cat is usually slightly inefficient. I'd recommend using zgrep here instead. You could generate a regex as follows
IP=`paste -s -d ' ' ip.txt`
zgrep -E "(${IP// /|})" /.../abc.log.gz
The first line loads the IP addresses into IP as a single line. The second line builds up a regex that looks something like (127.0.0.1|8.8.8.8) by replacing spaces with |'s. It then uses zgrep to search through abc.log.gz once, with that -Extended regex.
However, I recommend that you do not do this. Firstly, you should escape strings put into a regex. Even if you know that ip.txt really contains IP addresses (e.g. not controlled by a malicious user), you should still escape the periods. But rather than building up a search string and then escape it, just use the -Fixed strings and -file features of grep. Then you get the simple and fast one-liner:
zgrep -F -f ip.txt /.../abc.log.gz

duplicate grep output when comparing two files

I have literally been at this for 5 hours, I have busybox on my device, and I unfortunately do not have -X in grep to make my life easier.
edit;
I have two list both of them have mac addresses, essentially I am just wanting to achieve offline mac address lookup so I don't have to keep looking it up online
list.txt has vendor mac prefix of course this isn't the complete list but just for an example
00:13:46
00:15:E9
00:17:9A
00:19:5B
00:1B:11
00:1C:F0
scan will have list of different mac addresses unknown to which vendor they go to. Which will be full length mac addresses. when ever there is a match I want the line in scan to be output.
Pretty much it does that, but it outputs everything from the scan file, and then it will output matching one at the end, and causing duplicate. I tried sort -u, but it has no effect its as if there is two different output from two different methods, the reason why I say that is because it will instantly output scan file that has everything in it, and couple seconds later it will output the matching one.
From searching I came across this
#!/bin/bash
while read line; do
grep -F 'list' 'scan'
done < list.txt
which displays the duplicate result when/if found, the output is pretty much echoing my scan file then displaying the matched pattern, this creating duplicate
This is frustrating me that I have not found a solution after click on all the links in google up to page 9.
Please someone help me.
I don't know if the Busybox sed supports this out of the box, but it should be easy to do in Awk or Perl instead then.
Create a sed script to print lines from file2 which are covered by a prefix in file1 by transforming each line in file1 into a sed command to print a match for that regular expression:
sed 's%.*%/&/p%' file1 | sed -n -f - file2
The same in Awk:
awk 'NR==FNR { a[++i]="^" $0; next }
{ for (j=1; j<=i; ++j) if ($0 ~ a[j]) print }' file1 file2
Ok guys I did a nested for loop (probably very in efficient) but I got it working printing the matching mac addresses using this
#!/usr/bin/bash
for scanlist in `cat scan | cut -d: -f1,2,3`
do
for listt in `cat list`
do
if [[ $scanlist == $listt ]]; then
grep $scanlist scan
fi
done
done
if anyone can make this more elegant but it works for me for now. I think the problem I had was one list contained just 00:11:22 while my other list contained 00:11:22:33:44:55 that is why I cut it on my scanlist to make same length as my other list. So this only output the matches instead of doing duplicate output.

GREP How do I search for words that contain specific letters (one or more times)?

I'm using the operating systems dictionary file to scan. I'm creating a java program to allow a user to enter any concoction of letters to find words that contain those letters. How would I do this using grep commands?
To find words that contain only the given letters:
grep -v '[^aeiou]' wordlist
The above filters out the lines in wordlist that don't contain any characters except for those listed. It's sort of using a double negative to get what you want. Another way to do this would be:
grep '^[aeiou]+$' wordlist
which searches the whole line for a sequence of one or more of the selected letters.
To find words that contain all of the given letters is a bit more lengthy, because there may be other letters in between the ones we want:
cat wordlist | grep a | grep e | grep i | grep o | grep u
(Yes, there is a useless use of cat above, but the symmetry is better this way.)
You can use a single grep to solve the last problem in Greg's answer, provided your grep supports PCRE. (Based on this excellent answer, boiled down a bit)
grep -P "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)" wordlist
The positive lookahead means it will match anything with an "a" anywhere, and an "e" anywhere, and.... etc etc.

Resources