Do not merge the context of contiguous matches with grep - grep

If I run grep -C 1 match over the following file:
a
b
match1
c
d
e
match2
f
match3
g
I get the following output:
b
match1
c
--
e
match2
f
match3
g
As you can see, since the context around the contiguous matches "match2" and "match3" overlap, they are merged. However, I would prefer to get one context description for each match, possibly duplicating lines from the input in the context reporting. In this case, what I would like is:
b
match1
c
--
e
match2
f
--
f
match3
g
What would be the best way to achieve this? I would prefer solutions which are general enough to be trivially adaptable to other grep options (different values for -A, -B, -C, or entirely different flags). Ideally, I was hoping that there was a clever way to do that just with grep....

I don't think it is possible to do that using plain grep.
the sed construct below works to some extent, now I only need to figure out how to add the "--" separator
$ sed -n -e '/match/{x;1!p;g;$!N;p;D;}' -e h log
b
match1
c
e
match2
f
f
match3
g

I don't think this is possible using plain grep.
Have you ever used Python? In my opinion it's a perfect language for such tasks (this code snippet will work for both Python 2.7 and 3.x):
with open("your_file_name") as f:
lines = [line.rstrip() for line in f.readlines()]
for num, line in enumerate(lines):
if "match" in line:
if num > 0:
print(lines[num - 1])
print(line)
if num < len(lines) - 1:
print(lines[num + 1])
if num < len(lines) - 2:
print("--")
This gives me:
b
match1
c
--
e
match2
f
--
f
match3
g

I'd suggest to patch grep instead of working around it. In GNU grep 2.9 in src/main.cpp:
933 /* We print the SEP_STR_GROUP separator only if our output is
934 discontiguous from the last output in the file. */
935 if ((out_before || out_after) && used && p != lastout && group_separator)
936 {
937 PR_SGR_START_IF(sep_color);
938 fputs (group_separator, stdout);
939 PR_SGR_END_IF(sep_color);
940 fputc('\n', stdout);
941 }
942
A simple additional flag would suffice here.
Edit: Well, d'oh, it is of course not THAT simple since grep would not reproduce the context, just add a few more separators. Due to the linearity of grep, the whole patch is probably not that easy. Nevertheless, if you have a good case for the patch, it could be worth it.

This does not appear possible with grep or GNU grep. However it is possible with standard POSIX tools and a good shell like bash as leverage to obtain the desired output.
Note: neither python nor perl should be necessary for the solution. Worst case, use awk or sed.
One solution I rapidly prototyped is something like this (it does involve overhead of re-reading the file, and this solution depends on whether this overhead is OK, and the give-away is the original question's use of -1 as fixed number of lines of context which allows simple use of head & tail) :
$ OIFS="$IFS"; lines=`grep -n match greptext.txt | /bin/cut -f1 -d:`;
for l in $lines;
do IFS=""; match=`/bin/tail -n +$(($l-1)) greptext.txt | /bin/head -3`;
echo $match; echo "---";
done; IFS="$OIFS"
This might have some corner case associated with it, and this resets IFS when perhaps not necessary, though it is a hint for trying to use the power of POSIX shell & tools rather than a high level interpreter to get the desired output.
Opinion: All good operating systems have: grep, awk, sed, tr, cut, head, tail, more, less, vi as built-ins. On the best operating systems, these are in /bin.

Related

How to type AND in regex word matching

I'm trying to do a word search with regex and wonder how to type AND for multiple criteria.
For example, how to type the following:
(Start with a) AND (Contains p) AND (Ends with e), such as the word apple?
Input
apple
pineapple
avocado
Code
grep -E "regex expression here" input.txt
Desired output
apple
What should the regex expression be?
In general you can't implement and in a regexp (but you can implement then with .*) but you can in a multi-regexp condition using a tool that supports it.
To address the case of ands, you should have made your example starts with a and includes p and includes l and ends with e with input including alpine so it wasn't trivial to express in a regexp by just putting .*s in between characters but is trivial in a multi-regexp condition:
$ cat file
apple
pineapple
avocado
alpine
Using &&s will find both words regardless of the order of p and l as desired:
$ awk '/^a/ && /p/ && /l/ && /e$/' file
apple
alpine
but, as you can see, you can't just use .*s to implement and:
$ grep '^a.*p.*l.*e$' file
apple
If you had to use a single regexp then you'd have to do something like:
$ grep -E '^a.*(p.*l|l.*p).*e$' file
apple
alpine
two ways you can do it
all that "&&" is same as negating the totality of a bunch of OR's "||", so you can write the reverse of what you want.
at a single bit-level, AND is same as multiplication of the bits, which means, instead of doing all the && if u think it's overly verbose, you can directly "multiply" the patterns together :
awk '/^a/ * /p/ * /e$/'
so by multiplying them, you're doing the same as performing multiple logical ANDs all at once
(but only use the short hand if inputs aren't too gigantic, or when savings from early exit are known to be negligible.
don't think of them as merely regex patterns - it's easier for one to think of anything not inside an action block, what's typically referred to as pattern, as
any combination and collection of items that could be evaluated for a boolean outcome of TRUE or FALSE in the end
e.g. POSIX-compliant expressions that work in the space include
sprintf()
field assignments, etc
(even decrementing NR - if there's such a need)
but not
statements like next, print, printf(),
delete array etc, or any of the loop structures
surprisingly though, getline is directly doable
in the pattern space area (with some wrapper workaround)

gawk grep piping redirection to output file not working

I have this file.txt:
a b c
a f g
e h j
I wrote an awk script that does
BEGIN {...}
{...}
END {
a = "a"
b = "b"
system("grep " a " file.txt | grep " b " > t")
}
I expect it to print a b c in file t. Running the same script from ConEmu on Windows 7 will produce an empty t file. On the other hand, executing grep a file.txt | grep b > t will produce the expected result.
why am I doing so: I'm parsing obscure data from a complicated file with multiple field separators (or nested fields, if you prefer). After having the structure of the input file (which is not file.txt), since each line is a command that will be executed and their order is important to me, I would like to know if something is set to something else and if that is a condition for a new command to be introduced in the input file using awk. The condition file database is file.txt.
Why is that? Am I doing something wrong? Am I blind somehow? On Windows 7, gawk 4.2.0, (GNU) grep 2.4.2
I'm also unable to locate similar questions that might help: if you know any, do flag this as duplicate.
Since my goal was to achieve redirection of grep output to file from an awk script, and since the system() way did not work, I found this morning that:
BEGIN {...}
{...}
END {
a = "a"
b = "b"
ask = "grep " a " file.txt | grep " b
while (ask | getline _foo) {
print _foo > "dump.txt"
}
close(ask)
}
works just fine. As a bonus, I can do stuff with each line returned by grep. Again, this is working in Windows 7 using gawk 4.2.0 and ConEmu.

Counting the number of times each pattern in a file appears in a separate file

I am trying to scan a file (test.txt), something like this:
make
bake
baker
makes
take
cook
sbake
for patterns listed in a separate file (ref.txt):
ake
make
bake
look
I have tried looping with grep like so:
while read seq; do grep -c "$seq" test.txt; done > out.txt < ref.txt
However, it doesn't count partial matches only exact matches (or inconsistent in counting partial matches) and I output:
4
1
2
0
instead of
6
2
3
0
Thanks for any help!
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some, but not all, of the reasons not to try to do this with a shell loop.
The standard UNIX tool for manipulating text is awk:
$ awk 'NR==FNR{cnt[$0]=0;next} {for (re in cnt) cnt[re]+=gsub(re,"&")} END{for (re in cnt) print re, cnt[re]}' ref.txt test.txt
ake 6
bake 3
look 0
make 2
The above assumes the text in your ref.txt file doesn't contain any regexp metacharacters or if it does then a regexp match is desirable. If it can but you need a string instead of regexp match, you'd need a slightly different solution.
$ while read -r line; do grep -c $line test.txt ; done < ref.txt
6
2
3
0

Pattern matching using grep

Assuming we have one input string like
Nice
And we have the pattern
D*A*C*N*a*g*.h*ca*e
then "Nice" will match the pattern. (* means 0 or more occurrence, . means one char)
I think using grep is better than java in this case(maybe). How can I do it in grep?
Use the same regular expression:
grep 'D*A*C*N*a*g*.h*ca*e' <<EOF
Nice
EOF
If the input is "Nicely" it still prints it! How does it work?
The current regex looks for the pattern anywhere on the line. If it must match exactly (the whole line), then add anchors to start (^) and end ($) of line:
grep '^D*A*C*N*a*g*.h*ca*e$' <<EOF
Nice
Nicely
Darce
Darcy
Darcey
EOF

How to generate random numbers under OpenWRT?

With a "normal" (i mean "full") linux distro, it works just fine:
sleep $(echo "$[ ($RANDOM % 9 ) ]")
ok, it waits for about 0-9 sec
but under OpenWRT [not using bash, rather "ash"]:
$ sleep $(echo "$[ ($RANDOM % 9 ) ]") sleep: invalid number '$[' $
and why:
$ echo "$[ ($RANDOM % 9 ) ]" $[ ( % 9 ) ] $
So does anyone has a way to generate random numbers under OpenWRT, so i can put it in the "sleep"?
Thank you
You might try something like this:
sleep `head /dev/urandom | tr -dc "0123456789" | head -c1`
Which works on my WhiteRussian OpenWRT router.
I actually don't know if this will always return a number, but when it does, it will always return 0-9, and only 1 digit (you could make it go up to 99 if you made the second head -c2).
Good luck!
you could also use awk
sleep $(awk 'BEGIN{srand();print int(rand()*9)}')
For some scenarios, this might not yield a sufficient diversity of answers. Another approach is to use /dev/urandom directly (eg https://www.2uo.de/myths-about-urandom/):
echo $(hexdump -n 4 -e '"%u"' </dev/urandom)
When using awk, note that awk uses the time of day as the seed (https://linux.die.net/man/1/awk). This might be relevant for scenarios where the time of day is reset (eg no battery backed time of day clock), or synchronised across a fleet (eg group restart).
srand([expr])
Uses expr as a new seed for the random number generator. If no expr is provided, the time of day is used. The return value is the previous seed for the random number generator.
This is confirmed by looking at the source in busybox (https://github.com/mirror/busybox/blob/master/editors/awk.c):
seed = op1 ? (unsigned)L_d : (unsigned)time(NULL);
At least for some versions of Openwrt, it seems an explicit call to srand() is required to avoid obtaining the same answers repeatedly:
# awk 'BEGIN{print rand(), rand()}'
0 0.345001
# awk 'BEGIN{print rand(), rand()}'
0 0.345001

Resources