Trying to understand the behavior of grep - line seperator - grep

I'm trying to mimic grep
One thing I don't understand is: when should I print a line separator --?
For example, running the command:
cat test.txt | grep -b -v -A 1 -i on (I'll paste text.txt at the bottom)
I get the following output:
87:aaa
91:
92:from the start Her daughters were identical twins but the resemblance was purely physical
185-Madeline had been difficult baby from day one She was one who cried nonstop who
269:refused to nurse When Paige would finally get her settled and carefully place her down
358:she rarely got as far as the nursery door before Madelines blood curdling cries began
444-again often waking Erica as well Paige would want to sob on the floor she was so
529:absolutely depleted
550:Madeline was an infant who made Paige understand why Shaken Baby Syndrome was thing
It makes perfect sense in terms of matching lines and trailing context (because -A)
but I thought a line separator should be printed between continous blocks of printed lines:
For example, for the same input, my program prints:
87:aaa
91:
92:from the start Her daughters were identical twins but the resemblance was purely physical
185-Madeline had been difficult baby from day one She was one who cried nonstop who
--
269:refused to nurse When Paige would finally get her settled and carefully place her down
358:she rarely got as far as the nursery door before Madelines blood curdling cries began
444-again often waking Erica as well Paige would want to sob on the floor she was so
--
529:absolutely depleted
550:Madeline was an infant who made Paige understand why Shaken Baby Syndrome was thing
My reasoning - on the output itself:
87:aaa -- no match, so print (because -v)
91: -- no match so print (because -v)
92:from the start Her daughters were identical twins but the resemblance was purely physical -- no match, so print (because -v)
185-Madeline had been difficult baby from day one She was one who cried nonstop who -- match, print because -A 1 and -v -> extra line - last extra line , so must print a line seperator if found a new match
-- -> print a line separator - found a new line that has no match (because -v )
269:refused to nurse When Paige would finally get her settled and carefully place her do
..
.
.
The text.txt:
Their mother Paige would have told you there was something wrong with Madeline right
aaa
from the start Her daughters were identical twins but the resemblance was purely physical
Madeline had been difficult baby from day one She was one who cried nonstop who
refused to nurse When Paige would finally get her settled and carefully place her down
she rarely got as far as the nursery door before Madelines blood curdling cries began
again often waking Erica as well Paige would want to sob on the floor she was so
absolutely depleted
Madeline was an infant who made Paige understand why Shaken Baby Syndrome was thing
So my question is: when exactly does grep prints a line separator ?

Note: I'm not answering about -v and context matching, since I don't think they are well defined or even supposed to work. For example, this output doesn't make sense:
$ seq 5 | grep -v -A1 '3'
1
2
3
4
5
The separator -- won't be added if two or more groups of matching lines have overlapping lines or are next to each other in input file. Consider this sample input:
$ cat context.txt
wheat
roti
bread
blue
toy
flower
sand stone
light blue
flower
sky
water
dark red
ruby
blood
evening sky
Case 1: groups are next to each other
$ grep -C1 'flower' context.txt
toy
flower
sand stone
light blue
flower
sky
Case 2: overlapping groups
$ grep -A4 'flower' context.txt
flower
sand stone
light blue
flower
sky
water
dark red
ruby

Related

How to type AND in regex word matching

I'm trying to do a word search with regex and wonder how to type AND for multiple criteria.
For example, how to type the following:
(Start with a) AND (Contains p) AND (Ends with e), such as the word apple?
Input
apple
pineapple
avocado
Code
grep -E "regex expression here" input.txt
Desired output
apple
What should the regex expression be?
In general you can't implement and in a regexp (but you can implement then with .*) but you can in a multi-regexp condition using a tool that supports it.
To address the case of ands, you should have made your example starts with a and includes p and includes l and ends with e with input including alpine so it wasn't trivial to express in a regexp by just putting .*s in between characters but is trivial in a multi-regexp condition:
$ cat file
apple
pineapple
avocado
alpine
Using &&s will find both words regardless of the order of p and l as desired:
$ awk '/^a/ && /p/ && /l/ && /e$/' file
apple
alpine
but, as you can see, you can't just use .*s to implement and:
$ grep '^a.*p.*l.*e$' file
apple
If you had to use a single regexp then you'd have to do something like:
$ grep -E '^a.*(p.*l|l.*p).*e$' file
apple
alpine
two ways you can do it
all that "&&" is same as negating the totality of a bunch of OR's "||", so you can write the reverse of what you want.
at a single bit-level, AND is same as multiplication of the bits, which means, instead of doing all the && if u think it's overly verbose, you can directly "multiply" the patterns together :
awk '/^a/ * /p/ * /e$/'
so by multiplying them, you're doing the same as performing multiple logical ANDs all at once
(but only use the short hand if inputs aren't too gigantic, or when savings from early exit are known to be negligible.
don't think of them as merely regex patterns - it's easier for one to think of anything not inside an action block, what's typically referred to as pattern, as
any combination and collection of items that could be evaluated for a boolean outcome of TRUE or FALSE in the end
e.g. POSIX-compliant expressions that work in the space include
sprintf()
field assignments, etc
(even decrementing NR - if there's such a need)
but not
statements like next, print, printf(),
delete array etc, or any of the loop structures
surprisingly though, getline is directly doable
in the pattern space area (with some wrapper workaround)

How to print the starting position of pattern in grep

In python's regex (re) library I can do re.search("<pattern>", string).start() to get the start of the pattern (if pattern exists).
How can I do the same in the unix command line tool grep?
E.g. If pattern= "th.n" and the string is "somethingwrong", I expect to see the number 5 (considering 1-based but 4 in a 0-based would be ok)
Thank you!
For example:
echo "abcdefghij" | grep -aob "e"
outputs :
4:e
Here:
-b to gets the byte offset
-a tells grep to use the input as text
-o outputs the findings
With your example:
echo ""somethingwrong"" | grep -aob "th.n"
4:thin
This works great on multiple matches:
echo "abcdefghiqsdqdqdfjjklqsdljkhqsdlf" | grep -aob "f"
5:f
16:f
32:f
Maybe a Perl one-liner would be a happy medium between having to write a Python program and the simplicity of a standard Unix tool.
Given this file:
$ cat foo.txt
This thing
that thing
Not here
another thing way over here that has another thing and a third thing
thank you.
You could run this Perl one-liner:
$ perl -lne'while(/th.n/g){print $.," ",$-[0]," ",$_;}' foo.txt
1 5 This thing
2 5 that thing
4 8 another thing way over here that has another thing and a third thing
4 45 another thing way over here that has another thing and a third thing
4 63 another thing way over here that has another thing and a third thing
5 0 thank you.
Also, the greplike search tool ack (that I wrote)has a --column option to display the column:
$ ack th.n --column foo.txt /dev/null
foo.txt
1:6:This thing
2:6:that thing
4:9:another thing way over here that has another thing and a third thing
5:1:thank you.
Or with the --nogroup option so the filename appears on each line.
$ ack th.n --column --nogroup foo.txt /dev/null
foo.txt:1:6:This thing
foo.txt:2:6:that thing
foo.txt:4:9:another thing way over here that has another thing and a third thing
foo.txt:5:1:thank you.
I had to add the search of /dev/null because ack's output would be different if there was only one file being searched.
ripgrep has a --column option, too.
$ rg --column --line-number th.n foo.txt
1:6:This thing
2:6:that thing
4:9:another thing way over here that has another thing and a third thing
5:1:thank you.

Perl6 string coercion operator ~ doesn't like leading zeros

I'm toying with Rakudo Star 2015.09.
If I try to stringify an integer with a leading zero, the compiler issues a warning:
> say (~01234).WHAT
Potential difficulties:
Leading 0 does not indicate octal in Perl 6.
Please use 0o123 if you mean that.
at <unknown file>:1
------> say (~0123<HERE>).WHAT
(Str)
I thought maybe I could help the compiler by assigning the integer value to a variable, but obtained the same result:
> my $x = 01234; say (~$x).WHAT
Potential difficulties:
Leading 0 does not indicate octal in Perl 6.
Please use 0o1234 if you mean that.
at <unknown file>:1
------> my $x = 01234<HERE>; say (~$x).WHAT
(Str)
I know this is a silly example, but is this by design? If so, why?
And how can I suppress this kind of warning message?
Is there a reason you have data with leading zeroes? I tend to run into this problem when I have a column of postal codes.
When they were first thinking about Perl 6, one of the goals was to clean up some consistency issues. We had 0x and 0b (I think by that time), but Perl 5 still had to look for the leading 0 to guess it would be octal. See Radix Markers in Synopsis 2.
But, Perl 6 also has to care about what Perl 5 programmers are going to try to do and what they expect. Most people are going to expect a leading 0 to mean octal. But, it doesn't mean octal. It's that you typed the literal, not how you are using it. Perl 6 has lots of warnings about things that Perl 5 people would try to use, like foreach:
$ perl6 -e 'foreach #*ARGS -> $arg { say $arg }' 1 2 3
===SORRY!=== Error while compiling -e
Unsupported use of 'foreach'; in Perl 6 please use 'for' at -e:1
------> foreach⏏ #*ARGS -> $arg { say $arg }
To suppress that sort of warning, don't do what it's warning you about. The language doesn't want you to do that. If you need a string, start with a string '01234'. Or, if you want it to be octal, start with 0o. But, realize that stringifying a number will get you back the decimal representation:
$ perl6 -e 'say ~0o1234'
668

Search for combinations of a phrase

What is the way to use 'grep' to search for combinations of a pattern in a text file?
Say, for instance I am looking for "by the way" and possible other combinations like "way by the" and "the way by"
Thanks.
Awk is the tool for this, not grep. On one line:
awk '/by/ && /the/ && /way/' file
Across the whole file:
gawk -v RS='\0' '/by/ && /the/ && /way/' file
Note that this is searching for the 3 words, not searching for combinations of those 3 words with spaces between them. Is that what you want?
Provide more details including sample input and expected output if you want more help.
The simplest approach is probably by using regexps. But this is also slightly wrong:
egrep '([ ]*(by|the|way)\>){3}'
What this does is to match on the group of your three words, taking spaces in front of the words
with it (if any) and forcing it to be a complete word (hence the \> at the end) and matching the string if any of the words in the group occurs three times.
Example of running it:
$ echo -e "the the the\nby the\nby the way\nby the may\nthe way by\nby the thermo\nbypass the thermo" | egrep '([ ]*(by|the|way)\>){3}'
the the the
by the way
the way by
As already said, this procudes a 'false' positive for the the the but if you can live with that, I'd recommend doing it this way.

parsing a text file where each record spans more than 1 line

I need to parse a text file that contains hundreds of records that span more than 1 line each. I'm new to Python and have been trying to do this with grep and awk in several complex ways but no luck yet.
The file contains records that look like this:
409547095517 911033 00:47:41 C44 00:47:46 D44 00:47:53 00:47:55
(555) 555-1212 00:47 10/31 100 Main Street - NW
Some_City TX 323 WRLS METRO PCS
P# 122-5217 ALT# 555-555-1212 LEC:MPCSI WIRELESS CALL Q
UERY CALLER FOR LOCATION QUERY CALLER FOR PHONE #*
Really I can do all I need to if I could just get these multi-line records condensed to 1 line per record. Each record will always begin with "40" or I could let 9110 indicate start as these will always be there and are unqiue providing 40 is at begining of line. I used a HEX editer and found that I could remove all line feeds (hex 0D0A) but this is not better than manually editing the files and programaticaly I'd need to not remove the last one per record. Some records will be only 2 lines but most will be 5 like this one.
Is there a way python or otherwise to concatonate the lines that make up a record into one line where 40 or maybe better choice where 9110 indicates the start of the record?
Any ideas or pointers will be much appreciated. I've got python and a good IDE and I'm good with grep and find but learning awk (don't laugh)...
awk will do it. You need to identify The line that starts a record. In this case it is 409547095517
So let's assume that to be safe if a line starts with 8 numbers it is the start of a record.
awk ' NR> 1 && /^[0-9]{8}/ { printf("\n") }
{printf("%s", $0) }
END{ printf("\n") }' filename > newfilename
Change the {8} to any number that works for you.

Resources