Prepend match number to output match lines of grep? - grep

Say I have this file, test.log:
blabla test test
20 30 40
hello world
100 100
34 506 795
blabla test2
50 60 70
hello
10 10
200 200
blabla test BB
30 40 50
100 100
20 20 20 20
I would like to print all lines with blabla in them, the line after that - with the match number prepended.
Without match number, it is easy:
$ grep -A1 "blabla" test.log
blabla test test
20 30 40
--
blabla test2
50 60 70
--
blabla test BB
30 40 50
With a prepended match number, it would look like this:
1: blabla test test
1: 20 30 40
--
2: blabla test2
2: 50 60 70
--
3: blabla test BB
3: 30 40 50
The tricky part is, I want to preserve the match number, regardless if I just grep for a single line match, or with context (X lines after or before the match).
Is there an easy way to do this? If I could do a format specifier for the number, as in %03d, even better - but just a usual number would be fine too...

Something like
grep -A1 blahblah test.log | awk -v n=1 '$0 == "--" { n += 1; print; next }
{ printf("%03d: %s\n", n, $0) }'

Perl to the rescue!
perl -ne '/blabla/ and print ++$i, ":$_" and print "$i:", scalar <>' -- file
-n reads the input line by line
each line is read into the special variable $_
the diamond operator <> reads the next line from the input file
scalar makes it read just one line, not all the remaining ones
the variable $i is incremented each time blabla is encountered and is prepended to each output line.
Your specification doesn't handle the case when two blablas are present on adjacent lines.
To format the numbers, use sprintf:
perl -ne 'if (/blabla/) { $f = sprintf "%03d", ++$i; print $f, ":$_"; print "$f:", scalar <>}'

Related

Parsing Text FIle Using Awk

I would like to parse a text file that has the section of interest as follows:
mesh 0 400 12000
400 300 400
1 0 -1
300 500 600
0 0 1
etc....
12000
1300
1100
etc..
I would only like the rows that immediately follow the row that starts with string mesh and every other one after that as well, and has 3 columns. I would like this output to be in a separate text file with a modified name.
So desired output text file:
400 300 400
300 500 600
I tried to do this with python and loops but it literally took hours and never finished as there are thousand to hundred of thousands of lines in the original text file.
Is there a more efficient way to do this in with a bash script using awk?
awk to the rescue!
$ awk '/^mesh/{n=NR;next} NF==3 && n && NR%2==(n+1)%2' file > filtered_file
400 300 400
300 500 600

How to remove a word that matches at the beginning of each line

I would like to ask, how could I remove lines contaning the pattern AAA at their beginning?
example:
contents of file.txt:
AAA/bb/cc/d/d/d/d/e
AAA/dd/r/t/e/q/e/tg
AAA/uu/y/t/r/e/w/q
123 234 456 AAA/f/f/f/f/g/g
555 999 000 AAA/y/g/h/u/j/k
I would like to remove the first three lines with this type of pattern but would like to keep the last two lines.
The output of the command should be:
123 234 456 AAA/f/f/f/f/g/g
555 999 000 AAA/y/g/h/u/j/k
How could I do it with a unix command?
Thank you.
sed `/^AAA/d` file.txt
The /^AAA/ is a regular expression which matches AAA at the beginning of a line (^). d deletes the selected lines.
man sed for more information on the sed stream editor.

Display multiple lines using multiple patterns

Hope you can shed some light in one of my requirements. Let say I have file with the following entries:
ABC 123
XYZ 789
XYZ 456
ABC 234
XYZ 789
ABC 567
XYZ 789
XYZ 678
XYZ 123
Basically, I have rows ABC with X numbers of XYZ rows after it. The number of XYZ records in each ABC varies from 1 to many.
I need a shell script that will output the ABC and the corresponding XYZ based on the patterns in the 2nd column.
For example, display the ABC record with pattern 567 and the corresponding XYZ record with pattern 678.
The output should only be:
ABC 567
XYZ 678
To solve this, I use awk to massage the data into a single line, then grep on that output, then sed to revert matching entries to the original format.
awk '{ printf ($1 == "ABC" ? "\n" : " #¶# ") $0 }' file |grep 567 |sed 's/ #¶# /\n/g'
Code walk:
I used #¶# as a delimiter. Use something that won't have conflicts in your data (otherwise you'll have to deal with escaping it). Also note that your UTF8 support mileage may vary.
awk prints, without trailing line break, two things concatenated:
If we're on an ABC line, a line break (\n). Otherwise, the delimiter (#¶#).
Then the existing line ($0)
grep then runs for your query. This lets you use -f FILE_OF_PATTERNS or a collection of -e PATTERNs
sed then reverts the delimiters back to the original format
This has the advantage of going line by line. If you have tens of thousands of XYZs in a single ABC, it'll be a bit slower, but this doesn't keep anything in memory, so this should be pretty scalable.
Here is the output of the above awk command (yes, there is a leading blank line, which doesn't matter):
$ awk '{ printf ($1 == "ABC" ? "\n" : " #¶# ") $0 }' file
ABC 123 #¶# XYZ 789 #¶# XYZ 456
ABC 234 #¶# XYZ 789
ABC 567 #¶# XYZ 789 #¶# XYZ 678 #¶# XYZ 123
try this if it works for you. I hope I understood your requirement right:
awk -v p1='ABC 567' -v p2='XYZ 678'
'$0~p1{t=1;print;next}/^ABC/{t=0}$0~p2&&t' file

Why cut cannot work?

So basically I want to print out certain columns of the .data, .rodata and .bss sections of an ELF binary, and I use this command:
readelf -S hello | grep "data\|bss" | cut -f1,2,5,6
but to my surprise, the results are:
[15] .rodata PROGBITS 080484d8 0004d8 000020 00 A 0 0 8
[24] .data PROGBITS 0804a00c 00100c 000008 00 WA 0 0 4
[25] .bss NOBITS 0804a014 001014 000008 00 WA 0 0 4
which means the cut didn't work...
I don't know why and after some search online, I still don't know how to make it right, could anyone give me some help?
I would have used awk here since you can do all with one command.
readelf -S hello | awk '/data|bss/ {print $1,$2,$5,$6}'
awk will work with any blank space a separator. One space, multiple space, tabs etc.
You input is actually demited by spaces not TAB. By default cut expects TAB. This should work:
cut -d ' ' -f1,2,5,6
It specifies the delimiter as ' ' (space).

Unexpected behavior of io:fread in Erlang

This is an Erlang question.
I have run into some unexpected behavior by io:fread.
I was wondering if someone could check whether there is something wrong with the way I use io:fread or whether there is a bug in io:fread.
I have a text file which contains a "triangle of numbers"as follows:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
There is a single space between each pair of numbers and each line ends with a carriage-return new-line pair.
I use the following Erlang program to read this file into a list.
-module(euler67).
-author('Cayle Spandon').
-export([solve/0]).
solve() ->
{ok, File} = file:open("triangle.txt", [read]),
Data = read_file(File),
ok = file:close(File),
Data.
read_file(File) ->
read_file(File, []).
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N]} ->
read_file(File, [N | Data]);
eof ->
lists:reverse(Data)
end.
The output of this program is:
(erlide#cayle-spandons-computer.local)30> euler67:solve().
[59,73,41,52,40,9,26,53,6,3410,51,87,86,8161,95,66,57,25,
6890,81,80,38,92,67,7330,28,51,76,81|...]
Note how the last number of the fourth line (34) and the first number of the fifth line (10) have been merged into a single number 3410.
When I dump the text file using "od" there is nothing special about those lines; they end with cr-nl just like any other line:
> od -t a triangle.txt
0000000 5 9 cr nl 7 3 sp 4 1 cr nl 5 2 sp 4 0
0000020 sp 0 9 cr nl 2 6 sp 5 3 sp 0 6 sp 3 4
0000040 cr nl 1 0 sp 5 1 sp 8 7 sp 8 6 sp 8 1
0000060 cr nl 6 1 sp 9 5 sp 6 6 sp 5 7 sp 2 5
0000100 sp 6 8 cr nl 9 0 sp 8 1 sp 8 0 sp 3 8
0000120 sp 9 2 sp 6 7 sp 7 3 cr nl 3 0 sp 2 8
0000140 sp 5 1 sp 7 6 sp 8 1 sp 1 8 sp 7 5 sp
0000160 4 4 cr nl 8 4 sp 1 4 sp 9 5 sp 8 7 sp
One interesting observation is that some of the numbers for which the problem occurs happen to be on 16-byte boundary in the text file (but not all, for example 6890).
I'm going to go with it being a bug in Erlang, too, and a weird one. Changing the format string to "~2s" gives equally weird results:
["59","73","4","15","2","40","0","92","6","53","0","6","34",
"10","5","1","87","8","6","81","61","9","5","66","5","7",
"25","6",
[...]|...]
So it appears that it's counting a newline character as a regular character for the purposes of counting, but not when it comes to producing the output. Loopy as all hell.
A week of Erlang programming, and I'm already delving into the source. That might be a new record for me...
EDIT
A bit more investigation has confirmed for me that this is a bug. Calling one of the internal methods that's used in fread:
> io_lib_fread:fread([], "12 13\n14 15 16\n17 18 19 20\n", "~d").
{done,{ok,"\f"}," 1314 15 16\n17 18 19 20\n"}
Basically, if there's multiple values to be read, then a newline, the first newline gets eaten in the "still to be read" part of the string. Other testing suggests that if you prepend a space it's OK, and if you lead the string with a newline it asks for more.
I'm going to get to the bottom of this, gosh-darn-it... (grin) There's not that much code to go through, and not much of it deals specifically with newlines, so it shouldn't take too long to narrow it down and fix it.
EDIT^2
HA HA! Got the little blighter.
Here's the patch to the stdlib that you want (remember to recompile and drop the new beam file over the top of the old one):
--- ../erlang/erlang-12.b.3-dfsg/lib/stdlib/src/io_lib_fread.erl
+++ ./io_lib_fread.erl
## -35,9 +35,9 ##
fread_collect(MoreChars, [], Rest, RestFormat, N, Inputs).
fread_collect([$\r|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\r|More]);
fread_collect([$\n|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\n|More]);
fread_collect([C|More], Stack, Rest, RestFormat, N, Inputs) ->
fread_collect(More, [C|Stack], Rest, RestFormat, N, Inputs);
fread_collect([], Stack, Rest, RestFormat, N, Inputs) ->
## -55,8 +55,8 ##
eof ->
fread(RestFormat,eof,N,Inputs,eof);
_ ->
- %% Don't forget to count the newline.
- {more,{More,RestFormat,N+1,Inputs}}
+ %% Don't forget to strip and count the newline.
+ {more,{tl(More),RestFormat,N+1,Inputs}}
end;
Other -> %An error has occurred
{done,Other,More}
Now to submit my patch to erlang-patches, and reap the resulting fame and glory...
Besides the fact that it seems to be a bug in one of the erlang libs I think you could (very) easily circumvent the problem.
Given the fact your file is line-oriented I think best practice is that you process it line-by-line as well.
Consider the following construction. It works nicely on an unpatched erlang and because it uses lazy evaluation it can handle files of arbitrary length without having to read all of it into memory first. The module contains an example of a function to apply to each line - turning a line of text-representations of integers into a list of integers.
-module(liner).
-author("Harro Verkouter").
-export([liner/2, integerize/0, lazyfile/1]).
% Applies a function to all lines of the file
% before reducing (foldl).
liner(File, Fun) ->
lists:foldl(fun(X, Acc) -> Acc++Fun(X) end, [], lazyfile(File)).
% Reads the lines of a file in a lazy fashion
lazyfile(File) ->
{ok, Fd} = file:open(File, [read]),
lazylines(Fd).
% Actually, this one does the lazy read ;)
lazylines(Fd) ->
case io:get_line(Fd, "") of
eof -> file:close(Fd), [];
{error, Reason} ->
file:close(Fd), exit(Reason);
L ->
[L|lazylines(Fd)]
end.
% Take a line of space separated integers (string) and transform
% them into a list of integers
integerize() ->
fun(X) ->
lists:map(fun(Y) -> list_to_integer(Y) end,
string:tokens(X, " \n")) end.
Example usage:
Eshell V5.6.5 (abort with ^G)
1> c(liner).
{ok,liner}
2> liner:liner("triangle.txt", liner:integerize()).
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
And as a bonus, you can easily fold over the lines of any (lineoriented) file w/o running out of memory :)
6> lists:foldl( fun(X, Acc) ->
6> io:format("~.2w: ~s", [Acc,X]), Acc+1
6> end,
6> 1,
6> liner:lazyfile("triangle.txt")).
1: 59
2: 73 41
3: 52 40 09
4: 26 53 06 34
5: 10 51 87 86 81
6: 61 95 66 57 25 68
7: 90 81 80 38 92 67 73
8: 30 28 51 76 81 18 75 44
Cheers,
h.
I noticed that there are multiple instances where two numbers are merged, and it appears to be at the line boundaries on every line starting at the fourth line and beyond.
I found that if you add a whitespace character to the beginning of every line starting at the fifth, that is:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
The numbers get parsed properly:
39> euler67:solve().
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
It also works if you add the whitespace to the beginning of the first four lines, as well.
It's more of a workaround than an actual solution, but it works. I'd like to figure out how to set up the format string for io:fread such that we wouldn't have to do this.
UPDATE
Here's a workaround that won't force you to change the file. This assumes that all digits are two characters (< 100):
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N] } ->
if
N > 100 ->
First = N div 100,
Second = N - (First * 100),
read_file(File, [First , Second | Data]);
true ->
read_file(File, [N | Data])
end;
eof ->
lists:reverse(Data)
end.
Basically, the code catches any of the numbers which are the concatenation of two across a newline and splits them into two.
Again, it's a kludge that implies a possible bug in io:fread, but that should do it.
UPDATE AGAIN The above will only work for two-digit inputs, but since the example packs all digits (even those < 10) into a two-digit format, that will work for this example.

Resources