Output data with grep after symbol match - grep

I have the following script:
grep "John" uk-500.html | cut -d "<" -f2 | grep -v "St"
Which searches for John in file uk-500.html, and prints everything before second "<".
I get following result:
TD ALIGN="LEFT">Berry, John M Esq
TD ALIGN="LEFT">Cain, John M Esq
TD ALIGN="LEFT">Cavuto, John A
TD ALIGN="LEFT">Cheek, John D Esq
TD ALIGN="LEFT">Elliott, John W Esq
TD ALIGN="LEFT">Gallagher, John J Esq
TD ALIGN="LEFT">Graham, John A Esq
TD ALIGN="LEFT">Hancock, John J Esq
TD ALIGN="LEFT">Howard Johnson
TD ALIGN="LEFT">John Noda A Law Ofc Lawrence E
TD ALIGN="LEFT">Johnson, Matthew E Esq
Question is, which attributes should I use to make it structured in this way: "Surname, Name" and no html tags before?

there are many ways to do this. one way would be to add one more pipe to your sequence:
grep "John" uk-500.html | cut -d "<" -f2 | grep -v "St" | sed 's/^.*>//'
this call to sed searches your output and removes everything from each line matching the regular expression pattern ^.*>. this effectively removes the content of the original html tag from the beginning of each line.
this may not be the most ideal way to do it, but it definitely does the job with your example data.

Related

Parsing simple string with awk or sed in linux

original string :
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/
Depth of directories will vary, but /trunk part will always remain the same.
And a single character in front of /trunk is the indicator of that line.
desired output :
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Q /trunk/melon/juice/venti/straw
*** edit
I'm sorry I made a mistake by adding a slash at the end of each path in the original string which made the output confusing. Original string didn't have the slash in front of the capital letter, but I'll leave it be.
my attempt :
echo $str1 | sed 's/\(.\/trunk\)/\n\1/g'
I feel like it should work but it doesn't.
With GNU awk for multi-char RS and RT:
$ awk -v RS='([^/]+/){2}[^/\n]+' 'RT{sub("/",OFS,RT); print RT}' file
A trunk/apple
B trunk/apple
Z trunk/orange
I'm setting RS to a regexp describing each string you want to match, i.e. 2 repetitions of non-/s followed by / and then a final string of non-/s (and non-newline for the last string on the input line). RT is automatically set to each of the matching strings, so then I just change the first / to a blank and print the result.
If each path isn't always 3 levels deep but does always start with something/trunk/, e.g.:
$ cat file
A/trunk/apple/banana/B/trunk/apple/Z/trunk/orange
then:
$ awk -v RS='[^/]+/trunk/' 'RT{if (NR>1) print pfx $0; pfx=gensub("/"," ",1,RT)} END{printf "%s%s", pfx, $0}' file
A trunk/apple/banana/
B trunk/apple/
Z trunk/orange
To deal with complex samples input, like where there could be N number of / and values after trunk in a single line please try following.
awk '
{
gsub(/[^/]*\/trunk/,OFS"&")
sub(/^ /,"")
sub(/\//,OFS"&")
gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&")
sub(/\n/,OFS)
gsub(/\n /,ORS)
gsub(/\/trunk/,OFS"&")
sub(/[[:space:]]+/,OFS)
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
gsub(/[^/]*\/trunk/,OFS"&") ##Globally substituting everything from / to till next / followed by trunk/ with space and matched value.
sub(/^ /,"") ##Substituting starting space with NULL here.
sub(/\//,OFS"&") ##Substituting first / with space / here.
gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&") ##Globally substituting spaces followed by everything till / trunk till space comes with new line and matched values.
sub(/\n/,OFS) ##Substituting new line with space.
gsub(/\n /,ORS) ##Globally substituting new line space with ORS.
gsub(/\/trunk/,OFS"&") ##Globally substituting /trunk with OFS and matched value.
sub(/[[:space:]]+/,OFS) ##Substituting spaces with OFS here.
}
1 ##Printing edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
With your shown samples, please try following awk code.
awk '{gsub(/\/trunk/,OFS "&");gsub(/trunk\/[^/]*\//,"&\n")} 1' Input_file
In awk you can try this solution. It deals with the special requirement of removing forward slashes when the next character is upper case. Will not win a design award but works.
$ echo "A/trunk/apple/B/trunk/apple/Z/trunk/orange" |
awk -F '' '{ x=""; for(i=1;i<=NF;i++){
if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
if($i~/[A-Z]/){ printf x""$i" "}
else{ x="\n"; printf $i } }; print "" }'
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Also works for n words. Actually works with anything that follows the given pattern.
$ echo "A/fruits/apple/mango/B/anything/apple/pear/banana/Z/ball/orange/anything" |
awk -F '' '{ x=""; for(i=1;i<=NF;i++){
if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
if($i~/[A-Z]/){ printf x""$i" "}
else{ x="\n"; printf $i } }; print "" }'
A /fruits/apple/mango
B /anything/apple/pear/banana
Z /ball/orange/anything
This might work for you (GNU sed):
sed 's/[^/]*/& /;s/\//\n/3;P;D' file
Separate the first word from the first / by a space.
Replace the third / by a newline.
Print/delete the first line and repeat.
If the first word has the property that it is only one character long:
sed 's/./& /;s#/\(./\)#\n\1#;P;D' file
Or if the first word has the property that it begins with an upper case character:
sed 's/[[:upper:]][^/]*/& /;s#/\([[:upper:][^/]*/\)#\n\1#;P;D' file
Or if the first word has the property that it is followed by /trunk/:
sed -E 's#([^/]*)(/trunk/)#\n\1 \2#g;s/.//' file
With GNU sed:
$ str="A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/"
$ sed -E 's|/?(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Note the first empty output line. If it is undesirable we can separate the processing of the first output line:
$ sed -E 's|(.)|\1 |;s|/(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Using gnu awk you could use FPAT to set contents of each field using a pattern.
When looping the fields, replace the first / with /
str1="A/trunk/apple/B/trunk/apple/Z/trunk/orange"
echo $str1 | awk -v FPAT='[^/]+/trunk/[^/]+' '{
for(i=1;i<=NF;i++) {
sub("/", " /", $i)
print $i
}
}'
The pattern matches
[^/]+ Match any char except /
/trunk/[^/]+ Match /trunk/ and any char except /
Output
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Other patterns that can be used by FPAT after the updated question:
Matching a word boundary \\< and an uppercase char A-Z and after /trunk repeat / and lowercase chars
FPAT='\\<[A-Z]/trunk(/[a-z]+)*'
If the length of the strings for the directories after /trunk are at least 2 characters:
FPAT='\\<[A-Z]/trunk(/[^/]{2,})*'
If there can be no separate folders that consist of a single uppercase char A-Z
FPAT='\\<[A-Z]/trunk(/([^/A-Z][^/]*|[^/]{2,}))*'
Output
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Assuming your data will always be in the format provided as a single string, you can try this sed.
$ sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g' input_file
$ echo "A/trunk/apple/pine/skunk/B/trunk/runk/bunk/apple/Z/trunk/orange/T/fruits/apple/mango/P/anything/apple/pear/banana/L/ball/orange/anything/S/fruits/apple/mango/B/rupert/cream/travel/scout/H/tall/mountains/pottery/barnes" | sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g'
A /trunk/apple/pine/skunk
B /trunk/runk/bunk/apple
Z /trunk/orange
T /fruits/apple/mango
P /anything/apple/pear/banana
L /ball/orange/anything
S /fruits/apple/mango
B /rupert/cream/travel/scout
H /tall/mountains/pottery/barnes
Some fun with perl, where you can using nonconsuming regex to autosplit into the #F array, then just print however you want.
perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
Step #1: Split
perl -lanF/(?=.{1,2}trunk)/'
This will take the input stream, and split each line whenever the pattern .{1,2}trunk is encountered
Because we want to retain trunk and the preceeding 1 or 2 chars, we wrap the split pattern in the (?=) for a non-consuming forward lookahead
This splits things up this way:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print join " ", #F'
A /trunk/apple/ B /trunk/apple/ Z /trunk/orange/citrus/ Q /trunk/melon/juice/venti/straw/
Step 2: Format output:
The #F array contains pairs that we want to print in order, so we'll iterate half of the array indices, and print 2 at a time:
print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2 --> Double the iterator, and print pairs
using perl -l means each print has an implicit \n at the end
The results:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
A /trunk/apple/
B /trunk/apple/
Z /trunk/orange/citrus/
Q /trunk/melon/juice/venti/straw/
Endnote: Perl obfuscation that didn't work.
Any array in perl can be cast as a hash, of the format (key,val,key,val....)
So %F=#F; print "$_ $F{$_}" for keys %F seems like it would be really slick
But you lose order:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e '%F=#F; print "$_ $F{$_}" for keys %F'
Z /trunk/orange/citrus/
A /trunk/apple/
Q /trunk/melon/juice/venti/straw/
B /trunk/apple/
Update
With your new data file:
$ cat file
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/
This GNU awk solution:
awk '
{
sub(/[/]$/,"")
gsub(/[[:upper:]]{1}/,"& ")
print gensub(/([/])([[:upper:]])/,"\n\\2","g")
}' file
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

How to grep only if pattern1 and pattern2 matches in consecutive lines

I have a file like below:
city-italy
good food
bad climate
-
city-india
bad food
normal climate
-
city-brussel
normal dressing
stylish cookings
good food
-
Question - I want to grep city and food, for which "food" is "bad".
For example -
for the above question, i need a grep command to get a answer like below
city-india
bad food
Please help me like, how i will get pattern 1 and pattern 2 grepped only if both succeeds parallely.
i mean both pattern should match and it should grep in the following line.
You can do it with pipes - grep -A1 city <filename> | grep -B1 "bad food" or cat filename | grep -A1 city | grep -B1 "bad food" (or any other stream source for the pipe)
If the city name is guaranteed to come before the food quality (any other info in between is allowed):
sed -n -e '/^city/h' -e '/bad food/{x;G;p}' input
Which keeps the name of each city in the hold buffer and prints the last city name when matches bad food.
I know this is an old question, but here's a "robust" alternative (cuz I'm into that):
grep -x -e'city-.*' -e'good food' -e'bad food' -e'-' | tr \\n \| | sed -e's/|-|/\n/g' | grep -xe'[^|]\+|[^|]\+' | grep -e'|bad food$' | tr \| \\n
Explanation
grep -x -e'city-.*' -e'good food' -e'bad food' -e'-': only keep the lines that contain a "city line", a "food line" (either good or bad), or a "separator line" (the food line expression could be better, I know), the -x argument to grep will make it return a line only if the whole line matches the given expression (incidentally, this first stage makes the whole pipe not choke on differently-sized "registers"),
tr \\n \|: turn newlines into pipes (you can use any character that does not appear in the original file, pipe works, so does a colon, you get the idea),
sed -e's/|-|/\n/g': replace the |-| string by a newline (this are the places we know a "register" ends, since we only kept the datums we're interested in and the separators, we know that now we have each of our "registers" in a single line, with their fields separated by pipes),
grep -xe'[^|]\+|[^|]\+': only keep lines containing exactly two fields (ie. the city and food fields),
grep -e'|bad food$': keep only lines ending in |bad food,
tr \| \\n: turn pipes back into newlines (nb. this is just here so that the output conforms to the question's specification, it's not really needed, nor preferred in my opinion).
Partial outputs
After grep -x -e'city-.*' -e'good food' -e'bad food' -e'-':
city-italy
good food
-
city-india
bad food
-
city-brussel
good food
-
After tr \\n \|:
city-italy|good food|-|city-india|bad food|-|city-brussel|good food|-|
After sed -e's/|-|/\n/g':
city-italy|good food
city-india|bad food
city-brussel|good food
After grep -xe'[^|]\+|[^|]\+': idem, since we don't have a "city line" without a "food line" in the example given, nor a register containing two "city lines" and a "food line", nor a register containing a "city line" and two "food lines", nor... you get the picture,
After grep -e'|bad food$':
city-india|bad food
After tr \| \\n:
city-india
bad food
Why is this more "robust"?
The input file basically consists of different "registers", each containing a variable number of "fields", but instead of having them in an "horizontal" format, we find them in a "vertical" one, ie. one field per line with a lone - separating whole registers.
The pipe above supports any amount of fields in each register, it only assumes that:
Registers are separated by a lone -,
The "city fields" are all of the form city-*,
The "food fields" are either good food or bad food,
If at all existent, "city" fields appear before "food" fields.
(this last one I find particularly hard to relax, at least in a "normal"-ish pipe like the one given).
I does not assume that:
Each register has a "city" and a "food" field,
Each register has only "city" and "food" fields.
Disclaimer
I'm not claiming this is in any way better than any of the other answers, it's just that I can't do sed or awk to save my own life, and often find pipes like this are helpful in understanding how the file gets filtered and transformed.
All in all, it's just a matter of taste.
If the order is ensured, you can use directly the command grep with OR:
grep -e "city" -e "food" FILE_INPUT
Then hopefully the city will follow by its food feature at following.
The result looks like:
city-italy
good food
city-india
bad food
city-brussel
good food
You can change your pattern to get a more filtered result.
To get city with bad food using gnu awk (due to RS)
awk '/bad food/ {print RS $1}' RS="city" file
city-india
another awk line:
kent$ awk 'BEGIN{FS=OFS="\n";RS="-"FS}/bad food/{print $1,$2}' file
city-india
bad food

How to pass spaces in table (specflow scenario)?

How to pass spaces in table ?
Background:
Given the following books
|Author |(here several spaces)Title(here several spaces)|
I would do this:
Given the following books
| Author | Title |
| "J. K. Rowling" | "Harry P " |
| " Isaac Asimov " | "Robots and Empire" |
Then your bindings can be made to strip the quotes if present, but retaining the spaces.
I think this is much preferable to the idea of adding spaces afterward, because that isn't very human readable - quotations will make the spaces visible to the human (stakeholder / coder) reading them.
You can work around it by adding an extra step. Something like:
Given the following books
|Author | Title |
Add append <5> spaces to book title
Edit:
A complete feature can look something like:
Scenario: Adding books with spaces in the title
Given the following book
| price | title |
And <5> spaces appended to a title
When book is saved
Then the title should be equals to <title without spaces>
I just faced same situation, my solution was this, added spaces in the step as follows:
Scenario: Adding books with spaces in the title
Given the following book ' <title> '
When book is saved
Then the title should be equals to '<title>'
| price | title |
| 50.00 | Working hard |

"Grep-ing" from A to B in hexdump's output

Here is the situation: I have to find in the output from an hexdump the bytes between a string A and a string B. The structure of the hexdump is something like:
-random bytes
-A + useful bytes + B
-random bytes
-A + useful bytes + B
-random bytes
And now, the questions:
- Is it possible to grep "from A to B"? I haven't seen anything like that in the man page or in the internet. I know i can do it manually, but I need to script it.
- Is it possible to show the hexdump output without the line numbers? It seems very reasonable, but I haven't found the way to do it.
Thanks!
You can use Perl-like lookaround assertions to match everything between A and B, not including A and B:
$ echo 'TEST test A foo bar B test' | grep -oP '(?<=A).*(?=B)'
foo bar
However, taking Michael's answer into account, you'll have to convert the hexdump output to a single string to use grep. You can strip off the 'line numbers' on your way:
hexdump filename | sed -r 's/\S{5,}//g' | tr '\n' ' '
or better
hexdump filename | cut -d ' ' -f 2- | tr '\n' ' '
Now everything is on one line, so grep has to be lazy, not greedy:
$ echo 'TEST test A foo bar B test A bar foo B test' | grep -oP '(?<=A).*?(?=B)'
foo bar
bar foo
But Michael has a point, maybe you should use something more high-level, at least if you need to do it more than once.
P.S. If you are OK with including A and B in the match, just do
$ echo 'TEST test A foo bar B test A bar foo B test' | grep -oP 'A.*?B'
A foo bar B
A bar foo B
grep the program only works on one line at a time; you won't be able to get it to work intelligently on a hex dump.
my suggestion: use the regex functionality in perl or ruby or your favorite scripting language, to grep the raw binary data for the string. This example in ruby:
ARGF.read.force_encoding("BINARY").scan(/STR1(.*?)STR2/);
This will produce an array containing all the binary strings between occurences of STR1 and STR2. From there you could run each one through hexdump(1).

Resources