Nullify fields in pipe delimited file - parsing

Am not able to get the desired o/p when the data field has pipe in it.
If the i/p is
SAmple file is tst
hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst"
lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst"
I tried with this cmd but dont get the expected o/p - cut -f2,3 -d"|" tst
The expected o/p is
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Is there an easy way that we can crack this o/p...Dont want to go with sed bcoz the tool that am using doesnt allow the charecter (""- backslash). I mean am embedding this command in one of the tool
Also am using old version of gawk -
so this cmd doesnt give te desired o/p
gawk -v FPAT='[^|]*|("[^"]*")+' '{print $2, $3}' OFS="|"
Output of gawk --version
GNU Awk 3.1.7
Output of cat -vet tst
hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst"$
lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst"$

Upgrading your gawk version is by far the best approach as you're missing a few bug fixes and a ton of extremely useful functionality introduced since gawk 3.1.7 came out 10+ years ago (we're currently on gawk version 5.1!) but if you can't do that for some reason then - here's what you can do if you don't have FPAT using any awk in any shell on every UNIX box:
$ cat tst.awk
BEGIN { OFS="|" }
{
orig = $0
$0 = i = ""
while ( (orig != "") && match(orig,/[^|]*|("[^"]*")+/) ) {
$(++i) = substr(orig,RSTART,RLENGTH)
orig = substr(orig,RSTART+RLENGTH+1)
}
print $2, $3
}
.
$ awk -f tst.awk file
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Just to verify that it's identifying all of the fields correctly:
$ cat tst.awk
BEGIN { OFS="|" }
{
orig = $0
$0 = i = ""
while ( (orig != "") && match(orig,/[^|]*|("[^"]*")+/) ) {
$(++i) = substr(orig,RSTART,RLENGTH)
orig = substr(orig,RSTART+RLENGTH+1)
}
print NF " <" $0 ">"
for (i=1; i<=NF; i++) {
print "\t" i " <" $i ">"
}
}
.
$ awk -f tst.awk file
5 <hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst">
1 <hdr1>
2 <"hdr2|tst">
3 <"hdr3|tst|tst">
4 <hdr4>
5 <"hdr5|tst|tst">
5 <lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst">
1 <lbl1>
2 <"lbl2|tst">
3 <"lbl3|tst|tst">
4 <lbl4>
5 <"lbl5|tst|tst">

if you don't have embedded double quotes, you can substitute the quoted delimiter values with another unused character (I used ~) and after extraction switch back to the original values. Obviously it requires that the new delimiter is not used within text.
$ awk 'BEGIN{OFS=FS="\""} {for(i=2;i<NF;i+=2) gsub("\\|","~",$i)}1' file |
awk 'BEGIN{OFS=FS="|"} {print $2,$3}' |
sed 's/~/|/g'
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Not sure it's simpler than the single awk script though.
Main problem here is the document format design. Requires another patch if there are embedded double quotes, or escaped pipes etc.

Related

Extract bin name from Cargo.toml using Bash

I am trying to extract bin names from from Cargo.toml using Bash, I enabled perl regular expression like this
First attempt
grep -Pzo '(?<=(^\[\[bin\]\]))\s*name\s*=\s*"(.*)"' ./Cargo.toml
The regular expression is tested at regex101
But got nothing
the Pzo options usage can be found here
Second attempt
grep -P (?<=(^[[bin]]))\n*\sname\s=\s*"(.*)" ./Cargo.toml
Still nothing
grep -Pzo '(?<=(^\[\[bin\]\]))\s*name\s*=\s*"(.*)"' ./Cargo.toml
Cargo.toml
[[bin]]
name = "acme1"
path = "bin/acme1.rs"
[[bin]]
name = "acme2"
path = "src/acme1.rs"
grep:
grep -A1 '^\[\[bin\]\]$' |
grep -Po '(?<=^name = ")[^"]*(?=".*)'
or if you can use awk, this is more robust
awk '
$1 ~ /^\[\[?[[:alnum:]]*\]\]?$/{
if ($1=="[[bin]]" || $1=="[bin]") {bin=1}
else {bin=0}
}
bin==1 &&
sub(/^[[:space:]]*name[[:space:]]*=[[:space:]]*/, "") {
sub(/^"/, ""); sub(/".*$/, "")
print
}' cargo.toml
Example:
$ cat cargo.toml
[[bin]]
name = "acme1"
path = "bin/acme1.rs"
[bin]
name="acme2"
[[foo]]
name = "nobin"
[bin]
not_name = "hello"
name="acme3"
path = "src/acme3.rs"
[[bin]]
path = "bin/acme4.rs"
name = "acme4" # a comment
$ sh solution
acme1
acme2
acme3
acme4
Obviously, these are no substitute for a real toml parser.
With your shown samples and attempts, please try following code with tac + awk combination, which will be easier to maintain and does the job with easiness, which will be difficult in grep.
tac Input_file |
awk '
/^name =/{
gsub(/"/,"",$NF)
value=$NF
next
}
/^path[[:space:]]+=[[:space:]]+"bin\//{
print value
value=""
}
' |
tac
Explanation: Adding detailed explanation for above code.
tac Input_file | ##Using tac command on Input_file to print it in bottom to top order.
awk ' ##passing tac output to awk as standard input.
/^name =/{ ##Checking if line starts from name = then do following.
gsub(/"/,"",$NF) ##Globally substituting " with NULL in last field.
value=$NF ##Setting value to last field value here.
next ##next will skip all further statements from here.
}
/^path[[:space:]]+=[[:space:]]+"bin\//{ ##Checking if line starts from path followed by space = followed by spaces followed by "bin/ here.
print value ##printing value here.
value="" ##Nullifying value here.
}
' | ##Passing awk program output as input to tac here.
tac ##Printing values in their actual order.

awk to parse file and export as variable

I am parsing a text file
Lines File Name Gen LnkLN LINK Time
----- -------------------- ---- ----- ---- ------------------------
00090 TEST1_1519230912 0 00092 .X.X Wed Feb 21 16:35:14 2018
00091 TEST2_1619330534 0 00093 .X.X Wed Feb 21 16:35:14 2018
using code
awk '{if (($1 ~ /^[0-9A-Fa-f]+$/) && (length($1)==5)) {
if (! c[$4]) TLN=TLN $4 ","
c[$4]=$4;
if (! d[$3]) TGN=TGN $3 ","
d[$3]=$3
if (! b[$2]) TLNK=TLNK $2 ","
b[$2]=$2
}
} END {print "TLines="TLN,"TGEN="TGN,"TLink="TLNK}' /var/tmp/slink.jnk
I get O/p
TLines=00092,00093, TGEN=0,0, TLink=TEST1_1519230912,TEST2_1619330534,
I have two questions with this.
First one is I don't understand why value for TGN is being printed twice in the output "0,0,". If file has duplicate value for the field I want only one value in the o/p.
Second, I redirect these o/p into another file and use #source filename.txt command to set these values as environment variables and use them in later part of the script. Is there any better way to use them as variables inside the script rather than creating another file and sourcing it.
Use in to see if a value is being repeated to avoid the case where the value itself evaluates to false. That is what's happening with your 0 value and why it's being repeated in your output.
$ awk '{if (($1 ~ /^[0-9A-Fa-f]+$/) && (length($1)==5)) {
if (!($4 in c)) TLN=TLN $4 ","
c[$4]
if (!($3 in d)) TGN=TGN $3 ","
d[$3]
if (!($2 in b)) TLNK=TLNK $2 ","
b[$2]
}
} END {print "TLines="TLN,"TGEN="TGN,"TLink="TLNK}' f
Output:
TLines=00092,00093, TGEN=0, TLink=TEST1_1519230912,TEST2_1619330534,
EDIT
Above I've kept things close to your original version, but as mentioned in the comments, a more idiomatic and nicer version would be:
$ awk '($1 ~ /^[0-9A-Fa-f]+$/) && (length($1)==5) {
if (!c[$4]++) TLN=TLN $4 ","
if (!d[$3]++) TGN=TGN $3 ","
if (!b[$2]++) TLNK=TLNK $2 ","
} END {print "TLines="TLN,"TGEN="TGN,"TLink="TLNK}' f
END EDIT
For setting the variables, this worked for me (where a.awk contains the awk code, above):
$ eval "$(awk -f a.awk f)"
$ echo $TLines
00092,00093,
$ echo $TGEN
0,
$ echo $TLink
TEST1_1519230912,TEST2_1619330534,

Count the number of occurrence of string in a large file

I have a large 900MB xml file and the entire file is just one lines. There is no line break between tags. I need to count the occurence of a particular tag in that file.
I tried
grep -o '<start tag>' filename | wc -l
i get a grep: line too long error.
How can I get around this?
Here's a bit of a hack:
perl -ne 'BEGIN { $/ = ">"; $c = 0 } $c++ if /<start tag>/; END { print "$c\n" }' filename
The idea is to loop over "lines" that are terminated by > instead of \n (newline). That should avoid "line too long" errors.
Just use awk:
awk -F'<start tag>' '{print NF-1}' file
If that fails, you can do this with GNU awk (for multi-char RS):
awk -v RS='<start tag>' 'END{print NR-1}' file

joining 2 files on matching column values using awk

I know there have been similar questions posted but I'm still having a bit of trouble getting the output I want using awk FNR==NR...
I have 2 files as such
File 1:
123|this|is|good
456|this|is|better
...
File 2:
aaa|123
bbb|456
...
So I want to join on values from file 2/column2 to file 1/column1 and output file 1 (col 2,3,4) and file 2 (col 1).
Thanks in advance.
With awk you could do something like
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } $1 in val { $(NF + 1) = val[$1]; print }' file2 file1
NF is the number of fields in a record (line by default), so $NF is the last field, and $(NF + 1) is the field after that. By assigning the saved value from the pass over file2 to it, a new field is appended to the record before it is printed.
One thing to note: This behaves like an inner join, i.e., only records are printed whose key appears in both files. To make this a right join, you can use
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } { $(NF + 1) = val[$1]; print }' file2 file1
That is, you can drop the $1 in val condition on the append-and-print action. If $1 is not in val, val[$1] is empty, and an empty field will be appended to the record before printing.
But it's probably better to use join:
join -1 1 -2 2 -t \| file1 file2
If you don't want the key field to be part of the output, pipe the output of either of those commands through cut -d \| -f 2- to get rid of it, i.e.
join -1 1 -2 2 -t \| file1 file2 | cut -d \| -f 2-
If the files have the same number of lines in the same order, then
paste -d '|' file1 file2 | cut -d '|' -f 2-5
this|is|good|aaa
this|is|better|bbb
I see in a comment to Wintermute's answer that the files aren't sorted. With bash, process substitutions are handy to sort on the fly:
paste -d '|' <(sort -t '|' -k 1,1 file1) <(sort -t '|' -k 2,2 file2) |
cut -d '|' -f 2-5
To reiterate: this solution requires a one-to-one correspondence between the files

Inserting a matched string from previous line to the current line using sed or awk

I have a CSV file that shows the statistics for links on a half an hour basis. The link name only appears on the 00:00 line.
link1,0:00,0,0,0,0
,00:30,0,0,0,0
,01:00,0,0,0,0
,01:30,0,0,0,0
,02:00,0,0,0,0
,02:30,0,0,0,0
,03:00,0,0,0,0
,03:30,0,0,0,0
,23:30,0,0,0,0
....
....
link2,00:00,0,0,0,0
How do I copy the link name to every other line until the link name is different, using sed or awk?
With awk, just keep track of the last seen non-empty link name, and always use that.
awk -F, -v OFS=, '$1 != "" { link=$1 } { $1 = link; print $0 }'
Omitting the ellipses, this gives:
link1,0:00,0,0,0,0
link1,00:30,0,0,0,0
link1,01:00,0,0,0,0
link1,01:30,0,0,0,0
link1,02:00,0,0,0,0
link1,02:30,0,0,0,0
link1,03:00,0,0,0,0
link1,03:30,0,0,0,0
link1,23:30,0,0,0,0
link2,00:00,0,0,0,0
This is a simpler job with awk, but if you want to use sed:
sed -e '/^[^,]/{h;s/,.*//;x};/^,/{G;s/^\(.*\)\n\(.*\)/\2\1/}'
Bellow a commented version in sed script file format that can be run with sed -f script:
# For lines not beginning with a ',', saves what precedes a ',' in the hold space and print the original line.
/^[^,]/{
h
s/,.*//
x}
# For lines beginning with a ',', put what has been save in the hold space at the beginning of the pattern space and print.
/^,/{
G
s/^\(.*\)\n\(.*\)/\2\1/}
You can do that in pure bash shell without needing to start a new process, which should be faster than using awk or sed:
IFS=","
while read v1 v2; do
if [[ $v1 != "" ]]; then
link=$v1;
fi
printf "%s,%s\n" "$link" "$v2"
done < file

Resources