Parse dig output export csv - parsing

using a dig command in a shell script and want to output into csv format flags and authority section
dig #ns1.hosangit.com djzah.com +noall +authority +comments
output
; <<>> DiG 9.8.3-P1 <<>> #ns1.hosangit.com djzah.com +noall +authority +comments
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64505
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; AUTHORITY SECTION:
djzah.com. 3600 IN NS ns3.eventguyz.com.
djzah.com. 3600 IN NS ns1.eventguyz.com.
djzah.com. 3600 IN NS ns2.eventguyz.com.
Expected output for csv is ( domain, flags (not always these three), authority section (could be 5) ):
djzah.com,qr,aa,rd,ns3.eventguyz.com,ns1.eventguyz.com,ns2.eventguyz.com
I was trying to use awk and/or sed but am having difficulty searching for a pattern like for the flags section
;; flags: (then use a space delimiter until you reach ;)
Then the Authority section, I assume you would search for
;; AUTHORITY SECTION:
Then create an array and only use the last.
I don't know what I'm doing.

#!/usr/bin/awk -f
BEGIN { OFS = "," }
/^;; flags:/ {
sub(/;; flags: /, "")
sub(/;.*$/, "")
$1 = $1
flags = "," $0
next
}
/^;/ || NF < 5 { next }
!($1 in a) {
keys[++k] = $1
}
{
t = $5
sub(/[.][ \t\r]*$/, "", t)
a[$1] = a[$1] "," t
}
END {
for (i = 1; i <= k; ++i) {
key = keys[k]
t = key
sub(/[.][ \t\r]*$/, "", t)
print t flags a[key]
}
}
Usage:
dig #ns1.hosangit.com djzah.com +noall +authority +comments | awk -f script.awk
Test:
awk -f script.awk sample
Output:
djzah.com,qr,aa,rd,ns3.eventguyz.com,ns1.eventguyz.com,ns2.eventguyz
BEGIN { OFS = "," }: Every section in awk always run everytime a record is processed. BEGIN block only run once at start. This basically sets OFS to ,.
/^;; flags:/ matches ;; flags:. The section that is presented by it basically extracts the flags from the record (line). The sub commands basically remove unnecessary parts from the record. $1 = $1 just makes sure $0 is updated with OFS. flags = "," $0 assigns the now comma-separated flags into flags variable. next makes awk jump to the next record.
/^;/ || NF < 5 { next } basically makes awk skip unnecessary lines.
!($1 in a) { keys[++k] = $1 } if $1 e.g. djzah.com. is first encountered, add to keys array.
{ t = $5; sub(/[.][ \t\r]*$/, "", t); a[$1] = a[$1] "," t } adds the value of the 5th column e.g. ns3.eventguyz.com to the collection with the leading . removed.
When processing is finished, END block executes. It iterates through the keys found and prints the data bound to it.

Related

AWK take some data input from file and set as variable in output

I have some data in file and need to print in output some format to the data in print.
Example content to parse:
012231-33339411.sxz.ree.fg*-*
U2FsdGVkX1+1pfXeR/h4u6P/BrItX75L0wHVIka4yA6tqS9a5CFUWvLu1AB4x2m8NpmJ>fyoXdADqlWDiGWi6Pw1a8NgNDbdTOlMtGBz4FCi8n97UdVQX9f0a2u9d5l7lOCxVDDzd>wJXbi9x4O+Dmo/lm9DbWAjBGKwWu0tTQxsU2TIpqv
FhUZmGd3E6vN+puPXz4yXeVQhMfQ+K8OpSM2ZuTpKCtDgm0SdUDyFnalA4lxHaFZqh+E>3+9JgHK7/KiiZmIJshUmqrwnkX0yKihCcOXCzaFITiByxBM/7PGeJo0IBAjyKI/GflgQ>8GsIWWRkCJnz2OMiYKr8uOMOAfTHnW57Dq+orDG1p
012236-33349111.sxz.ree.fg*-*
bCRIVArOSClIWrZz6KciBFT2iPjqsS/qMRSBYinBzpDmESj8kZHoGQ46BMq+LgHJiY5P>7yygNxCkEv25GKGViKTX1X6KSSLZ+RVNEts4N7jzVLoufZ+X/TAv2Ib7pnnEj7h4rWDn>y7KP1XrTynItaas5z5fpFt2zUHFNElvNmyrjbFZVp
DUsnWWDuvemWUr5YwOLxeRCnwTvfw71gwGEVeBzIJq4TsZb2/G8j9vpb/L7KNybsyQNN>DlOTMW5CHzd5otyYaNBcYo9V/4ky63q2vZMzQDWtCwVPaTKREPUqPLRKea3VkQnnsUic>/iBe+6Sv5GYl+XPGbIjWbTJWLQmc1kv8LXPyvUmTm
cUVypKp9fDlyFUkOkEVAxW8dMxHJ0c83BPw37GkCvsR9itkzO0FpX0Zn+OvRQRkUCyzr>dgijhcH
I need some way to take in Awk the first variable from begin to "-"
Example:
variable1=012231
and
variable1=012236
Variable 2 the 4 digits after the - character
Example:
Variable2=3333
and
variable2=3334
Variable 3 the 2 digits after the 4 digits of variable2
Example:
variable3=94
and
variable3=91
Variable 4 as the text before the newline
Example:
variable4=U2FsdGVkX1+1pfXeR/h4u6P/BrItX75L0wHVIka4yA6tqS9a5CFUWvLu1AB4x2m8NpmJ>fyoXdADqlWDiGWi6Pw1a8NgNDbdTOlMtGBz4FCi8n97UdVQX9f0a2u9d5l7lOCxVDDzd>wJXbi9x4O+Dmo/lm9DbWAjBGKwWu0tTQxsU2TIpqv
FhUZmGd3E6vN+puPXz4yXeVQhMfQ+K8OpSM2ZuTpKCtDgm0SdUDyFnalA4lxHaFZqh+E>3+9JgHK7/KiiZmIJshUmqrwnkX0yKihCcOXCzaFITiByxBM/7PGeJo0IBAjyKI/GflgQ>8GsIWWRkCJnz2OMiYKr8uOMOAfTHnW57Dq+orDG1p
and
variable4=bCRIVArOSClIWrZz6KciBFT2iPjqsS/qMRSBYinBzpDmESj8kZHoGQ46BMq+LgHJiY5P>7yygNxCkEv25GKGViKTX1X6KSSLZ+RVNEts4N7jzVLoufZ+X/TAv2Ib7pnnEj7h4rWDn>y7KP1XrTynItaas5z5fpFt2zUHFNElvNmyrjbFZVp
DUsnWWDuvemWUr5YwOLxeRCnwTvfw71gwGEVeBzIJq4TsZb2/G8j9vpb/L7KNybsyQNN>DlOTMW5CHzd5otyYaNBcYo9V/4ky63q2vZMzQDWtCwVPaTKREPUqPLRKea3VkQnnsUic>/iBe+6Sv5GYl+XPGbIjWbTJWLQmc1kv8LXPyvUmTm
cUVypKp9fDlyFUkOkEVAxW8dMxHJ0c83BPw37GkCvsR9itkzO0FpX0Zn+OvRQRkUCyzr>dgijhcH
Example print expected in output:
'012231' '3333' '94' 'U2FsdGVkX1+1pfXeR/h4u6P/BrItX75L0wHVIka4yA6tqS9a5CFUWvLu1AB4x2m8NpmJ>fyoXdADqlWDiGWi6Pw1a8NgNDbdTOlMtGBz4FCi8n97UdVQX9f0a2u9d5l7lOCxVDDzd>wJXbi9x4O+Dmo/lm9DbWAjBGKwWu0tTQxsU2TIpqv
FhUZmGd3E6vN+puPXz4yXeVQhMfQ+K8OpSM2ZuTpKCtDgm0SdUDyFnalA4lxHaFZqh+E>3+9JgHK7/KiiZmIJshUmqrwnkX0yKihCcOXCzaFITiByxBM/7PGeJo0IBAjyKI/GflgQ>8GsIWWRkCJnz2OMiYKr8uOMOAfTHnW57Dq+orDG1p'
'012236' '3334' '91' 'bCRIVArOSClIWrZz6KciBFT2iPjqsS/qMRSBYinBzpDmESj8kZHoGQ46BMq+LgHJiY5P>7yygNxCkEv25GKGViKTX1X6KSSLZ+RVNEts4N7jzVLoufZ+X/TAv2Ib7pnnEj7h4rWDn>y7KP1XrTynItaas5z5fpFt2zUHFNElvNmyrjbFZVp
DUsnWWDuvemWUr5YwOLxeRCnwTvfw71gwGEVeBzIJq4TsZb2/G8j9vpb/L7KNybsyQNN>DlOTMW5CHzd5otyYaNBcYo9V/4ky63q2vZMzQDWtCwVPaTKREPUqPLRKea3VkQnnsUic>/iBe+6Sv5GYl+XPGbIjWbTJWLQmc1kv8LXPyvUmTm
cUVypKp9fDlyFUkOkEVAxW8dMxHJ0c83BPw37GkCvsR9itkzO0FpX0Zn+OvRQRkUCyzr>dgijhcH'
Haved tested the following code with result of print selecting by number of record and counting the fixed width of the field, without care the format or shape of the content.
awk -v FIELDWIDTHS="6 1 4 2 2 15" 'NR==1{print $1" "$3" "$4}NR==2{print}NR==3{print $1" "$3" "$4}NR==4{print}' file
But it`s a large file with variable lenght of number of records in the large string so the equal will not work for this case I will need catch this string to a variable to print it later in the output as field in all the sequences of show this field.
Could help me with some code to parse the input and print the output as close to the need, please explain how to take the positions in the input.
Thank in advance.
Using any awk in any shell on every Unix box:
$ cat tst.awk
split($0,f,"-") > 1 {
if ( NR > 1 ) {
prt()
delete var
}
var[1] = f[1]
var[2] = substr(f[2],1,4)
var[3] = substr(f[2],5,2)
next
}
{ var[4] = var[4] $0 }
END { prt() }
function prt( i) {
for ( i=1; i<=4; i++ ) {
printf "\047%s\047%s", var[i], (i<4 ? OFS : ORS)
}
}
$ awk -f tst.awk file
'012231' '3333' '94' 'U2FsdGVkX1+1pfXeR/h4u6P/BrItX75L0wHVIka4yA6tqS9a5CFUWvLu1AB4x2m8NpmJ>fyoXdADqlWDiGWi6Pw1a8NgNDbdTOlMtGBz4FCi8n97UdVQX9f0a2u9d5l7lOCxVDDzd>wJXbi9x4O+Dmo/lm9DbWAjBGKwWu0tTQxsU2TIpqvFhUZmGd3E6vN+puPXz4yXeVQhMfQ+K8OpSM2ZuTpKCtDgm0SdUDyFnalA4lxHaFZqh+E>3+9JgHK7/KiiZmIJshUmqrwnkX0yKihCcOXCzaFITiByxBM/7PGeJo0IBAjyKI/GflgQ>8GsIWWRkCJnz2OMiYKr8uOMOAfTHnW57Dq+orDG1p'
'012236' '3334' '91' 'bCRIVArOSClIWrZz6KciBFT2iPjqsS/qMRSBYinBzpDmESj8kZHoGQ46BMq+LgHJiY5P>7yygNxCkEv25GKGViKTX1X6KSSLZ+RVNEts4N7jzVLoufZ+X/TAv2Ib7pnnEj7h4rWDn>y7KP1XrTynItaas5z5fpFt2zUHFNElvNmyrjbFZVpDUsnWWDuvemWUr5YwOLxeRCnwTvfw71gwGEVeBzIJq4TsZb2/G8j9vpb/L7KNybsyQNN>DlOTMW5CHzd5otyYaNBcYo9V/4ky63q2vZMzQDWtCwVPaTKREPUqPLRKea3VkQnnsUic>/iBe+6Sv5GYl+XPGbIjWbTJWLQmc1kv8LXPyvUmTmcUVypKp9fDlyFUkOkEVAxW8dMxHJ0c83BPw37GkCvsR9itkzO0FpX0Zn+OvRQRkUCyzr>dgijhcH'

Performant comparisons in awk?

I've got a python script that runs through some logs and figured it'd be instructive to do a few benchmarks against some other approaches before deploying this out. When looking at awk, I'm hoping to minimize overhead to get a 'fair' shake at beating the somewhat optimized python variant.
My log entries look like:
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=AnotherValue
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=3,...
--------
And I'm keen to get the value of AnotherField when one of our ThirdBonusKeys exists and has a certain value (actually just the number 1).
The 'stupid' way here is to set our RS to '--------' and then just apply a regex to $0 twice, first to see if ThirdBonusKey=1 is in the record, and then to extract AnotherField=(desired_value).
But that seems like an unfair comparison, given it's just throwing a regex at the problem (twice!). Without a guaranteed ordering of fields to leverage awk's cool FS skills, is there a quicker or more appropriate approach here? It's possible that the answer is just "this is not a job for awk", and that's okay too, I guess.
Cyrus has kindly pointed out that the sketch of code I gave above is not technically code, and he's technically correct, so here's a reasonably stupid implementation:
awk 'BEGIN{RS="--------"} { if ($0 ~ /ThirdBonusKey=1/) { for(i=1;i<NF;i++) {if ($i ~ "AnotherField=") { print $i }}}}'
Given input
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=DesiredValue1
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=1,...
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=DesiredValue2
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=0,...
--------
SomeField=
ExtraStuff=
--------
we'd expect output
AnotherField=DesiredValue1
Most efficiently I expect:
$ awk '/^AnotherField=/{val=$0; next} /[=,]ThirdBonusKey=1(,|$)/{print val}' file
AnotherField=DesiredValue1
but more robustly and easier to enhance to do anything else you want later:
$ cat tst.awk
BEGIN { FS="[,=[:space:]]"; OFS="=" }
/^-+$/ {
if ( f["ExtraStuff_ThirdBonusKey"] == 1 ) {
print "AnotherField", f["AnotherField"]
}
delete f
next
}
{
if ( $1 == "ExtraStuff" ) {
pfx = $1
sub(/[^=]+=/,"")
f[pfx] = $0
pfx = pfx "_"
}
else {
pfx = ""
}
for (i=1; i<NF; i+=2) {
f[pfx $i] = $(i+1)
}
}
$ awk -f tst.awk file
AnotherField=DesiredValue1
That second script first stores all of the values in an array f[] so you can access the values by their names, here's what the contents of that array look like:
$ cat tst.awk
BEGIN { FS="[,=[:space:]]"; OFS="=" }
/^-+$/ {
for (i in f) printf "> f[%s]=%s\n", i, f[i]
if ( f["ExtraStuff_ThirdBonusKey"] == 1 ) {
print "AnotherField", f["AnotherField"]
}
print "----"
delete f
next
}
{
if ( $1 == "ExtraStuff" ) {
pfx = $1
sub(/[^=]+=/,"")
f[pfx] = $0
pfx = pfx "_"
}
else {
pfx = ""
}
for (i=1; i<NF; i+=2) {
f[pfx $i] = $(i+1)
}
}
.
$ awk -f tst.awk file
----
> f[OptionallyAppearingField]=WhoKnowsWhat
> f[AnotherField]=DesiredValue1
> f[ExtraStuff_SecondBonusKey]=2
> f[ExtraStuff_ThirdBonusKey]=1
> f[ExtraStuff_OneBonusKey]=1
> f[SomeField]=SomeValue
> f[ExtraStuff]=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=1,...
AnotherField=DesiredValue1
----
> f[OptionallyAppearingField]=WhoKnowsWhat
> f[AnotherField]=DesiredValue2
> f[ExtraStuff_SecondBonusKey]=2
> f[ExtraStuff_ThirdBonusKey]=0
> f[ExtraStuff_OneBonusKey]=1
> f[SomeField]=SomeValue
> f[ExtraStuff]=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=0,...
----
> f[SomeField]=
> f[ExtraStuff]=
----
Given that you can create whatever conditions and/or print whatever combinations of fields you want in any input or output order.

Weird antlr grammar rule

I have found an old file that define antlr grammar rules like that:
rule_name[ ParamType *param ] > [ReturnType *retval]:
<<
$retval = NULL;
OtherType1 *new_var1 = NULL;
OtherType2 *new_var2 = NULL;
>>
subrule1[ param ] > [ $retval ]
| subrule2 > [new_var2]
<<
if( new_var2 == SOMETHING ){
$retval = something_related_to_new_var2;
}
else{
$retval = new_var2;
}
>>
{
somethingelse > [new_var_1]
<<
/* Do something with new_var_1 */
$retval = new_var_1;
>>
}
;
I'm not an Antlr expert and It's the first time that i see this kind of semantic for a rule definition.
Does anybody know where I can find documentation/informations about this?
Even a keyword for a google search is welcome.
Edit:
It should be ANTLR Version 1.33MR33.
Ok, I found! Here is the guide:
http://www.antlr2.org/book/pcctsbk.pdf
I quote the interesting part of the pdf that answer to my question.
1) Page 47:
poly > [float r]
: <<float f;>>
term>[$r] ( "\+" term>[f] <<$r += f;>> )*
;
Rule poly is defined to have a return value called $r via the "> [float r]" notation; this is similar to the output redirection character of UNIX shells. Setting the value of $r sets the return value of poly. he first action after the ":" is an init-action (because it is the first action of a rule or subrule). The init-action defines a local variable called f that will be used in the (...)* loop to hold the return value of the term.
2) Page 85:
A rule looks like:
rule : alternative1
| alternative2
...
| alternativen
;
where each alternative production is composed of a list of elements that can be references to rules, references to tokens, actions, predicates, and subrules. Argument and return value definitions looks like the following where there are n arguments and m return values:
rule[arg1,...,argn] > [retval1,...,retvalm] : ... ;
The syntax for using a rule mirrors its definition:
a : ... rule[arg1,...,argn] > [v1,...,vm] ...
;
Here, the various vi receive the return values from the rule rule, each vi must be an l-value.
3) Page 87:
Actions are of the form <<...>> and contain user-supplied C or C++ code that must be executed during the parse.

LPeg Increment for Each Match

I'm making a serialization library for Lua, and I'm using LPeg to parse the string. I've got K/V pairs working (with the key explicitly named), but now I'm going to add auto-indexing.
It'll work like so:
#"value"
#"value2"
Will evaluate to
{
[1] = "value"
[2] = "value2"
}
I've already got the value matching working (strings, tables, numbers, and Booleans all work perfectly), so I don't need help with that; what I'm looking for is the indexing. For each match of #[value pattern], it should capture the number of #[value pattern]'s found - in other words, I can match a sequence of values ("#"value1" #"value2") but I don't know how to assign them indexes according to the number of matches. If that's not clear enough, just comment and I'll attempt to explain it better.
Here's something of what my current pattern looks like (using compressed notation):
local process = {} -- Process a captured value
process.number = tonumber
process.string = function(s) return s:sub(2, -2) end -- Strip of opening and closing tags
process.boolean = function(s) if s == "true" then return true else return false end
number = [decimal number, scientific notation] / process.number
string = [double or single quoted string, supports escaped quotation characters] / process.string
boolean = P("true") + "false" / process.boolean
table = [balanced brackets] / [parse the table]
type = number + string + boolean + table
at_notation = (P("#") * whitespace * type) / [creates a table that includes the key and value]
As you can see in the last line of code, I've got a function that does this:
k,v matched in the pattern
-- turns into --
{k, v}
-- which is then added into an "entry table" (I loop through it and add it into the return table)
Based on what you've described so far, you should be able to accomplish this using a simple capture and table capture.
Here's a simplified example I knocked up to illustrate:
lpeg = require 'lpeg'
l = lpeg.locale(lpeg)
whitesp = l.space ^ 0
bool_val = (l.P "true" + "false") / function (s) return s == "true" end
num_val = l.digit ^ 1 / tonumber
string_val = '"' * l.C(l.alnum ^ 1) * '"'
val = bool_val + num_val + string_val
at_notation = l.Ct( (l.P "#" * whitesp * val * whitesp) ^ 0 )
local testdata = [[
#"value1"
#42
# "value2"
#true
]]
local res = l.match(at_notation, testdata)
The match returns a table containing the contents:
{
[1] = "value1",
[2] = 42,
[3] = "value2",
[4] = true
}

parsing issue with comma separated csv file

I am trying to extract 4th column from csv file (comma separated, and skipping first 2 header lines) using this command,
awk 'NR <2 {next}{FS =","}{print $4}' filename.csv | more
However, it doesn't work because the first column cantains comma, thus 4th column is not really 4th. Below is an example of a row:
"sdfsdfsd, sfsdf", 454,fgdfg, I_want_this_column,sdfgdg,34546, 456465, etc
Unless you have specific reasons for using awk, I would recommend using a CSV parsing library. Many scripting languages have one built-in (or at least available) and they'll save you from these headaches.
if your first column has quotes always,
$ awk 'BEGIN{ FS="\042[ ]*," } { m=split($2,a,","); print a[3] } ' file
I_want_this_column
if the column you want is always the last 2nd,
$ awk -F"," '{print $(NF-1)}' file
I_want_this_column
You can try this demo script to break down the columns
awk 'BEGIN{ FS="," }
{
for(i=1;i<=NF;i++){
# save normal
if($i !~ /^[ ]*\042|[ ]*\042[ ]*$/){
a[++j]=$i
}
# if quotes at the end
if(f==1 && $i ~ /[ ]*\042[ ]*$/){
s=s","$i
a[++j]=s
#reset
s="";f=0
}
# if quotes in front
if($i ~ /^[ ]*\042/){
s=s $i
f=1
}
if(f==1 && ( $i !~/\042/ ) ){
s=s","$i
}
}
}
END{
# print columns
for(p=1;p<=j;p++){
print "Field "p,": "a[p]
}
} ' file
output
$ cat file
"sdfsdfsd, sfsdf", "454,fgdfg blah , words ", I_want_this_column,sdfgdg
$ ./shell.sh
Field 1 : "sdfsdfsd, sfsdf"
Field 2 : fgdfg blah
Field 3 : "454,fgdfg blah , words "
Field 4 : I_want_this_column
Field 5 : sdfgdg
You shouldn't use awk here. Use Python csv module or Perl Text::CSV or Text::CSV_XS modules or another real csv parser.
Related question -
parse csv file using gawk
If you can't avoid awk, this piece of code does the job you need:
BEGIN {FS=",";}
{
f=0;
j=0;
for (i = 1; i <=NF ; ++i) {
if (f) {
a[j] = a[j] "," $(i);
if ($(i) ~ "\"$") {
f = 0;
}
}
else {
++j;
a[j] = $(i);
if ((a[j] ~ "^\"[^\"]*$")) {
f = 1;
}
}
}
for (i = 1; i <= j; ++i) {
gsub("^\"","",a[i]);
gsub("\"$","",a[i]);
gsub("\"\"","\"",a[i]);
print "i = \"" a[i] "\"";
}
}
Working with CSV files that have quoted fields with commas inside can be difficult with the standard UNIX text tools.
I wrote a program called csvquote to make the data easy for them to handle. In your case, you could use it like this:
csvquote filename.csv | awk 'NR <2 {next}{FS =","}{print $4}' | csvquote -u | more
or you could use cut and tail like this:
csvquote filename.csv | tail -n +3 | cut -d, -f4 | csvquote -u | more
The code and docs are here: https://github.com/dbro/csvquote

Resources