How to parse values with AWK when column number is inconsistent - parsing

Input file:
6 31236622 HLA_C*05:01:01:01 A T . PASS AF=0.07724;MAF=0.07724;R2=0.98466;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.994:0.995,1.000:0.000,0.006,0.994
6 29910248 HLA_A*01:01 A T . PASS AF=0.15969;MAF=0.15969;R2=0.97333;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 1|0:1.000:1.000,0.000:0.000,1.000,0.000 0|0:0:0,0:1,0,0
6 31322134 HLA_B*55:01 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94511;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322132 HLA_B*55 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94485;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322006 HLA_B*44:02:01:01 A T . PASS AF=0.08074;MAF=0.08074;R2=0.97706;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.997:0.998,0.999:0.000,0.003,0.997
I want to parse a specific number from each column after the "GT:DS:HDS:GP" column, specifically, the numbers after "x|x:". So desired output is:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
To parse the desired values from (e.g.) line 4, I can use:
awk -F: '{for (i=5; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
Line 5 would require:
awk -F: '{for (i=9; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
So the problem with the input file is that column 3 (space delimited) contains a variable number of colons, which makes colons a poor delimiter for this particular input file (but the desired values are surrounded by colons!)
I though about using "|" as delimiter, with substr($i,3,?), but the desired values have an inconsistent number of digits (hence the "?").
Is there a flexible awk code to get the desired output?

You may try this awk:
awk -v OFS=', ' '$9 == "GT:DS:HDS:GP" {for (i=10; i<=NF; ++i) if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/)) printf "%s", (i == 10 ? "" : OFS) a[2]; print ""}' file
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
An expanded form:
awk -v OFS=', ' '
$9 == "GT:DS:HDS:GP" {
for (i=10; i<=NF; ++i)
if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/))
printf "%s", (i == 10 ? "" : OFS) a[2]
print ""
}' file

Why do you care about the space-delimited columns at all?
awk '{ sub(/.* GT:DS:HDS:GP */, "");
i = split($0, n, /[0-9]\|[0-9]:/);
sep = "";
for(x=2; x<=i; x++) {
sub(/:.*/, "", n[x]); printf("%s%s", sep, n[x]); sep=", " }
printf "\n"; }' file
We successively pick apart each line, first by removing everything through GT:DS:HDS:GP from the line, then by splitting the remaining string into n on the specified delimiter, and then cleaning up the resulting fields by removing everything after the first colon in each, and printing the result. (We skip the first one, which only contains the useless short or empty string before the first delimiter.)
Output for your sample:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
I have no idea what these fields stands for so I just picked single-letter variable names; you can probably improve the readability by giving these variables more descriptive names.

Related

Mix two variables with different lengths in a third variable in Powershell

How can I mix two variables with different lengths in a third variable?
Variable1 has 48 entries, variable2 has 16 entries. Variable3 should have after every third line from variable 1 the entries from variable two in every fourth line.
The length of the two variables could be different, but is always divisible by 3.
$i = 0 ; $var3 = $var1 | % { "$_ $($var2[$i])"; $i++ }
Doesn't work, because it is for variables of the same length
Example:
$Var1 (48 entries)
Name1
Location1
Country1
Name2
Location2
Country2
.
.
Name16
Location16
Country16
$Var2 (16 entries)
Date1
Date2
.
.
Date16
$Var3 (should have 64 entries)
Name1
Location1
Country1
Date1
.
.
Name16
Location16
Country16
Date16
I'm assuming $var1 and $var2 are arrays, and the "entries" are each elements.
If I had to use the method you're using to store variables, I'd do it like this:
$var1 = #('Name1','Location1','Country1','Name2','Location2','Country2','Name3','Location3',
'Country3','Name4','Location4','Country4','Name5','Location5','Country5','Name6','Location6',
'Country6','Name7','Location7','Country7','Name8','Location8','Country8','Name9','Location9',
'Country9','Name10','Location10','Country10','Name11','Location11','Country11','Name12',
'Location12','Country12','Name13','Location13','Country13','Name14','Location14','Country14',
'Name15','Location15','Country15','Name16','Location16','Country16');
$var2 = #('Date1','Date2','Date3','Date4','Date5','Date6','Date7','Date8','Date9','Date10',
'Date11','Date12','Date13','Date14','Date15','Date16');
$var3 = #();
for ($i = 0; $i -lt $var2.Count; $i++) {
$var3 += $var1[$i * 3];
$var3 += $var1[($i * 3) + 1];
$var3 += $var1[($i * 3) + 2];
$var3 += $var2[$i];
}
In reality, I'd probably store this as an array of hashtables, or in a PSObject/PSCustomObject as tuples. Hell, I might even prefer building a DataTable to a flat array.

Tcl return vs. last evaluated in proc - internals

When I write a proc in Tcl, which return value is actually the result of another proc I can do either of the following (see implicit example):
proc foo args {
...
...
bar $var1
}
Or I could do (see explicit example):
proc foo args {
...
...
return [ bar var1 ]
}
From an interface perspective, that is input vs. output, the two are identical.
Are they, internally?Or is there some added benefit to implicit vs. explicit return?
Thanks.
In Tcl 8.6 you can inspect the bytecode to see how such procedures compare.
If we define a pair of implementations of 'sum' and then examine them using tcl::unsupported::disassemble we can see that using the return statement or not results in the same bytecode.
% proc sum_a {lhs rhs} {expr {$lhs + $rhs}}
% proc sum_b {lhs rhs} {return [expr {$lhs + $rhs}]}
% ::tcl::unsupported::disassemble proc sum_a
ByteCode 0x03C5E8E8, refCt 1, epoch 15, interp 0x01F68CE0 (epoch 15)
Source "expr {$lhs + $rhs}"
Cmds 1, src 18, inst 6, litObjs 0, aux 0, stkDepth 2, code/src 0.00
Proc 0x03CC33C0, refCt 1, args 2, compiled locals 2
slot 0, scalar, arg, "lhs"
slot 1, scalar, arg, "rhs"
Commands 1:
1: pc 0-4, src 0-17
Command 1: "expr {$lhs + $rhs}"
(0) loadScalar1 %v0 # var "lhs"
(2) loadScalar1 %v1 # var "rhs"
(4) add
(5) done
% ::tcl::unsupported::disassemble proc sum_b
ByteCode 0x03CAD140, refCt 1, epoch 15, interp 0x01F68CE0 (epoch 15)
Source "return [expr {$lhs + $rhs}]"
Cmds 2, src 27, inst 6, litObjs 0, aux 0, stkDepth 2, code/src 0.00
Proc 0x03CC4B80, refCt 1, args 2, compiled locals 2
slot 0, scalar, arg, "lhs"
slot 1, scalar, arg, "rhs"
Commands 2:
1: pc 0-5, src 0-26 2: pc 0-4, src 8-25
Command 1: "return [expr {$lhs + $rhs}]"
Command 2: "expr {$lhs + $rhs}"
(0) loadScalar1 %v0 # var "lhs"
(2) loadScalar1 %v1 # var "rhs"
(4) add
(5) done
The return statement is really just documenting that you intended to return this value and it is not just a side-effect. Using return is not necessary but in my opinion it is to be recommended.

retrieve the grammar rules from the generated parsing tables

I have a quite old C corporate parser code that was generated from an ancient Yacc and uses the yyact, yypact, yypgo, yyr1, yyr2, yytoks, yyexca, yychk, yydef tables (but no yyreds though) and the original grammar source is lost. That legacy piece of code need revamping but I cannot afford to recode it from scratch.
Could it be possible to mechanically retrieve / regenerate the parsing rules by deduction of the parsing tables in order to reconstruct the grammar?
Example with a little expression parser that I can process with the same ancient Yacc:
yytabelem yyexca[] ={
-1, 1,
0, -1,
-2, 0,
-1, 21,
261, 0,
-2, 8,
};
yytabelem yyact[]={
13, 9, 10, 11, 12, 23, 8, 22, 13, 9,
10, 11, 12, 9, 10, 11, 12, 1, 2, 11,
12, 6, 7, 4, 3, 0, 16, 5, 0, 14,
15, 0, 0, 0, 17, 18, 19, 20, 21, 0,
0, 24 };
yytabelem yypact[]={
-248, -1000, -236, -261, -236, -236, -1000, -1000, -248, -236,
-236, -236, -236, -236, -253, -1000, -263, -245, -245, -1000,
-1000, -249, -1000, -248, -1000 };
yytabelem yypgo[]={
0, 17, 24 };
yytabelem yyr1[]={
0, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2 };
yytabelem yyr2[]={
0, 8, 12, 0, 6, 6, 6, 6, 6, 6,
4, 2, 2 };
yytabelem yychk[]={
-1000, -1, 266, -2, 259, 263, 257, 258, 267, 262,
263, 264, 265, 261, -2, -2, -1, -2, -2, -2,
-2, -2, 260, 268, -1 };
yytabelem yydef[]={
3, -2, 0, 0, 0, 0, 11, 12, 3, 0,
0, 0, 0, 0, 0, 10, 1, 4, 5, 6,
7, -2, 9, 3, 2 };
yytoktype yytoks[] =
{
"NAME", 257,
"NUMBER", 258,
"LPAREN", 259,
"RPAREN", 260,
"EQUAL", 261,
"PLUS", 262,
"MINUS", 263,
"TIMES", 264,
"DIVIDE", 265,
"IF", 266,
"THEN", 267,
"ELSE", 268,
"LOW", 269,
"UMINUS", 270,
"-unknown-", -1 /* ends search */
};
/* am getting this table in my example,
but it is not present in the studied parser :^( */
char * yyreds[] =
{
"-no such reduction-",
"stmt : IF exp THEN stmt",
"stmt : IF exp THEN stmt ELSE stmt",
"stmt : /* empty */",
"exp : exp PLUS exp",
"exp : exp MINUS exp",
"exp : exp TIMES exp",
"exp : exp DIVIDE exp",
"exp : exp EQUAL exp",
"exp : LPAREN exp RPAREN",
"exp : MINUS exp",
"exp : NAME",
"exp : NUMBER",
};
I am looking to retrieve
stmt : IF exp THEN stmt
| IF exp THEN stmt ELSE stmt
| /*others*/
;
exp : exp PLUS exp
| exp MINUS exp
| exp TIMES exp
| exp DIVIDE exp
| exp EQUAL exp
| LPAREN exp RPAREN
| MINUS exp
| NAME
| NUMBER
;
Edit: I have stripped down the generated parser of my example for clarity, but to help some analysis i have published the whole generated code as a gist. Please not that for some unknown reason there is no yyreds table in the parser I am trying to study / change. I suppose it would not have been fun :^S
An interesting problem. Just from matching the tables to the grammar, it seems that yyr1 and yyr2 give you the "outline" of the rules -- yyr1 is the symbol on the left side of each rule, while yyr2 is 2x the number of symbols on the right side. You also have the names of all the terminals in a convenient table. But the names of the non-terminals are lost.
To figure out which symbols go on the rhs of each rule, you'll need to reconstruct the state machine from the tables, which likely involves reading and understanding the code in the y.tab.c file that actually does the parsing. Some of the tables (looks like yypact, yychk and yydef) are indexed by state number. It seems likely that yyact is indexed by yypact[state] + token. But those are only guesses. You need to look at the parsing code and understand how its using the tables to encode possible shifts, reduces, and gotos.
Once you have the state machine, you can backtrack from the states containing reductions of specific rules through the states that have shifts and gotos of that rule. A shift into a reduction state means the last symbol on the rhs of that rule is the token shifted. A goto into a reduction state means the last symbol on the rhs is symbol for the goto. The second-to-last symbol comes from the shift/goto to the state that does the shift/goto to the reduction state, and so on.
edit
As I surmised, yypact is the 'primary action' for a state. If the value is YYFLAG (-1000), this is a reduce-only state (no shifts). Otherwise it is a potential shift state and yyact[yypact[state] + token] gives you the potential state to shift to. If yypact[state] + token is out of range for the yyact table, or the token doesn't match the entry symbol for that state, then there's no shift on that token.
yychk is the entry symbol for each state -- a positive number means you shift to that state on that token, while a negative means you goto that state on that non-terminal.
yydef is the reduction for that state -- a positive number means reduce that rule, while 0 means no reduction, and -2 means two or more possible reductions. yyexca is the table of reductions for those states with more than one reduction. The pair -1 state means the following entries are for the given state; following pairs of token rule mean that for lookahead token it should reduce rule. A -2 for token is a wildcard (end of the list), while a 0 for the rule means no rule to reduce (an error instead), and -1 means accept the input.
The yypgo table is the gotos for a symbol -- you go to state yyact[yypgo[sym] + state + 1] if that's in range for yyact and yyact[yypgo[sym]] otherwise.
So to reconstruct rules, look at the yydef and yyexca tables to see which states reduce each rule, and go backwards to see how the state is reached.
For example, rule #1. From the yyr1 and yyr2 tables, we know its of the form S1: X X X X -- non-terminal #1 on the lhs and 4 symbols on the rhs. Its reduced in state 16 (from the yydef table, and the accessing symbol for state 16 (from yychk) is -1. So its:
S1: ?? ?? ?? S1
You get into state 16 from yyact[26], and yypgo[1] == 17, so that means the goto is coming from state 8 (26 == yypgo[1] + 8 + 1. The accessing symbol of state 8 is 267 (THEN) so now we have:
S1: ?? ?? THEN S1
You get into state 8 from yyact[6], so the previous state has yypact[state] == -261 which is state 3. yychk[3] == -2, so we have:
S1: ?? S2 THEN S1
You get into state 3 from yyact[24], and yypgo[2] == 24 so any state might goto 3 here. So we're now kind of stuck for this rule; to figure out what the first symbol is, we need to work our way forward from state 0 (the start state) to reconstruct the state machine.
edit
This code will decode the state machine from the table format above and print out all the shift/reduce/goto actions in each state:
#define ALEN(A) (sizeof(A)/sizeof(A[0]))
for (int state = 0; state < ALEN(yypact); state++) {
printf("state %d:\n", state);
for (int i = 0; i < ALEN(yyact); i++) {
int sym = yychk[yyact[i]];
if (sym > 0 && i == yypact[state] + sym)
printf("\ttoken %d shift state %d\n", sym, yyact[i]);
if (sym < 0 && -sym < ALEN(yypgo) &&
(i == yypgo[-sym] || i == yypgo[-sym] + state + 1))
printf("\tsymbol %d goto state %d\n", -sym, yyact[i]); }
if (yydef[state] > 0)
printf("\tdefault reduce rule %d\n", yydef[state]);
if (yydef[state] < 0) {
for (int i = 0; i < ALEN(yyexca); i+= 2) {
if (yyexca[i] == -1 && yyexca[i+1] == state) {
for (int j = i+2; j < ALEN(yyexca) && yyexca[j] != -1; j += 2) {
if (yyexca[j] < 0) printf ("\tdefault ");
else printf("\ttoken %d ", yyexca[j]);
if (yyexca[j+1] < 0) printf("accept\n");
else if(yyexca[j+1] == 0) printf("error\n");
else printf("reduce rule %d\n", yyexca[j+1]); } } } } }
It will produce output like:
state 0:
symbol 1 goto state 1
token 266 shift state 2
symbol 2 goto state 3
default reduce rule 3
state 1:
symbol 1 goto state 1
symbol 2 goto state 3
token 0 accept
default error
state 2:
symbol 1 goto state 1
token 257 shift state 6
token 258 shift state 7
token 259 shift state 4
symbol 2 goto state 3
token 263 shift state 5
state 3:
token 261 shift state 13
token 262 shift state 9
token 263 shift state 10
token 264 shift state 11
token 265 shift state 12
token 267 shift state 8
symbol 1 goto state 1
symbol 2 goto state 3
..etc
which should be helpful for reconstructing the grammar.

How to use some text processing(awk etc..) to put some character in a text file at certain lines

I have a text file which has hex values, one value on one separate line. A file has many such values one below another. I need to do some analysis of the values for which i need to but some kind of delimiter/marker say a '#' in this file before line numbers 32,47,62,77... difference between two line numbers in this patterin is 15 always.
I am trying to do it using awk. I tried few things but didnt work.
What is the command in awk to do it?
Any other solution involving some other language/script/tool is also welcome.
Thank you.
-AD
This is how you can use AWK for it,
awk 'BEGIN{ i=0; } \
{if (FNR<31) {print $0} \
else {i++; if (i%15) {print $0} else {printf "#%s\n",$0}}\
}' inputfile.txt > outputfile.txt
How it works,
BEGIN sets an iterator for counting from your starting line 32
FNR<31 starts counting from the 31st record (the next record needs a #)
input lines are called records and FNR is an AWK variable that counts them
Once we start counting, the i%15 prefixes a # on every 15th line
$0 prints the record (the line) as is
You can type all the text with white spaces skipping the trailing '\' on a single command line.
Or, you can use it as an AWK file,
# File: comment.awk
BEGIN{ i=0; }
$0 ~ {\
if (FNR<31) {print $0} \
else {\
i++; \
if (i%15) {\
print $0
}\
else {\
printf "#%s\n",$0
}\
}\
}
And run it as,
awk -f comment.awk inputfile.txt > outputfile.txt
Hope this will help you to use more AWK.
Python:
f_in = open("file.txt")
f_out = open("file_out.txt","w")
offset = 4 # 0 <= offset < 15 ; first marker after fourth line in this example
for num,line in enumerate(f_in):
if not (num-offset) % 15:
f_out.write("#\n")
f_out.write(line)
Haskell:
offset = 31;
chunk_size = 15;
main = do
{
(h, t) <- fmap (splitAt offset . lines) getContents;
mapM_ putStrLn h;
mapM_ ((putStrLn "#" >>) . mapM_ putStrLn) $
map (take chunk_size) $
takeWhile (not . null) $
iterate (drop chunk_size) t;
}

How to perform calculation over a log file

I have a that looks like this:
I, [2009-03-04T15:03:25.502546 #17925] INFO -- : [8541, 931, 0, 0]
I, [2009-03-04T15:03:26.094855 #17925] INFO -- : [8545, 6678, 0, 0]
I, [2009-03-04T15:03:26.353079 #17925] INFO -- : [5448, 1598, 185, 0]
I, [2009-03-04T15:03:26.360148 #17925] INFO -- : [8555, 1747, 0, 0]
I, [2009-03-04T15:03:26.367523 #17925] INFO -- : [7630, 278, 0, 0]
I, [2009-03-04T15:03:26.375845 #17925] INFO -- : [7640, 286, 0, 0]
I, [2009-03-04T15:03:26.562425 #17925] INFO -- : [5721, 896, 0, 0]
I, [2009-03-04T15:03:30.951336 #17925] INFO -- : [8551, 4752, 1587, 1]
I, [2009-03-04T15:03:30.960007 #17925] INFO -- : [5709, 5295, 0, 0]
I, [2009-03-04T15:03:30.966612 #17925] INFO -- : [7252, 4928, 0, 0]
I, [2009-03-04T15:03:30.974251 #17925] INFO -- : [8561, 4883, 1, 0]
I, [2009-03-04T15:03:31.230426 #17925] INFO -- : [8563, 3866, 250, 0]
I, [2009-03-04T15:03:31.236830 #17925] INFO -- : [8567, 4122, 0, 0]
I, [2009-03-04T15:03:32.056901 #17925] INFO -- : [5696, 5902, 526, 1]
I, [2009-03-04T15:03:32.086004 #17925] INFO -- : [5805, 793, 0, 0]
I, [2009-03-04T15:03:32.110039 #17925] INFO -- : [5786, 818, 0, 0]
I, [2009-03-04T15:03:32.131433 #17925] INFO -- : [5777, 840, 0, 0]
I'd like to create a shell script that calculates the average of the 2nd and 3rd fields in brackets (840 and 0 in the last example). An even tougher question: is it possible to get the average of the 3rd field only when the last one is not 0?
I know I could use Ruby or another language to create a script, but I'd like to do it in Bash. Any good suggestions on resources or hints in how to create such a script would help.
Use bash and awk:
cat file | sed -ne 's:^.*INFO.*\[\([0-9, ]*\)\][ \r]*$:\1:p' | awk -F ' *, *' '{ sum2 += $2 ; sum3 += $3 } END { if (NR>0) printf "avg2=%.2f, avg3=%.2f\n", sum2/NR, sum3/NR }'
Sample output (for your original data):
avg2=2859.59, avg3=149.94
Of course, you do not need to use cat, it is included there for legibility and to illustrate the fact that input data can come from any pipe; if you have to operate on an existing file, run sed -ne '...' file | ... directly.
EDIT
If you have access to gawk (GNU awk), you can eliminate the need for sed as follows:
cat file | gawk '{ if(match($0, /.*INFO.*\[([0-9, ]*)\][ \r]*$/, a)) { cnt++; split(a[1], b, / *, */); sum2+=b[2]; sum3+=b[3] } } END { if (cnt>0) printf "avg2=%.2f, avg3=%.2f\n", sum2/cnt, sum3/cnt }'
Same remarks re. cat apply.
A bit of explanation:
sed only prints out lines (-n ... :p combination) that match the regular expression (lines containing INFO followed by any combination of digits, spaces and commas between square brackets at the end of the line, allowing for trailing spaces and CR); if any such line matches, only keep what's between the square brackets (\1, corresponding to what's between \(...\) in the regular expression) before printing (:p)
sed will output lines that look like: 8541, 931, 0, 0
awk uses a comma surrounded by 0 or more spaces (-F ' *, *') as field delimiters; $1 corresponds to the first column (e.g. 8541), $2 to the second etc. Missing columns count as value 0
at the end, awk divides the accumulators sum2 etc by the number of records processed, NR
gawk does everything in one shot; it will first test whether each line matches the same regular expression passed in the previous example to sed (except that unlike sed, awk does not require a \ in fron the round brackets delimiting areas or interest). If the line matches, what's between the round brackets ends up in a[1], which we then split using the same separator (a comma surrounded by any number of spaces) and use that to accumulate. I introduced cnt instead of continuing to use NR because the number of records processed NR may be larger than the actual number of relevant records (cnt) if not all lines are of the form INFO ... [...comma-separated-numbers...], which was not the case with sed|awk since sed guaranteed that all lines passed on to awk were relevant.
Posting the reply I pasted to you over IM here too, just because it makes me try StackOverflow out :)
# replace $2 with the column you want to avg;
awk '{ print $2 }' | perl -ne 'END{ printf "%.2f\n", $total/$n }; chomp; $total+= $_; $n++' < log
Use nawk or /usr/xpg4/bin/awk on Solaris.
awk -F'[],]' 'END {
print s/NR, t/ct
}
{
s += $(NF-3)
if ($(NF-1)) {
t += $(NF-2)
ct++
}
}' infile
Use Python
logfile= open( "somelogfile.log", "r" )
sum2, count2= 0, 0
sum3, count3= 0, 0
for line in logfile:
# find right-most brackets
_, bracket, fieldtext = line.rpartition('[')
datatext, bracket, _ = fieldtext.partition(']')
# split fields and convert to integers
data = map( int, datatext.split(',') )
# compute sums and counts
sum2 += data[1]
count2 += 1
if data[3] != 0:
sum3 += data[2]
count3 += 1
logfile.close()
print sum2, count2, float(sum2)/count2
print sum3, count3, float(sum3)/count3

Resources