I have a quite old C corporate parser code that was generated from an ancient Yacc and uses the yyact, yypact, yypgo, yyr1, yyr2, yytoks, yyexca, yychk, yydef tables (but no yyreds though) and the original grammar source is lost. That legacy piece of code need revamping but I cannot afford to recode it from scratch.
Could it be possible to mechanically retrieve / regenerate the parsing rules by deduction of the parsing tables in order to reconstruct the grammar?
Example with a little expression parser that I can process with the same ancient Yacc:
yytabelem yyexca[] ={
-1, 1,
0, -1,
-2, 0,
-1, 21,
261, 0,
-2, 8,
};
yytabelem yyact[]={
13, 9, 10, 11, 12, 23, 8, 22, 13, 9,
10, 11, 12, 9, 10, 11, 12, 1, 2, 11,
12, 6, 7, 4, 3, 0, 16, 5, 0, 14,
15, 0, 0, 0, 17, 18, 19, 20, 21, 0,
0, 24 };
yytabelem yypact[]={
-248, -1000, -236, -261, -236, -236, -1000, -1000, -248, -236,
-236, -236, -236, -236, -253, -1000, -263, -245, -245, -1000,
-1000, -249, -1000, -248, -1000 };
yytabelem yypgo[]={
0, 17, 24 };
yytabelem yyr1[]={
0, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2 };
yytabelem yyr2[]={
0, 8, 12, 0, 6, 6, 6, 6, 6, 6,
4, 2, 2 };
yytabelem yychk[]={
-1000, -1, 266, -2, 259, 263, 257, 258, 267, 262,
263, 264, 265, 261, -2, -2, -1, -2, -2, -2,
-2, -2, 260, 268, -1 };
yytabelem yydef[]={
3, -2, 0, 0, 0, 0, 11, 12, 3, 0,
0, 0, 0, 0, 0, 10, 1, 4, 5, 6,
7, -2, 9, 3, 2 };
yytoktype yytoks[] =
{
"NAME", 257,
"NUMBER", 258,
"LPAREN", 259,
"RPAREN", 260,
"EQUAL", 261,
"PLUS", 262,
"MINUS", 263,
"TIMES", 264,
"DIVIDE", 265,
"IF", 266,
"THEN", 267,
"ELSE", 268,
"LOW", 269,
"UMINUS", 270,
"-unknown-", -1 /* ends search */
};
/* am getting this table in my example,
but it is not present in the studied parser :^( */
char * yyreds[] =
{
"-no such reduction-",
"stmt : IF exp THEN stmt",
"stmt : IF exp THEN stmt ELSE stmt",
"stmt : /* empty */",
"exp : exp PLUS exp",
"exp : exp MINUS exp",
"exp : exp TIMES exp",
"exp : exp DIVIDE exp",
"exp : exp EQUAL exp",
"exp : LPAREN exp RPAREN",
"exp : MINUS exp",
"exp : NAME",
"exp : NUMBER",
};
I am looking to retrieve
stmt : IF exp THEN stmt
| IF exp THEN stmt ELSE stmt
| /*others*/
;
exp : exp PLUS exp
| exp MINUS exp
| exp TIMES exp
| exp DIVIDE exp
| exp EQUAL exp
| LPAREN exp RPAREN
| MINUS exp
| NAME
| NUMBER
;
Edit: I have stripped down the generated parser of my example for clarity, but to help some analysis i have published the whole generated code as a gist. Please not that for some unknown reason there is no yyreds table in the parser I am trying to study / change. I suppose it would not have been fun :^S
An interesting problem. Just from matching the tables to the grammar, it seems that yyr1 and yyr2 give you the "outline" of the rules -- yyr1 is the symbol on the left side of each rule, while yyr2 is 2x the number of symbols on the right side. You also have the names of all the terminals in a convenient table. But the names of the non-terminals are lost.
To figure out which symbols go on the rhs of each rule, you'll need to reconstruct the state machine from the tables, which likely involves reading and understanding the code in the y.tab.c file that actually does the parsing. Some of the tables (looks like yypact, yychk and yydef) are indexed by state number. It seems likely that yyact is indexed by yypact[state] + token. But those are only guesses. You need to look at the parsing code and understand how its using the tables to encode possible shifts, reduces, and gotos.
Once you have the state machine, you can backtrack from the states containing reductions of specific rules through the states that have shifts and gotos of that rule. A shift into a reduction state means the last symbol on the rhs of that rule is the token shifted. A goto into a reduction state means the last symbol on the rhs is symbol for the goto. The second-to-last symbol comes from the shift/goto to the state that does the shift/goto to the reduction state, and so on.
edit
As I surmised, yypact is the 'primary action' for a state. If the value is YYFLAG (-1000), this is a reduce-only state (no shifts). Otherwise it is a potential shift state and yyact[yypact[state] + token] gives you the potential state to shift to. If yypact[state] + token is out of range for the yyact table, or the token doesn't match the entry symbol for that state, then there's no shift on that token.
yychk is the entry symbol for each state -- a positive number means you shift to that state on that token, while a negative means you goto that state on that non-terminal.
yydef is the reduction for that state -- a positive number means reduce that rule, while 0 means no reduction, and -2 means two or more possible reductions. yyexca is the table of reductions for those states with more than one reduction. The pair -1 state means the following entries are for the given state; following pairs of token rule mean that for lookahead token it should reduce rule. A -2 for token is a wildcard (end of the list), while a 0 for the rule means no rule to reduce (an error instead), and -1 means accept the input.
The yypgo table is the gotos for a symbol -- you go to state yyact[yypgo[sym] + state + 1] if that's in range for yyact and yyact[yypgo[sym]] otherwise.
So to reconstruct rules, look at the yydef and yyexca tables to see which states reduce each rule, and go backwards to see how the state is reached.
For example, rule #1. From the yyr1 and yyr2 tables, we know its of the form S1: X X X X -- non-terminal #1 on the lhs and 4 symbols on the rhs. Its reduced in state 16 (from the yydef table, and the accessing symbol for state 16 (from yychk) is -1. So its:
S1: ?? ?? ?? S1
You get into state 16 from yyact[26], and yypgo[1] == 17, so that means the goto is coming from state 8 (26 == yypgo[1] + 8 + 1. The accessing symbol of state 8 is 267 (THEN) so now we have:
S1: ?? ?? THEN S1
You get into state 8 from yyact[6], so the previous state has yypact[state] == -261 which is state 3. yychk[3] == -2, so we have:
S1: ?? S2 THEN S1
You get into state 3 from yyact[24], and yypgo[2] == 24 so any state might goto 3 here. So we're now kind of stuck for this rule; to figure out what the first symbol is, we need to work our way forward from state 0 (the start state) to reconstruct the state machine.
edit
This code will decode the state machine from the table format above and print out all the shift/reduce/goto actions in each state:
#define ALEN(A) (sizeof(A)/sizeof(A[0]))
for (int state = 0; state < ALEN(yypact); state++) {
printf("state %d:\n", state);
for (int i = 0; i < ALEN(yyact); i++) {
int sym = yychk[yyact[i]];
if (sym > 0 && i == yypact[state] + sym)
printf("\ttoken %d shift state %d\n", sym, yyact[i]);
if (sym < 0 && -sym < ALEN(yypgo) &&
(i == yypgo[-sym] || i == yypgo[-sym] + state + 1))
printf("\tsymbol %d goto state %d\n", -sym, yyact[i]); }
if (yydef[state] > 0)
printf("\tdefault reduce rule %d\n", yydef[state]);
if (yydef[state] < 0) {
for (int i = 0; i < ALEN(yyexca); i+= 2) {
if (yyexca[i] == -1 && yyexca[i+1] == state) {
for (int j = i+2; j < ALEN(yyexca) && yyexca[j] != -1; j += 2) {
if (yyexca[j] < 0) printf ("\tdefault ");
else printf("\ttoken %d ", yyexca[j]);
if (yyexca[j+1] < 0) printf("accept\n");
else if(yyexca[j+1] == 0) printf("error\n");
else printf("reduce rule %d\n", yyexca[j+1]); } } } } }
It will produce output like:
state 0:
symbol 1 goto state 1
token 266 shift state 2
symbol 2 goto state 3
default reduce rule 3
state 1:
symbol 1 goto state 1
symbol 2 goto state 3
token 0 accept
default error
state 2:
symbol 1 goto state 1
token 257 shift state 6
token 258 shift state 7
token 259 shift state 4
symbol 2 goto state 3
token 263 shift state 5
state 3:
token 261 shift state 13
token 262 shift state 9
token 263 shift state 10
token 264 shift state 11
token 265 shift state 12
token 267 shift state 8
symbol 1 goto state 1
symbol 2 goto state 3
..etc
which should be helpful for reconstructing the grammar.
Related
Input file:
6 31236622 HLA_C*05:01:01:01 A T . PASS AF=0.07724;MAF=0.07724;R2=0.98466;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.994:0.995,1.000:0.000,0.006,0.994
6 29910248 HLA_A*01:01 A T . PASS AF=0.15969;MAF=0.15969;R2=0.97333;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 1|0:1.000:1.000,0.000:0.000,1.000,0.000 0|0:0:0,0:1,0,0
6 31322134 HLA_B*55:01 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94511;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322132 HLA_B*55 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94485;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322006 HLA_B*44:02:01:01 A T . PASS AF=0.08074;MAF=0.08074;R2=0.97706;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.997:0.998,0.999:0.000,0.003,0.997
I want to parse a specific number from each column after the "GT:DS:HDS:GP" column, specifically, the numbers after "x|x:". So desired output is:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
To parse the desired values from (e.g.) line 4, I can use:
awk -F: '{for (i=5; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
Line 5 would require:
awk -F: '{for (i=9; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
So the problem with the input file is that column 3 (space delimited) contains a variable number of colons, which makes colons a poor delimiter for this particular input file (but the desired values are surrounded by colons!)
I though about using "|" as delimiter, with substr($i,3,?), but the desired values have an inconsistent number of digits (hence the "?").
Is there a flexible awk code to get the desired output?
You may try this awk:
awk -v OFS=', ' '$9 == "GT:DS:HDS:GP" {for (i=10; i<=NF; ++i) if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/)) printf "%s", (i == 10 ? "" : OFS) a[2]; print ""}' file
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
An expanded form:
awk -v OFS=', ' '
$9 == "GT:DS:HDS:GP" {
for (i=10; i<=NF; ++i)
if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/))
printf "%s", (i == 10 ? "" : OFS) a[2]
print ""
}' file
Why do you care about the space-delimited columns at all?
awk '{ sub(/.* GT:DS:HDS:GP */, "");
i = split($0, n, /[0-9]\|[0-9]:/);
sep = "";
for(x=2; x<=i; x++) {
sub(/:.*/, "", n[x]); printf("%s%s", sep, n[x]); sep=", " }
printf "\n"; }' file
We successively pick apart each line, first by removing everything through GT:DS:HDS:GP from the line, then by splitting the remaining string into n on the specified delimiter, and then cleaning up the resulting fields by removing everything after the first colon in each, and printing the result. (We skip the first one, which only contains the useless short or empty string before the first delimiter.)
Output for your sample:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
I have no idea what these fields stands for so I just picked single-letter variable names; you can probably improve the readability by giving these variables more descriptive names.
I'm trying to recreate the wiki's example procedure, available here:
https://en.wikipedia.org/wiki/NTRUEncrypt
I've run into an issue while attempting to invert the polynomials.
The SAGE code below seems to be working fine for the given p=3, which is a prime number.
However, the representation of the polynomial in the field generated by q=32 ends up wrong, because it behaves as if the modulus was 2.
Here's the code in play:
F = PolynomialRing(GF(32),'a')
a = F.gen()
Ring = F.quotient(a^11 - 1, 'x')
x = Ring.gen()
pollist = [-1, 1, 1, 0, -1, 0, 1, 0, 0, 1, -1]
fq = Ring(pollist)
print(fq)
print(fq^(-1))
The Ring is described as follows:
Univariate Quotient Polynomial Ring in x over Finite Field in z5 of size 2^5 with modulus a^11 + 1
And the result:
x^10 + x^9 + x^6 + x^4 + x^2 + x + 1
x^5 + x + 1
I've tried to replace the Finite Field with IntegerModRing(32), but the inversion ends up demanding a field, as implied by the message:
NotImplementedError: The base ring (=Ring of integers modulo 32) is not a field
Any suggestions as to how I could obtain the correct inverse of f (mod q) would be greatly appreciated.
GF(32) is the finite field with 32 elements, not the integers modulo 32. You must use Zmod(32) (or IntegerModRing(32), as you suggested) instead.
As you point out, Sage psychotically bans you from computing inverses in ℤ/32ℤ[a]/(a¹¹-1) because that is not a field, and not even a factorial ring. It can, however, compute those inverses when they exist, only you must ask more kindly:
sage: F.<a> = Zmod(32)[]
sage: fq = F([-1, 1, 1, 0, -1, 0, 1, 0, 0, 1, -1])
sage: print(fq)
31*a^10 + a^9 + a^6 + 31*a^4 + a^2 + a + 31
sage: print(fq.inverse_mod(a^11 - 1))
16*a^8 + 4*a^7 + 10*a^5 + 28*a^4 + 9*a^3 + 13*a^2 + 21*a + 1
Not ideal, admittedly.
When I write a proc in Tcl, which return value is actually the result of another proc I can do either of the following (see implicit example):
proc foo args {
...
...
bar $var1
}
Or I could do (see explicit example):
proc foo args {
...
...
return [ bar var1 ]
}
From an interface perspective, that is input vs. output, the two are identical.
Are they, internally?Or is there some added benefit to implicit vs. explicit return?
Thanks.
In Tcl 8.6 you can inspect the bytecode to see how such procedures compare.
If we define a pair of implementations of 'sum' and then examine them using tcl::unsupported::disassemble we can see that using the return statement or not results in the same bytecode.
% proc sum_a {lhs rhs} {expr {$lhs + $rhs}}
% proc sum_b {lhs rhs} {return [expr {$lhs + $rhs}]}
% ::tcl::unsupported::disassemble proc sum_a
ByteCode 0x03C5E8E8, refCt 1, epoch 15, interp 0x01F68CE0 (epoch 15)
Source "expr {$lhs + $rhs}"
Cmds 1, src 18, inst 6, litObjs 0, aux 0, stkDepth 2, code/src 0.00
Proc 0x03CC33C0, refCt 1, args 2, compiled locals 2
slot 0, scalar, arg, "lhs"
slot 1, scalar, arg, "rhs"
Commands 1:
1: pc 0-4, src 0-17
Command 1: "expr {$lhs + $rhs}"
(0) loadScalar1 %v0 # var "lhs"
(2) loadScalar1 %v1 # var "rhs"
(4) add
(5) done
% ::tcl::unsupported::disassemble proc sum_b
ByteCode 0x03CAD140, refCt 1, epoch 15, interp 0x01F68CE0 (epoch 15)
Source "return [expr {$lhs + $rhs}]"
Cmds 2, src 27, inst 6, litObjs 0, aux 0, stkDepth 2, code/src 0.00
Proc 0x03CC4B80, refCt 1, args 2, compiled locals 2
slot 0, scalar, arg, "lhs"
slot 1, scalar, arg, "rhs"
Commands 2:
1: pc 0-5, src 0-26 2: pc 0-4, src 8-25
Command 1: "return [expr {$lhs + $rhs}]"
Command 2: "expr {$lhs + $rhs}"
(0) loadScalar1 %v0 # var "lhs"
(2) loadScalar1 %v1 # var "rhs"
(4) add
(5) done
The return statement is really just documenting that you intended to return this value and it is not just a side-effect. Using return is not necessary but in my opinion it is to be recommended.
Referring to the original problem: Optimizing hand-evaluation algorithm for Poker-Monte-Carlo-Simulation
I have a list of 5 to 7 cards and want to store their value in a hashtable, which should be an array of 32-bit-integers and directly accessed by the hashfunctions value as index.
Regarding the large amount of possible combinations in a 52-card-deck, I don't want to waste too much memory.
Numbers:
7-card-combinations: 133784560
6-card-combinations: 20358520
5-card-combinations: 2598960
Total: 156.742.040 possible combinations
Storing 157 million 32-bit-integer values costs about 580MB. So I would like to avoid increasing this number by reserving memory in an array for values that aren't needed.
So the question is: How could a hashfunction look like, that maps each possible, non duplicated combination of cards to a consecutive value between 0 and 156.742.040 or at least comes close to it?
Paul Senzee has a great post on this for 7 cards (deleted link as it is broken and now points to a NSFW site).
His code is basically a bunch of pre-computed tables and then one function to look up the array index for a given 7-card hand (represented as a 64-bit number with the lowest 52 bits signifying cards):
inline unsigned index52c7(unsigned __int64 x)
{
const unsigned short *a = (const unsigned short *)&x;
unsigned A = a[3], B = a[2], C = a[1], D = a[0],
bcA = _bitcount[A], bcB = _bitcount[B], bcC = _bitcount[C], bcD = _bitcount[D],
mulA = _choose48x[7 - bcA], mulB = _choose32x[7 - (bcA + bcB)], mulC = _choose16x[bcD];
return _offsets52c[bcA] + _table4[A] * mulA +
_offsets48c[ (bcA << 4) + bcB] + _table [B] * mulB +
_offsets32c[((bcA + bcB) << 4) + bcC] + _table [C] * mulC +
_table [D];
}
In short, it's a bunch of lookups and bitwise operations powered by pre-computed lookup tables based on perfect hashing.
If you go back and look at this website, you can get the perfect hash code that Senzee used to create the 7-card hash and repeat the process for 5- and 6-card tables (essentially creating a new index52c7.h for each). You might be able to smash all 3 into one table, but I haven't tried that.
All told that should be ~628 MB (4 bytes * 157 M entries). Or, if you want to split it up, you can map it to 16-bit numbers (since I believe most poker hand evaluators only need 7,462 unique hand scores) and then have a separate map from those 7,462 hand scores to whatever hand categories you want. That would be 314 MB.
Here's a different answer based on the colex function concept. It works with bitsets that are sorted in descending order. Here's a Python implementation (both recursive so you can see the logic and iterative). The main concept is that, given a bitset, you can always calculate how many bitsets there are with the same number of set bits but less than (in either the lexicographical or mathematical sense) your given bitset. I got the idea from this paper on hand isomorphisms.
from math import factorial
def n_choose_k(n, k):
return 0 if n < k else factorial(n) // (factorial(k) * factorial(n - k))
def indexset_recursive(bitset, lowest_bit=0):
"""Return number of bitsets with same number of set bits but less than
given bitset.
Args:
bitset (sequence) - Sequence of set bits in descending order.
lowest_bit (int) - Name of the lowest bit. Default = 0.
>>> indexset_recursive([51, 50, 49, 48, 47, 46, 45])
133784559
>>> indexset_recursive([52, 51, 50, 49, 48, 47, 46], lowest_bit=1)
133784559
>>> indexset_recursive([6, 5, 4, 3, 2, 1, 0])
0
>>> indexset_recursive([7, 6, 5, 4, 3, 2, 1], lowest_bit=1)
0
"""
m = len(bitset)
first = bitset[0] - lowest_bit
if m == 1:
return first
else:
t = n_choose_k(first, m)
return t + indexset_recursive(bitset[1:], lowest_bit)
def indexset(bitset, lowest_bit=0):
"""Return number of bitsets with same number of set bits but less than
given bitset.
Args:
bitset (sequence) - Sequence of set bits in descending order.
lowest_bit (int) - Name of the lowest bit. Default = 0.
>>> indexset([51, 50, 49, 48, 47, 46, 45])
133784559
>>> indexset([52, 51, 50, 49, 48, 47, 46], lowest_bit=1)
133784559
>>> indexset([6, 5, 4, 3, 2, 1, 0])
0
>>> indexset([7, 6, 5, 4, 3, 2, 1], lowest_bit=1)
0
"""
m = len(bitset)
g = enumerate(bitset)
return sum(n_choose_k(bit - lowest_bit, m - i) for i, bit in g)
I need to write a program, which returns a new list from a given list with following criteria.
If list member is negative or 0 it should and that value 3 times to new list. If member is positive it should add value 2 times for that list.
For example :
goal: dt([-3,2,0],R).
R = [-3,-3,-3,2,2,0,0,0].
I have written following code and it works fine for me, but it returns true as result instead of R = [some_values]
My code :
dt([],R):- write(R). % end print new list
dt([X|Tail],R):- X =< 0, addNegavite(Tail,X,R). % add 3 negatives or 0
dt([X|Tail],R):- X > 0, addPositive(Tail,X,R). % add 2 positives
addNegavite(Tail,X,R):- append([X,X,X],R,Z), dt(Tail, Z).
addPositive(Tail,X,R):- append([X,X],R,Z), dt(Tail, Z).
Maybe someone know how to make it print R = [] instead of true.
Your code prepares the value of R as it goes down the recursing chain top-to-bottom, treating the value passed in as the initial list. Calling dt/2 with an empty list produces the desired output:
:- dt([-3,2,0],[]).
Demo #1 - Note the reversed order
This is, however, an unusual way of doing things in Prolog: typically, R is your return value, produced in the other way around, when the base case services the "empty list" situation, and the rest of the rules grow the result from that empty list:
dt([],[]). % Base case: empty list produces an empty list
dt([X|Like],R):- X =< 0, addNegavite(Like,X,R).
dt([X|Like],R):- X > 0, addPositive(Like,X,R).
% The two remaining rules do the tail first, then append:
addNegavite(Like,X,R):- dt(Like, Z), append([X,X,X], Z, R).
addPositive(Like,X,R):- dt(Like, Z), append([X,X], Z, R).
Demo #2
Why do you call write inside your clauses?
Better don't have side-effects in your clauses:
dt([], []).
dt([N|NS], [N,N,N|MS]) :-
N =< 0,
dt(NS, MS).
dt([N|NS], [N,N|MS]) :-
N > 0,
dt(NS, MS).
That will work:
?- dt([-3,2,0], R).
R = [-3, -3, -3, 2, 2, 0, 0, 0] .
A further advantage of not invoking functions with side-effects in clauses is that the reverse works, too:
?- dt(R, [-3, -3, -3, 2, 2, 0, 0, 0]).
R = [-3, 2, 0] .
Of cause you can invoke write outside of your clauses:
?- dt([-3,2,0], R), write(R).
[-3,-3,-3,2,2,0,0,0]
R = [-3, -3, -3, 2, 2, 0, 0, 0] .