Meaning of insert() semantic function in attribute grammar - parsing

I am currently trying to draw the attribute flow for this attribute grammar.
decl → ID decl tail
decl.t := decl tail.t
decl tail.in tab := insert (decl.in tab, ID.n, decl tail.t)
decl.out tab := decl tail.out tab
decl tail → , decl
decl tail.t := decl.t
decl.in tab := decl tail.in tab
decl tail.out tab := decl.out tab
decl tail → : ID ;
decl tail.t := ID.n
decl tail.out tab := decl tail.in tab
But I do not understand what insert (decl.in tab, ID.n, decl tail.t) means.
My first assumption was it would be something similar to the insert() function in Python.
But as far as I know, Python's insert() takes two parameters, but in this attribute grammar it takes three parameters decl.in tab, ID.n, decl tail.t so my original assumption is clearly wrong here.
I am quite new to compiler designs and I have a hard time figuring out the meanings of some semantic functions I've never seen before. (e.g ReduceTo())
What does this insert (decl.in tab, ID.n, decl tail.t) mean?
Is there a list of semantic functions like this that I need to know or memorize?

At a guess, the intent is to add the declared variable with the specified type into a symbol table ("tab").
For the purposes of analysing attribute flow, you don't really need to know what insert does, however. You only need to know that it requires certain values (its arguments), so those attributes must be available in order to compute the attribute assigned from the return value of the function.
There are no predefined semantic functions for attribute grammars, so you don't need to worry about memorizing a list of them. The particular semantic functions used in a particular attribute grammar application will be specified by that application itself.

Related

Faulty bison reduction using %glr-parser and %merge rules

Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.

Coq: typeclasses vs dependent records

I can't understand the difference between typeclasses and dependent records in Coq. The reference manual gives the syntax of typeclasses, but says nothing about what they really are and how should you use them. A bit of thinking and searching reveals that typeclasses essentially are dependent records with a bit of syntactic sugar that allows Coq to automatically infer some implicit instances and parameters. It seems that the algorithm for typeclasses works better when there is more or a less only one possible instance of it in any given context, but that's not a big issue since we can always move all fields of typeclass to its parameters, removing ambiguity. Also the Instance declaration is automatically added to the Hints database which can often ease the proofs but will also sometimes break them, if the instances were too general and caused proof search loops or explosions. Are there any other issues I should be aware of? What is the heuristic for choosing between the two? E.g. would I lose anything if I use only records and set their instances as implicit parameters whenever possible?
You are right: type classes in Coq are just records with special plumbing and inference (there's also the special case of single-method type classes, but it doesn't really affect this answer in any way). Therefore, the only reason you would choose type classes over "pure" dependent records is to benefit from the special inference that you get with them: inference with plain dependent records is not very powerful and doesn't allow you to omit much information.
As an example, consider the following code, which defines a monoid type class, instantiating it with natural numbers:
Class monoid A := Monoid {
op : A -> A -> A;
id : A;
opA : forall x y z, op x (op y z) = op (op x y) z;
idL : forall x, op id x = x;
idR : forall x, op x id = x
}.
Require Import Arith.
Instance nat_plus_monoid : monoid nat := {|
op := plus;
id := 0;
opA := plus_assoc;
idL := plus_O_n;
idR := fun n => eq_sym (plus_n_O n)
|}.
Using type class inference, we can use any definitions that work for any monoid directly with nat, without supplying the type class argument, e.g.
Definition times_3 (n : nat) := op n (op n n).
However, if you make the above definition into a regular record by replacing Class and Instance by Record and Definition, the same definition fails:
Toplevel input, characters 38-39: Error: In environment n : nat The term "n" has type "nat" while it is expected to have type "monoid ?11".
The only caveat with type classes is that the instance inference engine gets a bit lost sometimes, causing hard-to-understand error messages to appear. That being said, it's not really a disadvantage over dependent records, given that this possibility isn't even available there.

Defining Function Signatures in a Simple Language Grammar

I am currently learning how to create a simple expression language using Irony. I'm having a little bit of trouble figuring out the best way to define function signatures, and determining whose responsibility it is to validate the input to those functions.
So far, I have a simple grammar that defines the basic elements of my language. This includes a handful of binary operators, parentheses, numbers, identifiers, and function calls. The BNF for my grammar looks something like this:
<expression> ::= <number> | <parenexp> | <binexp> | <fncall> | <identifier>
<parenexp> ::= ( <expression> )
<fncall> ::= <identifier> ( <argumentlist> )
<binexp> ::= <expression> <binop> <expression>
<binop> ::= + - * / %
... the rest of the grammar definition
Using the Irony parser, I am able to validate the syntax of various input strings to make sure they conform to this grammar:
x + y / z * AVG(a + b, p) -> Valid Syntax
x +/ AVG(x -> Invalid Syntax
All that is well and good, but now I want to go a step further and define the available functions, along with the number of parameters that each function requires. So for example, I want to have a function FOO that accepts one parameter and BAR that accepts two parameters:
FOO(a + b) * BAR(x + y, p + q) -> Valid
FOO(a + b, 13) -> Invalid
When the second statement is parsed, I'd like to be able to output an error message that is aware of the expected input for this function:
Too many arguments specified for function 'FOO'
I don't actually need to evaluate any of these statements, only validate the syntax of the statements and determine if they are valid expressions or not.
How exactly should I be doing this? I know that technically I could simply add the functions to the grammar like so:
<foofncall> ::= FOO( <expression> )
<barfncall> ::= BAR( <expression>, <expression> )
But something about this doesn't feel quite right. To me it seems like the grammar should only define a generic function call, and not every function available to the language.
How is this typically accomplished in other languages?
What are the components called that should handle the responsibilities of analyzing the basic syntax of the language grammar versus the more specific elements like function definitions? Should both responsibilities be handled by the same component?
While you can do typechecking in directly in the grammar so its enforced in the parser, its generally a bad idea to do so. Instead, the parser should just parse the basic syntax, and separate typechecking code should be used for typechecking.
In the normal case of a compiler, the parser just produces an abstract syntax tree or some equivalent representation of the program. Then, a typechecking pass is run over the AST that ensures all types match appropriately -- ensures that functions have the right number of arguments and those arguments have the right type, as well as ensuring that variables have the right type for what is assigned to them and how they are used.
Besides being generally simpler, this usually allows you to give better error messages -- instead of just 'Invalid', you can say 'too many arguments to FOO' or what have you.

Producing Expressions from This Grammar with Recursive Descent

I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.

What are FIRST and FOLLOW sets used for in parsing?

What are FIRST and FOLLOW sets? What are they used for in parsing?
Are they used for top-down or bottom-up parsers?
Can anyone explain me FIRST and FOLLOW SETS for the following set of grammar rules:
E := E+T | T
T := T*V | T
V := <id>
They are typically used in LL (top-down) parsers to check if the running parser would encounter any situation where there is more than one way to continue parsing.
If you have the alternative A | B and also have FIRST(A) = {"a"} and FIRST(B) = {"b", "a"} then you would have a FIRST/FIRST conflict because when "a" comes next in the input you wouldn't know whether to expand A or B. (Assuming lookahead is 1).
On the other side if you have a Nonterminal that is nullable like AOpt: ("a")? then you have to make sure that FOLLOW(AOpt) doesn't contain "a" because otherwise you wouldn't know if to expand AOpt or not like here: S: AOpt "a" Either S or AOpt could consume "a" which gives us a FIRST/FOLLOW conflict.
FIRST sets can also be used during the parsing process for performance reasons. If you have a nullable nonterminal NullableNt you can expand it in order to see if it can consume anything, or it may be faster to check if FIRST(NullableNt) contains the next token and if not simply ignore it (backtracking vs predictive parsing). Another performance improvement would be to additionally provide the lexical scanner with the current FIRST set, so the scanner does not try all possible terminals but only those that are currently allowed by the context. This conflicts with reserved terminals but those are not always needed.
Bottom up parsers have different kinds of conflicts namely Reduce/Reduce and Shift/Reduce. They also use item sets to detect conflicts and not FIRST,FOLLOW.
Your grammar would't work with LL-parsers because it contains left recursion. But the FIRST sets for E, T and V would be {id} (assuming your T := T*V | T is meant to be T := T*V | V).
Answer :
E->E+T|T
left recursion
E->TE'
E'->+TE'|eipsilon
T->T*V|T
left recursion
T->VT'
T'->*VT'|epsilon
no left recursion in
V->(id)
Therefore the grammar is:
E->TE'
E'->+TE'|epsilon
T->VT'
T'->*VT'|epsilon
V-> (id)
FIRST(E)={(}
FIRST(E')={+,epsilon}
FIRST(T)={(}
FIRST(T')={*,epsilon}
FIRST(V)={(}
Starting Symbol=FOLLOW(E)={$}
E->TE',E'->TE'|epsilon:FOLLOW(E')=FOLLOW(E)={$}
E->TE',E'->+TE'|epsilon:FOLLOW(T)=FIRST(E')={+,$}
T->VT',T'->*VT'|epsilon:FOLLOW(T')=FOLLOW(T)={+,$}
T->VT',T->*VT'|epsilon:FOLLOW(V)=FIRST(T)={ *,epsilon}
Rules for First Sets
If X is a terminal then First(X) is just X!
If there is a Production X → ε then add ε to first(X)
If there is a Production X → Y1Y2..Yk then add first(Y1Y2..Yk) to first(X)
First(Y1Y2..Yk) is either
First(Y1) (if First(Y1) doesn't contain ε)
OR (if First(Y1) does contain ε) then First (Y1Y2..Yk) is everything in First(Y1) except for ε as well as everything in First(Y2..Yk)
If First(Y1) First(Y2)..First(Yk) all contain ε then add ε to First(Y1Y2..Yk) as well.
Rules for Follow Sets
First put $ (the end of input marker) in Follow(S) (S is the start symbol)
If there is a production A → aBb, (where a can be a whole string) then everything in FIRST(b) except for ε is placed in FOLLOW(B).
If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
If there is a production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A) is in FOLLOW(B)
Wikipedia is your friend. See discussion of LL parsers and first/follow sets.
Fundamentally they are used as the basic for parser construction, e.g., as part of parser generators. You can also use them to reason about properties of grammars, but most people don't have much of a need to do this.

Resources