ANTLR templated rules - parsing

I'm using ANTLR 4 to parse an SQL subset and currently, I'm faced with a problem.
I need a rule for predicate with the following structure:
predicate
:
expr relation expr
| between_clause
| predicate OR predicate
| predicate AND predicate
| '(' predicate ')'
;
The issue here is an expr rule. There are different types of predicates where expr should be different, but the whole aforementioned structure is preserved.
I would like to parametrize the predicate rule in some way to automatically instantiate several rules for different predicate types (the rule above where specific expr types are substituted).
Is it possible in ANTLR 4?
P.S. I see two alternative options:
Copy-paste predicate rule and substitute different expr rules manually
Have all possible alternatives in the expr rule and verify that only one type of them is used for a specific predicate in the target language code
But both of them are looking bad enough. The first case leads to a combinatorial explosion, and the second one leads to accepting a lot of incorrect predicates on the grammar level.

Is it possible in ANTLR 4?
No, ANTLR does not have such a feature.
Your two alternatives are exactly how these things are commonly done in practice. Which one is preferable depends on how exactly the different replacements for expr differ.
the second one leads to accepting a lot of incorrect predicates on the grammar level.
There'll always types of errors that you can't prevent in the grammar and that'll have to be handled in a separate phase. Type errors or undeclared variable errors for example. So having the grammar accept incorrect programs isn't unusual at all.

Related

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

How do you deal with keywords in Lex?

Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.
On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:
optional : OPTIONAL identifier '=' expression ;
If I then define identifier as, say:
identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */
| IDENTIFIER ;
It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...
Is there an idiomatic way to solve this?
Is there an idiomatic way to solve this?
Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.
The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.
You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.
You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.
Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).
This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:
identifier : keyword | IDENTIFIER ;
keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;
If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.

Grammar: Precedence of grammar alternatives

This is a very basic question about grammar alternatives. If you have the following alternative:
Myalternative: 'a' | .;
Myalternative2: 'a' | 'b';
Would 'a' have higher priority over the '.' and over 'b'?
I understand that this may also depend on the behaviour of the parser generated by this syntax but in pure theoretical grammar terms could you imagine these rules being matched in parallel i.e. test against 'a' and '.' at the same time and select the one with highest priority? Or is the 'a' and . ambiguous due to the lack of precedence in grammars?
The answer depends primarily on the tool you are using, and what the semantics of that tool is. As written, this is not a context-free grammar in canonical form, and you'd need to produce that to get a theoretical answer, because only in that way can you clarify the intended semantics.
Since the question is tagged antlr, I'm going to guess that this is part of an Antlr lexical definition in which . is a wildcard character. In that case, 'a' | . means exactly the same thing as ..
Since MyAlternative matches everything that MyAlternative2 matches, and since MyAlternative comes first in the Antlr lexical definition, MyAlternative2 can never match anything. Any single character will be matched by MyAlternative (unless there is some other lexical rule which matches a longer sequence of input characters).
If you put the definition of MyAlternative2 first in the grammar file, then a or b would be matched as MyAlternative2, while any other character would be matched as MyAlternative.
The question of precedence within alternatives is meaningless. It doesn't matter whether MyAlternative considers the match of an a to be a match of a or a match of .. It is, in the end, a match of MyAlternative, and that symbol can only have one associated action.
Between lexical rules, there is a precedence relationship: The first one wins. (More accurately, as far as I know, Antlr obeys the usual convention that the longest match wins; between two rules which both match the same longest sequence, the first one in the grammar file wins.) That is not in any way influenced by alternative bars in the rules themselves.

YACC|BISON :How do I manipulate parse tree?

The goal of my application is to validate an sql code and generate,in the mean time, from that code a formatted one with some modification.For example this where clause :
where e.student_name= c.contact_name and ( c.address = " nefta"
or c.address=" tozeur ") and e.age <18
we will have as formatted output something like that :
where e.student_name= c.contact_name and (c.address=trim("nefta")
or c.address=trim("tozeur") ) and e.age <18
I hope I've explained my aim well
The problem is grammars may contain recursive rules which make the rewrite task unreliable ; for instance in my sql grammar i have this :
search_condition : search_condition OR search_condition{clbck_or}
| search_condition AND search_condition{clbck_and}
| NOT search_condition {clbck_not}
| '(' search_condition ')'{clbck__}
| predicate {clbck_pre}
;
Knowing that I specified a precedence priority to solve shift reduce problems
%left OR
%left AND
%left NOT
So back on the last example ; my clause where will be consumed this way:
c.address="nefta"or c.address="tozeur" -> search_condition
(c.address="nefta"or c.address="tozeur")->search_condition
e.student_name= c.contact_name and (c.address="nefta"or c.address="tozeur")-> search_condition
... and e.age<18-> search_condition
You can by the way understand that it's tough to rebuild the input stream referring to callbacks triggered by each reduction cause the order is not the same.
Any help for this problem ?
Your question is a bit vague, so I'm guessing that you actually print in your clbck_or (), etc. The "common" way to which wildplasser has alluded is to use "semantic values", i. e. (untested):
search_condition : search_condition OR search_condition{$$ = clbck_or($1, $3);}
| search_condition AND search_condition{$$ = clbck_and($1, $3);}
| NOT search_condition {$$ = clbck_not($2);}
| '(' search_condition ')'{$$ = clbck__($2);}
| predicate {$$ = clbck_pre($1);}
;
If you're using Bison, the manual has a fine example in the section "Infix Notation Calculator: `calc'". With strings and C, you will have to add some memory handling.
Bison is good at parsing, and with some manual help, good at building a custom syntax tree. After that, its up to you to do what you want with the tree. The good news is you can do whatever you want. The bad news is you still have to build a lot of machinery to do what you want. Your basic problem of regenerating source code is called "prettyprinting"; see my SO answer on how to prettyprint to understand what it takes to do this, including all the peccadillos of lexical syntax (you don't to lose the escapes in your literal strings, right?). You didn't at all address how to find the construct you wanted to change in the tree, or how you'd smash the tree to change it.
If you don't want to do all of that, then what you really want is a program transformation system, which is good at parsing, building a syntax tree for you (so you don't have to think about it, SQL is pretty big grammar), will let you find patterns in the tree in terms of SQL syntax you are used to, make tree changes without knowing much about the shape of the tree, and can finally regenerate valid source text by prettyprint as I describe in my answer link above. (A program transformation systems essentially includes a parser as a subroutine).
Our DMS Software Reengineering Toolkit is such a program transformation system. It has a set of predefined language definitions including SQL2011 and means for configuring for a particular dialect.
Using DMS source-to-source syntax rules, you could carry out the change in your example with the following rule:
domain SQL;
rule trim_c_members(f: identifier, s: string):condition->condition
= " c.\f = \s " -> " c.\f = trim(\s) ";
This is DMS Rule language (meta) syntax to describe a rewrite on ("domain") SQL code.
The rule has a name (because in complex application there's lot of rules) and it
as syntactic place holders "f" and "s"; it rewrites only conditions in the code.
The quotes are RSL meta-quotes; stuff inside is SQL with RSL metavariables "\f"
and "\s"; stuff outside is RSL rule syntax. What the rule says is,
"for any condition on a variable explicitly named 'c', with any field f,
if that field is compared by equality to some literal string, then replace
the literal string by 'trim' applied to the literal string".
I left out some code that basically says, "apply this rule to the entire tree, and don't apply it twice in the same place". That "strategy" is one of many built into DMS.
There's the question of how does the rule work. that is accomplished by DMS applying the SQL parser to the meta-quoted strings, to produce "pattern" syntax trees with placeholders where the metavariables are written. The left hand side pattern tree is then matched against the target tree with placeholder referring to subtrees; the right hand tree is spliced in where the left tree matched, and the placeholder subtrees transferred. So, you the programmer see surface sytax that you know and love; the tool works with trees and so it isn't confused by text.
Now, I don't think my rule matches exactly your intent, but that's partly because I can't guess your actual intent. You can write other rules if this isn't what you wanted.
This rule is purely driven by syntax; one can add a semantic predicate (not shown) if you want more complicated conditions to apply to the rule (e.g, the variable has to be ones only in certain scopes you define), and that gets messier to say. But it is much simpler and far easier to read than C code that climbs over the AST (notice you didn't see the AST here?) and tries to figure all this out.
The parsing and prettyprinting happens before and after rule application; there's a lot of machinery required to implement all that, but that machinery is built into DMS (e.g., it has something like [but more powerful] than Bison built in), and for predefined domains such as SQL, all the pretty printing works has been preconfigured, too.
If you want to get a better sense of what it takes to go full cycle with DMS (define your own language parser, define a pretty printer, define complicated rules), here's a nice and complete example of defining and symbolically simplifying calculus using DMS.

What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree?

I've been reading a bit about how interpreters/compilers work, and one area where I'm getting confused is the difference between an AST and a CST. My understanding is that the parser makes a CST, hands it to the semantic analyzer which turns it into an AST. However, my understanding is that the semantic analyzer simply ensures that rules are followed. I don't really understand why it would actually make any changes to make it abstract rather than concrete.
Is there something that I'm missing about the semantic analyzer, or is the difference between an AST and CST somewhat artificial?
A concrete syntax tree represents the source text exactly in parsed form. In general, it conforms to the context-free grammar defining the source language.
However, the concrete grammar and tree have a lot of things that are necessary to make source text unambiguously parseable, but do not contribute to actual meaning. For example, to implement operator precedence, your CFG usually has several levels of expression components (term, factor, etc.), with the operators connecting them at the different levels (you add terms to get expressions, terms are composed of factors optionally multipled, etc.). To actually interpret or compile the language, however, you don't need this; you just need Expression nodes that have operators and operands. The abstract syntax tree is the result of simplifying the concrete syntax tree down to the things actually needed to represent the meaning of the program. This tree has a much simpler definition and is thus easier to process in the later stages of execution.
You usually don't need to actually build a concrete syntax tree. The action routines in your YACC (or Antlr, or Menhir, or whatever...) grammar can directly build the abstract syntax tree, so the concrete syntax tree only exists as a conceptual entity representing the parse structure of your source text.
A concrete syntax tree matches what the grammar rules say is the syntax. The purpose of the abstract syntax tree is have a "simple" representation of what's essential in "the syntax tree".
A real value in the AST IMHO is that it is smaller than the CST, and therefore takes less time to process. (You might say, who cares? But I work with a tool where we have
tens of millions of nodes live at once!).
Most parser generators that have any support for building syntax trees insist that you personally specify exactly how they get built under the assumption that your tree nodes will be "simpler" than the CST (and in that, they are generally right, as programmers are pretty lazy). Arguably it means you have to code fewer tree visitor functions, and that's valuable, too, in that it minimizes engineering energy. When you have 3500 rules (e.g., for COBOL) this matters. And this "simpler"ness leads to the good property of "smallness".
But having such ASTs creates a problem that wasn't there: it doesn't match the grammar, and now you have to mentally track both of them. And when there are 1500 AST nodes for a 3500 rule grammar, this matters a lot. And if the grammar evolves (they always do!), now you have two giant sets of things to keep in synch.
Another solution is to let the parser simply build CST nodes for you and just use those. This is a huge advantage when building the grammars: there's no need to invent 1500 special AST nodes to model 3500 grammar rules. Just think about the tree being isomorphic to the grammar. From the point of view of the grammar engineer this is completely brainless, which lets him focus on getting the grammar right and hacking at it to his heart's content. Arguably you have to write more node visitor rules, but that can be managed. More on this later.
What we do with the DMS Software Reengineering Toolkit is to automatically build a CST based on the results of a (GLR) parsing process. DMS then automatically constructs an "compressed" CST for space efficiency reasons, by eliminating non-value carrying terminals (keywords, punctation), semantically useless unary productions, and forming directly-indexable lists for grammar rule pairs that are list like:
L = e ;
L = L e ;
L2 = e2 ;
L2 = L2 ',' e2 ;
and a wide variety of variations of such forms. You think in terms of the grammar rules and the virtual CST; the tool operates on the compressed representation. Easy on your brain, faster/smaller at runtime.
Remarkably, the compressed CST built this way looks a lot an AST that you might have designed by hand (see link at end to examples). In particular, the compressed CST doesn't carry any nodes that are just concrete syntax.
There are minor bits of awkwardness: for example while the concrete nodes for '(' and ')' classically found in expression subgrammars are not in the tree, a "parentheses node" does appear in the compressed CST and has to be handled. A true AST would not have this. This seems like a pretty small price to pay for the convenience of not have to specify the AST construction, ever. And the documentation for the tree is always available and correct: the grammar is the documentation.
How do we avoid "extra visitors"? We don't entirely, but DMS provides an AST library that walks the AST and handles the differences between the CST and the AST transparently. DMS also offers an "attribute grammar" evaluator (AGE), which is a method for passing values computed at nodes up and down the tree; the AGE handles all the tree representation issues and so the tool engineer only worries about writing computations effectively directly on the grammar rules themselves. Finally, DMS also provides "surface-syntax" patterns, which allows code fragments from the grammar to used to find specific types of subtrees, without knowing most of the node types involved.
One of the other answers observes that if you want to build tools that can regenerate source, your AST will have to match the CST. That's not really right, but it is far easier to regenerate the source if you have CST nodes. DMS generates most of the prettyprinter automatically because it has access to both :-}
Bottom line: ASTs are good for small, both phyiscal and conceptual. Automated AST construction from the CST provides both, and lets you avoid the problem of tracking two different sets.
EDIT March 2015: Link to examples of CST vs. "AST" built this way
This is based on the Expression Evaluator grammar by Terrence Parr.
The grammar for this example:
grammar Expr002;
options
{
output=AST;
ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}
prog : ( stat )+ ;
stat : expr NEWLINE -> expr
| ID '=' expr NEWLINE -> ^('=' ID expr)
| NEWLINE ->
;
expr : multExpr (( '+'^ | '-'^ ) multExpr)*
;
multExpr
: atom ('*'^ atom)*
;
atom : INT
| ID
| '('! expr ')'!
;
ID : ('a'..'z' | 'A'..'Z' )+ ;
INT : '0'..'9'+ ;
NEWLINE : '\r'? '\n' ;
WS : ( ' ' | '\t' )+ { skip(); } ;
Input
x=1
y=2
3*(x+y)
Parse Tree
The parse tree is a concrete representation of the input. The parse tree retains all of the information of the input. The empty boxes represent whitespace, i.e. end of line.
AST
The AST is an abstract representation of the input. Notice that parens are not present in the AST because the associations are derivable from the tree structure.
EDIT
For a more through explanation see Compilers and Compiler Generators pg. 23
This blog post may be helpful.
It seems to me that the AST "throws away" a lot of intermediate grammatical/structural information that wouldn't contribute to semantics. For example, you don't care that 3 is an atom is a term is a factor is a.... You just care that it's 3 when you're implementing the exponentiation expression or whatever.
The concrete syntax tree follows the rules of the grammar of the language. In the grammar, "expression lists" are typically defined with two rules
expression_list can be: expression
expression_list can be: expression, expression_list
Followed literally, these two rules gives a comb shape to any expression list that appears in the program.
The abstract syntax tree is in the form that's convenient for further manipulation. It represents things in a way that makes sense for someone that understand the meaning of programs, and not just the way they are written. The expression list above, which may be the list of arguments of a function, may conveniently be represented as a vector of expressions, since it's better for static analysis to have the total number of expression explicitly available and be able to access each expression by its index.
Simply, AST only contains semantics of the code, Parse tree/CST also includes information on how exactly code was written.
The concrete syntax tree contains all information like superfluous parenthesis and whitespace and comments, the abstract syntax tree abstracts away from this information.
NB: funny enough, when you implement a refactoring engine your AST will again contain all the concrete information, but you'll keep referring to it as an AST because that has become the standard term in the field (so one could say it has long ago lost its original meaning).
CST(Concrete Syntax Tree) is a tree representation of the Grammar(Rules of how the program should be written).
Depending on compiler architecture, it can be used by the Parser to produce an AST.
AST(Abstract Syntax Tree) is a tree representation of Parsed source, produced by the Parser part of the compiler. It stores information about tokens+grammar.
Depending on architecture of your compiler, The CST can be used to produce an AST. It is fair to say that CST evolves into AST. Or, AST is a richer CST.
More explanations can be found on this link: http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees#id6
It is a difference which doesn't make a difference.
An AST is usually explained as a way to approximate the semantics of a programming language expression by throwing away lexical content. For example in a context free grammar you might write the following EBNF rule
term: atom (('*' | '/') term )*
whereas in the AST case you only use mul_rule and div_rule which expresses the proper arithmetic operations.
Can't those rules be introduced in the grammar in the first place? Of course. You can rewrite the above compact and abstract rule by breaking it into a more concrete rules used to mimic the mentioned AST nodes:
term: mul_rule | div_rule
mul_rule: atom ('*' term)*
div_rule: atom ('/' term)*
Now, when you think in terms of top-down parsing then the second term introduces a FIRST/FIRST conflict between mul_rule and div_rule something an LL(1) parser cannot deal with. The first rule form was the left factored version of the second one which effectively eliminated structure. You have to pay some prize for using LL(1) here.
So ASTs are an ad hoc supplement used to fix the deficiencies of grammars and parsers. The CST -> AST transformation is a refactoring move. No one ever bothered when an additional comma or colon is stored in the syntax tree. On the contrary some authors retrofit them into ASTs because they like to use ASTs for doing refactorings instead of maintaining various trees at the same time or write an additional inference engine. Programmers are lazy for good reasons. Actually they store even line and column information gathered by lexical analysis in ASTs for error reporting. Very abstract indeed.

Resources