Does antl4 support adaptive grammars that allows the user to specify new rules, such as enforcing the number of arguments specified in a function declaration?
Example:
Base language includes the following token definitions:
Token#1 is defined as [a-z][0-9]*
token#2 is [A-Z][0-9]*
The uppercase are reserved for function names, and the lower case are reserved for variables passed to the function.
The user can "declare" Fxy, and every following instance of F has to have two variables. I want the parser to enforce the "new rule".
Perhaps this is standard fair in compilers, I know the compilers I use for C, python, etc. bitch when I don't pass the right number of arguments for a function I declared elsewhere. However, I don't know how to do this myself in my own grammar; the undergrad course I took on compilers was more than 15 years ago and I don't recall it including how to enforce # of arguments required for user declared functions. I've written some simple languages with five keywords and scoping (brackets), somewhat akin to the calculator examples you find in textbooks, but nothing complex.
So, I guess what I also want to know is whether the ANTLR books will teach me how to do this (don't want to spend the money if the books don't explain what I want to achieve).
An adaptive grammar would be a grammar for producing another grammar. But that is not what you are really asking for or how parsers are typically used for the purposes you describe.
In general, a grammar defines the allowed syntax of the language (or DSL) while the visitors to the tree generated from the grammar determine if the language semantics are met. Whether a call to a named function contains the right number and type of parameters is a question of semantics, not syntax.
Consider the following grammar snippet:
decl : fName AS FUNC LPAREN params? RPAREN body ;
func : FUNC fName LPAREN params? RPAREN body ;
params : param ( COMMA param)* ;
param : type pname ;
stmnt : fname LPAREN ( pname ( COMMA pname )* )? RPAREN SEMI ;
It allows standard functions (methods) and it allows new functions to be declared. The stmnt rule allows a named function to be called.
Whether the type and number of pnames is correct is a question of semantics that can only be answered in an analysis implemented by walking the generated tree: is there a function with the given fname, do the number of pnames and params match, do the types match or are they convertable, etc.
The Antlr books will help. You may wish to spend some time looking at the repository grammars to get a better feel for how different languages can be described by a grammar.
An adaptive grammar is essentially a grammar for a "self-extensible" parser that "learns" new grammar rules from its input. ANTLR does not appear to support adaptive grammars, but there are some other parser generators that do support them, such as dypgen, which is based on the GLR parsing algorithm.
Related
Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().
I have a grammar that I do not know what type of parser I need in order to parse it other than I do not believe the grammar is LL(1). I am thinking I need a parser with backtracking or LL(*) of some sort. The grammar I have came up with (which may need some rewriting) is:
S: Rules
Rules: Rule | Rule Rules
Rule: id '=' Ids
Ids: id | Ids id
The language I am trying to generate looks something like this:
abc = def g hi jk lm
xy = aaa bbb ccc ddd eee fff jjj kkk
foo = bar ha ha
Zero or more Rule that contain a left identifier followed by an equal sign followed by one or more identifers. The part that I think I will have a problem writing a parser for is that the grammar allows any amount of id in a Rule and that the only way to tell when a new Rule starts is when it finds id =, which would require backtracking.
Does anyone know the classification of this grammar and the best method of parsing, for a hand written parser?
The grammar that generates an identifier followed by an equals sign followed by a finite sequence of identifiers is regular. This means that strings in the language can be parsed using a DFA or regular expression. No need for fancy nondeterministic or LL(*) parsers.
To see that the language is regular, let Id = U {a : a ∈ Γ}, where Γ ⊂ Σ is the set of symbols that can occur in identifiers. The language you are trying to generate is denoted by the regular expression
Id+ =( Id+)* Id+
Setting Γ = {a, b, ..., z}, examples of strings in the language of the regular expression are:
look = i am in a regular language
hey = that means i can be recognized by a dfa
cool = or even a regular expression
There is no need to parse your language using powerful parsing techniques. This is one case where parsing using regular expressions or DFA is both appropriate and optimal.
edit:
Call the above regular expression R. To parse R*, generate a DFA recognizing the language of R*. To do this, generate an NFA recognizing the language of R* using the algorithm obtainable from Kleene's theorem. Then convert the NFA into a DFA using the subset construction. The resultant DFA will recognize all strings in R*. Given a representation of the constructed DFA in your implementation language, the required actions - for instance,
Add the last identifier parsed to the right-hand side of the current declaration statement being parsed
Add the last declaration statement parsed to a list of parsed declarations, and use the last identifier parsed to begin parsing a new declaration statement
can be encoded into the states of the DFA. In reality, using Kleene's theorem and the subset construction is probably unnecessary for such a simple language. That is, you can probably just write a parser with the above two actions without implementing an automaton. Given a more complicated regular langauge (for instance, the lexical structure of a programming langauge), the conversion would be the best option.
I am writing an interpreter for a language where functions can be used as operators. However, the functions content will only be known at runtime.
For that I considered two solutions:
Parsing is done at runtime, using the runtime information on the function
All user-defined operators use default values for precedence and associativity.
I chose the latter as I see a number of advantages in parsing separately to execution.
Now it comes to implementation and I am interested to see what options there are. My initial thoughts are a shift reduce parser, but I have little experience in constructing parsers.
Example:
LHS op RHS : LHS * RHS /* define a binary operator 'op' */
var : 3 /* define a variable */
print 5 op var /* should print 15 */
LHS op RHS : LHS / RHS /* Re-define op */
print var op var /* Should print 1 */
in the last case, the parser will get from the lexer: " id id id id ". Only at runtime do I know that the 'op' id is an operator.
(Posting the results of the comments, as requested.)
Solution #1 is definitely ugly, complex to implement, and unneeded, I agree. Solution #2 is by far easier to implement and comprehend. You can also allow custom associativity and precedence for operators, as long as those are known statically. The main thing is that these facts are known at parse time.
As for actual parsing, most parsers will work just fine, as any two expression surrounding an id are an application of a custom infix operator (this is less true if you allow custom precedence and associativity, in that case you need an algorithm that allows determine those on a per-operator basis at parse time). Either case, my personal favorite is a "Top Down Operator Precedence Parser", or Pratt parser. I found the following resources (ordered by usefulness to me, YMMV) describe it quite well:
Simple Top-Down Parsing in Python
Pratt Parsers: Expression Parsing Made Easy
Top Down Operator Precedence
Two properties of the algorithm make it suit this problem very well:
The lookup of associativity ("binding power") happens dynamically for each token (allowing the parser to allow the user to define precedence for their operators).
It's very simple to write by hand[*], and you'll probably have to do that as such an degree of dynamism is beyond the scope of most (at least all I know) parser generators.
[*] I've personally written a parser for a very large (lacking only case, multidimensional arrays and perhaps some obscure subtleties) subset of Pascal in 500 lines of Python and 2-3 days of work, the rest is only missing because other parts of the software it's used in were more interesting at the time and I didn't have a reason to implement the rest.
When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?
The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.