What's the difference between this grammar:
...
if_statement : 'if' condition 'then' statement 'else' statement 'end_if';
...
and this:
...
if_statement : IF condition THEN statement ELSE statement END_IF;
...
IF : 'if';
THEN: 'then';
ELSE: 'else';
END_IF: 'end_if';
....
?
If there is any difference, as this impacts on performance ...
Thanks
In addition to Will's answer, it's best to define your lexer tokens explicitly (in your lexer grammar). In case you're mixing them in your parser grammar, it's not always clear in what order the tokens are tokenized by the lexer. When defining them explicitly, they're always tokenized in the order they've been put in the lexer grammar (from top to bottom).
The biggest difference is one that may not matter to you. If your Lexer rules are in the lexer then you can use inheritance to have multiple lexer's share a common set of lexical rules.
If you just use strings in your parser rules then you can not do this. If you never plan to reuse your lexer grammar then this advantage doesn't matter.
Additionally I, and I'm guessing most Antlr veterans, are more accustom to finding the lexer rules in the actual lexer grammar rather than mixed in with the parser grammar, so one could argue the readability is increased by putting the rules in the lexer.
There is no runtime performance impact after the Antlr parser has been built to either approach.
The only difference is that in your first production rule, the keyword tokens are defined implicitly. There is no run-time performance implication for tokens defined implicitly vs. explicitly.
Yet another difference: when you explicitly define your lexer rules you can access them via the name you gave them (e.g. when you need to check for a specific token type). Otherwise ANTLR will use arbitrary numbers (with a prefix).
Related
Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.
On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:
optional : OPTIONAL identifier '=' expression ;
If I then define identifier as, say:
identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */
| IDENTIFIER ;
It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...
Is there an idiomatic way to solve this?
Is there an idiomatic way to solve this?
Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.
The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.
You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.
You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.
Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).
This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:
identifier : keyword | IDENTIFIER ;
keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;
If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.
In python, the word in indicates a operator in an expression 1 in [1,2,3]. But, in statement for i in range(10), it indicates a keyword of 'for' statement. I wrote a lexer based on regular expression. I use the rule (\+|-|\*|/|is|in) to match operator and (for|in|if|elif|else) for keywords. I don't know if I should put in in the rule of operator or keywords. Both of them will lose one meaning. It seems that I should solve this in parsing. But I need give in a label in tokenizing. What should I do?
Call it "token_in" :) It's usually better not to categorize in your lexer; the parser is responsible for analyzing the syntactic purpose of a token.
In any case, I don't see the point of the lexer producing a single token type for different keywords. if and else are syntactically distinct tokens, and the parser wants to know that it is seeing an if; the fact that it is presented with a "keyword" is not particularly useful to it.
I wrote a parser that analyzes code and reduces it as would by lex and yacc (pretty much.)
I am wondering about one aspect of the matter. If I have a set of rules such as the following:
unary: IDENTIFIER
| IDENTIFIER '(' expr_list ')'
The first rule with just an IDENTIFIER can be reduced as soon as an identifier is found. However, the second rule can only be reduced if the input also includes a valid list of expressions written between parenthesis.
How is a parser expected to work in a case like this one?
If I reduce the first identifier immediately, I can keep the result and throw it away if I realize that the second rule does match. If the second rule does not match, then I can return the result of the early reduction.
This also means that both reduction functions are going to be called if the second rule applies.
Are we instead expected to hold on the early reduction and apply it only if the second, longer rule applies?
For those interested, I put a more complete version of my parser grammar in this answer: https://codereview.stackexchange.com/questions/41769/numeric-expression-parser-calculator/41786#41786
Bottom-up parsers (like bison and yacc) do not reduce until they reach the end of the production. They don't have to guess which reduction they will use until they need it. So it's really not an issue to have two productions with the same prefix. In this sense, the algorithm is radically different from a top-down algorithm, used in recursive descent parsing, for example.
In order for the fragment which you provide to be parseable by an LALR(1) parser-generator -- that is, a bottom-up parser with the ability to examine only one (1) token beyond the end of a production -- the grammar must be such that there is no place in which unary might be followed by (. As long as that is true, the fact that the parser can see a ( is sufficient to prevent the unit reduction unary: IDENTIFIER from happening in a context in which it should reduce with the other unary production.
(That's a bit of an oversimplification, but I don't think that it would be correct to reproduce a standard text on LALR parsing here on SO.)
Can I write a parser for Communicating sequential processes(CSP) in ANTLR? I think it uses left recursion like in statement
VMS = (coin → (choc → VMS))
complete language specification can be found at CSPM : A Reference Manua
so it is not a LL grammer. Am I right?
In general, even if you have a grammar with left recursion, you can refactor the grammar to remove it. So ANTLR is reasonably likely to be able to process your grammar. There's no a priori reason you can't write a CSP grammar for ANTLR.
Whether the one you have is suitable is another question.
If your quoted phrase is a grammar rule, it doesn't have left recursion. (If it is, I don't understand the syntax of your grammar rules, especially why the parentheses [terminals?] would be unbalanced; that's pretty untradional.)
So ANTLR should be able to process it, modulo converting it to ANTRL grammar rule syntax.
You didn't show the rest of the grammar, so one can't have an opinion about the rest of it.
In the case above does not have left recursion. It would looks something like. Note this is a simplified version, CSP is much more complicated. I'm just showing it is possible.
assignment : PROCNAME EQ process
;
process : LPAREN EVENT ARROW process RPAREN
| PROCNAME
;
Besides, you can factor out left recursion with ANTLRWorks 'Remove Left Recursion' function.
CSP's are definitely possible in ANTLR, check http://www.carnap.info/ for an example.
I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?