PREDICT set for a LL(1) parser help me fix it - parsing

I am trying to build a compiler for a college projec. Currently I am having problems to create a part of my predict set for the LL(1) grammar. This is the part of my grammar that I am having problems with:
<DECLS> ::- id <DECL> <N_DECL>
<N_DECL> ::- op_, <DECLS> | null
<DECL> ::- <SOMETHING_MORE>
The first sets for this part of the grammar are:
<DECLS> = id
<N_DECL> = op_ , null
The follow sets for this part of the grammar are:
<DECLS> = op_; , op_,
<N_DECL> = op_; , op_,
The predict set for this part of the grammar is (see as a matrix):
id op_,
<DECLS> id <DECL> <N_DECL>
<N_DECL> op_, <DECLS> / null <--- here
As you can see the entry for N_DECL and op_ has two entries. As far as I know, a grammar is wrong if the predict set has two or more entries in the same box. So I was wondering if someone can help me changing my grammar to solve this problem.

If I understand your notation correctly, then
FIRST(<N_DECL>) = { op_, , null }
FOLLOW(<N_DECL>) = { op_; , op_, }
I don't see where op_, in <N_DECL>'s FOLLOW set comes from (perhaps it's coming from a different part of your grammar), but if it's really there, your grammar would be ambiguous due to a FIRST/FOLLOW conflict. These occur whenever the FIRST set of a non-terminal contain null (i.e. the non-terminal can produce the empty string), and the FIRST and FOLLOW set of that non-terminal overlap. In this case, they both contain op_,, which causes problems as you have noticed.
From the fact that you list op_, in <DECLS>'s FOLLOW set, I deduce that the actual ambiguity arises from a part of the grammar that you didn't include in your question. If you're parsing a <DECLS> with a lookahead of one token, encountering a comma may mean that you have another op_, id <DECL> <N_DECL> phrase, or that that comma belongs to a structure where the comma is "outside" the <DECLS>.
It might be that your language is LL(1), but without knowing why <DECLS> can be followed by op_,, I wouldn't know how to resolve this conflict. However, if it turns out that your language isn't LL(1) as it is tokenised now, you should consider tricking the tokeniser to do the lookahead for you:
Use substitution to arrive at <N_DECL> ::- op_, id <DECL> <N_DECL> | null.
Let the lexer parse op_, id as its own token comma_id.
Correct the production: <N_DECL> ::- comma_id <DECL> <N_DECL> | null.

Related

ANTLR Tries to Match an Expression That Wasn't Specified as an Option

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
main:
| LBRACKET TABLENAME RBRACKET { Table $2 }
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?
You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

Removing hidden ambiguity in grammar using left factoring

I am trying to reduce the grammar to LL(1) for a hypothetical language we created. I have removed most of the left factoring issues in the grammar, using the general rule of introducing new non-terminal characters for the same.
For example,
<assignmentstats>------- <type><id>=<E>/ id=<E>/id'['<E>']'=<E>/
is transformed to
<assignmentstats>------- <type><id>=<E>/ id<LF3>
<LF3>------- =<E>/[<E>]=<E>
After applying such rules for all the required statements, I expected an unambiguous grammar.
But the following statement is somehow still ambiguous.
<body>------- <forrelatedstuff>/ <floors>/<stats>
Here are the related statements of the grammar(only those required has been shown)
<funcbody>----- {<stats>}
<stats>----- <stat> <stats>/ #
<floors>-------- <floor><floors>/ #
<floor>------- floo <id><arr>{stats}/id<arr>{<stats>}
<stat>----- <superstats>/<returnstats>
<superstats>----- <type>id<Zip>/id<LF3>
<building> ------- build <id> {<body>}
<returnstats>--------- return <E>
I looked more into it , and found that the ambiguity is between 'floors' and 'stats'. The FIRST('stats') and FIRST('floors'), i.e. the set of first terminal characters contain 'id' and '}' for both.
I can see why this would be a problem and how left factoring could solve this.But how can I remove such kind of ambiguity through left factoring?
Note: Here,
'id' denotes an identifier.
'#' denotes epsilon.

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

Resources