I'm currently writing a small parser in erlang, using yecc, and have encountered some problems. The problems occur when I'm parsing rules with 'lbrack' in it. The following rule
is an illustration of my problem:
program -> 'char' 'ident' 'lbrack' 'int_constant' 'rbrack' 'semi'
It compiles ok, but when I'm trying to parse the following tokens:
[{char,1},
{ident,1,1,t},
{lbrack,1},
{int_constant,1,10},
{rbrack,1},
{semi,1}]
the parser crashes with
{error,
{1,parser,["syntax error before: ","lbrack"]}}}
I tried with the following yecc file, yt.yrl:
Nonterminals
program.
Terminals
char ident lbrack int_constant rbrack semi.
Rootsymbol
program.
program -> 'char' 'ident' 'lbrack' 'int_constant' 'rbrack' 'semi'.
with your input and it worked fine. It didn't return anything, well '$undefined', but that is as it should be as my example doesn't return anything. Note that none of your terminal symbols need to be quoted as they are just normal atoms with "ordinary" names.
Related
I'm a newbie to antlr4 and trying to make a parser. However, I am stuck on the most basic step of parsing a single word.
My grammar looks like:
grammar test;
program : WORD EOF;
WORD : 'test';
And my input file looks like:
test
The input file is only one line, has no trailing spaces. If I open it in a hex-editor it shows as 5 bytes: test<EOF>.
From what I understand of antlr, this rule should match a WORD token then an EOF token then stop parsing. However, I get line 1:4 missing 'test' at '<EOF>' when parsing the file.
I've worked with lex/yacc before and not encountered an error like this. I understand that antlr works differently, so I am curious why I am encountering this error.
My apologies for the bad title, but couldn't express it in better words.
I'm writing a parser using ANTLR to calculate complexities in dart code.
Things seem to work fine until I tried to parse a file with the following Method Signature
Stream<SomeState> mapEventToState(SomeEvent event,) async* {
//someCode to map the State to Event
}
Here the mapEventToState(SomeEvent event,) creates an issue because of the COMMA , at the end.
It presents 2 params to me because of the trailing COMMA (whereas in reality it's just one) and includes some part of the code in the params list thus making the rest of the code unreadable for ANTLR.
This is normal in flutter to end a list of parameters with a COMMA.
The grammar corresponding to it is:
initializedVariableDeclaration
: declaredIdentifier ('=' expression)? (','initializedIdentifier)*
;
initializedIdentifier
: identifier ('=' expression)?
;
initializedIdentifierList
: initializedIdentifier (',' initializedIdentifier)*
;
The full grammar can be checked at https://github.com/antlr/grammars-v4/blob/master/dart2/Dart2.g4
What should I change on the grammar so that I don't face this issue and the parser can understand that functionName(Param param1, Param param2,) is same as functionName(Param param1, Param param2)
The Dart project maintains a reference ANTLR grammar for the Dart language (mostly as a tool for ourselves, to ensure new language features can be parsed).
It might be useful as a reference.
The "dart2" grammar you are linking to in the ANTLR repository is probably severely outdated. It was not created by a Dart team member, and if it doesn't handle trailing commas in argument lists, it was probably never complete for Dart 2.0. Use with caution.
I do not believe that the rule you mentioned (initializedVariableDeclaration) is the grammar corresponding to the problem. That's for an ordinary variable declaration (with an initializer).
I believe you actually want to change formalParameterList. The Dart grammar is provided by the language specification, and we can compare the grammar listed there to the grammar from the ANTLR repository.
The ANTLR file has:
formalParameterList
: '(' ')'
| '(' normalFormalParameters ')'
...
whereas the Dart 2.10 specification has, from section 9.2 (Formal Parameters):
<formalParameterList> ::= ‘(’ ‘)’
| ‘(’ <normalFormalParameters> ‘,’? ‘)’
...
You should file an issue against ANTLR or create a pull request to fix it.
That file also does not appear to have been substantially updated since May 2019 and seems to be missing some notable changes to the Dart language since that time (e.g. spread collections (spreadElement), collection-if (ifElement), and collection-for (forElement) from Dart 2.3, and the changes for null safety).
I am currently trying to improvise/fix bug an existing grammar which someone else has created.
We have our own language for which we have created an editor We are using eclipse ide.
Some grammar examples like
calc : choice INTEGER INTEGER
choice : add|sub|div|mul
INTEGER : ('0'..'9')+
So in my editor, if I type
calc add 2 aaa
So the error parser of antlr recognizes it as an error since it is expecting an integer and we typed string and throws error message such as
extraneous input 'aaa' expecting {'{', INTEGER}"
(I have my class extends BaseErrorListener, where I create markers for these errors )
Similarly, I have such grammar defined for my editor.
Now the question is: for all this, it identifies that something is wrong in the syntax and it throws errors, but what for syntax which is not part of grammar like
If I type any garbage value such as
abc add 2 3
or
just_type_junk_in_editor
it does not throw any error since ‘abc’ or ‘just_type_junk_in_editor‘ is not in my grammar
so is there a way that for keywords which are not part of grammar, the error parser of antlr should parse it as an error.
Without having seen the full grammar I think your problem is the missing EOF token in your main rule. ANTLR4 consumes input as much as it can, but if it doesn't match anything at least in the main rule, it ignores the rest, which explains why you don't see an error. By adding EOF you tell your ANTLR4 that all input must be matched:
calc: choice INTEGER INTEGER EOF;
My Lexer is supposed to distinguish brackets and maintain a stack of opened brackets during lexing. For this I specified a helper function in my fsl file like this:
let updateBracketStack sign = // whenever a bracket is parsed, update the stack accordingly
match sign with
| '[' -> push sign
| '{' -> push sign
| ']' -> if top() = '[' then pop() else ()
| '}' -> if top() = '{' then pop() else ()
| _ -> ()
The stack of course is a ref of char list. And push, top, pop are implemented accordingly.
The problem is that everything worked up until I added the { character. Now FsLex simply dies with error: parse error
If I change the characters to strings, i.e. write "{" FsLex is fine again, so a workaround would be to change the implementation to a stack of strings instead of characters.
My question is however, where does this behaviour come from? Is this a bug if FsLex?
FsLex's parser is generated using FsLexYacc. The message "parse error" means the lexing (of your .fsl file) until error position is succeeded but parsing is failed at that position. To find the root cause you will need to post full input text to FsLex.
This is only guess. FsLex could be confused by '{' character since it is also open token for embedded code block? Or your input text contains some special unicode characters but it looks like whitespace on editor?
One possible workaround is, to create another module and .fs file, LexHelper module in LexHelper.fs, and place your helper functions in it, and open it from .fsl file.
EDIT
Looking at the source code of FsLexYacc, it doesn't handle } character enclosed by single-quotes in embedded F# code, but does when enclosed by double-quotes.
https://github.com/fsprojects/FsLexYacc/blob/master/src/FsLex/fslexlex.fsl
Im trying to model the EBNF expression
("declare" "namespace" ";")* ("declare" "variable" ";")*
I have built up the yacc (Im using MPPG) grammar, which seems to represent this, but it fails to match my test expression.
The test case i'm trying to match is
declare variable;
The Token stream from the lexer is
KW_Declare
KW_Variable
Separator
The grammar parse says there is a "Shift/Reduce conflict, state 6 on KW_Declare". I have attempted to solve this with "%left PrologHeaderList PrologBodyList", but neither solution works.
Program : Prolog;
Prolog : PrologHeaderList PrologBodyList;
PrologHeaderList : /*EMPTY*/
| PrologHeaderList PrologHeader;
PrologHeader : KW_Declare KW_Namespace Separator;
PrologBodyList : /*EMPTY*/
| PrologBodyList PrologBody;
PrologBody : KW_Declare KW_Variable Separator;
KW_Declare KW_Namespace KW_Variable Separator are all tokens with values "declare", "naemsapce", "variable", ";".
It's been a long time since I've used anything yacc-like, but here are a couple of suggestions that may or may not help.
It seems that you need a 2-token lookahead in this situation. The parser gets to the last PrologHeader, and it has to decide whether the next construct is a PrologHeader or a PrologBody, and it can't tell that from the KW_Declare. If there's a directive to increase lookahead in this situation, it will probably solve the problem.
You could also introduce context into your actions: rather than define PrologHeaderList and PrologBodyList, define PrologRuleList and have the actions throw an error if a header appears after a body. Ugly, but sometimes you have to do it: what appears simple in a grammar may not be simple in the generated parser.
A hackish approach might be to combine the tokens: rather than KW_Declare and KW_Variable, have your lexer recognize the space and use KW_Declare_Variable. Since both are keywords, you're not going to run into namespace collision problems.
The grammar at the top is regular so IIRC you can plot it out as a DFA (or a NDA and convert it to a DFA) and then convert the DFA to a grammar. It's bean a while so I'll leave the work as an exercise for the reader.