I am using Gold Parser v5.2.
I attempting to slightly modify the Object Pascal Engine (by Rob van den Brink) so that it can parse .dpr and .dpk files as well as .pas files.
The garmmar file (named D7Grammar.grm, in the file downloaded from above link) passes Gold's analysis [Project | Analyze the Grammar] (with the modifications below) but fails with 'Project | Create LALR Parse tables'.
Modifications to 'D7Grammar.grm' file:
Find definition for 'FloatLiteral' and rewrite it as this:
FloatLiteral = {Digit} + '.' + {Digit} +
Find '<UsesClause>' and rewrite it as this:
<UsesClause> ::= USES <UnitList> ';'
| SynError
Add the following rules
<UnitRef> ::= <RefID> !see http://stackoverflow.com/questions/35871440/
| <RefID> IN 'StringLiteral'
| <RefID> IN 'StringLiteral' Comment Start <RefID> Comment End
<UnitList> ::= <UnitList> ',' <UnitRef>
| <UnitRef>
Having done these, when I issue Project | Create LALR Parse tables' in Gold Parser, I get the following error.
')' can follow more than one completed rule. A Reduce-Reduce error is
a caused when a grammar allows two or more rules to be reduced at the
same time, for the same token. The grammar is ambigious. Please see
the documentation for more information.
Further clicking around displays a table showing/hinting that 'FieldDesignator' and 'EnumId' are the culprits --as well as some more information I have no idea what they mean.
I am guessing this new ambiguity was looked over by older versions of Gold (since D7Grammar.grm had no issues then) but brought to surface by the new version.
Trouble is, other than doing trial-error (mostly copy/paste from random ideas or from other peoples suggestions) I am useless with grammar rules.
Hence, goes without saying, help is badly needed to get past this problem.
Related
My apologies for the bad title, but couldn't express it in better words.
I'm writing a parser using ANTLR to calculate complexities in dart code.
Things seem to work fine until I tried to parse a file with the following Method Signature
Stream<SomeState> mapEventToState(SomeEvent event,) async* {
//someCode to map the State to Event
}
Here the mapEventToState(SomeEvent event,) creates an issue because of the COMMA , at the end.
It presents 2 params to me because of the trailing COMMA (whereas in reality it's just one) and includes some part of the code in the params list thus making the rest of the code unreadable for ANTLR.
This is normal in flutter to end a list of parameters with a COMMA.
The grammar corresponding to it is:
initializedVariableDeclaration
: declaredIdentifier ('=' expression)? (','initializedIdentifier)*
;
initializedIdentifier
: identifier ('=' expression)?
;
initializedIdentifierList
: initializedIdentifier (',' initializedIdentifier)*
;
The full grammar can be checked at https://github.com/antlr/grammars-v4/blob/master/dart2/Dart2.g4
What should I change on the grammar so that I don't face this issue and the parser can understand that functionName(Param param1, Param param2,) is same as functionName(Param param1, Param param2)
The Dart project maintains a reference ANTLR grammar for the Dart language (mostly as a tool for ourselves, to ensure new language features can be parsed).
It might be useful as a reference.
The "dart2" grammar you are linking to in the ANTLR repository is probably severely outdated. It was not created by a Dart team member, and if it doesn't handle trailing commas in argument lists, it was probably never complete for Dart 2.0. Use with caution.
I do not believe that the rule you mentioned (initializedVariableDeclaration) is the grammar corresponding to the problem. That's for an ordinary variable declaration (with an initializer).
I believe you actually want to change formalParameterList. The Dart grammar is provided by the language specification, and we can compare the grammar listed there to the grammar from the ANTLR repository.
The ANTLR file has:
formalParameterList
: '(' ')'
| '(' normalFormalParameters ')'
...
whereas the Dart 2.10 specification has, from section 9.2 (Formal Parameters):
<formalParameterList> ::= ‘(’ ‘)’
| ‘(’ <normalFormalParameters> ‘,’? ‘)’
...
You should file an issue against ANTLR or create a pull request to fix it.
That file also does not appear to have been substantially updated since May 2019 and seems to be missing some notable changes to the Dart language since that time (e.g. spread collections (spreadElement), collection-if (ifElement), and collection-for (forElement) from Dart 2.3, and the changes for null safety).
The flex info manual provides allows whitespace in regular expressions using the "x" modifier in the (?r-s:pattern) form. It specifically offers a simple example (without whitespace)
(?:foo) same as (foo)
but the following program fails to compile with the error "unrecognized rule":
BAD (?:foo)
%%
{BAD} {}
I cannot find any form of (? that is acceptable as a rule pattern. Is the manual in error, or do I misunderstand?
The example in your question does not seem to reflect the question itself, since it shows neither the use of whitespace nor a x flag. So I'm going to assume that the pattern which is failing for you is something like
BAD (?x:two | lines |
of | words)
%%
{BAD} { }
And, indeed, that will not work. Although you can use extended format in a pattern, you can only use it in a definition if it doesn't contain a newline. The definition terminates at the last non-whitespace character on the definition line.
Anyway, definitions are overused. You could write the above as
%%
(?x:two | lines |
of | words ) { }
Which saves anyone reading your code from having to search for a definition.
I do understand that you might want to use a very long pattern in a rule, which is awkward, particularly if you want to use it twice. Regardless of the issue with newlines, this tends to run into problems with Flex's definition length limit (2047 characters). My approach has been to break the very long pattern into a series of definitions, and then define another symbol which concatenates the pieces.
Before v2.6, Flex did not chop whitespace off the end of the definition line, which also leads to mysterious "unrecognized rule" errors. The manual seems to still reflect the v2.5 behaviour:
The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line.
I am working on an ANTLRv4 grammar for BUGS - my repo is here, the link points to a particular commit so shouldn't go out of date.
Minimum code example below.
I would like the input rule to go along t route if input is T(, but to go along the id route if the input is T for the grammar below.
grammar temp;
input: t | id;
t: T '(';
id: ID;
T: 'T' {_input.LA(1)==(}?;
ID: [a-zA-Z][a-zA-Z0-9._]*;
My ANLTRv4 specification of BUGS grammar was obtained heavily inspired with the FLEX+BISON lexing and parsing grammar incorporated in JAGS 4.3.0 source code, in files src/lib/compiler/parser.yy and src/lib/compiler/scanner.ll.
The way they accomplish it is by using the trailing context in the lexer, e.g. r/s. The way to do it in ANTLR is given here, but I cannot get it to work.
I need it to work this way because another part of the grammar depends on this mechanism - relevant code fragment here.
You can recreate my particular issue by cloning my repo and running make - this will give list of tokens lexed and error in parsing stage. In the tokens list the letter T is lexed as token 'T' rather than ID as I'd like it to be.
I feel there is much more natural/correct way to do it in ANTLR, but I'm new to this and cannot figure out a way.
PS If you have an idea how to better name this question please edit it.
If I understand the problem correctly the following code will work fine:
grammar temp;
input: t | id;
t: T '(';
id: ID | T;
T: 'T';
LPAREN: '(';
ID: [a-zA-Z][a-zA-Z0-9._]*;
I started to work with antlr a few days ago. I'd like to use it to parse #include macros in c. Only includes are to my interest, all other parts are irrelevant. here i wrote a simple grammar file:
... parser part omitted...
INCLUDE : '#include';
INCLUDE_FILE_QUOTE: '"'FILE_NAME'"';
INCLUDE_FILE_ANGLE: '<'FILE_NAME'>';
fragment
FILE_NAME: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|' ')+;
MACROS: '#'('if' | 'ifdef' | 'define' | 'endif' | 'undef' | 'elif' | 'else' );
//MACROS: '#'('a'..'z'|'A'..'Z')+;
OPERATORS: ('+'|'-'|'*'|'/'|'='|'=='|'!='|'>'|'>='|'<'|'<='|'>>'|'<<'|'<<<'|'|'|'&'|','|';'|'.'|'->'|'#');
... other supporting tokens like ID, WS and COMMENT ...
This grammar produces ambiguity when such statement are encountered:
(;i<listLength;i++)
output: mismatched character ';' expecting '>'
Seems it's trying to match INCLUDE_FILE_ANGLE instead of treating the ";" as OPERATORS.
I heard there's an operator called syntactic predicate, but im not sure how to properly use it in this case.
How can i solve this problem in an Antlr encouraged way?
Looks like there's not lots of activity about antlr here.
Anyway i figured this out.
INCLUDE_MACRO: ('#include')=>'#include';
VERSION_MACRO: ('#version')=>'#version';
OTHER_MACRO:
(
|('#if')=>'#if'
|('#ifndef')=>'#ifndef'
|('#ifdef')=>'#ifdef'
|('#else')=>'#else'
|('#elif')=>'#elif'
|('#endif')=>'#endif'
);
This only solves first half of the problem. Secondly, one cannot use the INCLUDE_FILE_ANGLE to match the desired string in the #include directive.
The '<'FILE_NAME'>' stuffs creates ambiguity and must be broken down to basic tokens from lexer or use more advanced context-aware checks. Im not familiar with the later technique, So i wrote this in the parser rule:
include_statement :
INCLUDE_MACRO include_file
-> ^(INCLUDE_MACRO include_file);
include_file
: STRING
| LEFT_ANGLE(INT|ID|OPERATORS)+RIGHT_ANGLE
;
Though this works , but it admittedly looks ugly.
I hope experienced users can comment with much better solution.
Are there any common solutions how to use incomplete grammars? In my case I just want to detect methods in Delphi (Pascal)-files, that means procedures and functions. The following first attempt is working
methods
: ( procedure | function | . )+
;
but is that a solution at all? Are there any better solutions? Is it possible to stop parsing with an action (e. g. after detecting implementation). Does it make sense to use a preprocessor? And when yes - how?
If you're only looking for names, then something as simple as this:
grammar PascalFuncProc;
parse
: (Procedure | Function)* EOF
;
Procedure
: 'procedure' Spaces Identifier
;
Function
: 'function' Spaces Identifier
;
Ignore
: (StrLiteral | Comment | .) {skip();}
;
fragment Spaces : (' ' | '\t' | '\r' | '\n')+;
fragment Identifier : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
fragment StrLiteral : '\'' ~'\''* '\'';
fragment Comment : '{' ~'}'* '}';
will do the trick. Note that I am not very familiar with Delhpi/Pascal, so I am surely goofing up StrLiterals and/or Comments, but that'll be easily fixed.
The lexer generated from the grammar above will only produce two type of tokens (Procedures and Functions), the rest of the input (string literals, comments or if nothing is matched, a single character: the .) is being discarded from the lexer immediately (the skip() method).
For input like this:
some valid source
{
function NotAFunction ...
}
procedure Proc
Begin
...
End;
procedure Func
Begin
s = 'function NotAFunction!!!'
End;
the following parse tree is created:
What you asking about are called island grammars. The notion is that you define a parser for the part of the language you care about (the "island") with all the classic tokenization needed for that part, and that you define an extremely sloppy parser to skip the rest (the "ocean" in which the island is embedded). One common trick to doing this is to define correspondingly sloppy lexers, that pick up vast amounts of stuff (to skip past HTML to embedded code, you can try to skip past anything that doesn't look like a script tag in the lexer, for example).
The ANTLR site even discusses some related issues but notably says there are examples included with ANTLR. I have no experience with ANTLR so I don't know how useful this specific information is.
Having built many tools that use parsers to analyze/transform code (check my bio) I'm a bit of a pessimist about the general utility of island grammmars. Unless your goal is to do something pretty trivial with the parsed-island, you will need to collect the meaning of all identifiers it uses directly or indirectly... and most of them are unfortunately for you defined in the ocean. So IMHO you pretty much have to parse the ocean too to get past trivial tasks. You'll have other troubles, too, making sure you really skip the island stuff; this pretty much means your ocean lexer has know about whitespace, comment, and all the picky syntax of character strings (this is harder than it looks with modern languages) so that these get properly skipped over. YMMV.