Matching end-of-line or end-of-file using nom - parsing

I'm trying to parse a string using nom that will either be terminated by a newline or will reach end-of-input once consumed. I have the following code which seems like it should compile:
named!(am_implied <AddressingMode>,
do_parse!(
opt!(space) >>
alt!(
line_ending |
eof!()
) >>
(AddressingMode::Implied)
)
);
This fails with the following message:
error[E0282]: unable to infer enough type information about `E`
--> src/lib.rs:181:1
|
181 | named!(am_implied <AddressingMode>,
| ^ cannot infer type for `E`
|
= note: type annotations or generic parameter binding required
I'm led to believe that the above code should compile since following code compiles:
named!(am_implied <AddressingMode>,
do_parse!(
opt!(space) >>
line_ending >>
eof!() >>
(AddressingMode::Implied)
)
);
I'm confused as to why this works when the line_ending and eof! parsers aren't used within an alt! parser, but fails when they are. I'd like to know the correct solution to matching on line_ending or eof!.

This looks like it's this issue in nom, and this WIP PR. Essentially there aren't enough type hints provided by some of the nom macros, so the inference fails.
The suggested workaround is to split some of the sub-parsers into separate parsers to help the type inference, but that didn't work for me when I tried it in this case.

Related

Parsing Dart | ANTLR | Handle a comma at the end of parameter list

My apologies for the bad title, but couldn't express it in better words.
I'm writing a parser using ANTLR to calculate complexities in dart code.
Things seem to work fine until I tried to parse a file with the following Method Signature
Stream<SomeState> mapEventToState(SomeEvent event,) async* {
//someCode to map the State to Event
}
Here the mapEventToState(SomeEvent event,) creates an issue because of the COMMA , at the end.
It presents 2 params to me because of the trailing COMMA (whereas in reality it's just one) and includes some part of the code in the params list thus making the rest of the code unreadable for ANTLR.
This is normal in flutter to end a list of parameters with a COMMA.
The grammar corresponding to it is:
initializedVariableDeclaration
: declaredIdentifier ('=' expression)? (','initializedIdentifier)*
;
initializedIdentifier
: identifier ('=' expression)?
;
initializedIdentifierList
: initializedIdentifier (',' initializedIdentifier)*
;
The full grammar can be checked at https://github.com/antlr/grammars-v4/blob/master/dart2/Dart2.g4
What should I change on the grammar so that I don't face this issue and the parser can understand that functionName(Param param1, Param param2,) is same as functionName(Param param1, Param param2)
The Dart project maintains a reference ANTLR grammar for the Dart language (mostly as a tool for ourselves, to ensure new language features can be parsed).
It might be useful as a reference.
The "dart2" grammar you are linking to in the ANTLR repository is probably severely outdated. It was not created by a Dart team member, and if it doesn't handle trailing commas in argument lists, it was probably never complete for Dart 2.0. Use with caution.
I do not believe that the rule you mentioned (initializedVariableDeclaration) is the grammar corresponding to the problem. That's for an ordinary variable declaration (with an initializer).
I believe you actually want to change formalParameterList. The Dart grammar is provided by the language specification, and we can compare the grammar listed there to the grammar from the ANTLR repository.
The ANTLR file has:
formalParameterList
: '(' ')'
| '(' normalFormalParameters ')'
...
whereas the Dart 2.10 specification has, from section 9.2 (Formal Parameters):
<formalParameterList> ::= ‘(’ ‘)’
| ‘(’ <normalFormalParameters> ‘,’? ‘)’
...
You should file an issue against ANTLR or create a pull request to fix it.
That file also does not appear to have been substantially updated since May 2019 and seems to be missing some notable changes to the Dart language since that time (e.g. spread collections (spreadElement), collection-if (ifElement), and collection-for (forElement) from Dart 2.3, and the changes for null safety).

how to report error for undefined grammar defined using anltr

I am currently trying to improvise/fix bug an existing grammar which someone else has created.
We have our own language for which we have created an editor We are using eclipse ide.
Some grammar examples like
calc : choice INTEGER INTEGER
choice : add|sub|div|mul
INTEGER : ('0'..'9')+
So in my editor, if I type
calc add 2 aaa
So the error parser of antlr recognizes it as an error since it is expecting an integer and we typed string and throws error message such as
extraneous input 'aaa' expecting {'{', INTEGER}"
(I have my class extends BaseErrorListener, where I create markers for these errors )
Similarly, I have such grammar defined for my editor.
Now the question is: for all this, it identifies that something is wrong in the syntax and it throws errors, but what for syntax which is not part of grammar like
If I type any garbage value such as
abc add 2 3
or
just_type_junk_in_editor
it does not throw any error since ‘abc’ or ‘just_type_junk_in_editor‘ is not in my grammar
so is there a way that for keywords which are not part of grammar, the error parser of antlr should parse it as an error.
Without having seen the full grammar I think your problem is the missing EOF token in your main rule. ANTLR4 consumes input as much as it can, but if it doesn't match anything at least in the main rule, it ignores the rest, which explains why you don't see an error. By adding EOF you tell your ANTLR4 that all input must be matched:
calc: choice INTEGER INTEGER EOF;

Reduce-Reduce error: .. can follow more than one completed rule

I am using Gold Parser v5.2.
I attempting to slightly modify the Object Pascal Engine (by Rob van den Brink) so that it can parse .dpr and .dpk files as well as .pas files.
The garmmar file (named D7Grammar.grm, in the file downloaded from above link) passes Gold's analysis [Project | Analyze the Grammar] (with the modifications below) but fails with 'Project | Create LALR Parse tables'.
Modifications to 'D7Grammar.grm' file:
Find definition for 'FloatLiteral' and rewrite it as this:
FloatLiteral = {Digit} + '.' + {Digit} +
Find '<UsesClause>' and rewrite it as this:
<UsesClause> ::= USES <UnitList> ';'
| SynError
Add the following rules
<UnitRef> ::= <RefID> !see http://stackoverflow.com/questions/35871440/
| <RefID> IN 'StringLiteral'
| <RefID> IN 'StringLiteral' Comment Start <RefID> Comment End
<UnitList> ::= <UnitList> ',' <UnitRef>
| <UnitRef>
Having done these, when I issue Project | Create LALR Parse tables' in Gold Parser, I get the following error.
')' can follow more than one completed rule. A Reduce-Reduce error is
a caused when a grammar allows two or more rules to be reduced at the
same time, for the same token. The grammar is ambigious. Please see
the documentation for more information.
Further clicking around displays a table showing/hinting that 'FieldDesignator' and 'EnumId' are the culprits --as well as some more information I have no idea what they mean.
I am guessing this new ambiguity was looked over by older versions of Gold (since D7Grammar.grm had no issues then) but brought to surface by the new version.
Trouble is, other than doing trial-error (mostly copy/paste from random ideas or from other peoples suggestions) I am useless with grammar rules.
Hence, goes without saying, help is badly needed to get past this problem.

How to remove ambiguity from this piece of Delphi grammar

I'm working on a Delphi Grammar in Rascal and I'm having some problems parsing its “record” type. The relevant section of Delphi code can look as follows:
record
private
a,b,c : Integer;
x : Cardinal;
end
Where the "private" can be optional, and the variable declaration lines can also be optional.
I tried to interpret this section using the rules below:
syntax FieldDecl = IdentList ":" Type
| IdentList ":" Type ";"
;
syntax FieldSection = FieldDecl
| "var" FieldDecl
| "class" "var" FieldDecl
;
syntax Visibility = "private" | "protected" | "public"| "published" ;
syntax VisibilitySectionContent = FieldSection
| MethodOrProperty
| ConstSection
| TypeSection
;
syntax VisibilitySection = Visibility? VisibilitySectionContent+
;
syntax RecordType = "record" "end"
| "record" VisibilitySection+ "end"
;
Problem is ambiguity. The entire text between “record” and “end” can be parsed in a single VisibilitySection, but every line on its own can also be a seperate VisibilitySection.
I can change the rule VisibilitySection to
syntax VisibilitySection = Visibility
| VisibilitySectionContent
;
Then the grammar is no longer ambiguous, but the VisibilitySection becomes, flat, there is no nesting anymore of the variable lines under an optional 'private' node, which I would prefer.
Any suggestions on how to solve this problem? What I would like to do is demand a longest /greedy match on the VisibilitySectionContent+ symbol of VisibilitySection.
But changing
syntax VisibilitySection = Visibility? VisibilitySectionContent+
to
syntax VisibilitySection = Visibility? VisibilitySectionContent+ !>> VisibilitySectionContent
does not seem to work for this.
I also ran the Ambiguity report tool on Rascal, but it does not provide me any insights.
Any thoughts?
Thanks
I can't check since you did not provide the full grammar, but I believe this should work to get your "longest match" behavior:
syntax VisibilitySection
= Visibility? VisibilitySectionContent+ ()
>> "public"
>> "private"
>> "published"
>> "protected"
>> "end"
;
In my mind this should remove the interpretation where your nested VisibilitySections are cut short. Now we only accept such sections if they are immediately followed by either the end of the record, or the next section. I'm curious to find out if it really works because it is always hard to predict the behavior of a grammar :-)
The () at the end of the rule (empty non-terminal) makes sure we can skip to the start of the next part before applying the restriction. This only works if you have a longest match rule on layout already somewhere in the grammar.
The VisibilitySectionContent+ in VisibilitySection should be VisibilitySectionContent (without the Kleene plus).
I’m guessing here, but your intention is probably to allow a number of sections/declarations within the record type, and any of those may or may not have a Visibility modifier. To avoid putting this optional Visibility in every section, you have created a VisibilitySectionContent nonterminal which basically models “things that can happen within the record type definition”, one thing per nonterminal, without worrying about visibility modifiers. In this case, you’re fine with one VisibilitySectionContent per VisibilitySection since there is explicit repetition when you refer to the VisibilitySection from the RecordType anyway.

Antlr mismatched '>' for include macro

I started to work with antlr a few days ago. I'd like to use it to parse #include macros in c. Only includes are to my interest, all other parts are irrelevant. here i wrote a simple grammar file:
... parser part omitted...
INCLUDE : '#include';
INCLUDE_FILE_QUOTE: '"'FILE_NAME'"';
INCLUDE_FILE_ANGLE: '<'FILE_NAME'>';
fragment
FILE_NAME: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|' ')+;
MACROS: '#'('if' | 'ifdef' | 'define' | 'endif' | 'undef' | 'elif' | 'else' );
//MACROS: '#'('a'..'z'|'A'..'Z')+;
OPERATORS: ('+'|'-'|'*'|'/'|'='|'=='|'!='|'>'|'>='|'<'|'<='|'>>'|'<<'|'<<<'|'|'|'&'|','|';'|'.'|'->'|'#');
... other supporting tokens like ID, WS and COMMENT ...
This grammar produces ambiguity when such statement are encountered:
(;i<listLength;i++)
output: mismatched character ';' expecting '>'
Seems it's trying to match INCLUDE_FILE_ANGLE instead of treating the ";" as OPERATORS.
I heard there's an operator called syntactic predicate, but im not sure how to properly use it in this case.
How can i solve this problem in an Antlr encouraged way?
Looks like there's not lots of activity about antlr here.
Anyway i figured this out.
INCLUDE_MACRO: ('#include')=>'#include';
VERSION_MACRO: ('#version')=>'#version';
OTHER_MACRO:
(
|('#if')=>'#if'
|('#ifndef')=>'#ifndef'
|('#ifdef')=>'#ifdef'
|('#else')=>'#else'
|('#elif')=>'#elif'
|('#endif')=>'#endif'
);
This only solves first half of the problem. Secondly, one cannot use the INCLUDE_FILE_ANGLE to match the desired string in the #include directive.
The '<'FILE_NAME'>' stuffs creates ambiguity and must be broken down to basic tokens from lexer or use more advanced context-aware checks. Im not familiar with the later technique, So i wrote this in the parser rule:
include_statement :
INCLUDE_MACRO include_file
-> ^(INCLUDE_MACRO include_file);
include_file
: STRING
| LEFT_ANGLE(INT|ID|OPERATORS)+RIGHT_ANGLE
;
Though this works , but it admittedly looks ugly.
I hope experienced users can comment with much better solution.

Resources