Parser rules confusion in ANTLR4 - parsing

I am confused as to whether the following is allowed:
(I am using declaration in the forloop rule however declaration also defines how to declare other things. Could this be error checked later in the compiler? Am I clear?)
declaration :
operand ASSIGNMENTOPERATOR variable var_type CONST?
|operations ASSIGNMENTOPERATOR variable var_type CONST?
|funcall ASSIGNMENTOPERATOR variable var_type CONST?
|(funcall|operand|NOINDEXARRAY) ASSIGNMENTOPERATOR variable var_type ARRAY CONST? ;
forloop :
block
(LPARENS ((number_operation ASSIGNMENTOPERATOR variable)|number_functions)
SEMICOLON bool_operation
SEMICOLON declaration
RPARENS
)
'for'
;
UPDATE: I know that it would work when I supply the right type of declaration inside the for loop. The question is what happens if I don't?

It seems what you have in mind is a semantic phase, which is very typical in parser setups. Parsing the input is only a small part of the work. Usually you have a step after that to validate your parse tree (e.g. look for duplicate variable names or unknown symbols and check other conditions). This is usually called the semantic phase (parsing is the syntactic phase).
You can use this semantic phase for all kind of error checking, including your declaration check (whatever you want to check there, that's not clear from your question).

Related

Parsing Dart | ANTLR | Handle a comma at the end of parameter list

My apologies for the bad title, but couldn't express it in better words.
I'm writing a parser using ANTLR to calculate complexities in dart code.
Things seem to work fine until I tried to parse a file with the following Method Signature
Stream<SomeState> mapEventToState(SomeEvent event,) async* {
//someCode to map the State to Event
}
Here the mapEventToState(SomeEvent event,) creates an issue because of the COMMA , at the end.
It presents 2 params to me because of the trailing COMMA (whereas in reality it's just one) and includes some part of the code in the params list thus making the rest of the code unreadable for ANTLR.
This is normal in flutter to end a list of parameters with a COMMA.
The grammar corresponding to it is:
initializedVariableDeclaration
: declaredIdentifier ('=' expression)? (','initializedIdentifier)*
;
initializedIdentifier
: identifier ('=' expression)?
;
initializedIdentifierList
: initializedIdentifier (',' initializedIdentifier)*
;
The full grammar can be checked at https://github.com/antlr/grammars-v4/blob/master/dart2/Dart2.g4
What should I change on the grammar so that I don't face this issue and the parser can understand that functionName(Param param1, Param param2,) is same as functionName(Param param1, Param param2)
The Dart project maintains a reference ANTLR grammar for the Dart language (mostly as a tool for ourselves, to ensure new language features can be parsed).
It might be useful as a reference.
The "dart2" grammar you are linking to in the ANTLR repository is probably severely outdated. It was not created by a Dart team member, and if it doesn't handle trailing commas in argument lists, it was probably never complete for Dart 2.0. Use with caution.
I do not believe that the rule you mentioned (initializedVariableDeclaration) is the grammar corresponding to the problem. That's for an ordinary variable declaration (with an initializer).
I believe you actually want to change formalParameterList. The Dart grammar is provided by the language specification, and we can compare the grammar listed there to the grammar from the ANTLR repository.
The ANTLR file has:
formalParameterList
: '(' ')'
| '(' normalFormalParameters ')'
...
whereas the Dart 2.10 specification has, from section 9.2 (Formal Parameters):
<formalParameterList> ::= ‘(’ ‘)’
| ‘(’ <normalFormalParameters> ‘,’? ‘)’
...
You should file an issue against ANTLR or create a pull request to fix it.
That file also does not appear to have been substantially updated since May 2019 and seems to be missing some notable changes to the Dart language since that time (e.g. spread collections (spreadElement), collection-if (ifElement), and collection-for (forElement) from Dart 2.3, and the changes for null safety).

Validating unique names for strings and optional reference

New to XText, I am struggling with two issues with the following MWE grammar.
Metamodel:
(classes += Type)*
;
Type:
Enumeration | Class
;
Enumeration:
'enumeration' name = ValidID '{' (literals += EnumLiteral ';')+ '}'
;
EnumLiteral:
ValidID
;
Class:
'class' name = ValidID '{'
(references += Reference)*
'}'
;
Reference:
'reference' name = ValidID ':' type = Class ('#' opposite = [Reference])?
;
So my questions are:
Since the enumeration literals list is ValidID, it seems to be represented by EStrings. The documentation does not seem to deal with the case of primitive types in ECore. How is it possible to check for non-duplicates in literals, and report it adequately in the editor (i.e., the error should be at the first occurence of a repeated literal)?
Despite my best efforts, I was unable to write a custom scope for the opposite reference. Since XText uses reflection for retrieving the scoping methods, I suspect I don't have the correct one: I tried def scope_Reference_opposite(Reference context, EReference r), is it correct? An example would be really appreciated, from which I am confident I can easily adapt to my "real" DSL.
Thanks a lot for the help, you will save me a lot of time looking again and again for a solution in documentation...
Errors can be attached to a certain index of a many-values feature. Write a validation for the type Enumeration and check the the list of literals for duplicates. Attach the error to the index in the list.
The signature is correct. Did you import the correct 'Reference' or did you use some other class with the same simple name by accident. Also please not that your grammar appears to be wrong for the type of the reference. This should be type=[Class] or more likely type=[Class|ValidID].
If you plan to use or do already use Xbase, things may look different. Xbase doesn't use the reflective scope provider.

Why are redundant parenthesis not allowed in syntax definitions?

This syntax module is syntactically valid:
module mod1
syntax Empty =
;
And so is this one, which should be an equivalent grammar to the previous one:
module mod2
syntax Empty =
( )
;
(The resulting parser accepts only empty strings.)
Which means that you can make grammars such as this one:
module mod3
syntax EmptyOrKitchen =
( ) | "kitchen"
;
But, the following is not allowed (nested parenthesis):
module mod4
syntax Empty =
(( ))
;
I would have guessed that redundant parenthesis are allowed, since they are allowed in things like expressions, e.g. ((2)) + 2.
This problem came up when working with the data types for internal representation of rascal syntax definitions. The following code will create the same module as in the last example, namely mod4 (modulo some whitespace):
import Grammar;
import lang::rascal::format::Grammar;
str sm1 = definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
alt({seq([])})
],{})))))));
The problematic part of the data is on its own line - alt({seq([])}). If this code is changed to seq([]), then you get the same syntax module as mod2. If you further delete this whole expression, i.e. so that you get this:
str sm3 =
definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
], {})))))));
Then you get mod1.
So should such redundant parenthesis by printed by the definition2rascal(...) function? And should it matter with regards to making the resulting module valid or not?
Why they are not allowed is basically we wanted to see if we could do without. There is currently no priority relation between the symbol kinds, so in general there is no need to have a bracket syntax (like you do need to + and * in expressions).
Already the brackets have two different semantics, one () being the epsilon symbol and two (Sym1 Sym2 ...) being a nested sequence. This nested sequence is defined (syntactically) to expect at least two symbols. Now we could without ambiguity introduce a third semantics for the brackets with a single symbol or relax the requirement for sequence... But we reckoned it would be confusing that in one case you would get an extra layer in the resulting parse tree (sequence), while in the other case you would not (ignored superfluous bracket).
More detailed wise, the problem of printing seq([]) is not so much a problem of the meta syntax but rather that the backing abstract notation is more relaxed than the concrete notation (i.e. it is a bigger language or an over-approximation). The parser generator will generate a working parser for seq([]). But, there is no Rascal notation for an empty sequence and I guess the pretty printer should throw an exception.

Can't use IF NOT with a set in delphi

I have a set of chars which I define in the TYPE section as:
TAmpls = set of '1'..'9'';
In my function I declare a new variable, in the var section, with type Tampls using:
myAmpls : Tampls;
I then un-assign everything in myAmpls using:
myAMpls := [];
I then find an integer (I'll call it n). If this number is not assigned in my set variable, I want to assign it, for this I have tried using:
if not chr(n) in myAmpls then include(myAmpls,chr(n));
But the compiler throws an error saying:
'Operator not applicable to this operand type'
If I remove the 'not', the code compiles fine, why is this?
I would have thought that whether or not n was already in myAmpls was boolean, so why can't I use 'not'?
Delphi operator precedence is detailed in the documentation. There you will find a table of the operators listing their precedence. I won't reproduce the table here, no least because it's hard to lay out in markdown!
You will also find this text:
An operator with higher precedence is evaluated before an operator with lower precedence, while operators of equal precedence associate to the left.
Your expression is:
not chr(n) in myAmpls
Now, not has higher precedence than in. Which means that not is evaluated first. So the expression is parsed as
(not chr(n)) in myAmpls
And that is a syntax error because not cannot be used with a character operand. You need to apply parens to give the desired meaning to your expression:
not (chr(n) in myAmpls)

how typedef-name - identifier issue is resolved in C?

I've been recently writing parser for language based on C. I'm using CUP (Yacc for Java).
I want to implement "The lexer hack" (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ or https://en.wikipedia.org/wiki/The_lexer_hack), to distinguish typedef names and variable/function names etc. To enable declaring variables of the same name as type declared earlier (example from first link):
typedef int AA;
void foo() {
AA aa; /* OK - define variable aa of type AA */
float AA; /* OK - define variable AA of type float */
}
we have to introduce some new productions, where variable/function name could be either IDENTIFIER or TYPENAME. And this is the moment where difficulties occur - conflicts in grammar.
I was trying not to use this messy Yacc grammar for gcc 3.4 (http://yaxx.googlecode.com/svn-history/r2/trunk/gcc-3.4.0/gcc/c-parse.y), but this time I have no idea how to resolve conflicts on my own. I took a look at Yacc grammar:
declarator:
after_type_declarator
| notype_declarator
;
after_type_declarator:
...
| TYPENAME
;
notype_declarator:
...
| IDENTIFIER
;
fndef:
declspecs_ts setspecs declarator
// some action code
// the rest of production
...
setspecs: /* empty */
// some action code
declspecs_ts means declaration_specifiers where
"Whether a type specifier has been seen; after a type specifier, a typedef name is an identifier to redeclare (_ts or _nots)."
From declspecs_ts we can reach
typespec_nonreserved_nonattr:
TYPENAME
...
;
At the first glance I can't believe how shift/reduce conflicts does not appear!
setspecs is empty, so we have declspecs_ts followed by declarator, so that we can expect that parser should be confused whether TYPENAME is from declspecs_ts or from declarator.
Can anyone explain this briefly (or even precisely). Thanks in advance!
EDIT:
Useful link: http://www.gnu.org/software/bison/manual/bison.html#Semantic-Tokens
I can't speak for the specific code.
But the basic trick is that the C lexer inspects every IDENTIFIER, and decides if might be the name of a typedef. If so, then it changes the lexeme type to TYPEDEF and hands it to the parser.
How is the lexer to know what identifiers are typedefs? The parser must in effect tell it, by capturing typedef information as it runs. Somewhere in the grammar related to declarations, there must be an action to provide this information. I would have expected it to be attached to the grammar rules for, well, typedef declarations.
You didn't show what "setspec" did; maybe that's the place. A common trick used with LR parser generators is to introduce a grammar rule E with an empty right hand (your example "setspec"?), to be invoked in the middle of some other grammar rule (your example "fndef") just to enable access to a semantic action in the middle of processing that rule.
This whole trick is to get around parsing ambiguity if you can't tell typedefs from other identifiers. If your parser tolerates ambiguity, you don't need this hack at all; just parse, and built ASTs with both (sub)parses. After you acquire the AST, a tree walk can find type information and eliminate inconsistent subparses. We do this with GLR for both C and C++, and it beautifully separates parsing from name resolution.

Resources