I'm working on a Delphi Grammar in Rascal and I'm having some problems parsing its “record” type. The relevant section of Delphi code can look as follows:
record
private
a,b,c : Integer;
x : Cardinal;
end
Where the "private" can be optional, and the variable declaration lines can also be optional.
I tried to interpret this section using the rules below:
syntax FieldDecl = IdentList ":" Type
| IdentList ":" Type ";"
;
syntax FieldSection = FieldDecl
| "var" FieldDecl
| "class" "var" FieldDecl
;
syntax Visibility = "private" | "protected" | "public"| "published" ;
syntax VisibilitySectionContent = FieldSection
| MethodOrProperty
| ConstSection
| TypeSection
;
syntax VisibilitySection = Visibility? VisibilitySectionContent+
;
syntax RecordType = "record" "end"
| "record" VisibilitySection+ "end"
;
Problem is ambiguity. The entire text between “record” and “end” can be parsed in a single VisibilitySection, but every line on its own can also be a seperate VisibilitySection.
I can change the rule VisibilitySection to
syntax VisibilitySection = Visibility
| VisibilitySectionContent
;
Then the grammar is no longer ambiguous, but the VisibilitySection becomes, flat, there is no nesting anymore of the variable lines under an optional 'private' node, which I would prefer.
Any suggestions on how to solve this problem? What I would like to do is demand a longest /greedy match on the VisibilitySectionContent+ symbol of VisibilitySection.
But changing
syntax VisibilitySection = Visibility? VisibilitySectionContent+
to
syntax VisibilitySection = Visibility? VisibilitySectionContent+ !>> VisibilitySectionContent
does not seem to work for this.
I also ran the Ambiguity report tool on Rascal, but it does not provide me any insights.
Any thoughts?
Thanks
I can't check since you did not provide the full grammar, but I believe this should work to get your "longest match" behavior:
syntax VisibilitySection
= Visibility? VisibilitySectionContent+ ()
>> "public"
>> "private"
>> "published"
>> "protected"
>> "end"
;
In my mind this should remove the interpretation where your nested VisibilitySections are cut short. Now we only accept such sections if they are immediately followed by either the end of the record, or the next section. I'm curious to find out if it really works because it is always hard to predict the behavior of a grammar :-)
The () at the end of the rule (empty non-terminal) makes sure we can skip to the start of the next part before applying the restriction. This only works if you have a longest match rule on layout already somewhere in the grammar.
The VisibilitySectionContent+ in VisibilitySection should be VisibilitySectionContent (without the Kleene plus).
I’m guessing here, but your intention is probably to allow a number of sections/declarations within the record type, and any of those may or may not have a Visibility modifier. To avoid putting this optional Visibility in every section, you have created a VisibilitySectionContent nonterminal which basically models “things that can happen within the record type definition”, one thing per nonterminal, without worrying about visibility modifiers. In this case, you’re fine with one VisibilitySectionContent per VisibilitySection since there is explicit repetition when you refer to the VisibilitySection from the RecordType anyway.
Related
Let's say I'm writing a parser that parses the following syntax:
foo.bar().baz = 5;
The grammar rules look something like this:
program: one or more statement
statement: expression followed by ";"
expression: one of:
- identifier (\w+)
- number (\d+)
- func call: expression "(" ")"
- dot operator: expression "." identifier
Two expressions have a problem, the func call and the dot operator. This is because the expressions are recursive and look for another expression at the start, causing a stack overflow. I will focus on the dot operator for this quesition.
We face a similar problem with the plus operator. However, rather than using an expression you would do something like this to solve it (look for a "term" instead):
add operation: term "+" term
term: one of:
- number (\d+)
- "(" expression ")"
The term then includes everything except the add operation itself. To ensure that multiple plus operators can be chained together without using parenthesis, one would rather do:
add operation: term, one or more of ("+" followed by term)
I was thinking a similar solution could for for the dot operator or for function calls.
However, the dot operator works a little differently. We always evaluate from left-to-right and need to allow full expressions so that you can do function calls etc. in-between. With parenthesis, an example might be:
(foo.bar()).baz = 5;
Unfortunately, I do not want to require parenthesis. This would end up being the case if following the method used for the plus operator.
How could I go about implementing this?
Currently my parser never peeks ahead, but even if I do look ahead, it still seems tricky to accomplish.
The easy solution would be to use a bottom-up parser which doesn't drop into a bottomless pit on left recursion, but I suppose you have already rejected that solution.
I don't understand your objection to using a looping construct, though. Postfix modifiers like field lookup and function call are not really different from binary operators like addition (except, of course, for the fact that they will not need to claim an eventual right operand). Plus and minus intermingle freely, which you can parse with a repetition like:
additive: term ( '+' term | '-' term )*
Similarly, postfix modifiers can be easily parsed with something like:
postfixed: atom ( '.' ID | '(' opt-expr-list `)` )*
I'm using a form of extended BNF: parentheses group; | separates alternatives and binds less stringly than concatenation; and * means "zero or more repetitions" of the atom on its left.
Another postfix operator which falls into the same category is array/map subscripting ('[' expr ']'), although you might also have other postfix operators.
Note that like the additive syntax above, selecting the appropriate alternative does not require looking beyond the next token. It's hard to parse without being able to peek one token into the future. Fortunately, that's very little overhead.
One way could be for the dot operator to parse a non-dot expression, that is, a rule that is the same as expression but without the dot operator. This prevents recursion.
Then, when the non-dot expression has been parsed, check if a dot and an identifier follows. If this is not the case, we are done. If this is the case, wrap the current node up in a dot operation node. Then, keep track of the entire string text that has been parsed for this operation so far. Then revert everything back to before the operation was being parsed, and now re-parse a "custom expression", where the first directly-nested expression would really be trying to match the exact string that was parsed before rather than a real expression. Repeat until there are no more dot-identifier pairs (this should happen automatically by the new "custom expression").
This is messy, complicated and possibly slow, and I'm not entirely sure if it'll work but I'll try it out. I'd appreciate alternative solutions.
I have a syntax definition that looks like this
keyword LegendOperation = 'or' | 'and';
syntax LegendData
= legend_data: LegendKey '=' {ID LegendOperation}+ Newlines
;
I need to implode this into a way that allows me to retain the information on whether the separator for the ID is 'or' or 'and' but I didn't find anything in the docs on whether the separator is retained and if it can be used by implode. Initially, I did something like the below to try and keep that informatioACn.
syntax LegendData
= legend_data_or: LegendKey '=' {ID 'or'}+ Newlines
> legend_data_and: LegendKey '=' {ID 'and'}+ Newlines
;
The issue that I run into is that there are three forms of text that it needs to be able to parse
. = Background
# = Crate and Target
# = Crate or Wall
And when it tries to parse the first line it hits an ambiguity error when it should instead parse it as a legend_data_or with a single element (perhaps I misunderstood how to use the priorities). Honestly, I would prefer to be able to use that second format, but is there a way to disambiguate it?
Either a way to implode the first syntax while retaining the separator or a way to disambiguate the second format would help me with this issue.
I did not manage to come up with an elegant solution in the end. Discussing with others the best we could come up with was
syntax LegendOperation
= legend_or: 'or' ID
| legend_and: 'and' ID
;
syntax LegendData
= legend_data: LegendKey '=' ID LegendOperation* Newlines
;
Which works and allows us to retain the information on the separator but requires post-processing to turn into a usable datatype.
I'm new to Lex and I'm confused on how to declare the following macro, keyword. I want keyword to consist of either "if", "then", "else", or "while."
I typed this in lex:
keyword "if" | "then" | "else" | "while"
but the compiler is giving me an "unrecognized rule error". When I instead do
keyword "if"
It compiles ok.
Is this just a limitation of Lex? I know in jflex you can do what I did above and it'll work fine. Or am I doing it incorrectly?
Thanks
I can't test this right now, but off the top of my head:
Try putting the values in parentheses (before the first %%)
keyword ("if"|"then"|"else"|"while")
And then use it in rules like this (between %% and %%):
{keyword} {//action}
This is how you make a class in lex, so in the rest of the code you can use {keyword} and it will be recognized as the regex you've assigned in the definition section (before the first %%).
Also, you can use a class as a part of other regexs:
{keyword}\{[^\}]\} {//action}
This recognizes a whole block of code. (but it doesn't check the syntax inside the block, I leave that to you :) )
I started to work with antlr a few days ago. I'd like to use it to parse #include macros in c. Only includes are to my interest, all other parts are irrelevant. here i wrote a simple grammar file:
... parser part omitted...
INCLUDE : '#include';
INCLUDE_FILE_QUOTE: '"'FILE_NAME'"';
INCLUDE_FILE_ANGLE: '<'FILE_NAME'>';
fragment
FILE_NAME: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|' ')+;
MACROS: '#'('if' | 'ifdef' | 'define' | 'endif' | 'undef' | 'elif' | 'else' );
//MACROS: '#'('a'..'z'|'A'..'Z')+;
OPERATORS: ('+'|'-'|'*'|'/'|'='|'=='|'!='|'>'|'>='|'<'|'<='|'>>'|'<<'|'<<<'|'|'|'&'|','|';'|'.'|'->'|'#');
... other supporting tokens like ID, WS and COMMENT ...
This grammar produces ambiguity when such statement are encountered:
(;i<listLength;i++)
output: mismatched character ';' expecting '>'
Seems it's trying to match INCLUDE_FILE_ANGLE instead of treating the ";" as OPERATORS.
I heard there's an operator called syntactic predicate, but im not sure how to properly use it in this case.
How can i solve this problem in an Antlr encouraged way?
Looks like there's not lots of activity about antlr here.
Anyway i figured this out.
INCLUDE_MACRO: ('#include')=>'#include';
VERSION_MACRO: ('#version')=>'#version';
OTHER_MACRO:
(
|('#if')=>'#if'
|('#ifndef')=>'#ifndef'
|('#ifdef')=>'#ifdef'
|('#else')=>'#else'
|('#elif')=>'#elif'
|('#endif')=>'#endif'
);
This only solves first half of the problem. Secondly, one cannot use the INCLUDE_FILE_ANGLE to match the desired string in the #include directive.
The '<'FILE_NAME'>' stuffs creates ambiguity and must be broken down to basic tokens from lexer or use more advanced context-aware checks. Im not familiar with the later technique, So i wrote this in the parser rule:
include_statement :
INCLUDE_MACRO include_file
-> ^(INCLUDE_MACRO include_file);
include_file
: STRING
| LEFT_ANGLE(INT|ID|OPERATORS)+RIGHT_ANGLE
;
Though this works , but it admittedly looks ugly.
I hope experienced users can comment with much better solution.
I'm learning Bison and at this time the only thing that I do was the rpcalc example, but now I want to implement a print function(like printf of C), but I don't know how to do this and I'm planning to have a syntax like this print ("Something here");, but I don't know how to build the print function and I don't know how to create that ; as a end of line. Thanks for your help.
You first need to ask yourself:
What are the [sub-]parts of my 'print ("something");' syntax ?
Once you identify these parts, "simply" describe them in the form of grammar syntax rules, along with applicable production rules. And then let Bison generate the parser for you; that's about it.
To put you on your way:
The semi-column is probably a element you will use to separate statemements (such a one "call" to print from another).
'print' itself is probably a keyword, or preferably a native function name of your language.
The print statement appears to take a literal string as [one of] its arguments. a literal string starts and ends with a double quote (and probably allow for escaped quotes within itself)
etc.
The bolded and italic expressions above are some of the entities (the 'symbols' in parser lingo) you'll likely need to define in the syntax for your language. For that you'll use Bison grammar rules, such as
stmt : print_stmt ';' | input_stmt ';'| some_other_stmt ';' ;
prnt_stmt : print '(' args ')'
{ printf( $3 ); }
;
args : arg ',' args;
...
Since the question asked about the semi-column, maybe some confusion was from the different uses thereof; see for example above how the ';' belong to your language's syntax whereby the ; (no quotes) at the end of each grammar rule are part of Bison's language.
Note: this is of course a simplistic implementation, aimed at showing the essential. Also the Bison syntax may be a tat off (been there / done it, but a long while back ;-) I then "met" ANTLR never to return to Bison, although I do see how its lightweight and fully self contained nature can make it appropriate in some cases)