Antlr Indirect Left Recursion - parsing

I've seen this question asked multiple times, and also seen people "solve" it... but it either confused me or didn't solve my specific situation:
Here's approximately what's going on:
block: statement*;
statement: <bunch of stuff> | expressionStatement;
expression_statement: <more stuff, for example> | method_invoke;
method_invoke: expression LEFT_PAREN <args...> RIGHT_PAREN block;
expression: <bunch of stuff> | expression_statement;
Everything inside of the expression_statement that starts with an expression uses indirect left recursion that I do not know how to fix while still being able to use those syntaxes as statements so they'll be usable in blocks(It is possible to do something like Print("hello world");
On it's own(a statement), but also do something like int c = a + b.getValue() as a part of an expression(an expression)...
How would I handle it differently?
If you need more info please let me know and I'll try my best to provide

I knew that to solve Indirect Left-Recursion I'd have to duplicate one or more of the rules... I hoped there'd be a better way to handle it then what is written online and also said here, but there isn't. I ended up doing that and it worked, thank you

Related

How to implode an embedded choice of alternative symbols?

I have this syntax definition
syntax RuleData
= rule_data: ID+ RulePart '-\>' (Command|RulePart)+ Message? Newlines
;
Which for the most part doesn't cause me an issue, the only problem is that I'm not sure how to implode (Command|RulePart)+, I looked through the Rascal docs but I didn't find anything on how to define "Union" types.
This is what my ADT looks like currently
data RULEDATA
= rule_data(list[str] prefix, list[RULEPART] left, list[???] right, list[str] message, str)
;
The ??? is the bit where it could either be a RulePart (which, for the sake of simplicity, is a list[str]) or a Command (which is a str).
Turns out I was overcomplicating the whole thing. Instead of trying to have a union of types I simply added an additional construction to RULEPART that could accommodate COMMAND. I think I might have been also initially confused about some of the issues because I had bugs in other parts of the code that I was misinterpreting as being caused by this problem.
data RULEDATA
= rule_data(list[str] prefix, list[RULEPART] left, list[RULEPART] right, list[str] message, str)
;
data RULEPART
= part(list[RULECONTENT] contents)
| command(str command)
;

whitespace in flex patterns leads to "unrecognized rule"

The flex info manual provides allows whitespace in regular expressions using the "x" modifier in the (?r-s:pattern) form. It specifically offers a simple example (without whitespace)
(?:foo) same as (foo)
but the following program fails to compile with the error "unrecognized rule":
BAD (?:foo)
%%
{BAD} {}
I cannot find any form of (? that is acceptable as a rule pattern. Is the manual in error, or do I misunderstand?
The example in your question does not seem to reflect the question itself, since it shows neither the use of whitespace nor a x flag. So I'm going to assume that the pattern which is failing for you is something like
BAD (?x:two | lines |
of | words)
%%
{BAD} { }
And, indeed, that will not work. Although you can use extended format in a pattern, you can only use it in a definition if it doesn't contain a newline. The definition terminates at the last non-whitespace character on the definition line.
Anyway, definitions are overused. You could write the above as
%%
(?x:two | lines |
of | words ) { }
Which saves anyone reading your code from having to search for a definition.
I do understand that you might want to use a very long pattern in a rule, which is awkward, particularly if you want to use it twice. Regardless of the issue with newlines, this tends to run into problems with Flex's definition length limit (2047 characters). My approach has been to break the very long pattern into a series of definitions, and then define another symbol which concatenates the pieces.
Before v2.6, Flex did not chop whitespace off the end of the definition line, which also leads to mysterious "unrecognized rule" errors. The manual seems to still reflect the v2.5 behaviour:
The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line.

How to parse and translate DSL using Red or Rebol

I'm trying to see if I can use Red (or Rebol) to implement a simple DSL. I want to compile my DSL to source code for another language, perhaps Red or C# or both - rather than directly interpreting and executing it.
The DSL has only a couple of simple statements, plus an if/else statement.
Statements can be grouped into rules. A rule would get translated into a function definition, with each statement the equivalent statement in the target language.
The parse capability in Red/Rebol is great and lets me implement a parser very easily - in effect it's basically just the definition of the grammar itself.
However I haven't been able to find any examples of how to take the next steps, specifically handling an if statement and translating it to other source code.
Translating an if statement seems a good example of something minimal but still slightly tricky - because in Red having an else means you need to change the if to an either, rather than just an extra optional else.
Traditionally, during parsing I would build an abstract syntax tree, and then have functions to operate on the AST and generate the new source code. Should I be following this same approach or is there some other more idiomatic way in Red ?
I've experimented with using collect/keep in my parse rules to return a block of nested blocks, which in effect forms the AST. Another approach would be to save data into specific objects representing the different statements etc.
I'm still getting to grips with collect/keep, as to when a new block will be created and what will be kept. I'd also like to keep my parser rules as "clean" as possible, with as little other code intertwined in it. So I'm still not sure how best to add in Red code in round brackets in the parse rules. Adding code too early can cause the Red code to get executed, even if the rule eventually fails. Adding code too late means the code may not be executed in the order you expect, especially when dealing with multi-level statements like if, which can contain other statements.
So, specifically, any help on how to translate my example DSL to Red source code would be appreciated. Also any links to implementing DSLs like this in Red or Rebol would be great ! :)
Here are my parse rules :-
Red [
Purpose: example rules for parsing a simple language
]
SimpleLanguageParser: make object! [
Expr: [string! | integer! | block!]
Data: ['Person.AGE | 'Person.INCOME]
WriteMessageToLog: ['write 'message 'to 'log Expr]
SetData: ['set 'data Data '= Expr]
IfStatement: ['if Expr [any Statement] opt ['else [any Statement]] 'endif]
Statement: [WriteMessageToLog | SetData | IfStatement]
Rule: [
'rule word!
[any Statement]
'endrule
]
AnySimpLeLanguage: [Rule | [any Statement]]
]
SL: function [slInput] [
parse slInput SimpleLanguageParser/AnySimpleLanguage
]
An example of some source in the DSL :-
RULE TooYoung
IF [Person.Age < 15]
WRITE MESSAGE TO LOG "too young to earn an income"
SET DATA Person.Income = 0
ELSE
WRITE MESSAGE TO LOG "old enough"
ENDIF
ENDRULE
Translated to Red source code :-
TooYoung: function [] [
either Person.Age < 15 [
WriteMessageToLog "too young to earn an income"
Person.Income: 0
] [
WriteMessageToLog "old enough"
]
]
The data, ie Person.Age, Person.Income, and the function WriteMessageToLog are all things which would have been previously defined.
Note, for simplicity I've left Expr as block! etc, rather than defining Expr in any more detail in the DSL itself. Also, setting Person.Income in the function doesn't work as coded as it sets a local - but that's ok for now :)
Always nice to see someone digging language-oriented programming, keep it up and welcome to Red! ;)
Specifying correct grammar rules is the trickiest part of the job, and you've already nailed that. What's left is to intersperse your PEG (parsing expression grammar) with set, copy, collect/keep combo and paren! expressions in the right places, and then either create an AST from that or, in simplier cases, emit code directly.
Example
Here's a quickly baked (and definitely buggy!) example of how I'd tackled your task. Basically, it's slightly reworked code of yours, where matched patterns are setted, copyed or collected, and then bounded to specific words, which then just pasted into "template" (compose function inside emit-rule) to produce a Red code.
It's not the only way, I believe. #rebolek might come up with more industrial-strength solution, as he has experience with sophisticated parsers, which I'm lacking :P
Followup
As for if/else dilemma, I followed the approach proposed above -- instead of using opt I wrapped rule for else-branch into block and added an alternative match, which just sets false-block to none.
What to use for AST -- anything that allow to express a hierarchical structure, which is either a block! (though for performance gain you might want to use hash! or map!) or an object!. The advantage of object! is that it provides a context to be bound to, but here we're approaching a realm of so-called Bindology ("scoping" rules of Red language), which is another beast :)
Emitting C# code would be harder, but doable -- you'll need to assemble a string instead of Red code. I think, however, that, as a newcomer, you should stick with parsing directly at block-level (the way you done in your example), because it a lot easier and much expressive.
Another interesting (but much hairy) approach would be to re-define all words used in your DSL-block to work as you want.
Resources
We have a wiki entry about Red/Rebol dialects on github, which you might find if not useful, but interesting to read.
Also, two articles (this and this) in Red blog, I think you skimmed over first one already (if not, you should!).
Last, but not least, an exhaustive review of Parse principles and keywords (which has a couple of wrong parts in it though, so, caveat emptor). It's written for Rebol, but you should adapt examples to Red rather easily.
As a relative newcomer to the language, I do agree that there's a lack of examples and tutorials about DSL development, but we're working on that (at least in our heads) :)
Taking 9214's answer as a starting point, I've coded one possible solution. My approach has been :-
try to keep the parse rules as "clean" as possible
use collect and keep to return a block as the result, rather than trying to build a more complex AST
do some minimal translation in the keeps
the resulting block should be valid Red code
which uses predefined functions, where any more complex processing needs to happen
Most simple statements are easily translated to functions eg WRITE MESSAGE TO LOG becomes SL_WriteMessageToLog which can then do whatever it needs to do.
More complicated statements with structure, eg If/Else become functions which take block parameters which can then process the blocks as required.
For the If/Else complication, I've made this into two separate functions, SL_If and SL_Else. SL_If stores the result of the condition in a sequence, and SL_Else checks the latest result and removes it. This allows for nested If/Elses.
The presence of the final endrule can be checked for to ensure the input was correctly parsed. Once this is removed, we should have a valid function definition.
Here's the code :-
Red [
Purpose: example rules for parsing and translating a simple language
]
; some data
Person.AGE: 0
Person.INCOME: 0
; functions to implement some simple SL statements
SL_WriteMessageToLog: function [value] [
print value
]
SL_SetData: function [parmblock] [
field: parmblock/1
value: parmblock/2
if type? value = word! [
value: do value
]
print ["old value" field "=" do field]
set field value
print ["new value" field "=" do field]
]
; hold the If condition results, to be used to determine whether or not to do Else
IfConditionResults: []
SL_If: function [cond stats] [
cond_result: do cond
head insert IfConditionResults cond_result
if cond_result stats
]
SL_Else: function [stats] [
cond_result: first IfConditionResults
remove IfConditionResults
if not cond_result stats
]
; parsing rules
SimpleLanguageParser: make object! [
Expr: [logic! | string! | integer! | block!]
Data: ['Person.AGE | 'Person.INCOME]
WriteMessageToLog: ['write 'message 'to 'log set x Expr keep ('SL_WriteMessageToLog) keep (x)]
SetData: ['set 'data set d Data '= set x Expr keep ('SL_SetData) keep (reduce [d x])]
IfStatement: ['if keep ('SL_If) keep Expr collect [any Statement] opt ['else keep ('SL_Else) collect [any Statement]] 'endif]
Statement: [WriteMessageToLog | SetData | IfStatement]
Rule: [collect [
'rule set fname word! keep (to set-word! fname) keep ('does)
collect [any Statement]
keep 'endrule
]
]
AnySimpLeLanguage: [Rule | [any Statement]]
]
SL: function [slInput] [
parse slInput SimpleLanguageParser/Rule
]
For the example in the original post, the output is :-
TooYoung: does [
SL_If [Person.Age < 15] [
SL_WriteMessageToLog "too young to earn an income"
SL_SetData [Person.Income 0]
]
SL_Else [
SL_WriteMessageToLog "old enough"
]
]
ENDRULE
Thanks for your help to get this far.
Feedback on this approach and solution would be appreciated :)

Help with Shift/Reduce conflict - Trying to model (X A)* (X B)*

Im trying to model the EBNF expression
("declare" "namespace" ";")* ("declare" "variable" ";")*
I have built up the yacc (Im using MPPG) grammar, which seems to represent this, but it fails to match my test expression.
The test case i'm trying to match is
declare variable;
The Token stream from the lexer is
KW_Declare
KW_Variable
Separator
The grammar parse says there is a "Shift/Reduce conflict, state 6 on KW_Declare". I have attempted to solve this with "%left PrologHeaderList PrologBodyList", but neither solution works.
Program : Prolog;
Prolog : PrologHeaderList PrologBodyList;
PrologHeaderList : /*EMPTY*/
| PrologHeaderList PrologHeader;
PrologHeader : KW_Declare KW_Namespace Separator;
PrologBodyList : /*EMPTY*/
| PrologBodyList PrologBody;
PrologBody : KW_Declare KW_Variable Separator;
KW_Declare KW_Namespace KW_Variable Separator are all tokens with values "declare", "naemsapce", "variable", ";".
It's been a long time since I've used anything yacc-like, but here are a couple of suggestions that may or may not help.
It seems that you need a 2-token lookahead in this situation. The parser gets to the last PrologHeader, and it has to decide whether the next construct is a PrologHeader or a PrologBody, and it can't tell that from the KW_Declare. If there's a directive to increase lookahead in this situation, it will probably solve the problem.
You could also introduce context into your actions: rather than define PrologHeaderList and PrologBodyList, define PrologRuleList and have the actions throw an error if a header appears after a body. Ugly, but sometimes you have to do it: what appears simple in a grammar may not be simple in the generated parser.
A hackish approach might be to combine the tokens: rather than KW_Declare and KW_Variable, have your lexer recognize the space and use KW_Declare_Variable. Since both are keywords, you're not going to run into namespace collision problems.
The grammar at the top is regular so IIRC you can plot it out as a DFA (or a NDA and convert it to a DFA) and then convert the DFA to a grammar. It's bean a while so I'll leave the work as an exercise for the reader.

javacc parseException... lookahead problem?

I'm writing a parser for a very simple grammar in javacc. It's beginning to come together but at the moment I'm completely stuck on this error:
ParseException: Encountered "" at line 4, column 15.
Was expecting one of:
The line of input in question is z = y + z + 5
and the production that is giving me problems is my expression which get called from
varDecl():
<ID> <EQL> expression()
Expression looks like this:
<VAR> (<PLUS> expression())?| <NUM> (<PLUS> expression())?
| call() (<PLUS> expression())?
I'm at a loss as to why I'm getting this error - any insight would be greatly appreciated.
Hm, yes, that's not a very helpful error from JavaCC. What version of JavaCC are you using?
Also, it's difficult to troubleshoot these problems without seeing the full grammar... and although I understand you might not be in a position to post that.

Resources