ANTLR doesn't find the defined start rule - parsing

I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi

There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.

Related

xtext not accepting string constant - expecting RULE_ID

I have tried to cut down my problem to the simplest problem I can in xtext - I would like to use the following grammar:
M: lines += T*;
T:
DT
| BDT
| N
;
BDT:
name = ('a' | 'b' | 'c')
;
DT:
'd' name=ID
('(' (ts += BDT (','ts += BDT)*) ')')?
;
N:
'n' name=ID ':' type=[T]
;
I am intending to parse expressions of the form d f(a,b,b) for example which works fine. I would also like to be able to parse n g:f which also works, but not n g:a - where a here is part of the BDT rule. The error given is "Missing RULE_ID at 'a'".
I'd like to allow the grammar to parse n g:a for example, and I'd be very grateful if anyone could point out where I'm going wrong here on this very simple grammar.
Lexing is done context free. A keyword can never be an ID. You can address this trough parser rules.
You can introduce a datatype rule
MyID: ID | "a" | ... | "c";
And use it where you use ID

ANTLR grammar for multi-level text segmentation

I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.

ANTLR 4 Parser Grammar

How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

ANTLR Parsing Literals and Quoted IDs

I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.
The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.
Here are my results:
input: 'blahblah' result: string_literal as expected.
input: field1 restul: column_name as expected
input: table.field1 result: column_spec as expected
input: 'table'.'field1' result: string_literal, MissingTokenException
Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.
grammar test;
expression
:
simpleExpression EOF!
;
simpleExpression
:
column_spec
| literal_value
;
column_spec
:
(table_name '.')? column_name
| ('\''table_name '\'''.')? '\'' column_name '\''
| ('\"'table_name '\"' '.')? '\"' column_name '\"'
;
string_literal: QUOTED_LITERAL ;
boolean_literal: 'TRUE' | 'FALSE' ;
literal_value :
(
string_literal
| boolean_literal
)
;
table_name :ID;
column_name :ID;
QUOTED_LITERAL:
( '\''
( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )*
'\'' )
|
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
;
ID
:
( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_' | '0'..'9'| '::' )*
;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.
This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.
This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)
You have the following problem:
When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)
It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:
('\"'table_name '\"' '.')? '\"' column_name '\"'
and
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
I strongly suggest to rewrite the simpleExpression rule to:
simpleExpression:
IDENTIFIER |
IDENTIFIER . IDENTIFIER |
QUOTED_LITERAL |
QUOTED_LITERAL . QUOTED_LITERAL |
boolean_literal;
And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.
It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.
The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.
When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.
As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.
What you want to do is something like this.
QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;
id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;
If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.
The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.
Looking that this, it will fail to correctly parse lines like
'tag' text 'second'
as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like
fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;
(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)

Resources