Antlr whitespace token error - parsing

I have the following grammar and I want to match the String "{name1, name2}". I just want lists of names/intergers with at least one element. However I get the error:
line 1:6 no viable alternative at character ' '
line 1:11 no viable alternative at character '}'
line 1:7 mismatched input 'name' expecting SIMPLE_VAR_TYPE
I would expect whitespaces and such are ignored... Also interesting is the error does not occur with input "{name1,name2}" (no space after ',').
Heres my gramar
grammar NusmvInput;
options {
language = Java;
}
#header {
package secltlmc.grammar;
}
#lexer::header {
package secltlmc.grammar;
}
specification :
SIMPLE_VAR_TYPE EOF
;
INTEGER
: ('0'..'9')+
;
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;
NAME
: ('A'..'Z' | 'a'..'z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$' | '#' | '-')*
;
WS
: (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;}
;
And this is my testing code
package secltlmc;
public class Main {
public static void main(String[] args) throws
IOException, RecognitionException {
CharStream stream = new ANTLRStringStream("{name1, name2}");
NusmvInputLexer lexer = new NusmvInputLexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
NusmvInputParser parser = new NusmvInputParser(tokenStream);
parser.specification();
}
}
Thanks for your help.

The problem is that you are trying to parse SIMPLE_VAR_TYPE with the lexer, i.e. you are trying to make it a single token. In reality, it looks like you want a multi-token production, since you'd like whitespace to be re-directed to hidden channel through WS.
You should change SIMPLE_VAR_TYPE from a lexer rule to a parser rule by changing its initial letter (or better yet, the entire name) to lower case.
specification :
simple_var_type EOF
;
simple_var_type
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;

The defintion of SIMPLE_VAR_TYPE specifies the following expression:
Open {
followed by one of NAME or INTEGER
follwoed by zero or more of:
comma (,) followed by one of NAME or INTEGER
followed by closing }
Nowhere does it allow white-space in the input (neither NAME nor INTEGER allows it either), so you get an error when you supply one
Try:
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (WS* ',' WS* (NAME | INTEGER))* '}' )
;

Related

Disable wrapping in Xtext formatter

I have a xtext grammar which consists of one declaration per line. When I format the code, all the declarations end up in the same line, the line breaks are removed.
As I didn't manage to change the grammar to require line breaks, I would like to disable the removal of line breaks. How do I do that? Bonus points if someone can tell me how to require line breaks at the end of each declaration.
Part of the Grammar:
grammar com.example.Msg with org.eclipse.xtext.common.Terminals
hidden(WS, SL_COMMENT)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate msg_idl "http://www.example.com/ex/ample/msg"
Model:
MsgDef
;
MsgDef:
(definitions+=definition)+
;
definition:
type=fieldType ' '+ name=ValidID (' '* '=' ' '* const=Value)?
;
fieldType:
value = ( builtinType | header)
;
builtinType:
BOOL = "bool"
| INT32 = "int32"
| CHAR = "char"
;
header:
value="Header"
;
Bool_l:
target=BOOL_E
;
String_l:
target = ('""'|STRING)
;
Number_l:
Double_l | Integer_l | NegInteger_l
;
NegInteger_l:
target=NEG_INT
;
Integer_l :
target=INT
;
Double_l:
target=DOUBLE
;
terminal NEG_INT returns ecore::EInt:
'-' INT
;
terminal DOUBLE returns ecore::EDouble :
('-')? ('0'..'9')* ('.' INT) |
('-')? INT ('.') |
('-')? INT ('.' ('0'..'9')*)? (('e'|'E')('-'|'+')? INT )|
'nan' | 'inf' | '-inf'
;
enum BOOL_E :
true | false
;
ValidID:
"bool"
| "string"
| "time"
| "duration"
| "char"
| ID ;
Value:
String_l | Number_l
;
terminal SL_COMMENT :
' '* '#' !('\n'|'\r')* ('\r'? '\n')?
;
Example data
string left
string top
string right
string bottom
I already tried:
class MsgFormatter extends AbstractDeclarativeFormatter {
extension MsgGrammarAccess msgGrammarAccess = grammarAccess as MsgGrammarAccess
override protected void configureFormatting(FormattingConfig c) {
c.setLinewrap(0, 1, 2).before(SL_COMMENTRule)
c.setLinewrap(0, 1, 2).before(ML_COMMENTRule)
c.setLinewrap(0, 1, 1).after(ML_COMMENTRule)
c.setLinewrap().before(definitionRule); // does not work
c.setLinewrap(1,1,2).before(definitionRule); // does not work
c.setLinewrap().before(fieldTypeRule); // does not work
}
}
In general it is a bad idea to encode whitespace into the language itself. Most of the time it is better to write the language in a way that you can use all kinds of whitespaces (blanks, tabs, newlines ...) to separate tokens.
You should implement a custom formatter for your language that inserts the line breaks after each statement. Xtext comes with two formatter APIs (an old one and a new one starting with Xtext 2.8). I propose to use the new one.
Here you extend AbstractFormatter2 and implement the format methods.
You can find a bit information in the online manual: https://www.eclipse.org/Xtext/documentation/303_runtime_concepts.html#formatting
Some more explanation in the folowing blog post: https://blogs.itemis.com/en/tabular-formatting-with-the-new-formatter-api
Some technical background: https://de.slideshare.net/meysholdt/xtexts-new-formatter-api

Illegal Argument: ParseTree error on small language

I'm stuck on this problem for a while now, hope you can help. I've got the following (shortened) language grammar:
lexical Id = [a-zA-Z][a-zA-Z]* !>> [a-zA-Z] \ MyKeywords;
lexical Natural = [1-9][0-9]* !>> [0-9];
lexical StringConst = "\"" ![\"]* "\"";
keyword MyKeywords = "value" | "Male" | "Female";
start syntax Program = program: Model* models;
syntax Model = Declaration;
syntax Declaration = decl: "value" Id name ':' Type t "=" Expression v ;
syntax Type = gender: "Gender";
syntax Expression = Terminal;
syntax Terminal = id: Id name
| constructor: Type t '(' {Expression ','}* ')'
| Gender;
syntax Gender = male: "Male"
| female: "Female";
alias ASLId = str;
data TYPE = gender();
public data PROGRAM = program(list[MODEL] models);
data MODEL = decl(ASLId name, TYPE t, EXPR v);
data EXPR = constructor(TYPE t, list[EXPR] args)
| id(ASLId name)
| male()
| female();
Now, I'm trying to parse:
value mannetje : Gender = Male
This parses fine, but fails on implode, unless I remove the id: Id name and it's constructor from the grammar. I expected that the /MyKeywords would prevent this, but unfortunately it doesn't. Can you help me fix this, or point me in the right direction to how to debug? I'm having some trouble with debugging the Concrete and Abstract syntax.
Thanks!
It does not seem to be parsing at all (I get a ParseError if I try your example).
One of the problems is probably that you don't define Layout. This causes the ParseError with you given example. One of the easiest fixes is to extend the standard Layout in lang::std::Layout. This layout defines all the default white spaces (and comment) characters.
For more information on nonterminals see here.
I took the liberty in simplifying your example a bit further so that parsing and imploding works. I removed some unused nonterminals to keep the parse tree more concise. You probably want more that Declarations in your Program but I leave that up to you.
extend lang::std::Layout;
lexical Id = ([a-z] !<< [a-z][a-zA-Z]* !>> [a-zA-Z]) \ MyKeywords;
keyword MyKeywords = "value" | "Male" | "Female" | "Gender";
start syntax Program = program: Declaration* decls;
syntax Declaration = decl: "value" Id name ':' Type t "=" Expression v ;
syntax Type = gender: "Gender";
syntax Expression
= id: Id name
| constructor: Type t '(' {Expression ','}* ')'
| Gender
;
syntax Gender
= male: "Male"
| female: "Female"
;
data PROGRAM = program(list[DECL] exprs);
data DECL = decl(str name, TYPE t, EXPR v);
data EXPR = constructor(TYPE t, list[EXPR] args)
| id(str name)
| male()
| female()
;
data TYPE = gender();
Two things:
The names of the ADTs should correspond to the nonterminal names (you have difference cases and EXPR is not Expression). That is the only way implode can now how to do its work. Put the data decls in their own module and implode as follows: implode(#AST::Program, pt) where pt is the parse tree.
The grammar was ambiguous: the \ MyKeywords only applied to the tail of the identifier syntax. Use the fix: ([a-zA-Z][a-zA-Z]* !>> [a-zA-Z]) \ MyKeywords;.
Here's what worked for me (grammar unchanged except for the fix):
module AST
alias ASLId = str;
data Type = gender();
public data Program = program(list[Model] models);
data Model = decl(ASLId name, Type t, Expression v);
data Expression = constructor(Type t, list[Expression] args)
| id(ASLId name)
| male()
| female();

How to handle arithmetic operator < and > in antlr grammar that removes html tags

Following is my antlr 3 grammar. I want to strip off content inside html tags.
The problem arises when I have arithmetic operator < > inside the tag.
How can this be handled?
grammar T;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: text+ ;
text
: (tag)=> tag !
| SPACE !
| outsidetag
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ ;
tag
: OPEN INSIDETAG CLOSE ;
CLOSE : '>' ;
OPEN : '<' ;
INSIDETAG
: ~(CLOSE|OPEN)+ ;
outsidetag
: ~(SPACE) ;
First you don't need to check for OPEN in your INSIDETAG rule, since there is no harm in skipping it there. In fact you want it that way. Additionally combine tag and INSIDETAG and make it greedy so it tries to consume anything until the last CLOSE TOKEN, skipping so any intermediate ones:
tag: options { greedy = true; }: OPEN ~CLOSE* CLOSE;

Matching a "text" in line by line file with XText

I try to write the Xtext BNF for Configuration files (known with the .ini extension)
For instance, I'd like to successfully parse
[Section1]
a = Easy123
b = This *is* valid too
[Section_2]
c = Voilà # inline comments are ignored
My problem is matching the property value (what's on the right of the '=').
My current grammar works if the property matches the ID terminal (eg a = Easy123).
PropertyFile hidden(SL_COMMENT, WS):
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE properties+=Property)+
NEWLINE+;
Property:
name=ID (':' | '=') value=ID ';'?;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
I don't know how to generalize the grammar to match any text (eg c = Voilà).
I certainly need to introduce a new terminal
Property:
name=ID (':' | '=') value=TEXT ';'?;
Question is: how should I define this TEXT terminal?
I have tried
terminal TEXT: ANY_OTHER+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT,RULE_STRING,RULE_ML_COMMENT,RULE_ANY_OTHER
(I think it doesn't matter).
Parsing Fails with
Required loop (...)+ did not match anything at input 'à'
terminal TEXT: !('\r'|'\n'|'#')+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT
(I think it doesn't matter).
Parsing Fails with
Missing EOF at [Section1]
terminal TEXT: ('!'|'$'..'~'); (which covers most characters, except # and ")
No warning during the generation of the lexer/parser.
However Parsing Fails with
Mismatch input 'Easy123' expecting RULE_TEXT
Extraneous input 'This' expecting RULE_TEXT
Required loop (...)+ did not match anything at 'is'
Thanks for your help (and I hope this grammar can be useful for others too)
This grammar does the trick:
grammar org.xtext.example.mydsl.MyDsl hidden(SL_COMMENT, WS)
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore"
PropertyFile:
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE+ properties+=Property)+
NEWLINE+;
Property:
name=ID value=PROPERTY_VALUE;
terminal PROPERTY_VALUE: (':' | '=') !('\n' | '\r')*;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
Key is, that you do not try to cover the complete semantics only in the grammar but take other services into account, too. The terminal rule PROPERTY_VALUE consumes the complete value including leading assignment and optional trailing semicolon.
Now just register a value converter service for that language and take care of the insignificant parts of the input, there:
import org.eclipse.xtext.conversion.IValueConverter;
import org.eclipse.xtext.conversion.ValueConverter;
import org.eclipse.xtext.conversion.ValueConverterException;
import org.eclipse.xtext.conversion.impl.AbstractDeclarativeValueConverterService;
import org.eclipse.xtext.conversion.impl.AbstractIDValueConverter;
import org.eclipse.xtext.conversion.impl.AbstractLexerBasedConverter;
import org.eclipse.xtext.nodemodel.INode;
import org.eclipse.xtext.util.Strings;
import com.google.inject.Inject;
public class PropertyConverters extends AbstractDeclarativeValueConverterService {
#Inject
private AbstractIDValueConverter idValueConverter;
#ValueConverter(rule = "ID")
public IValueConverter<String> ID() {
return idValueConverter;
}
#Inject
private PropertyValueConverter propertyValueConverter;
#ValueConverter(rule = "PROPERTY_VALUE")
public IValueConverter<String> PropertyValue() {
return propertyValueConverter;
}
public static class PropertyValueConverter extends AbstractLexerBasedConverter<String> {
#Override
protected String toEscapedString(String value) {
return " = " + Strings.convertToJavaString(value, false);
}
public String toValue(String string, INode node) {
if (string == null)
return null;
try {
String value = string.substring(1).trim();
if (value.endsWith(";")) {
value = value.substring(0, value.length() - 1);
}
return value;
} catch (IllegalArgumentException e) {
throw new ValueConverterException(e.getMessage(), node, e);
}
}
}
}
The follow test case will succeed, after you registered the service in the runtime module like this:
#Override
public Class<? extends IValueConverterService> bindIValueConverterService() {
return PropertyConverters.class;
}
Test case:
import org.junit.runner.RunWith
import org.eclipse.xtext.junit4.XtextRunner
import org.xtext.example.mydsl.MyDslInjectorProvider
import org.eclipse.xtext.junit4.InjectWith
import org.junit.Test
import org.eclipse.xtext.junit4.util.ParseHelper
import com.google.inject.Inject
import org.xtext.example.mydsl.myDsl.PropertyFile
import static org.junit.Assert.*
#RunWith(typeof(XtextRunner))
#InjectWith(typeof(MyDslInjectorProvider))
class ParserTest {
#Inject
ParseHelper<PropertyFile> helper
#Test
def void testSample() {
val file = helper.parse('''
[Section1]
a = Easy123
b : This *is* valid too;
[Section_2]
# comment
c = Voilà # inline comments are ignored
''')
assertEquals(2, file.sections.size)
val section1 = file.sections.head
assertEquals(2, section1.properties.size)
assertEquals("a", section1.properties.head.name)
assertEquals("Easy123", section1.properties.head.value)
assertEquals("b", section1.properties.last.name)
assertEquals("This *is* valid too", section1.properties.last.value)
val section2 = file.sections.last
assertEquals(1, section2.properties.size)
assertEquals("Voilà # inline comments are ignored", section2.properties.head.value)
}
}
The problem (or one problem anyway) with parsing a format like that is that, since the text part may contain = characters, a line like foo = bar will be interpreted as a single TEXT token, not an ID, followed by a '=', followed by a TEXT. I can see no way to avoid that without disallowing (or requiring escaping of) = characters in the text part.
If that is not an option, I think, the only solution would be to make a token type LINE that matches an entire line and then take that apart yourself. You'd do that by removing TEXT and ID from your grammar and replacing them with a token type LINE that matches everything up to the next line break or comment sign and must start with a valid ID. So something like this:
LINE :
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*
WS* '=' WS*
!('\r' | '\n' | '#')+
;
This token would basically replace your Property rule.
Of course this is a rather unsatisfactory solution as it will give you the entire line as a string and you still have to pick it apart yourself to separate the ID from the text part. It also prevents you from highlighting the ID part or the = sign as the entire line is one token and you can't highlight part of a token (as far as I know). Overall this does not buy you all that much over not using XText at all, but I don't see a better way.
As a workaround, I have changed
Property:
name=ID ':' value=ID ';'?;
Now, of course, = is not in conflict any more, but this is certainly not a good solution, because properties can usually defined with name=value
Edit: Actually, my input is a specific property file, and the properties are know in advance.
My code now looks like
Section:
'[' name=ID ']'
(NEWLINE (properties+=AbstractProperty)?)+;
AbstractProperty:
ADef
| BDef
ADef:
'A' (':'|'=') ID;
BDef:
'B' (':'|'=') Float;
There is an extra benefit, the property names are know as keywords, and colored as such. However, autocompletion only suggest '[' :(

how to setup flex/bison rules for parsing a comma-delimited argument list

I would like to be able to parse a non-empty, one-or-many element, comma-delimited (and optionally parenthesized) list using flex/bison parse rules.
some e.g. of parseable lists :
1
1,2
(1,2)
(3)
3,4,5
(3,4,5,6)
etc.
I am using the following rules to parse the list (final result is parse element 'top level list'), but they do not seem to give the desired result when parsing (I get a syntax-error when supplying a valid list). Any suggestion on how I might set this up ?
cList : ELEMENT
{
...
}
| cList COMMA ELEMENT
{
...
}
;
topLevelList : LPAREN cList RPAREN
{
...
}
| cList
{
...
}
;
This sounds simple. Tell me if i missed something or if my example doesnt work
RvalCommaList:
RvalCommaListLoop
| '(' RvalCommaListLoop ')'
RvalCommaListLoop:
Rval
| RvalCommaListLoop ',' Rval
Rval: INT_LITERAL | WHATEVER
However if you accept rvals as well as this list you'll have a conflict confusing a regular rval with a single item list. In this case you can use the below which will either require the '('')' around them or require 2 items before it is a list
RvalCommaList2:
Rval ',' RvalCommaListLoop
| '(' RvalCommaListLoop ')'
I too want to know how to do this, thinking about it briefly, one way to achieve this would be to use a linked list of the form,
struct list;
struct list {
void *item;
struct list *next;
};
struct list *make_list(void *item, struct list *next);
and using the rule:
{ $$ = make_list( $1, $2); }
This solution is very similar in design to:
Using bison to parse list of elements
The hard bit is to figure out how to handle lists in the scheme of a (I presume) binary AST.
%start input
%%
input:
%empty
| integer_list
;
integer_list
: integer_loop
| '(' integer_loop ')'
;
integer_loop
: INTEGER
| integer_loop COMMA INTEGER
;
%%

Resources