ANTLR extraneous input at first position doesn't build tree - parsing

I have a problem with a extranous input in my sample file. I got the following lexer:
lexer grammar CtoLexer;
ENUM: 'enum';
NAMESPACE: 'namespace';
LBRACE: '{';
RBRACE: '}';
DOT: '.';
VAR: 'o ';
IDENTIFIER: LetterOrDigit+;
fragment LetterOrDigit
: [a-zA-Z$_] | [0-9];
WS: [ \t\r\n\u000C]+ -> skip;
... and parser:
parser grammar CtoParser;
options { tokenVocab=CtoLexer; }
modelUnit
: namespaceDeclaration enumDeclaration* EOF;
namespaceDeclaration
: NAMESPACE IDENTIFIER ('.' IDENTIFIER)*;
enumDeclaration
: ENUM IDENTIFIER '{' enumConstant* '}';
enumConstant
: VAR IDENTIFIER;
This is my sample cto file:
namespace org.basic.sample
enum FooType {
o FOO
}
enum BarType {
o BAR
}
enum BazType {
o BAZ
}
The tree for this sample file looks like this:
(modelUnit
(namespaceDeclaration namespace org . basic . sample)
(enumDeclaration enum FooType { (enumConstant o FOO) })
(enumDeclaration enum BarType { (enumConstant o BAR) })
(enumDeclaration enum BazType { (enumConstant o BAZ) })
<EOF>)
When I change the first enum in the sample to something other, let's say 'enum' to 'enumi', almost the whole tree is messed up. The parser recognizes only the namespace the rest seems to be an IDENTIFIER.
(modelUnit
(namespaceDeclaration namespace org . basic . sample)
enumi FooType { o FOO }
enum BarType { o BAR }
enum BazType { o BAZ })
However when I do the same with the second enum, the somehow only that invalid enum is not recognized and the rest is fine.
(modelUnit
(namespaceDeclaration namespace org . basic . sample)
(enumDeclaration enum FooType { (enumConstant o FOO) })
enumi BarType { o BAR }
(enumDeclaration enum BazType { (enumConstant o BAZ) }) <EOF>)
What can I do that also the first erroneous input is skipped and the rest is recognized? I tried with new line tokens, but that causes problems when I want to introduce a new declaration after the namespace.

This is a limitation of the prediction engine, from what I know. I also have seen that in cases like this MySQL query:
select * from sakila.actor where
which marks the select keyword as errorneous instead of the missing where expression. What happens here is that the ALL(*) prediction looks too far down the rule chain. If there's an error it doesn't allow to parse the correct input up to that error, but sometimes fails the entire input. I have not found a good solution for this problem. It all seems to depend on the grammar structure.

Related

Grammatically restricted JVMModelInferrer inheritence with Xtext and XBase

The following canonical XBase entities grammar (from "Implementing Domain-Specific Languages with Xtext and Xtend" Bettini) permits entities to extend any Java class. As the commented line indicates, I would like to grammatically force entities to only inherit from entities.
grammar org.example.xbase.entities.Entities with org.eclipse.xtext.xbase.Xbase
generate entities "http://www.example.org/xbase/entities/Entities"
Model:
importSection=XImportSection?
entities+=Entity*;
Entity:
'entity' name=ID ('extends' superType=JvmParameterizedTypeReference)? '{'
// 'entity' name=ID ('extends' superType=[Entity|QualifiedName])? '{'
attributes += Attribute*
constructors+=Constructor*
operations += Operation*
'}';
Attribute:
'attr' (type=JvmTypeReference)? name=ID ('=' initexpression=XExpression)? ';';
Operation:
'op' (type=JvmTypeReference)? name=ID
'(' (params+=FullJvmFormalParameter (',' params+=FullJvmFormalParameter)*)? ')'
body=XBlockExpression;
Constructor: 'new'
'(' (params+=FullJvmFormalParameter (',' params+=FullJvmFormalParameter)*)? ')'
body=XBlockExpression;
Here is a working JVMModelInferrer for the model above, again where the commented line (and extra method) reflect my intention.
package org.example.xbase.entities.jvmmodel
import com.google.inject.Inject
import org.eclipse.xtext.common.types.JvmTypeReference
import org.eclipse.xtext.naming.IQualifiedNameProvider
import org.eclipse.xtext.xbase.jvmmodel.AbstractModelInferrer
import org.eclipse.xtext.xbase.jvmmodel.IJvmDeclaredTypeAcceptor
import org.eclipse.xtext.xbase.jvmmodel.JvmTypesBuilder
import org.example.xbase.entities.entities.Entity
class EntitiesJvmModelInferrer extends AbstractModelInferrer {
#Inject extension JvmTypesBuilder
#Inject extension IQualifiedNameProvider
def dispatch void infer(Entity entity, IJvmDeclaredTypeAcceptor acceptor, boolean isPreIndexingPhase) {
acceptor.accept(entity.toClass("entities." + entity.name)) [
documentation = entity.documentation
if (entity.superType !== null) {
superTypes += entity.superType.cloneWithProxies
//superTypes += entity.superType.jvmTypeReference.cloneWithProxies
}
entity.attributes.forEach [ a |
val type = a.type ?: a.initexpression?.inferredType
members += a.toField(a.name, type) [
documentation = a.documentation
if (a.initexpression != null)
initializer = a.initexpression
]
members += a.toGetter(a.name, type)
members += a.toSetter(a.name, type)
]
entity.operations.forEach [ op |
members += op.toMethod(op.name, op.type ?: inferredType) [
documentation = op.documentation
for (p : op.params) {
parameters += p.toParameter(p.name, p.parameterType)
}
body = op.body
]
]
entity.constructors.forEach [ con |
members += entity.toConstructor [
for (p : con.params) {
parameters += p.toParameter(p.name, p.parameterType)
}
body = con.body
]
]
]
}
def JvmTypeReference getJvmTypeReference(Entity e) {
e.toClass(e.fullyQualifiedName).typeRef
}
}
The following simple instance parses and infers perfectly (with the comments in place).
entity A {
attr String y;
new(String y) {
this.y=y
}
}
entity B extends A {
new() {
super("Hello World!")
}
}
If, however, I uncomment (and comment in the corresponding line above) both the grammar and the inferrer (and regenerate), the above instance no longer parses. The message is "The method super(String) is undefined".
I understand how to leave the inheritance "loose" and restrict using validators, etc., but would far prefer to strongly type this into the model.
I am lost as to how to solve this, as I am not sure where things are breaking, given the role of XBase and the JvmModelInferrer. A pointer (or reference) would suffice.
[... I am able to implement all the scoping issues for a non-xbase version of this grammar ...]
this won't work. you either have to leave the grammar as is and customize proposal provider and validation. or you have to use "f.q.n.o.y.Entity".typeRef. You can use NodeModelUtils to read the FQN or try something like ("entities."+entity.superType.name).typeRef

How to parsing Velocity Variables using ANTLR4

The variables of Velocity has following notation. (see Velocity User Guide):
The shorthand notation of a variable consists of a leading "$" character followed by a VTL Identifier. A VTL Identifier must start with an alphabetic character (a .. z or A .. Z). The rest of the characters are limited to the following types of characters:
alphabetic (a .. z, A .. Z)
numeric (0 .. 9)
underscore ("_")
I want to use lexer mode to split the normal text and the variables, so I wrote something like this:
// default mode
DOLLAR : ‘$’ -> pushMode(VARIABLE);
TEXT : ~[$]+? -> skip;
mode VARIABLE:
ID : [a-zA-Z] [a-zA-Z0-9-_]*;
???? : XXX -> popMode; // how can I pop mode to default?
Because the notation of the variables has no explicit end character, so I don't know how to determine its end.
Maybe I got it wrong?
You would pop out of that scope like this:
mode VARIABLE;
ID : [a-zA-Z] [a-zA-Z0-9-_]* -> popMode;
Here's a quick demo:
lexer grammar VelocityLexer;
DOLLAR : '$' -> more, pushMode(VARIABLE);
TEXT : ~[$]+ -> skip;
mode VARIABLE;
// the `-` needs to be escaped!
ID : [a-zA-Z] [a-zA-Z0-9\-_]* -> popMode;
Note the more in the DOLLAR which will cause the $ to be included in the ID token. If you don't, you end up with two tokens ($ and foo for the input $foo)
Test the grammar with the following Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
VelocityLexer lexer = new VelocityLexer(CharStreams.fromString("<strong>$Mu</strong>$foo..."));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-10s '%s'\n", VelocityLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
ID '$Mu'
ID '$foo'
EOF '<EOF>'
However, I think a lexical mode is not a good choice in case of an ID. Why not simply do:
lexer grammar VelocityLexer;
DOLLAR : '$' [a-zA-Z] [a-zA-Z0-9\-_]*;
TEXT : ~[$]+ -> skip;
?

How to embed Scala code inside a specially defined syntax?

I don't know if this info is relevant to the question, but I am learning Scala parser combinators.
Using some examples (in this master thesis) I was able to write a simple functional (in the sense that it is non imperative) programming language.
Is there a way to improve my parser/evaluator such that it could allow/evaluate input like this:
<%
import scala.<some package / classes>
import weka.<some package / classes>
%>
some DSL code (lambda calculus)
<%
System.out.println("asdasd");
J48 j48 = new J48();
%>
as input written in the guest language (DSL)?
Should I use reflection or something similar* to evaluate such input?
Is there some source code recommendation to study (may be groovy sources?)?
Maybe this is something similar: runtime compilation, but I am not sure this is the best alternative.
EDIT
Complete answer given bellow with "{" and "}". Maybe "{{" would be better.
It is the question as to what the meaning of such import statements should be.
Perhaps you start first with allowing references to java methods in your language (the Lambda Calculus, I guess?).
For example:
java.lang.System.out.println "foo"
If you have that, you can then add resolution of unqualified names like
println "foo"
But here comes the first problem: println exists in System.out and System.err, or, to be more correct: it is a method of PrintStream, and both System.err and System.out are PrintStreams.
Hence you would need some notion of Objects, Classes, Types, and so on to do it right.
I managed how to run Scala code embedded in my interpreted DSL.
Insertion of DSL vars into Scala code and recovering returning value comes as a bonus. :)
Minimal relevant code from parsing and interpreting until performing embedded Scala code run-time execution (Main Parser AST and Interpreter):
object Main extends App {
val ast = Parser1 parse "some dsl code here"
Interpreter eval ast
}
object Parser1 extends RegexParsers with ImplicitConversions {
import AST._
val separator = ";"
def parse(input: String): Expr = parseAll(program, input).get
type P[+T] = Parser[T]
def program = rep1sep(expr, separator) <~ separator ^^ Sequence
def expr: Parser[Expr] = (assign /*more calls here*/)
def scalacode: P[Expr] = "{" ~> rep(scala_text) <~ "}" ^^ {case l => Scalacode(l.flatten)}
def scala_text = text_no_braces ~ "$" ~ ident ~ text_no_braces ^^ {case a ~ b ~ c ~ d => List(a, b + c, d)}
//more rules here
def assign = ident ~ ("=" ~> atomic_expr) ^^ Assign
//more rules here
def atomic_expr = (
ident ^^ Var
//more calls here
| "(" ~> expr <~ ")"
| scalacode
| failure("expression expected")
)
def text_no_braces = """[a-zA-Z0-9\"\'\+\-\_!##%\&\(\)\[\]\/\?\:;\.\>\<\,\|= \*\\\n]*""".r //| fail("Scala code expected")
def ident = """[a-zA-Z]+[a-zA-Z0-9]*""".r
}
object AST {
sealed abstract class Expr
// more classes here
case class Scalacode(items: List[String]) extends Expr
case class Literal(v: Any) extends Expr
case class Var(name: String) extends Expr
}
object Interpreter {
import AST._
val env = collection.immutable.Map[VarName, VarValue]()
def run(code: String) = {
val code2 = "val res_1 = (" + code + ")"
interpret.interpret(code2)
val res = interpret.valueOfTerm("res_1")
if (res == None) Literal() else Literal(res.get)
}
class Context(private var env: Environment = initEnv) {
def eval(e: Expr): Any = e match {
case Scalacode(l: List[String]) => {
val r = l map {
x =>
if (x.startsWith("$")) {
eval(Var(x.drop(1)))
} else {
x
}
}
eval(run(r.mkString))
}
case Assign(id, expr) => env += (id -> eval(expr))
//more pattern matching here
case Literal(v) => v
case Var(id) => {
env getOrElse(id, sys.error("Undefined " + id))
}
}
}
}

Matching a "text" in line by line file with XText

I try to write the Xtext BNF for Configuration files (known with the .ini extension)
For instance, I'd like to successfully parse
[Section1]
a = Easy123
b = This *is* valid too
[Section_2]
c = Voilà # inline comments are ignored
My problem is matching the property value (what's on the right of the '=').
My current grammar works if the property matches the ID terminal (eg a = Easy123).
PropertyFile hidden(SL_COMMENT, WS):
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE properties+=Property)+
NEWLINE+;
Property:
name=ID (':' | '=') value=ID ';'?;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
I don't know how to generalize the grammar to match any text (eg c = Voilà).
I certainly need to introduce a new terminal
Property:
name=ID (':' | '=') value=TEXT ';'?;
Question is: how should I define this TEXT terminal?
I have tried
terminal TEXT: ANY_OTHER+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT,RULE_STRING,RULE_ML_COMMENT,RULE_ANY_OTHER
(I think it doesn't matter).
Parsing Fails with
Required loop (...)+ did not match anything at input 'à'
terminal TEXT: !('\r'|'\n'|'#')+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT
(I think it doesn't matter).
Parsing Fails with
Missing EOF at [Section1]
terminal TEXT: ('!'|'$'..'~'); (which covers most characters, except # and ")
No warning during the generation of the lexer/parser.
However Parsing Fails with
Mismatch input 'Easy123' expecting RULE_TEXT
Extraneous input 'This' expecting RULE_TEXT
Required loop (...)+ did not match anything at 'is'
Thanks for your help (and I hope this grammar can be useful for others too)
This grammar does the trick:
grammar org.xtext.example.mydsl.MyDsl hidden(SL_COMMENT, WS)
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore"
PropertyFile:
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE+ properties+=Property)+
NEWLINE+;
Property:
name=ID value=PROPERTY_VALUE;
terminal PROPERTY_VALUE: (':' | '=') !('\n' | '\r')*;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
Key is, that you do not try to cover the complete semantics only in the grammar but take other services into account, too. The terminal rule PROPERTY_VALUE consumes the complete value including leading assignment and optional trailing semicolon.
Now just register a value converter service for that language and take care of the insignificant parts of the input, there:
import org.eclipse.xtext.conversion.IValueConverter;
import org.eclipse.xtext.conversion.ValueConverter;
import org.eclipse.xtext.conversion.ValueConverterException;
import org.eclipse.xtext.conversion.impl.AbstractDeclarativeValueConverterService;
import org.eclipse.xtext.conversion.impl.AbstractIDValueConverter;
import org.eclipse.xtext.conversion.impl.AbstractLexerBasedConverter;
import org.eclipse.xtext.nodemodel.INode;
import org.eclipse.xtext.util.Strings;
import com.google.inject.Inject;
public class PropertyConverters extends AbstractDeclarativeValueConverterService {
#Inject
private AbstractIDValueConverter idValueConverter;
#ValueConverter(rule = "ID")
public IValueConverter<String> ID() {
return idValueConverter;
}
#Inject
private PropertyValueConverter propertyValueConverter;
#ValueConverter(rule = "PROPERTY_VALUE")
public IValueConverter<String> PropertyValue() {
return propertyValueConverter;
}
public static class PropertyValueConverter extends AbstractLexerBasedConverter<String> {
#Override
protected String toEscapedString(String value) {
return " = " + Strings.convertToJavaString(value, false);
}
public String toValue(String string, INode node) {
if (string == null)
return null;
try {
String value = string.substring(1).trim();
if (value.endsWith(";")) {
value = value.substring(0, value.length() - 1);
}
return value;
} catch (IllegalArgumentException e) {
throw new ValueConverterException(e.getMessage(), node, e);
}
}
}
}
The follow test case will succeed, after you registered the service in the runtime module like this:
#Override
public Class<? extends IValueConverterService> bindIValueConverterService() {
return PropertyConverters.class;
}
Test case:
import org.junit.runner.RunWith
import org.eclipse.xtext.junit4.XtextRunner
import org.xtext.example.mydsl.MyDslInjectorProvider
import org.eclipse.xtext.junit4.InjectWith
import org.junit.Test
import org.eclipse.xtext.junit4.util.ParseHelper
import com.google.inject.Inject
import org.xtext.example.mydsl.myDsl.PropertyFile
import static org.junit.Assert.*
#RunWith(typeof(XtextRunner))
#InjectWith(typeof(MyDslInjectorProvider))
class ParserTest {
#Inject
ParseHelper<PropertyFile> helper
#Test
def void testSample() {
val file = helper.parse('''
[Section1]
a = Easy123
b : This *is* valid too;
[Section_2]
# comment
c = Voilà # inline comments are ignored
''')
assertEquals(2, file.sections.size)
val section1 = file.sections.head
assertEquals(2, section1.properties.size)
assertEquals("a", section1.properties.head.name)
assertEquals("Easy123", section1.properties.head.value)
assertEquals("b", section1.properties.last.name)
assertEquals("This *is* valid too", section1.properties.last.value)
val section2 = file.sections.last
assertEquals(1, section2.properties.size)
assertEquals("Voilà # inline comments are ignored", section2.properties.head.value)
}
}
The problem (or one problem anyway) with parsing a format like that is that, since the text part may contain = characters, a line like foo = bar will be interpreted as a single TEXT token, not an ID, followed by a '=', followed by a TEXT. I can see no way to avoid that without disallowing (or requiring escaping of) = characters in the text part.
If that is not an option, I think, the only solution would be to make a token type LINE that matches an entire line and then take that apart yourself. You'd do that by removing TEXT and ID from your grammar and replacing them with a token type LINE that matches everything up to the next line break or comment sign and must start with a valid ID. So something like this:
LINE :
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*
WS* '=' WS*
!('\r' | '\n' | '#')+
;
This token would basically replace your Property rule.
Of course this is a rather unsatisfactory solution as it will give you the entire line as a string and you still have to pick it apart yourself to separate the ID from the text part. It also prevents you from highlighting the ID part or the = sign as the entire line is one token and you can't highlight part of a token (as far as I know). Overall this does not buy you all that much over not using XText at all, but I don't see a better way.
As a workaround, I have changed
Property:
name=ID ':' value=ID ';'?;
Now, of course, = is not in conflict any more, but this is certainly not a good solution, because properties can usually defined with name=value
Edit: Actually, my input is a specific property file, and the properties are know in advance.
My code now looks like
Section:
'[' name=ID ']'
(NEWLINE (properties+=AbstractProperty)?)+;
AbstractProperty:
ADef
| BDef
ADef:
'A' (':'|'=') ID;
BDef:
'B' (':'|'=') Float;
There is an extra benefit, the property names are know as keywords, and colored as such. However, autocompletion only suggest '[' :(

How to parse subnodes that depended on parents' information?

If I write grammar file in Yacc/Bison like this:
Module
:ModuleName "=" Functions
{ $$ = Builder::concat($1, $2, ","); }
Functions
:Functions Function
{ $$ = Builder::concat($1, $2, ","); }
| Function
{ $$ = $1; }
Function
: DEF ID ARGS BODY
{
/** Lacks module name to do name mangling for the function **/
/** How can I obtain the "parent" node's module name here ?? **/
module_name = ; //????
$$ = Builder::def_function(module_name, $ID, $ARGS, $BODY);
}
And this parser should parse codes like this:
main_module:
def funA (a,b,c) { ... }
In my AST, the name "funA" should be renamed as main_module.funA. But I can't get the module's information while the parser is processing Function node !
Is there any Yacc/Bison facilities can help me to handle this problem, or should I change my parsing style to avoid such embarrassing situations ?
There is a bison feature, but as the manual says, use it with care:
$N with N zero or negative is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:
foo: expr bar '+' expr { ... }
| expr bar '-' expr { ... }
;
bar: /* empty */
{ previous_expr = $0; }
;
As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo.
More cleanly, you could use a mid-rule action (in Module) to push the module name on a name stack (which would have to be part of the parsing context). You would then pop the stack at the end of the rule.
For more information and examples of mid-rules actions, see the manual.

Resources