I want to extend the IDL.g4 grammar a bit so that I can distinguish the following two comments //#top-level false and //#top-level true, all other comments I just want to skip like before.
I have tried to add top_level, TOP_LEVEL_TRUEand TOP_LEVEL_FALSElike this, because I thought antr4 gave precedence to lexical rules comming first.
top_level
: TOP_LEVEL_TRUE
| TOP_LEVEL_FALSE
;
TOP_LEVEL_TRUE
: '//#top-level true'
;
TOP_LEVEL_FALSE
: '//#top-level false'
;
LINE_COMMENT
: '//' ~('\n'|'\r')* '\r'? '\' -> channel(HIDDEN)
;
But the listener enterTop_level(...) is never called,
all comments seems to be eaten by LINE_COMMENT. How shall I organize the lexer and parser rules?
And one more question, I also want to be notified when end of input-file is reached. How do I do that? I have tried a finalize() function i the listener class, but never get called.
Updated with a complete example:
I use this grammar file : IDL.g4 as I said above. Then I update it by putting the parser rule top_level just below the event_header rule. The Lexer rules is put just above the ID rule.
Here is my Listener.java file
class Listener extends IDLBaseListener {
#Override
public void enterTop_level(IDLParser.Top_levelContext ctx) {
System.out.println("Found top-level");
}
}
and here is a main program: IDLCheck.java
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import java.io.FileInputStream;
import java.io.InputStream;
public class IDLCheck {
public void process(String[] args) throws Exception {
InputStream is = new FileInputStream("sample.idl");
ANTLRInputStream input = new ANTLRInputStream(is);
IDLLexer lexer = new IDLLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
IDLParser parser = new IDLParser(tokens);
parser.setBuildParseTree(true);
RuleContext tree = parser.specification();
Listener listener = new Listener();
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(listener, tree);
}
public static void main(String[] args) throws Exception {
new IDLCheck().process(args);
}
}
and a input file: sample.idl
module CommonTypes {
struct WChannel {
int w;
float d;
}; //#top-level false
struct EPlanID {
int kind;
short index;
}; //#top-level TRUE
};
I expect to see the output "Found top-level" twice, but I see nothing
Finally I found a solution. I just added newline characters to the TOP_LEVEL_FALSE and TOP_LEVEL_TRUElexer rules an I also added the top_level parser rule to the definition rule because I only expected top_level to appear after a struct or union. this is a rti.com specific extension to the IDL-format, this modification seems to be good enough for me.
definition
: type_decl SEMICOLON top_level?
| const_decl SEMICOLON
...
TOP_LEVEL_TRUE
: '//#top-level true' '\r'? '\n'
;
TOP_LEVEL_FALSE
: '//#top-level false' '\r'? '\n'
;
Related
I am new to compiler design as well as ANTLR and trying out some basic stuffs from a book.I came across this problem.Can someone help out. I am expecting the grammar validate any numbers separated by commas in between 2 braces.
Below is my java application to trigger the generated parser but it says at this line parser.init() cannot resolve method init(). I assume the init is the parser rule defined in my grammar.
import org.antlr.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
import java.io.IOException;
public class Adamant{
public static void main(String[] args) throws Exception {
ANTLRInputStream input = new ANTLRInputStream(System.in);
testLexer lexer = new testLexer((CharStream) input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
testParser parser = new testParser(tokens);
ParseTree tree = parser.init();
System.out.println(tree.toStringTree(parser));
}
}
Below is my grammar
grammar test;
init: '{' value (',' value)* '}';
value:
init
| INT
;
INT : [0-9]+ ;
WS : [\t\n\r] -> skip;
I been using ANTLR for a month now and I'm still no expert. I wanted to know if ANTLR's BaseVisitor class that is generated, automatically visits a specific rule context once the visitRuleContext() is implemented and the file to be parsed is done so.
Yes, if you look into the generated visitor class, you'll see that all methods return visitChildren(ctx). So when you only override one visit...(...) method in your own visitor, your single method would be called.
A quick test shows this:
grammar T;
parse
: something+ EOF
;
something
: ANY+
| number
;
number
: DIGITS
;
DIGITS
: [0-9]+
;
ANY
: .
;
And a test class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(CharStreams.fromString("mu 123"));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
new TestVisitor().visit(root);
}
}
class TestVisitor extends TBaseVisitor<Object> {
#Override
public Object visitSomething(TParser.SomethingContext ctx) {
System.out.println("visitSomething: " + ctx.getText());
return super.visitChildren(ctx);
}
}
will print:
visitSomething: mu
visitSomething: 123
It's quite strange, but DefaultErrorStrategy does not do anything for catching unrecognized characters from a stream. I tried a custom error strategy, a custom error listener and BailErrorStrategy - no luck here.
My grammar
grammar Polynomial;
parse : canonical EOF
;
canonical : polynomial+ #canonicalPolynom
| polynomial+ EQUAL polynomial+ #equality
;
polynomial : SIGN? '(' (polynomial)* ')' #parens
| monomial #monom
;
monomial : SIGN? coefficient? VAR ('^' INT)? #addend
| SIGN? coefficient #number
;
coefficient : INT | DEC;
INT : ('0'..'9')+;
DEC : INT '.' INT;
VAR : [a-z]+;
SIGN : '+' | '-';
EQUAL : '=';
WHITESPACE : (' '|'\t')+ -> skip;
and I'm giving an input 23*44=12 or #1234
I'm expecting that my parser throws mismatched token or any kind of exception for a character * or # that is not defined in my grammar.
Instead, my parser just skips * or # and traverse a tree like there are do not exist.
My handler function where I'm calling lexer, parser and that's kind of stuff.
private static (IParseTree tree, string parseErrorMessage) TryParseExpression(string expression)
{
ICharStream stream = CharStreams.fromstring(expression);
ITokenSource lexer = new PolynomialLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
PolynomialParser parser = new PolynomialParser(tokens);
//parser.ErrorHandler = new PolynomialErrorStrategy(); -> I tried custom error strategy
//parser.RemoveErrorListeners();
//parser.AddErrorListener(new PolynomialErrorListener()); -> I tried custom error listener
parser.BuildParseTree = true;
try
{
var tree = parser.canonical();
return (tree, string.Empty);
}
catch (RecognitionException re)
{
return (null, re.Message);
}
catch (ParseCanceledException pce)
{
return (null, pce.Message);
}
}
I tried to add a custom error listener.
public class PolynomialErrorListener : BaseErrorListener
{
private const string Eof = "EOF";
public override void SyntaxError(TextWriter output, IRecognizer recognizer, IToken offendingSymbol, int line, int charPositionInLine, string msg,
RecognitionException e)
{
if (msg.Contains(Eof))
{
throw new ParseCanceledException($"{GetSyntaxErrorHeader(charPositionInLine)}. Missing an expression after '=' sign");
}
if (e is NoViableAltException || e is InputMismatchException)
{
throw new ParseCanceledException($"{GetSyntaxErrorHeader(charPositionInLine)}. Probably, not closed operator");
}
throw new ParseCanceledException($"{GetSyntaxErrorHeader(charPositionInLine)}. {msg}");
}
private static string GetSyntaxErrorHeader(int errorPosition)
{
return $"Expression is invalid. Input is not valid at {--errorPosition} position";
}
}
After that, I tried to implement a custom error strategy.
public class PolynomialErrorStrategy : DefaultErrorStrategy
{
public override void ReportError(Parser recognizer, RecognitionException e)
{
throw e;
}
public override void Recover(Parser recognizer, RecognitionException e)
{
for (ParserRuleContext context = recognizer.Context; context != null; context = (ParserRuleContext) context.Parent) {
context.exception = e;
}
throw new ParseCanceledException(e);
}
public override IToken RecoverInline(Parser recognizer)
{
InputMismatchException e = new InputMismatchException(recognizer);
for (ParserRuleContext context = recognizer.Context; context != null; context = (ParserRuleContext) context.Parent) {
context.exception = e;
}
throw new ParseCanceledException(e);
}
protected override void ReportInputMismatch(Parser recognizer, InputMismatchException e)
{
string msg = "mismatched input " + GetTokenErrorDisplay(e.OffendingToken);
// msg += " expecting one of " + e.GetExpectedTokens().ToString(recognizer.());
RecognitionException ex = new RecognitionException(msg, recognizer, recognizer.InputStream, recognizer.Context);
throw ex;
}
protected override void ReportMissingToken(Parser recognizer)
{
BeginErrorCondition(recognizer);
IToken token = recognizer.CurrentToken;
IntervalSet expecting = GetExpectedTokens(recognizer);
string msg = "missing " + expecting.ToString() + " at " + GetTokenErrorDisplay(token);
throw new RecognitionException(msg, recognizer, recognizer.InputStream, recognizer.Context);
}
}
Is there any flag that I forgot to specify in a parser or I have incorrect grammar?
Funny thing that I'm using ANTLR plugin in my IDE and when I'm testing my grammar in here this plugin correctly responds with line 1:2 token recognition error at: '*'
Full source code: https://github.com/EvgeniyZ/PolynomialCanonicForm
I'm using ANTLR 4.8-complete.jar
Edit
I tried to add to a grammar rule
parse : canonical EOF
;
Still no luck here
What happens if you do this:
parse
: canonical EOF
;
and also invoke this rule:
var tree = parser.parse();
By adding the EOF token (end of input), you are forcing the parser to consume all tokens, which should result in an error when the parser cannot handle them properly.
Funny thing that I'm using ANTLR plugin in my IDE and when I'm testing my grammar in here this plugin correctly responds with line 1:2 token recognition error at: '*'
That is what the lexer emits on the std.err stream. The lexer just reports this warning and goes its merry way. So the lexer just ignores these chars and therefor never end up in the parser. If you add the following line at the end of your lexer:
// Fallback rule: matches any single character if not matched by another lexer rule
UNKNOWN : . ;
then the * and # chars will be sent to the parser as UNKNOWN tokens and should then cause recognition errors.
I am using Xtext 2.15 to generate a language that, among other things, processes asynchronous calls in a way they look synchronous.
For instance, the following code in my language:
int a = 1;
int b = 2;
boolean sleepSuccess = doSleep(2000); // sleep two seconds
int c = 3;
int d = 4;
would generate the following Java code:
int a = 1;
int b = 2;
doSleep(2000, new DoSleepCallback() {
public void onTrigger(boolean rc) {
boolean sleepSuccess = rc;
int c = 3;
int d = 4;
}
});
To achieve it, I defined the grammar this way:
grammar org.qedlang.qed.QED with jbase.Jbase // Jbase inherits Xbase
...
FunctionDeclaration return XExpression:
=>({FunctionDeclaration} type=JvmTypeReference name=ValidID '(')
(params+=FullJvmFormalParameter (',' params+=FullJvmFormalParameter)*)?
')' block=XBlockExpression
;
The FunctionDeclaration rule is used to define asynchronous calls. In my language library, I would have as system call:
boolean doSleep(int millis) {} // async FunctionDeclaration element stub
The underlying Java implementation would be:
public abstract class DoSleepCallback {
public abstract void onTrigger(boolean rc);
}
public void doSleep(int millis, DoSleepCallback callback) {
<perform sleep and call callback.onTrigger(<success>)>
}
So, using the inferrer, type computer and compiler, how to identify calls to FunctionDeclaration elements, add a callback parameter and process the rest of the body in an inner class?
I could, for instance, override appendFeatureCall in the language compiler, would it work? There is still a part I don't know how to do...
override appendFeatureCall(XAbstractFeatureCall call, ITreeAppendable b) {
...
val feature = call.feature
...
if (feature instanceof JvmExecutable) {
b.append('(')
val arguments = call.actualArguments
if (!arguments.isEmpty) {
...
arguments.appendArguments(b, shouldBreakFirstArgument)
// HERE IS THE PART I DON'T KNOW HOW TO DO
<IF feature IS A FunctionDeclaration>
<argument.appendArgument(NEW GENERATED CALLBACK PARAMETER)>
<INSERT REST OF XBlockExpression body INSIDE CALLBACK INSTANCE>
<ENDIF>
}
b.append(');')
}
}
So basically, how to tell if "feature" points to FunctionDeclaration? The rest, I may be able to do it...
Related to another StackOverflow entry, I had the idea of implementing FunctionDeclaration in the inferrer as a class instead of as a method:
def void inferExpressions(JvmDeclaredType it, FunctionDeclaration function) {
// now let's go over the features
for ( f : (function.block as XBlockExpression).expressions ) {
if (f instanceof FunctionDeclaration) {
members += f.toClass(f.fullyQualifiedName) [
inferVariables(f)
superTypes += typeRef(FunctionDeclarationObject)
// let's add a default constructor
members += f.toConstructor [
for (p : f.params)
parameters += p.toParameter(p.name, p.parameterType)
body = f.block
]
inferExpressions(f)
]
}
}
}
The generated class would extend FunctionDeclarationObject, so I thought there was a way to identify FunctionDeclaration as FunctionDeclarationObject subclasses. But then, I would need to extend the XFeatureCall default scoping to include classes in order to making it work...
I fully realize the question is not obvious, sorry...
Thanks,
Martin
EDIT: modified DoSleepCallback declaration from static to abstract (was erroneous)
I don't think you can generate what you need using the jvm model inferrer.
You should provide your own subclass of the XbaseCompiler (or JBaseCompiler, if any... and don't forget to register with guice in your runtime module), and override doInternalToJavaStatement(XExpression expr, ITreeAppendable it, boolean isReferenced) to manage how your FunctionDeclaration should be generated.
The DSL I'm working on allows users to define a 'complete text substitution' variable. When parsing the code, we then need to look up the value of the variable and start parsing again from that code.
The substitution can be very simple (single constants) or entire statements or code blocks.
This is a mock grammar which I hope illustrates my point.
grammar a;
entry
: (set_variable
| print_line)*
;
set_variable
: 'SET' ID '=' STRING_CONSTANT ';'
;
print_line
: 'PRINT' ID ';'
;
STRING_CONSTANT: '\'' ('\'\'' | ~('\''))* '\'' ;
ID: [a-z][a-zA-Z0-9_]* ;
VARIABLE: '&' ID;
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
Then the following statements executed consecutively should be valid;
SET foo = 'Hello world!';
PRINT foo;
SET bar = 'foo;'
PRINT &bar // should be interpreted as 'PRINT foo;'
SET baz = 'PRINT foo; PRINT'; // one complete statement and one incomplete statement
&baz foo; // should be interpreted as 'PRINT foo; PRINT foo;'
Any time the & variable token is discovered, we immediately switch to interpreting the value of that variable instead. As above, this can mean that you set up the code in such a way that is is invalid, full of half-statements that are only completed when the value is just right. The variables can be redefined at any point in the text.
Strictly speaking the current language definition doesn't disallow nesting &vars inside each other, but the current parsing doesn't handle this and I would not be upset if it wasn't allowed.
Currently I'm building an interpreter using a visitor, but this one I'm stuck on.
How can I build a lexer/parser/interpreter which will allow me to do this? Thanks for any help!
So I have found one solution to the issue. I think it could be better - as it potentially does a lot of array copying - but at least it works for now.
EDIT: I was wrong before, and my solution would consume ANY & that it found, including those in valid locations such as inside string constants. This seems like a better solution:
First, I extended the InputStream so that it is able to rewrite the input steam when a & is encountered. This unfortunately involves copying the array, which I can maybe resolve in the future:
MacroInputStream.java
package preprocessor;
import org.antlr.v4.runtime.ANTLRInputStream;
public class MacroInputStream extends ANTLRInputStream {
private HashMap<String, String> map;
public MacroInputStream(String s, HashMap<String, String> map) {
super(s);
this.map = map;
}
public void rewrite(int startIndex, int stopIndex, String replaceText) {
int length = stopIndex-startIndex+1;
char[] replData = replaceText.toCharArray();
if (replData.length == length) {
for (int i = 0; i < length; i++) data[startIndex+i] = replData[i];
} else {
char[] newData = new char[data.length+replData.length-length];
System.arraycopy(data, 0, newData, 0, startIndex);
System.arraycopy(replData, 0, newData, startIndex, replData.length);
System.arraycopy(data, stopIndex+1, newData, startIndex+replData.length, data.length-(stopIndex+1));
data = newData;
n = data.length;
}
}
}
Secondly, I extended the Lexer so that when a VARIABLE token is encountered, the rewrite method above is called:
MacroGrammarLexer.java
package language;
import language.DSL_GrammarLexer;
import org.antlr.v4.runtime.Token;
import java.util.HashMap;
public class MacroGrammarLexer extends MacroGrammarLexer{
private HashMap<String, String> map;
public DSL_GrammarLexerPre(MacroInputStream input, HashMap<String, String> map) {
super(input);
this.map = map;
// TODO Auto-generated constructor stub
}
private MacroInputStream getInput() {
return (MacroInputStream) _input;
}
#Override
public Token nextToken() {
Token t = super.nextToken();
if (t.getType() == VARIABLE) {
System.out.println("Encountered token " + t.getText()+" ===> rewriting!!!");
getInput().rewrite(t.getStartIndex(), t.getStopIndex(),
map.get(t.getText().substring(1)));
getInput().seek(t.getStartIndex()); // reset input stream to previous
return super.nextToken();
}
return t;
}
}
Lastly, I modified the generated parser to set the variables at the time of parsing:
DSL_GrammarParser.java
...
...
HashMap<String, String> map; // same map as before, passed as a new argument.
...
...
public final SetContext set() throws RecognitionException {
SetContext _localctx = new SetContext(_ctx, getState());
enterRule(_localctx, 130, RULE_set);
try {
enterOuterAlt(_localctx, 1);
{
String vname = null; String vval = null; // set up variables
setState(1215); match(SET);
setState(1216); vname = variable_name().getText(); // set vname
setState(1217); match(EQUALS);
setState(1218); vval = string_constant().getText(); // set vval
System.out.println("Found SET " + vname +" = " + vval+";");
map.put(vname, vval);
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.reportError(this, re);
_errHandler.recover(this, re);
}
finally {
exitRule();
}
return _localctx;
}
...
...
Unfortunately this method is final so this will make maintenance a bit more difficult, but it works for now.
The standard pattern to handling your requirements is to implement a symbol table. The simplest form is as a key:value store. In your visitor, add var declarations as encountered, and read out the values as var references are encountered.
As described, your DSL does not define a scoping requirement on the variables declared. If you do require scoped variables, then use a stack of key:value stores, pushing and popping on scope entry and exit.
See this related StackOverflow answer.
Separately, since your strings may contain commands, you can simply parse the contents as part of your initial parse. That is, expand your grammar with a rule that includes the full set of valid contents:
set_variable
: 'SET' ID '=' stringLiteral ';'
;
stringLiteral:
Quote Quote? (
( set_variable
| print_line
| VARIABLE
| ID
)
| STRING_CONSTANT // redefine without the quotes
)
Quote
;