C# ANTLR4 DefaultErrorStrategy or custom error listener does not catch unrecognized characters - parsing

It's quite strange, but DefaultErrorStrategy does not do anything for catching unrecognized characters from a stream. I tried a custom error strategy, a custom error listener and BailErrorStrategy - no luck here.
My grammar
grammar Polynomial;
parse : canonical EOF
;
canonical : polynomial+ #canonicalPolynom
| polynomial+ EQUAL polynomial+ #equality
;
polynomial : SIGN? '(' (polynomial)* ')' #parens
| monomial #monom
;
monomial : SIGN? coefficient? VAR ('^' INT)? #addend
| SIGN? coefficient #number
;
coefficient : INT | DEC;
INT : ('0'..'9')+;
DEC : INT '.' INT;
VAR : [a-z]+;
SIGN : '+' | '-';
EQUAL : '=';
WHITESPACE : (' '|'\t')+ -> skip;
and I'm giving an input 23*44=12 or #1234
I'm expecting that my parser throws mismatched token or any kind of exception for a character * or # that is not defined in my grammar.
Instead, my parser just skips * or # and traverse a tree like there are do not exist.
My handler function where I'm calling lexer, parser and that's kind of stuff.
private static (IParseTree tree, string parseErrorMessage) TryParseExpression(string expression)
{
ICharStream stream = CharStreams.fromstring(expression);
ITokenSource lexer = new PolynomialLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
PolynomialParser parser = new PolynomialParser(tokens);
//parser.ErrorHandler = new PolynomialErrorStrategy(); -> I tried custom error strategy
//parser.RemoveErrorListeners();
//parser.AddErrorListener(new PolynomialErrorListener()); -> I tried custom error listener
parser.BuildParseTree = true;
try
{
var tree = parser.canonical();
return (tree, string.Empty);
}
catch (RecognitionException re)
{
return (null, re.Message);
}
catch (ParseCanceledException pce)
{
return (null, pce.Message);
}
}
I tried to add a custom error listener.
public class PolynomialErrorListener : BaseErrorListener
{
private const string Eof = "EOF";
public override void SyntaxError(TextWriter output, IRecognizer recognizer, IToken offendingSymbol, int line, int charPositionInLine, string msg,
RecognitionException e)
{
if (msg.Contains(Eof))
{
throw new ParseCanceledException($"{GetSyntaxErrorHeader(charPositionInLine)}. Missing an expression after '=' sign");
}
if (e is NoViableAltException || e is InputMismatchException)
{
throw new ParseCanceledException($"{GetSyntaxErrorHeader(charPositionInLine)}. Probably, not closed operator");
}
throw new ParseCanceledException($"{GetSyntaxErrorHeader(charPositionInLine)}. {msg}");
}
private static string GetSyntaxErrorHeader(int errorPosition)
{
return $"Expression is invalid. Input is not valid at {--errorPosition} position";
}
}
After that, I tried to implement a custom error strategy.
public class PolynomialErrorStrategy : DefaultErrorStrategy
{
public override void ReportError(Parser recognizer, RecognitionException e)
{
throw e;
}
public override void Recover(Parser recognizer, RecognitionException e)
{
for (ParserRuleContext context = recognizer.Context; context != null; context = (ParserRuleContext) context.Parent) {
context.exception = e;
}
throw new ParseCanceledException(e);
}
public override IToken RecoverInline(Parser recognizer)
{
InputMismatchException e = new InputMismatchException(recognizer);
for (ParserRuleContext context = recognizer.Context; context != null; context = (ParserRuleContext) context.Parent) {
context.exception = e;
}
throw new ParseCanceledException(e);
}
protected override void ReportInputMismatch(Parser recognizer, InputMismatchException e)
{
string msg = "mismatched input " + GetTokenErrorDisplay(e.OffendingToken);
// msg += " expecting one of " + e.GetExpectedTokens().ToString(recognizer.());
RecognitionException ex = new RecognitionException(msg, recognizer, recognizer.InputStream, recognizer.Context);
throw ex;
}
protected override void ReportMissingToken(Parser recognizer)
{
BeginErrorCondition(recognizer);
IToken token = recognizer.CurrentToken;
IntervalSet expecting = GetExpectedTokens(recognizer);
string msg = "missing " + expecting.ToString() + " at " + GetTokenErrorDisplay(token);
throw new RecognitionException(msg, recognizer, recognizer.InputStream, recognizer.Context);
}
}
Is there any flag that I forgot to specify in a parser or I have incorrect grammar?
Funny thing that I'm using ANTLR plugin in my IDE and when I'm testing my grammar in here this plugin correctly responds with line 1:2 token recognition error at: '*'
Full source code: https://github.com/EvgeniyZ/PolynomialCanonicForm
I'm using ANTLR 4.8-complete.jar
Edit
I tried to add to a grammar rule
parse : canonical EOF
;
Still no luck here

What happens if you do this:
parse
: canonical EOF
;
and also invoke this rule:
var tree = parser.parse();
By adding the EOF token (end of input), you are forcing the parser to consume all tokens, which should result in an error when the parser cannot handle them properly.
Funny thing that I'm using ANTLR plugin in my IDE and when I'm testing my grammar in here this plugin correctly responds with line 1:2 token recognition error at: '*'
That is what the lexer emits on the std.err stream. The lexer just reports this warning and goes its merry way. So the lexer just ignores these chars and therefor never end up in the parser. If you add the following line at the end of your lexer:
// Fallback rule: matches any single character if not matched by another lexer rule
UNKNOWN : . ;
then the * and # chars will be sent to the parser as UNKNOWN tokens and should then cause recognition errors.

Related

Can ANTLR's Visitor system automatically visit a rule context when a file is parsed?

I been using ANTLR for a month now and I'm still no expert. I wanted to know if ANTLR's BaseVisitor class that is generated, automatically visits a specific rule context once the visitRuleContext() is implemented and the file to be parsed is done so.
Yes, if you look into the generated visitor class, you'll see that all methods return visitChildren(ctx). So when you only override one visit...(...) method in your own visitor, your single method would be called.
A quick test shows this:
grammar T;
parse
: something+ EOF
;
something
: ANY+
| number
;
number
: DIGITS
;
DIGITS
: [0-9]+
;
ANY
: .
;
And a test class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(CharStreams.fromString("mu 123"));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
new TestVisitor().visit(root);
}
}
class TestVisitor extends TBaseVisitor<Object> {
#Override
public Object visitSomething(TParser.SomethingContext ctx) {
System.out.println("visitSomething: " + ctx.getText());
return super.visitChildren(ctx);
}
}
will print:
visitSomething: mu
visitSomething: 123

How to walk the parse tree to check for syntax errors in ANTLR

I have written a fairly simple language in ANTLR. Before actually interpreting the code written by a user, I wish to parse the code and check for syntax errors. If found I wish to output the cause for the error and exit. How can I check the code for syntax errors and output the corresponding error. Please not that for my purposes the error statements similar to those generated by the ANTLR tool are more than sufficient. For example
line 3:0 missing ';'
There is ErrorListener that you can use to get more information.
For example:
...
FormulaParser parser = new FormulaParser(tokens);
parser.IsCompletion = options.IsForCompletion;
ErrorListener errListener = new ErrorListener();
parser.AddErrorListener(errListener);
IParseTree tree = parser.formula();
Only thing you need to do is to attach ErrorListener to the parser.
Here is the code of ErrorListener.
/// <summary>
/// Error listener recording all errors that Antlr parser raises during parsing.
/// </summary>
internal class ErrorListener : BaseErrorListener
{
private const string Eof = "the end of formula";
public ErrorListener()
{
ErrorMessages = new List<ErrorInfo>();
}
public bool ErrorOccured { get; private set; }
public List<ErrorInfo> ErrorMessages { get; private set; }
public override void SyntaxError(IRecognizer recognizer, IToken offendingSymbol, int line, int charPositionInLine, string msg, RecognitionException e)
{
ErrorOccured = true;
if (e == null || e.GetType() != typeof(NoViableAltException))
{
ErrorMessages.Add(new ErrorInfo()
{
Message = ConvertMessage(msg),
StartIndex = offendingSymbol.StartIndex,
Column = offendingSymbol.Column + 1,
Line = offendingSymbol.Line,
Length = offendingSymbol.Text.Length
});
return;
}
ErrorMessages.Add(new ErrorInfo()
{
Message = string.Format("{0}{1}", ConvertToken(offendingSymbol.Text), " unexpected"),
StartIndex = offendingSymbol.StartIndex,
Column = offendingSymbol.Column + 1,
Line = offendingSymbol.Line,
Length = offendingSymbol.Text.Length
});
}
public override void ReportAmbiguity(Antlr4.Runtime.Parser recognizer, DFA dfa, int startIndex, int stopIndex, bool exact, BitSet ambigAlts, ATNConfigSet configs)
{
ErrorOccured = true;
ErrorMessages.Add(new ErrorInfo()
{
Message = "Ambiguity", Column = startIndex, StartIndex = startIndex
});
base.ReportAmbiguity(recognizer, dfa, startIndex, stopIndex, exact, ambigAlts, configs);
}
private string ConvertToken(string token)
{
return string.Equals(token, "<EOF>", StringComparison.InvariantCultureIgnoreCase)
? Eof
: token;
}
private string ConvertMessage(string message)
{
StringBuilder builder = new StringBuilder(message);
builder.Replace("<EOF>", Eof);
return builder.ToString();
}
}
It is some dummy listener, but you can see what it does. And that you can tell if the error is syntax error, or some ambiguity error. After parsing, you can ask directly the errorListener, if some error occurred.

want to skip all line comments except two in antlr4 grammar

I want to extend the IDL.g4 grammar a bit so that I can distinguish the following two comments //#top-level false and //#top-level true, all other comments I just want to skip like before.
I have tried to add top_level, TOP_LEVEL_TRUEand TOP_LEVEL_FALSElike this, because I thought antr4 gave precedence to lexical rules comming first.
top_level
: TOP_LEVEL_TRUE
| TOP_LEVEL_FALSE
;
TOP_LEVEL_TRUE
: '//#top-level true'
;
TOP_LEVEL_FALSE
: '//#top-level false'
;
LINE_COMMENT
: '//' ~('\n'|'\r')* '\r'? '\' -> channel(HIDDEN)
;
But the listener enterTop_level(...) is never called,
all comments seems to be eaten by LINE_COMMENT. How shall I organize the lexer and parser rules?
And one more question, I also want to be notified when end of input-file is reached. How do I do that? I have tried a finalize() function i the listener class, but never get called.
Updated with a complete example:
I use this grammar file : IDL.g4 as I said above. Then I update it by putting the parser rule top_level just below the event_header rule. The Lexer rules is put just above the ID rule.
Here is my Listener.java file
class Listener extends IDLBaseListener {
#Override
public void enterTop_level(IDLParser.Top_levelContext ctx) {
System.out.println("Found top-level");
}
}
and here is a main program: IDLCheck.java
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import java.io.FileInputStream;
import java.io.InputStream;
public class IDLCheck {
public void process(String[] args) throws Exception {
InputStream is = new FileInputStream("sample.idl");
ANTLRInputStream input = new ANTLRInputStream(is);
IDLLexer lexer = new IDLLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
IDLParser parser = new IDLParser(tokens);
parser.setBuildParseTree(true);
RuleContext tree = parser.specification();
Listener listener = new Listener();
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(listener, tree);
}
public static void main(String[] args) throws Exception {
new IDLCheck().process(args);
}
}
and a input file: sample.idl
module CommonTypes {
struct WChannel {
int w;
float d;
}; //#top-level false
struct EPlanID {
int kind;
short index;
}; //#top-level TRUE
};
I expect to see the output "Found top-level" twice, but I see nothing
Finally I found a solution. I just added newline characters to the TOP_LEVEL_FALSE and TOP_LEVEL_TRUElexer rules an I also added the top_level parser rule to the definition rule because I only expected top_level to appear after a struct or union. this is a rti.com specific extension to the IDL-format, this modification seems to be good enough for me.
definition
: type_decl SEMICOLON top_level?
| const_decl SEMICOLON
...
TOP_LEVEL_TRUE
: '//#top-level true' '\r'? '\n'
;
TOP_LEVEL_FALSE
: '//#top-level false' '\r'? '\n'
;

Code substitution for DSL using ANTLR

The DSL I'm working on allows users to define a 'complete text substitution' variable. When parsing the code, we then need to look up the value of the variable and start parsing again from that code.
The substitution can be very simple (single constants) or entire statements or code blocks.
This is a mock grammar which I hope illustrates my point.
grammar a;
entry
: (set_variable
| print_line)*
;
set_variable
: 'SET' ID '=' STRING_CONSTANT ';'
;
print_line
: 'PRINT' ID ';'
;
STRING_CONSTANT: '\'' ('\'\'' | ~('\''))* '\'' ;
ID: [a-z][a-zA-Z0-9_]* ;
VARIABLE: '&' ID;
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
Then the following statements executed consecutively should be valid;
SET foo = 'Hello world!';
PRINT foo;
SET bar = 'foo;'
PRINT &bar // should be interpreted as 'PRINT foo;'
SET baz = 'PRINT foo; PRINT'; // one complete statement and one incomplete statement
&baz foo; // should be interpreted as 'PRINT foo; PRINT foo;'
Any time the & variable token is discovered, we immediately switch to interpreting the value of that variable instead. As above, this can mean that you set up the code in such a way that is is invalid, full of half-statements that are only completed when the value is just right. The variables can be redefined at any point in the text.
Strictly speaking the current language definition doesn't disallow nesting &vars inside each other, but the current parsing doesn't handle this and I would not be upset if it wasn't allowed.
Currently I'm building an interpreter using a visitor, but this one I'm stuck on.
How can I build a lexer/parser/interpreter which will allow me to do this? Thanks for any help!
So I have found one solution to the issue. I think it could be better - as it potentially does a lot of array copying - but at least it works for now.
EDIT: I was wrong before, and my solution would consume ANY & that it found, including those in valid locations such as inside string constants. This seems like a better solution:
First, I extended the InputStream so that it is able to rewrite the input steam when a & is encountered. This unfortunately involves copying the array, which I can maybe resolve in the future:
MacroInputStream.java
package preprocessor;
import org.antlr.v4.runtime.ANTLRInputStream;
public class MacroInputStream extends ANTLRInputStream {
private HashMap<String, String> map;
public MacroInputStream(String s, HashMap<String, String> map) {
super(s);
this.map = map;
}
public void rewrite(int startIndex, int stopIndex, String replaceText) {
int length = stopIndex-startIndex+1;
char[] replData = replaceText.toCharArray();
if (replData.length == length) {
for (int i = 0; i < length; i++) data[startIndex+i] = replData[i];
} else {
char[] newData = new char[data.length+replData.length-length];
System.arraycopy(data, 0, newData, 0, startIndex);
System.arraycopy(replData, 0, newData, startIndex, replData.length);
System.arraycopy(data, stopIndex+1, newData, startIndex+replData.length, data.length-(stopIndex+1));
data = newData;
n = data.length;
}
}
}
Secondly, I extended the Lexer so that when a VARIABLE token is encountered, the rewrite method above is called:
MacroGrammarLexer.java
package language;
import language.DSL_GrammarLexer;
import org.antlr.v4.runtime.Token;
import java.util.HashMap;
public class MacroGrammarLexer extends MacroGrammarLexer{
private HashMap<String, String> map;
public DSL_GrammarLexerPre(MacroInputStream input, HashMap<String, String> map) {
super(input);
this.map = map;
// TODO Auto-generated constructor stub
}
private MacroInputStream getInput() {
return (MacroInputStream) _input;
}
#Override
public Token nextToken() {
Token t = super.nextToken();
if (t.getType() == VARIABLE) {
System.out.println("Encountered token " + t.getText()+" ===> rewriting!!!");
getInput().rewrite(t.getStartIndex(), t.getStopIndex(),
map.get(t.getText().substring(1)));
getInput().seek(t.getStartIndex()); // reset input stream to previous
return super.nextToken();
}
return t;
}
}
Lastly, I modified the generated parser to set the variables at the time of parsing:
DSL_GrammarParser.java
...
...
HashMap<String, String> map; // same map as before, passed as a new argument.
...
...
public final SetContext set() throws RecognitionException {
SetContext _localctx = new SetContext(_ctx, getState());
enterRule(_localctx, 130, RULE_set);
try {
enterOuterAlt(_localctx, 1);
{
String vname = null; String vval = null; // set up variables
setState(1215); match(SET);
setState(1216); vname = variable_name().getText(); // set vname
setState(1217); match(EQUALS);
setState(1218); vval = string_constant().getText(); // set vval
System.out.println("Found SET " + vname +" = " + vval+";");
map.put(vname, vval);
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.reportError(this, re);
_errHandler.recover(this, re);
}
finally {
exitRule();
}
return _localctx;
}
...
...
Unfortunately this method is final so this will make maintenance a bit more difficult, but it works for now.
The standard pattern to handling your requirements is to implement a symbol table. The simplest form is as a key:value store. In your visitor, add var declarations as encountered, and read out the values as var references are encountered.
As described, your DSL does not define a scoping requirement on the variables declared. If you do require scoped variables, then use a stack of key:value stores, pushing and popping on scope entry and exit.
See this related StackOverflow answer.
Separately, since your strings may contain commands, you can simply parse the contents as part of your initial parse. That is, expand your grammar with a rule that includes the full set of valid contents:
set_variable
: 'SET' ID '=' stringLiteral ';'
;
stringLiteral:
Quote Quote? (
( set_variable
| print_line
| VARIABLE
| ID
)
| STRING_CONSTANT // redefine without the quotes
)
Quote
;

Handling Antlr Syntax Errors or how to give a better message on unexpected token

We have the following sub-part of an Antlr grammar:
signed_int
: SIGN? INT
;
INT : '0'..'9'+
;
When someone enters a numeric value everything is fine, but if they
mistakenly type something like 1O (one and capital o) we get a cryptic
error message like:
error 1 : Missing token at offset 14
near [Index: 0 (Start: 0-Stop: 0) ='<missing COLON>' type<24> Line: 26 LinePos:14]
: syntax error...
What is a good way to handle this type of error? I thought of
defining catch-all SYMBOL token type but this lead to too many
parser building errors. I will continue looking into Antlr error handling but I
thought I would post this here to look for some insights.
You should Override the reportError methods in lexer and parser.
You can do it by adding this code to your lexer file:
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
And create a method matches in parser that checks if input string matches the specified grammar:
public static boolean matches(String input) {
try {
regExLexer lexer = new regExLexer(new ANTLRStringStream(input));
regExParser parser = new regExParser(new CommonTokenStream(lexer));
parser.goal();
return true;
} catch (RuntimeException e) {
return false;
}
catch (Exception e) {
return false;
}
catch (OutOfMemoryError e) {
return false;
}
}
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
Then in your file use the Parser.matches(input); to check if the given input matches the gramar. If it matches the method returns true, otherwise returns false, so when it returns false you can give any customized error message to users.
You could try to use an ANTLRErrorStrategy, by overriding some of the messages in DefaultErrorStrategy.

Resources