ANTLR4 comments, preprocessor or directives parsing

ANTLR4 comments, preprocessor or directives parsing - parsing

I need to parse Delphi directives in ANTLR4. For example,
{$MODE Delphi}{$ifdef net}
net,
{$else}
NewKernel,
{$ifndef st}
{$ifndef neter}
hyper,
{$endif}
{$endif}
{$endif}
But simultaneously, in Delphi there are inside curly braces { } could be comments. So, How to create ANTLR4 grammar, which is working something like this: "file: not to parse text { parsing } not to parse text ;" ?

Hard to be certain from how the Q is phrased. Sounds like you are looking for:
directive : LDrctv .... RBrace ; // replace .... with appropriate rule terms
comment : LBrace .*? RBrace ;
skip : . ; // aggregates all tokens not included in above rules
LDrctv : '{$' ;
LBrace : '{' ;
RBrace : '}' ;
// other token rules
Update: the skip rule catches all tokens that are not consumed by the other rules. If only the directive rule is of interest, then only the corresponding DirectiveContext nodes in the parse tree need be evaluated.

Related

antlr4 not parsing according to grammar

I'm trying to parse 'for loop' according to this (partial) grammar:
grammar GaleugParserNew;
/*
* PARSER RULES
*/
relational
: '>'
| '<'
;
varChange
: '++'
| '--'
;
values
: ID
| DIGIT
;
for_stat
: FOR '(' ID '=' values ';' values relational values ';' ID varChange ')' '{' '}'
;
/*
* LEXER RULES
*/
FOR : 'for' ;
ID : [a-zA-Z_] [a-zA-Z_0-9]* ;
DIGIT : [0-9]+ ;
SPACE : [ \t\r\n] -> skip ;
When I try to generate the gui of how it's parsed, it's not following the grammar I provided above. This is what it produces:
I've encountered this problem before, what I did then was simply exit cmd, open it again and compile everything and somehow that worked then. It's not working now though.
I'm not really very knowledgeable about antlr4 so I'm not sure where to look to solve this problem.

Must be a problem of the IDE you are using. The grammar is fine and produces this parse tree in Visual Studio Code:
I guess the IDE is using the wrong parser or lexer (maybe from a different work file?). Print the lexer tokens to see if they are what you expect. Hint: avoid defining implicit lexer tokens (like '(', '}' etc.), which will allow to give the tokens good names.

ANTLR4 can't parse Integer if a parser rules has an own numeric literal

I am struggling a bit with trying to define integers in my grammar.
Let's say I have this small grammar:
grammar Hello;
r : 'hello' INTEGER;
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
If I then type in
hello 5
it parses correctly.
However, if I have an additional parser rule (even if it's unused) which defines a token '5',
then I can't parse the previous example anymore.
So this grammar:
grammar Hello;
r : 'hello' INTEGER;
unusedRule: 'hi' '5';
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
with
hello 5
won't parse anymore. It gives me the following error:
Hello::r:1:6: mismatched input '5' expecting INTEGER
How is that possible and how can I work around this?

When you define a parser rule like
unusedRule: 'hi' '5';
Antlr creates implicit lexer tokens for the subterms. Since they are automatically created in the lexer, you have no control over where the sit in the precedence evaluation of Lexer rules.
Consequently, the best policy is to never use literals in parser rules; always explicitly define your tokens.

Why does this grammar fail to parse this input?

I'm defining a grammar for a small language and Antlr4. The idea is in that language, there's a keyword "function" which can be used to either define a function or as a type specifier when defining parameters. I would like to be able to do something like this:
function aFunctionHere(int a, function callback) ....
However, it seems Antlr doesn't like that I use "function" in two different places. As far as I can tell, the grammar isn't even ambiguous.
In the following grammar, if I remove LINE 1, the generated parser parses the sample input without a problem. Also, if I change the token string in either LINE 2 or LINE 3, so that they are not equal, the parser works.
The error I get with the grammar as-is:
line 1:0 mismatched input 'function' expecting <INVALID>
What does "expecting <INVALID>" mean?
The (stripped down) grammar:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: BaseParamType IDENTIFIER ;
// Lexer stuff
BaseParamType:
INT_TYPE
| FUNCTION_TYPE // <---- LINE 1
;
FUNCTION : 'function'; // <---- LINE 2
INT_TYPE : 'int';
FUNCTION_TYPE : 'function'; // <---- LINE 3
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;
The input I'm using:
function abc(int c, int d, int a)
The program to test the generated parser:
from antlr4 import *
from testLexer import testLexer as Lexer
from testParser import testParser as Parser
from antlr4.tree.Trees import Trees
def main(argv):
input = FileStream(argv[1] if len(argv)>1 else "test.in")
lexer = Lexer(input)
tokens = CommonTokenStream(lexer)
parser = Parser(tokens)
tree = parser.begin()
print Trees.toStringTree(tree, None, parser)
if __name__ == '__main__':
import sys
main(sys.argv)

Just use one name for the token function.
A token is just a token. Looking at function in isolation, it is not possible to decide whether it is a FUNCTION or a FUNCTION_TYPE. Since FUNCTION, comes first in the file, that's what the lexer used. That makes it impossible to match FUNCTION_TYPE, so that becomes an invalid token type.
The parser will figure out the syntactic role of the token function. So there would be no point using two different lexical descriptors for the same token, even if it would be possible.
In the grammar in the OP, BaseParamType is also a lexical type, which will absorb all uses of the token function, preventing FUNCTION from being recognized in the production for function. Changing its name to baseParamType, which effectively changes it to a parser non-terminal, will allow the parser to work, although I suppose it may alter the parse tree in undesirable ways.
I understand the objection that the parser "should know" which lexical tokens are possible in context, given the nature of Antlr's predictive parsing strategy. I'm far from an Antlr expert so I won't pretend to explain why it doesn't seem to work, but with the majority of parser generators -- and all the ones I commonly use -- lexical analysis is effectively performed as a prior pass to parsing, so the conversion of textual input into a stream of tokens is done prior to the parser establishing context. (Most lexical generators, including Antlr, have mechanisms with which the user can build lexical context, but IMHO these mechanisms reduce grammar readability and should only be used if strictly necessary.)
Here's the grammar file which I tested:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: baseParamType IDENTIFIER ;
// Lexer stuff
baseParamType:
INT_TYPE
| FUNCTION //
;
FUNCTION : 'function';
INT_TYPE : 'int';
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;

Ignoring whitespace (in certain parts) in Antlr4

I am not so familiar with antlr. I am using version 4 and I have a grammar where whitespace is not important in some parts (but it might be in others, or rather its luck).
So say we have the following grammar
grammar Foo;
program : A* ;
A : ID '#' ID '(' IDList ')' ';' ;
ID : [a-zA-Z]+ ;
IDList : ID (',' IDList)* ;
WS : [ \t\r\n]+ -> skip ;
and a test input
foo#bar(X,Y);
foo#baz ( z,Z) ;
The first line is parsed correctly whereas the second one is not.
I don't want to polute my rules with the places where whitespace is not relevant, since my actual grammar is more complicated than the toy example. In case it's not clear the part ID'#'ID should not have a whitespace. Whitespace in any other position shouldn't matter at all.

Even though you are skipping WS, lexer rules are still sensitive to the existence of the whitespace characters. Skip simply means that no token is generated for consumption by the parser. Thus, the lexer Addr rule explicitly does not permit any interior whitespace characters.
Conversely, the a and idList parser rules never see interior whitespace tokens so those rules are insensitive to the occurrence of whitespace characters occurring between the generated tokens.
grammar Foo;
program : a* EOF ; // EOF will require parsing the entire input
a : Addr LParen IDList RParen Semi ;
idList : ID (Comma ID)* ; // simpler equivalent construct
Addr : ID '#' ID ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;

Define ID '#' ID as lexer token rather than as parser token.
A : AID '(' IDList ')' ';' ;
AID : [a-zA-Z]+ '#' [a-zA-Z]+;
Other options
enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Yacc fails with one declaration but reduce two declarations with success

I have the following yacc parser (don't have any conflict) for int or float variable declaration:
%token ID INT FLOAT
%token SEMICOLON
%%
program : list_declaration { printf("program\n"); }
;
list_declaration : declaration { printf("list_declaration\n"); }
| declaration declaration { printf("list_declaration\n"); }
;
declaration : var_declaration { printf("declaration\n"); }
;
var_declaration : type ID SEMICOLON { printf("var_declaration\n"); }
;
type : INT { printf("type\n"); }
| FLOAT { printf("type\n"); }
;
%%
I've been knocking my head trying to solve this problem but didn't come with any solutions.
If there's two variable declarations as input, like:
int test;
float test2;
Its parsed normally, here is the output:
type
var_declaration
declaration
type
var_declaration
declaration
list_declaration
program
But if there's only one declaration the parser never reduces it to program, for instance:
int test;
gives:
type
var_declaration
declaration
Shouldn't declaration be reduced to list_declaration and then list_declaration reduced to program? I'm planning later to extend list_declaration to any number of declarations, but I can't do that unless I understand first why is not working properly for at least two declarations.

The problem is almost certainly that you are suppressing the EOF return from yylex. yylex must return 0 on EOF; otherwise, bison parsers cannot reliably recognize the start production.
Like most parser generators -- and as described in most parsing textbooks -- bison and yacc create an "augmented" start production whose right-hand side consists of the declared (or implicit) start non-terminal followed by an EOF pseudo-token. The parse will only succeed if that production is reduced, and that production cannot be reduced without the EOF.
Because bison will reduce without lookahead for states in which lookahead is unnecessary, it is possible, with your grammar, for bison to reduce declaration declaration to program without lookahead. But it cannot reduce declaration to program without the EOF lookahead, so it doesn't. In the case with two declarations, despite the fact that program has been reduced, the parse has not actually succeeded and yyparse will not have returned.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

ANTLR4 comments, preprocessor or directives parsing - parsing

Related

antlr4 not parsing according to grammar

ANTLR4 can't parse Integer if a parser rules has an own numeric literal

Why does this grammar fail to parse this input?

Ignoring whitespace (in certain parts) in Antlr4

Yacc fails with one declaration but reduce two declarations with success

Categories

Resources