How do underivable rules affect parsing?

How do underivable rules affect parsing? - parsing

When writing an XText grammar for a simple SQL dialect, I found out, that apparently rules that cannot be derived from the start symbol affect parsing.
E.g. given the following (very simplified) extract of my grammar which should be able to parse expressions like FROM table1;:
Start:
subquery ';';
subquery:
/*select=select_clause */tables=from_clause;
from_clause:
'FROM' tables;
tables:
tables+=table (',' tables+=table)*;
table:
name=table_name (alias=alias)?;
table_name:
prefix=qualified_name_prefix? name=qualified_name;
qualified_name_prefix:
ID'.';
qualified_name :
=>qualified_name_prefix? ID;
alias returns EString:
'AS'? alias=ID;
with_clause :
'WITH' elements+=with_list_element (',' elements+=with_list_element)*;
with_list_element :
name=ID (column_list_clause=column_list_clause)? 'AS' '(' subquery=subquery ')';
column_list_clause :
'(' names+=ID+ ')';
When trying to parse the string FROM table1;, I get the following error:
'no viable alternative at input ';'' on EString
If I remove rule with_clause, the error is gone and the string is parsed properly. How is this possible even though with_clause cannot be derived from Start?

the problem is that the predicate (=>) covers an ambiguity
maybe you can pull together prefix and name
Table_name:
name=Qualified_name;
Qualified_name :
(ID '.' (ID '.')?)? ID;
or you try something like
Table_name:
((prefix=ID ".")? =>name=Qualified_name);
Qualified_name :
=>(ID '.' ID) | ID;

Related

ANTLR4 Grammar not detecting fields inside mode properly

I'm trying to make a grammar to parse easy schema in a structure like the prisma orm uses. After trying a lot of different things it still doesn't parse it properly and I am not sure where the issue lies.
This is the lexer I have written:
lexer grammar OrmLexer;
MODEL: 'model';
MODELNAME : LETTERS+;
OPEN : '{' NEWLINE?;
OPENMODEL : MODEL MODELNAME OPEN -> pushMode(MODELMODE) ;
WHITESPACE : ' ' -> skip ;
NEWLINE: '\r' '\n' | '\n' | '\r';
mode MODELMODE ;
FIELD: FIELDNAME FIELDTYPE NEWLINE;
FIELDNAME : LETTERS+;
FIELDTYPE : LETTERS+;
CLOSE: '}' -> popMode ;
fragment LETTERS : [a-zA-Z] ;
This is the parser I have written:
parser grammar OrmParser;
options { tokenVocab=OrmLexer; }
start: root | EOF ;
root: model*;
model : modelstart modelbody modelend;
modelbody: modelfield*;
modelstart : MODEL MODELNAME OPEN ;
modelfield : FIELDNAME FIELDTYPE NEWLINE ;
modelend: CLOSE ;
I am testing the grammar on an example schema which looks like:
model test {
id test
user test
}
In this schema the fields are not being parsed correctly, there should be two fields namely:
(FIELDNAME: id, FIELDTYPE: test) and (FIELDNAME: user, FIELDTYPE: test) however it doesn't parse those, this is the tree I get in result:
Besides that I am not really sure when to use fragments and when not in the lexer.
I hope anyone can help me in finding/solving the issue! Thanks in advance
I have tried modifying my lexer so such that: OPENMODEL : OPEN -> pushMode(MODELMODE) ; and tried many different small changes in both the parser and lexer, the attempt in above code was the one where I got the closest to the result I want to achieve, namely that the modelbody has the modelfields all separately parsed without any errors occurring.

What is the best way to handle overlapping lexer patterns that are sensitive to context?

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#0,0:8='workspace',<'workspace'>,1:0]
[#1,10:15='"Name"',<STRING>,1:10]
[#2,17:29='"Description"',<STRING>,1:17]
[#3,31:31='{',<'{'>,1:31]
[#4,32:32='\n',<NL>,1:32]
[#5,37:46='properties',<'properties'>,2:4]
[#6,48:48='{',<'{'>,2:15]
[#7,49:49='\n',<NL>,2:16]
[#8,58:60='xyz',<BLOB>,3:8]
[#9,62:80='"a string property"',<STRING>,3:12]
[#10,81:81='\n',<NL>,3:31]
[#11,90:98='nonstring',<BLOB>,4:8]
[#12,100:113='nodoublequotes',<BLOB>,4:18]
[#13,114:114='\n',<NL>,4:32]
[#14,119:119='}',<'}'>,5:4]
[#15,120:120='\n',<NL>,5:5]
[#16,121:121='}',<'}'>,6:0]
[#17,122:122='\n',<NL>,6:1]
[#18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.

This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

The following token definitions can never be matched because prior tokens match the same input: INT,STRING

Trying a simple Grammar on antlr. it should parse inputs such as L=[1,2,hello].
However, antlr is producing this error: The following token definitions can never be matched because prior tokens match the same input: INT,STRING.Any Help?
grammar List;
decl: ID '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|STRING) Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|STRING) Inside2 | ; //
ID:('0'..'Z')+;
INT:('0'..'9')+;
STRING:('a'..'Z')+;
EDIT: The updated Grammar. The error remains with INT Only.
grammar List;
decl: STRING '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|'"'STRING'"') Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|'"'STRING'"') Inside2 | ; //
STRING:('A'..'Z')+;
INT:('0'..'9')+;

Your ID pattern matches everything that would be matched by INT or STRING, making them irrelevant. I don't think that's what you want.
ID shouldn't match tokens starting with a digit; 42 is not an identifier. And your comment implies that STRING is intended to be a string literal ('hello') but your lexical pattern makes no attempt to match '.

Token with different interpretations (i.e. keyword and identifier)

I am writing a grammar with a lot of case-insensitive keywords in ANTLR4. I collected some example files for the format, that I try to test parse and some use the same tokens which exist as keywords as identifiers in other places. For example there is a CORE keyword, which in other places is used as a ID for a structure from user input. Here some parts of my grammar:
fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
[...]
CORE: C O R E ;
[...]
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
id: IDSTRING ;
The error thrown then is line 7982:8 mismatched input 'core' expecting IDSTRING, as the user input is intended as IDSTRING, but always eaten by the keyword rule. In the input it exists both as keyword and as id like this:
MACRO oa12f01
CLASS CORE ; #here it is a KEYWORD
[...]
SITE core ; #here it is a ID
Is there a way I can let users use some keywords as identifiers by changing my grammar somehow like "casting" the token to IDSTRING for conjunctive rules like this or is this a false hope in not hand written parsers?

You can simply list the keywords that are allowed as identifiers as alternatives in the id rule:
id: IDSTRING | CORE | ... ;

Need keywords to be recognized as such only in the correct places

I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!

What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.

I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How do underivable rules affect parsing? - parsing

the problem is that the predicate (=>) covers an ambiguity maybe you can pull together prefix and name Table_name: name=Qualified_name; Qualified_name : (ID '.' (ID '.')?)? ID; or you try something like Table_name: ((prefix=ID ".")? =>name=Qualified_name); Qualified_name : =>(ID '.' ID) | ID;

Related

ANTLR4 Grammar not detecting fields inside mode properly

What is the best way to handle overlapping lexer patterns that are sensitive to context?

The following token definitions can never be matched because prior tokens match the same input: INT,STRING

Token with different interpretations (i.e. keyword and identifier)

Need keywords to be recognized as such only in the correct places

Categories

Resources