ANTLR4 Grammar not detecting fields inside mode properly - parsing

I'm trying to make a grammar to parse easy schema in a structure like the prisma orm uses. After trying a lot of different things it still doesn't parse it properly and I am not sure where the issue lies.
This is the lexer I have written:
lexer grammar OrmLexer;
MODEL: 'model';
MODELNAME : LETTERS+;
OPEN : '{' NEWLINE?;
OPENMODEL : MODEL MODELNAME OPEN -> pushMode(MODELMODE) ;
WHITESPACE : ' ' -> skip ;
NEWLINE: '\r' '\n' | '\n' | '\r';
mode MODELMODE ;
FIELD: FIELDNAME FIELDTYPE NEWLINE;
FIELDNAME : LETTERS+;
FIELDTYPE : LETTERS+;
CLOSE: '}' -> popMode ;
fragment LETTERS : [a-zA-Z] ;
This is the parser I have written:
parser grammar OrmParser;
options { tokenVocab=OrmLexer; }
start: root | EOF ;
root: model*;
model : modelstart modelbody modelend;
modelbody: modelfield*;
modelstart : MODEL MODELNAME OPEN ;
modelfield : FIELDNAME FIELDTYPE NEWLINE ;
modelend: CLOSE ;
I am testing the grammar on an example schema which looks like:
model test {
id test
user test
}
In this schema the fields are not being parsed correctly, there should be two fields namely:
(FIELDNAME: id, FIELDTYPE: test) and (FIELDNAME: user, FIELDTYPE: test) however it doesn't parse those, this is the tree I get in result:
Besides that I am not really sure when to use fragments and when not in the lexer.
I hope anyone can help me in finding/solving the issue! Thanks in advance
I have tried modifying my lexer so such that: OPENMODEL : OPEN -> pushMode(MODELMODE) ; and tried many different small changes in both the parser and lexer, the attempt in above code was the one where I got the closest to the result I want to achieve, namely that the modelbody has the modelfields all separately parsed without any errors occurring.

Related

Drop the required surrounding quotes in the lexer/parser

I several projects I have run into a similar effect in my grammars.
I have the need to parse something like Key="Value"
So I create a grammar (simplest I could make to show the effect):
grammar test;
KEY : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;
DOUBLEQUOTE : '"' ;
EQUALS : '=' ;
entry : key=KEY EQUALS value=VALUE;
I can now parse thing="One Two Three" and in my code I receive
key = thing
value = "One Two Three"
In all of my projects I end up with an extra step to strip those " from the value.
Usually something like this (I use Java)
String value = ctx.value.getText();
value = value.substring(1, value.length()-1);
In my real grammars I find it very hard to move the check of the surrounding " into the parser.
Is there a clean way to already drop the " by doing something in the lexer/parser?
Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".
Update:
I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for.
By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
;
STRING
: [ _a-zA-Z0-9.-]+
;
and
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value=STRING ;
Try this:
VALUE
: DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE
{setText(getText().substring(1, getText().length()-1));}
;
Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.
EDIT
Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.
A quick demo:
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> type(DOUBLEQUOTE), popMode
;
STRING_ATOM
: [ _a-zA-Z0-9.-]
;
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value;
value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;
string_atoms : STRING_ATOM*;
If you now run the Java code:
Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());
this will be printed:
One Two Three

What is the best way to handle overlapping lexer patterns that are sensitive to context?

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#0,0:8='workspace',<'workspace'>,1:0]
[#1,10:15='"Name"',<STRING>,1:10]
[#2,17:29='"Description"',<STRING>,1:17]
[#3,31:31='{',<'{'>,1:31]
[#4,32:32='\n',<NL>,1:32]
[#5,37:46='properties',<'properties'>,2:4]
[#6,48:48='{',<'{'>,2:15]
[#7,49:49='\n',<NL>,2:16]
[#8,58:60='xyz',<BLOB>,3:8]
[#9,62:80='"a string property"',<STRING>,3:12]
[#10,81:81='\n',<NL>,3:31]
[#11,90:98='nonstring',<BLOB>,4:8]
[#12,100:113='nodoublequotes',<BLOB>,4:18]
[#13,114:114='\n',<NL>,4:32]
[#14,119:119='}',<'}'>,5:4]
[#15,120:120='\n',<NL>,5:5]
[#16,121:121='}',<'}'>,6:0]
[#17,122:122='\n',<NL>,6:1]
[#18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

How to deal with string-like values between xml tags in an xtext grammar

I am attempting to create a strongly defined xml language, but have run into trouble on element values between element tags. I want them to be treated like a string except they are not wrapped in quotes. Here is a basic grammar I created to demonstrate the idea:
grammar org.xtext.example.myxml.MyXml hidden(WS)
generate myXml "http://www.xtext.org/example/myxml/MyXml"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Element:
{Element}
'<Element' attributes+=ElementAttribute* ('/>' | '>'
subElement+=SubElement*
'</Element' '>')
;
SubElement:
{SubElement}
'<SubElement' attributes+=SubElementAttribute* ('/>' | '>'
value=ElementValue
'</SubElement' '>')
;
ElementAttribute:
NameAttribute | TypeAttribute
;
SubElementAttribute:
NameAttribute
;
TypeAttribute:
'type' '=' type=STRING
;
NameAttribute:
'name' '=' name=STRING
;
ElementValue hidden():
value=ID
;
terminal STRING:
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
terminal WS: (' '|'\t'|'\r'|'\n')+;
terminal ID: '^'?('a'..'z'|'A'..'Z'|'_'|'0'..'9'|':'|'-'|'('|')')*;
Here is a test to demonstrate its usage:
#Test
def void parseXML() {
val result = parseHelper.parse('''
<Element type="myType" name="myName">
<SubElement>some string:like-stuff here </SubElement>
</Element>
''')
Assert.assertNotNull(result)
val errors = result.eResource.errors
for (error : errors) {
println(error.message)
}
}
The error I get from this exact code is mismatched input 'string:like-stuff' expecting '</SubElement'
Obviously this will not work because ID does not allow for white space, adding white space to ID fixes the above error, but causes other issues parsing. So my question is how can I parse the element value into a string-like representation without causing ambiguity for the parser in other areas. The only way I have been able to get this to work in any form in my full language is by turning the ElementValue into a list of ID's separated by white space. (I could not get it to work on this minimal example however, not sure what is different)
I would not really recommend it because Xtext is usually not the best fit for XML parsing, but it would probably be possible by turning ElementValue into a datatype rule that allows everything that doesn't create an ambiguity.
Something along the lines of:
ElementValue returns ecore::EString hidden(): (ID|WS|STRING|UNMATCHED)+ ;
and at the end of the grammar:
terminal UNMATCHED: .;
You will probably want to make SubElement.value optional to allow for an empty element.
value=ElementValue?

How do underivable rules affect parsing?

When writing an XText grammar for a simple SQL dialect, I found out, that apparently rules that cannot be derived from the start symbol affect parsing.
E.g. given the following (very simplified) extract of my grammar which should be able to parse expressions like FROM table1;:
Start:
subquery ';';
subquery:
/*select=select_clause */tables=from_clause;
from_clause:
'FROM' tables;
tables:
tables+=table (',' tables+=table)*;
table:
name=table_name (alias=alias)?;
table_name:
prefix=qualified_name_prefix? name=qualified_name;
qualified_name_prefix:
ID'.';
qualified_name :
=>qualified_name_prefix? ID;
alias returns EString:
'AS'? alias=ID;
with_clause :
'WITH' elements+=with_list_element (',' elements+=with_list_element)*;
with_list_element :
name=ID (column_list_clause=column_list_clause)? 'AS' '(' subquery=subquery ')';
column_list_clause :
'(' names+=ID+ ')';
When trying to parse the string FROM table1;, I get the following error:
'no viable alternative at input ';'' on EString
If I remove rule with_clause, the error is gone and the string is parsed properly. How is this possible even though with_clause cannot be derived from Start?
the problem is that the predicate (=>) covers an ambiguity
maybe you can pull together prefix and name
Table_name:
name=Qualified_name;
Qualified_name :
(ID '.' (ID '.')?)? ID;
or you try something like
Table_name:
((prefix=ID ".")? =>name=Qualified_name);
Qualified_name :
=>(ID '.' ID) | ID;

ANTLR Grammar for parsing a text file

I'm driving crazy trying to generate a parser Grammar with ANTLR.
I've got plain text file like:
Diagram : VW 503 FSX 09/02/2015 12/02/2015 STP
Fleet : AAAA
OFF :
AAA 05+44 5R06
KKK 05+55 06.04 1R06 5530
ZZZ 06.24 06.30 1R06 5530
YYY 07.53 REVRSE
YYY 08.23 9G98 5070
WORKS :
MILES :(LD) 1288.35 (ETY) 3.18 (TOT) 1291.53
Each "Diagram" entity is contained beetween "Diagram :" and the "(TOT) before EOF.
In the same plain txt file multiple "Diagram" entity can be present.
I've done some test with ANTRL
`grammar Hello2;
xxxt : diagram+;
diagram : DIAGRAM_ini txt fleet LEGS+ DIAGRAM_end;
txt : TEXT;
fleet : FLEET_INI txt;
num : NUMBER;
// Lexer Rules
DIAGRAM_ini : 'Diagram :';
DIAGRAM_end : '(TOT)' ;
LEGS : ('AAA' | 'KKK' | 'ZZZ' | 'YYY') ;
FLEET_INI : 'Fleet :';
TEXT : ('a'..'z')+ ;
NUMBER: ('0'..'9') ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> skip ;`
My Goal is to be able to parse Diagrams recursively, and gather all LEGS text/number.
Any help/tips is much more than appreciated!
Many Thanks
Regs
S.
I suggest not parsing the file like you did. This file does not define a language with words and grammar, but rather a formatted text of chars:
The formatting conventions are rather weak
The labels before the colon cannot serve as tokens since they may reappear in the body (AAA (=label) vs AAAA (=body)
The tokens must be very primitive to fit this requirements
Solution with ANTLR
You need a weaker grammar to solve this problem, e.g.
grammar diagrams;
diagrams : diagram+ ;
diagram : section+ ;
section : WORD ':' body? ;
body : textline+;
textline : (WORD | NUMBER | SIGNS)* ('\r' | '\n')+;
WORD : LETTER+ ;
NUMBER : DIGIT+ ;
SIGNS : SIGN+ ;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment SIGN : ('.'|'+'|'('|')'|'/') ;
fragment DIGIT : ('0'..'9') ;
Run a visitor on the Parsing result
to build up the normalized text of body
to filter out the LEGS lines out of the body
to parse a LEGS line with another parser (a regexp-parser would be sufficient here, but you could also define another ANTLR-Parser)
Another alternative:
Try out Packrat parsing (e.g. parboiled)
- it is (especially for people with low experience in compiler construction) more comprehensible
it matches better to your grammar design
parboiled is pure java (grammar specified in java)
Disadvantages:
Whitespace handling must be done in Parser Rules
Debugging/Error Messages are a problem (with all packrat parsers)

Resources