I'm trying to make a grammar to parse easy schema in a structure like the prisma orm uses. After trying a lot of different things it still doesn't parse it properly and I am not sure where the issue lies.
This is the lexer I have written:
lexer grammar OrmLexer;
MODEL: 'model';
MODELNAME : LETTERS+;
OPEN : '{' NEWLINE?;
OPENMODEL : MODEL MODELNAME OPEN -> pushMode(MODELMODE) ;
WHITESPACE : ' ' -> skip ;
NEWLINE: '\r' '\n' | '\n' | '\r';
mode MODELMODE ;
FIELD: FIELDNAME FIELDTYPE NEWLINE;
FIELDNAME : LETTERS+;
FIELDTYPE : LETTERS+;
CLOSE: '}' -> popMode ;
fragment LETTERS : [a-zA-Z] ;
This is the parser I have written:
parser grammar OrmParser;
options { tokenVocab=OrmLexer; }
start: root | EOF ;
root: model*;
model : modelstart modelbody modelend;
modelbody: modelfield*;
modelstart : MODEL MODELNAME OPEN ;
modelfield : FIELDNAME FIELDTYPE NEWLINE ;
modelend: CLOSE ;
I am testing the grammar on an example schema which looks like:
model test {
id test
user test
}
In this schema the fields are not being parsed correctly, there should be two fields namely:
(FIELDNAME: id, FIELDTYPE: test) and (FIELDNAME: user, FIELDTYPE: test) however it doesn't parse those, this is the tree I get in result:
Besides that I am not really sure when to use fragments and when not in the lexer.
I hope anyone can help me in finding/solving the issue! Thanks in advance
I have tried modifying my lexer so such that: OPENMODEL : OPEN -> pushMode(MODELMODE) ; and tried many different small changes in both the parser and lexer, the attempt in above code was the one where I got the closest to the result I want to achieve, namely that the modelbody has the modelfields all separately parsed without any errors occurring.
I several projects I have run into a similar effect in my grammars.
I have the need to parse something like Key="Value"
So I create a grammar (simplest I could make to show the effect):
grammar test;
KEY : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;
DOUBLEQUOTE : '"' ;
EQUALS : '=' ;
entry : key=KEY EQUALS value=VALUE;
I can now parse thing="One Two Three" and in my code I receive
key = thing
value = "One Two Three"
In all of my projects I end up with an extra step to strip those " from the value.
Usually something like this (I use Java)
String value = ctx.value.getText();
value = value.substring(1, value.length()-1);
In my real grammars I find it very hard to move the check of the surrounding " into the parser.
Is there a clean way to already drop the " by doing something in the lexer/parser?
Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".
Update:
I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for.
By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
;
STRING
: [ _a-zA-Z0-9.-]+
;
and
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value=STRING ;
Try this:
VALUE
: DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE
{setText(getText().substring(1, getText().length()-1));}
;
Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.
EDIT
Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.
A quick demo:
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> type(DOUBLEQUOTE), popMode
;
STRING_ATOM
: [ _a-zA-Z0-9.-]
;
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value;
value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;
string_atoms : STRING_ATOM*;
If you now run the Java code:
Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());
this will be printed:
One Two Three
I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#0,0:8='workspace',<'workspace'>,1:0]
[#1,10:15='"Name"',<STRING>,1:10]
[#2,17:29='"Description"',<STRING>,1:17]
[#3,31:31='{',<'{'>,1:31]
[#4,32:32='\n',<NL>,1:32]
[#5,37:46='properties',<'properties'>,2:4]
[#6,48:48='{',<'{'>,2:15]
[#7,49:49='\n',<NL>,2:16]
[#8,58:60='xyz',<BLOB>,3:8]
[#9,62:80='"a string property"',<STRING>,3:12]
[#10,81:81='\n',<NL>,3:31]
[#11,90:98='nonstring',<BLOB>,4:8]
[#12,100:113='nodoublequotes',<BLOB>,4:18]
[#13,114:114='\n',<NL>,4:32]
[#14,119:119='}',<'}'>,5:4]
[#15,120:120='\n',<NL>,5:5]
[#16,121:121='}',<'}'>,6:0]
[#17,122:122='\n',<NL>,6:1]
[#18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;
I am parsing a language which has some difficult syntax that I need some help or suggestions to tackle it.
Heres an a typical line =>
IF CLCI((ZNTEM+CHRCNT),1,1H())) EQ 0
The difficult bit is nH(....any character within the character set.....) where n is 1 in this example and the single char in question is a ')' . My lexer has:
fragment Lp: '(';
fragment Rp: ')';
LP: Lp;
RP: Rp; etc....
My current non-working solution is to switch modes in the lexer because then I can then define all the special chars to consume
// Default mode rules
STRING_SUB: INT 'H' LP -> pushMode(ISLAND) ; // switch to ISLAND mode
and then to switch back
// Special mode of INT H ( ID )
// ID is the string substitute which can includes, spaces, backslash, etc, special chars
mode ISLAND;
ISLAND_CLOSE : RP -> popMode ; // back to nomal mode
ID : SpecialChars+; // match/send ID in tag to parser
fragment SpecialChars: '\u0020'..'\u0027' | '\u002A'..'\u0060' | '\u0061'..'\u007E' | '¦';
But obviously the pop mode trigger is the ')' which fails in the particular example case, because the payload is a RP. Any suggestions?
I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.