I'm trying to make a grammar to parse easy schema in a structure like the prisma orm uses. After trying a lot of different things it still doesn't parse it properly and I am not sure where the issue lies.
This is the lexer I have written:
lexer grammar OrmLexer;
MODEL: 'model';
MODELNAME : LETTERS+;
OPEN : '{' NEWLINE?;
OPENMODEL : MODEL MODELNAME OPEN -> pushMode(MODELMODE) ;
WHITESPACE : ' ' -> skip ;
NEWLINE: '\r' '\n' | '\n' | '\r';
mode MODELMODE ;
FIELD: FIELDNAME FIELDTYPE NEWLINE;
FIELDNAME : LETTERS+;
FIELDTYPE : LETTERS+;
CLOSE: '}' -> popMode ;
fragment LETTERS : [a-zA-Z] ;
This is the parser I have written:
parser grammar OrmParser;
options { tokenVocab=OrmLexer; }
start: root | EOF ;
root: model*;
model : modelstart modelbody modelend;
modelbody: modelfield*;
modelstart : MODEL MODELNAME OPEN ;
modelfield : FIELDNAME FIELDTYPE NEWLINE ;
modelend: CLOSE ;
I am testing the grammar on an example schema which looks like:
model test {
id test
user test
}
In this schema the fields are not being parsed correctly, there should be two fields namely:
(FIELDNAME: id, FIELDTYPE: test) and (FIELDNAME: user, FIELDTYPE: test) however it doesn't parse those, this is the tree I get in result:
Besides that I am not really sure when to use fragments and when not in the lexer.
I hope anyone can help me in finding/solving the issue! Thanks in advance
I have tried modifying my lexer so such that: OPENMODEL : OPEN -> pushMode(MODELMODE) ; and tried many different small changes in both the parser and lexer, the attempt in above code was the one where I got the closest to the result I want to achieve, namely that the modelbody has the modelfields all separately parsed without any errors occurring.
I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#0,0:8='workspace',<'workspace'>,1:0]
[#1,10:15='"Name"',<STRING>,1:10]
[#2,17:29='"Description"',<STRING>,1:17]
[#3,31:31='{',<'{'>,1:31]
[#4,32:32='\n',<NL>,1:32]
[#5,37:46='properties',<'properties'>,2:4]
[#6,48:48='{',<'{'>,2:15]
[#7,49:49='\n',<NL>,2:16]
[#8,58:60='xyz',<BLOB>,3:8]
[#9,62:80='"a string property"',<STRING>,3:12]
[#10,81:81='\n',<NL>,3:31]
[#11,90:98='nonstring',<BLOB>,4:8]
[#12,100:113='nodoublequotes',<BLOB>,4:18]
[#13,114:114='\n',<NL>,4:32]
[#14,119:119='}',<'}'>,5:4]
[#15,120:120='\n',<NL>,5:5]
[#16,121:121='}',<'}'>,6:0]
[#17,122:122='\n',<NL>,6:1]
[#18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;
Trying a simple Grammar on antlr. it should parse inputs such as L=[1,2,hello].
However, antlr is producing this error: The following token definitions can never be matched because prior tokens match the same input: INT,STRING.Any Help?
grammar List;
decl: ID '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|STRING) Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|STRING) Inside2 | ; //
ID:('0'..'Z')+;
INT:('0'..'9')+;
STRING:('a'..'Z')+;
EDIT: The updated Grammar. The error remains with INT Only.
grammar List;
decl: STRING '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|'"'STRING'"') Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|'"'STRING'"') Inside2 | ; //
STRING:('A'..'Z')+;
INT:('0'..'9')+;
Your ID pattern matches everything that would be matched by INT or STRING, making them irrelevant. I don't think that's what you want.
ID shouldn't match tokens starting with a digit; 42 is not an identifier. And your comment implies that STRING is intended to be a string literal ('hello') but your lexical pattern makes no attempt to match '.
I am writing a grammar with a lot of case-insensitive keywords in ANTLR4. I collected some example files for the format, that I try to test parse and some use the same tokens which exist as keywords as identifiers in other places. For example there is a CORE keyword, which in other places is used as a ID for a structure from user input. Here some parts of my grammar:
fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
[...]
CORE: C O R E ;
[...]
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
id: IDSTRING ;
The error thrown then is line 7982:8 mismatched input 'core' expecting IDSTRING, as the user input is intended as IDSTRING, but always eaten by the keyword rule. In the input it exists both as keyword and as id like this:
MACRO oa12f01
CLASS CORE ; #here it is a KEYWORD
[...]
SITE core ; #here it is a ID
Is there a way I can let users use some keywords as identifiers by changing my grammar somehow like "casting" the token to IDSTRING for conjunctive rules like this or is this a false hope in not hand written parsers?
You can simply list the keywords that are allowed as identifiers as alternatives in the id rule:
id: IDSTRING | CORE | ... ;
I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!
What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.
I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.