How can I define an ANTLR4 indentation block based grammar? - parsing

I am trying to define a language using ANTLR4 to generate its parser. While the language is actually a bit more complex, this is a tiny valid example of a file I want the parser to read, which triggers the problem I am trying to fix:
features \\ Keyword which initializes the "features" block
Server
mandatory \\ Relation word
FileSystem
OperatingSystem
optional \\ Relation word
Logging
features word simply starts the block, while mandatory and optional are relation words. The words remaining are just simple words (called features in this context). What I want is to make Server child of features block, then, mandatory and optional both children of Server and finally, FileSystem and OperatingSystem children of mandatory, and Logging child of optional.
The following grammar is my attempt to achieve this structure:
grammar MyGrammar;
tokens {
INDENT,
DEDENT
}
#lexer::header {
from antlr_denter.DenterHelper import DenterHelper
from UVLParser import UVLParser
}
#lexer::members {
class UVLDenter(DenterHelper):
def __init__(self, lexer, nl_token, indent_token, dedent_token, ignore_eof):
super().__init__(nl_token, indent_token, dedent_token, ignore_eof)
self.lexer: UVLLexer = lexer
def pull_token(self):
return super(UVLLexer, self.lexer).nextToken()
denter = None
def nextToken(self):
if not self.denter:
self.denter = self.UVLDenter(self, self.NL, UVLParser.INDENT, UVLParser.DEDENT, True)
return self.denter.next_token()
}
// parser rules
feature_model: features?;
features: 'features' INDENT child;
child: feature_spec INDENT relation* DEDENT;
relation: relation_spec INDENT child* DEDENT;
feature_spec: WORD ('.' WORD)*;
relation_spec: RELATION_WORD;
//lexer rules
RELATION_WORD: ('alternative' | 'or' | 'optional' | 'mandatory');
WORD: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \n\r]+ -> skip;
NL: ('\r'? '\n' '\t');
I am using antlr-denter in order to manage indent and dedent.
Then, I am defining RELATION_WORD and WORD separately in the lexer.
Finally, the parser rules attempt to construct the structure I described before. I want the features word to be followed by a single child. Then, any child is going to be a feature spec followed by any amount of relations between an INDENT and DEDENT. Same happens with relations being a relation spec followed by a similar set of children, with this loop being repeated indefinitely.
However, I can't manage to make the parser read this structure correctly. With the previous example as input, I am getting mandatory as child of Server, but not optional. Changing the example to this one:
features
Server
mandatory
optional
Logging
Both mandatory and optional are interpreted as children of mandatory. It must have something to do with INDENT and DEDENT interpretation to correctly find blocks, but I have been unable to find a solution so far.
Any ideas to fix this would be very welcome. Thanks in advance!

Try changing your child and feature rules as follows:
child: feature_spec (INDENT relation* DEDENT)?;
relation: relation_spec (INDENT child* DEDENT)?;
Explanation:
As #Kaby76 mentions, it's quite helpful to print out the token stream to understand how your parser stream sees the stream of tokens.
I've not used antlr-denter before, but the way it plugs in, it would appear that you're not going to get a token stream just by using the grun tool.
As a substitute, I tried just making up INDENT and OUTDENT Tokens (I used -> and <-, respectively)
revised grammar:
grammar MyGrammar;
// parser rules
feature_model: features?;
features: 'features' INDENT child;
child: feature_spec INDENT relation* DEDENT;
relation: relation_spec INDENT child* DEDENT;
feature_spec: WORD ('.' WORD)*;
relation_spec: RELATION_WORD;
//lexer rules
RELATION_WORD: ('alternative' | 'or' | 'optional' | 'mandatory');
WORD: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \n\r]+ -> skip;
// Temporary
//NL: ('\r'? '\n' '\t');
NL: ('\r'? '\n' '\t') -> skip;
INDENT: '->';
DEDENT: '<-';
And revised to input file to use the explicit tokens:
features
->Server
->mandatory
optional
->Logging
Just making this change, you'll notice that there are no <- tokens in your sample.
But, now I can dump the token stream:
➜ grun MyGrammar tokens -tokens < MGIn.txt
[#0,0:7='features',<'features'>,1:0]
[#1,12:13='->',<'->'>,2:3]
[#2,14:19='Server',<WORD>,2:5]
[#3,28:29='->',<'->'>,3:7]
[#4,30:38='mandatory',<RELATION_WORD>,3:9]
[#5,47:48='->',<'->'>,4:7]
[#6,49:56='optional',<RELATION_WORD>,4:9]
[#7,69:70='->',<'->'>,5:11]
[#8,71:77='Logging',<WORD>,5:13]
[#9,78:77='<EOF>',<EOF>,5:20]
Now let's try parsing:
➜ grun MyGrammar feature_model -tree < MGIn.txt
line 4:9 mismatched input 'optional' expecting {WORD, '<-'}
line 5:20 mismatched input '<EOF>' expecting {'.', '->'}
(feature_model (features features -> (child (feature_spec Server) -> (relation (relation_spec mandatory) ->) (relation (relation_spec optional) -> (child (feature_spec Logging))) <missing '<-'>)))
So, your grammar calls for 'mandatory' (as a RELATION_WORD) to be followed by an INDENT as well as a DEDENT (which isn't present). This makes sense as they don't have any children, So, it seems that the INDENT/DEDENT need to be connected to whether there are any children:
Let's change that:
child: feature_spec (INDENT relation* DEDENT)?;
relation: relation_spec (INDENT child* DEDENT)?;
Try again:
➜ grun MyGrammar feature_model -tree < MGIn.txt
➜ grun MyGrammar feature_model -tree < MGIn.txt
line 5:20 extraneous input '<EOF>' expecting {WORD, '<-'}
(feature_model (features features -> (child (feature_spec Server) -> (relation (relation_spec mandatory)) (relation (relation_spec optional) -> (child (feature_spec Logging))) <missing '<-'>)))
Now we're missing a <- (OUTDENT) at EOF. The solution to this depends on whether the antlr-denter closes all the INDENTs at <EOF>
Assuming it does, my fake input should look something like:
features
->Server
->mandatory
optional
->Logging
<-
<-
<-
and, we try again:
grun MyGrammar feature_model -gui < MGIn.txt

Related

What is the best way to handle overlapping lexer patterns that are sensitive to context?

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#0,0:8='workspace',<'workspace'>,1:0]
[#1,10:15='"Name"',<STRING>,1:10]
[#2,17:29='"Description"',<STRING>,1:17]
[#3,31:31='{',<'{'>,1:31]
[#4,32:32='\n',<NL>,1:32]
[#5,37:46='properties',<'properties'>,2:4]
[#6,48:48='{',<'{'>,2:15]
[#7,49:49='\n',<NL>,2:16]
[#8,58:60='xyz',<BLOB>,3:8]
[#9,62:80='"a string property"',<STRING>,3:12]
[#10,81:81='\n',<NL>,3:31]
[#11,90:98='nonstring',<BLOB>,4:8]
[#12,100:113='nodoublequotes',<BLOB>,4:18]
[#13,114:114='\n',<NL>,4:32]
[#14,119:119='}',<'}'>,5:4]
[#15,120:120='\n',<NL>,5:5]
[#16,121:121='}',<'}'>,6:0]
[#17,122:122='\n',<NL>,6:1]
[#18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

ANTLR4 Assembler Language Parser - issues - miscellaneous comments

I am trying to write a parser for the IBM Assembler Language, Example below.
Comment lines start with a star* at the first character, however there are 2 problems
Beyond a set point in the line there can also be descriptive text, but there is no star* neccessary.
The descriptive can/does contain lexer tokens, such as ENTRY or INPUT.....
* TYPE.
ARG DSECT
NXENT DS F some comment text ENTRY NUMBER
NMADR DS F some comment text INPUT NAME
NAADR DS F some comment text
NATYP DS F some comment text
NAENT DS F some comment text
ORG NATYP some comment text
In my lexer I have devised the following, which works absolutley fine:
fragment CommentLine: Star {getCharPositionInLine() == 1}? .*? Nl
;
fragment Star: '*';
fragment Nl: '\r'? '\n' ;
COMMENT_LINE
: CommentLine -> channel (COMMENT)
;
My question is how do I manage the line comments starting at a particular char position in the parser grammer? I.e. Parser -> NAME DS INT? LETTER ??????????
Sending comments to a COMMENT channel (or -> skiping them) is a technique used to avoid having to define all the places comments are valid in your parser rules.
(Old 360+ Assembler programmer here)
Since there are not really ways to place arbitrarily positioned comments in Assembler source, you don't really need to deal with shunting them off to the side. Actually because of the way comments are handled in assembler source, there's just NOT a way to identify them in a Lexer rule.
Since it can be a parser rule, you could set up a rule like:
trailingComment: (ID | STRING | NUMBER)* EOL;
where ID, STRING, NUMBER, etc. are just the tokens in your lexer (You'd need to include pretty much all of them... a good argument, for not getting down to tokens for MVC, CLC, CLI, (all the op codes... the path to madness). And of course EOL is your rule to match end of line (probably '\r?\n')
You would then end each of your rules for parsing a line that can contain a trailing comment (pretty much all of them) with the trailingComment rule.

Is there a way to insert phases between the lexer and parser in ANTLR

I am writing a lexer/parser for a language that allows abbreviations (and globs) for its keywords. And, I am trying to determine the best way to do it.
And one thought that occurs to me, is to insert a phase between the lexer and the parser, where the lexer recognizes the general class, e.g. is this a "command name" or is it an "option" and then passes those general tokens to a second phase which does further analysis and recognizes which command name it is and passes that on as the token type to the parser.
It will make the parser simple. I will only have to deal with well formed command names. Every token will be clear what it means.
It will keep the lexer simple. It will only have to divide things into classes. This is a simple name. This is a glob. This is an option name (starts with a dash).
The phase is the middle will also be relatively simple. The simple name (and option forms) will only have to deal with strings. The glob form can use standard glob techniques to match the glob against the legal candidates, which are in the tables for the simple names and options.
The question is how to insert that phase into ANTLR, so that I call the lexer and it creates tokens and the intermediate phase massages them and then the parser gets the tokens the intermediate phase has categorized.
Is there a known solution for this?
Something like:
lexer grammar simple
letter: [A-Z][a-z];
digit: [0-9];
glob-char: [*?];
name: letter (letter | digit)*;
option: '-'name;
glob: (glob-char|letter)(glob-char|letter|digit)*;
glob-option: '-'glob;
filter grammar name;
end: 'e' | 'end';
generate: 'ge' | 'generate';
goto: 'go' | 'goto';
help: 'h' | 'help';
if: 'i' | 'if';
then: 't' | 'then';
parser grammar simple;
The user (programmer writing the language I am parsing) need to be to write
g*te and have if match generate.
The phase between the lexer and the parser when it sees a glob needs to look at the glob (and the list of keywords) and see if only one of them matches the glob and if so, return that keyword. The stuff I listed in the "filter grammar" is the stuff that builds the list of keywords globs can match. I have found code on the web that matches globs to a list of names. That part isn't hard.
And, I've since found in the ANTLR doc how to run arbitrary code on matching a token and how to change the resulting tokens type. (See my answer.)
It looks like you can use lexerCustomActions to achieve the desired effect. Something like the following.
in your lexer:
GLOB: [-A-Za-z0-9_.]* '*' [-A-Za-z0-9_.*]* { setType(lexGlob(getText())); }
in your Java (or whatever language you are using code):
void int lexGlob(String origText()) {
return xyzzy; // some code that computes the right kind of token type
}

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
main:
| LBRACKET TABLENAME RBRACKET { Table $2 }
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?
You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Resources