ANTLR4 Assembler Language Parser - issues - miscellaneous comments - parsing

I am trying to write a parser for the IBM Assembler Language, Example below.
Comment lines start with a star* at the first character, however there are 2 problems
Beyond a set point in the line there can also be descriptive text, but there is no star* neccessary.
The descriptive can/does contain lexer tokens, such as ENTRY or INPUT.....
* TYPE.
ARG DSECT
NXENT DS F some comment text ENTRY NUMBER
NMADR DS F some comment text INPUT NAME
NAADR DS F some comment text
NATYP DS F some comment text
NAENT DS F some comment text
ORG NATYP some comment text
In my lexer I have devised the following, which works absolutley fine:
fragment CommentLine: Star {getCharPositionInLine() == 1}? .*? Nl
;
fragment Star: '*';
fragment Nl: '\r'? '\n' ;
COMMENT_LINE
: CommentLine -> channel (COMMENT)
;
My question is how do I manage the line comments starting at a particular char position in the parser grammer? I.e. Parser -> NAME DS INT? LETTER ??????????

Sending comments to a COMMENT channel (or -> skiping them) is a technique used to avoid having to define all the places comments are valid in your parser rules.
(Old 360+ Assembler programmer here)
Since there are not really ways to place arbitrarily positioned comments in Assembler source, you don't really need to deal with shunting them off to the side. Actually because of the way comments are handled in assembler source, there's just NOT a way to identify them in a Lexer rule.
Since it can be a parser rule, you could set up a rule like:
trailingComment: (ID | STRING | NUMBER)* EOL;
where ID, STRING, NUMBER, etc. are just the tokens in your lexer (You'd need to include pretty much all of them... a good argument, for not getting down to tokens for MVC, CLC, CLI, (all the op codes... the path to madness). And of course EOL is your rule to match end of line (probably '\r?\n')
You would then end each of your rules for parsing a line that can contain a trailing comment (pretty much all of them) with the trailingComment rule.

Related

Problem with parsing file ending on a newline

It seems a bit like a trivial question, but I am stuck on parsing the end of file EOF using my own island grammar. I am using the new VScode extension btw.
I've mostly been using the examples from the basic recipes and have a simple grammar with the following layout rules:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical Comment = "%%" ![\n]* $;
Using this, and some rules it parses some simple files, but will give a parse error anytime a file ends in a newline. (newlines in between lines are no problem).
Am is missing something obvious?
Thanks!
It sounds a bit like your grammar is missing a start nonterminal. All grammar rules get whitespace in between their constituent symbols but not at the start or the end.
A start nonterminal is the exception:
start syntax Islands = Island+;
Islands parseIslands(loc input)
= parse(#start[Islands], input).top;
Passing the start nonterminal to parse will allow the file to start and end with whitespace, and using the .top field you can ignore that whitespace from the parse tree again by projecting out the middle Islands tree.
Island grammars tend to be a complex beast, so without sharing the full grammar and input string, it might be a bit hard to answer this question. But I'll share some generic feedback.
he layout production might be ambiguous, if any other part of your language has optional parts. Rascal's parsing is non-greedy. So if you have:
lexical A = "a";
lexical B = "b";
lexical C = "c";
syntax A = A? B? C;
After fusing in the layouts, this becomes:
A` = A? Whitespace? B? Whitespace? C;
Now since whitespace is not eating all characters, the grammar is ambigous, as the parser can "bind" a whitespace between the A and B, or between the B and C. So in most cases, you want to make sure it's a greedy match by adding a follow restriction:
layout Whitespace = [\t-\n \r \ ]* !>> [\t-\n \r \ ];
Also, I fixed a bug, the layout definition didn't include a space as valid whitespace. Rascal allows for spaces in the character class (for readability), so in case we need to add a space, you have to say \ .
For the rest, it looks okay, but like I started with, island grammars are a bit harder to debug without both the full syntax, and what you want to have as water and what as island.

ANTLR Tries to Match an Expression That Wasn't Specified as an Option

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.

Lexer/Parser design for data file

I am writing a small program which needs to preprocess some data files that are inputs to another program. Because of this I can't change the format of the input files and I have run into a problem.
I am working in a language that doesn't have libraries for this sort of thing and I wouldn't mind the exercise so I am planning on implementing the lexer and parser by hand. I would like to implement a Lexer based roughly on this which is a fairly simple design.
The input file I need to interpret has a section which contains chemical reactions. The different chemical species on each side of the reaction are separated by '+' signs, but the names of the species can also have + characters in them (symbolizing electric charge). For example:
N2+O2=>NO+NO
N2++O2-=>NO+NO
N2+ + O2 => NO + NO
are all valid and the tokens output by the lexer should be
'N2' '+' 'O2' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
(note that the last two are identical). I would like to avoid look ahead in the lexer for simplicity. The problem is that the lexer would start reading the any of the above inputs, but when it got to the 3rd character (the first '+'), it wouldn't have any way to know whether it was a part of the species name or if it was a separator between reactants.
To fix this I thought I would just split it off so the second and third examples above would output:
'N2' '+' '+' 'O2-' '=>' 'NO' '+' 'NO'
The parser then would simply use the context, realize that two '+' tokens in a row means the first is part of the previous species name, and would correctly handle all three of the above cases. The problem with this is that now imagine I try to lex/parse
N2 + + O2- => NO + NO
(note the space between 'N2' and the first '+'). This is invalid syntax, however the lexer I just described would output exactly the same token outputs as the second and third examples and my parser wouldn't be able to catch the error.
So possible solutions as I see it:
implement a lexer with atleast one character look ahead
include tokens for whitespace
include leading white space in the '+' token
create a "combined" token that includes both the species name and any trailing '+' without white space between, then letting the parser sort out whether the '+' is actually part of the name or not.
Since I am very new to this kind of programming I am hoping someone can comment on my proposed solutions (or suggest another). My main reservation about the first solution is I simply do not know how much more complicated implementing a lexer with look ahead is.
You don't mention your implementation language, but with an input syntax as relatively simple as the one you outline, I don't think having logic along the lines of the following pseudo-code would be unreasonable.
string GetToken()
{
string token = GetAlphaNumeric(); // assumed to ignore (eat) white-space
var ch = GetChar(); // assumed to ignore (eat) white-space
if (ch == '+')
{
var ch2 = GetChar();
if (ch2 == '+')
token += '+';
else
PutChar(ch2);
}
PutChar(ch);
return token;
}

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

How to match any symbol in ANTLR parser (not lexer)?

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Resources