Lexer/Parser design for data file

Lexer/Parser design for data file - parsing

I am writing a small program which needs to preprocess some data files that are inputs to another program. Because of this I can't change the format of the input files and I have run into a problem.
I am working in a language that doesn't have libraries for this sort of thing and I wouldn't mind the exercise so I am planning on implementing the lexer and parser by hand. I would like to implement a Lexer based roughly on this which is a fairly simple design.
The input file I need to interpret has a section which contains chemical reactions. The different chemical species on each side of the reaction are separated by '+' signs, but the names of the species can also have + characters in them (symbolizing electric charge). For example:
N2+O2=>NO+NO
N2++O2-=>NO+NO
N2+ + O2 => NO + NO
are all valid and the tokens output by the lexer should be
'N2' '+' 'O2' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
(note that the last two are identical). I would like to avoid look ahead in the lexer for simplicity. The problem is that the lexer would start reading the any of the above inputs, but when it got to the 3rd character (the first '+'), it wouldn't have any way to know whether it was a part of the species name or if it was a separator between reactants.
To fix this I thought I would just split it off so the second and third examples above would output:
'N2' '+' '+' 'O2-' '=>' 'NO' '+' 'NO'
The parser then would simply use the context, realize that two '+' tokens in a row means the first is part of the previous species name, and would correctly handle all three of the above cases. The problem with this is that now imagine I try to lex/parse
N2 + + O2- => NO + NO
(note the space between 'N2' and the first '+'). This is invalid syntax, however the lexer I just described would output exactly the same token outputs as the second and third examples and my parser wouldn't be able to catch the error.
So possible solutions as I see it:
implement a lexer with atleast one character look ahead
include tokens for whitespace
include leading white space in the '+' token
create a "combined" token that includes both the species name and any trailing '+' without white space between, then letting the parser sort out whether the '+' is actually part of the name or not.
Since I am very new to this kind of programming I am hoping someone can comment on my proposed solutions (or suggest another). My main reservation about the first solution is I simply do not know how much more complicated implementing a lexer with look ahead is.

You don't mention your implementation language, but with an input syntax as relatively simple as the one you outline, I don't think having logic along the lines of the following pseudo-code would be unreasonable.
string GetToken()
{
string token = GetAlphaNumeric(); // assumed to ignore (eat) white-space
var ch = GetChar(); // assumed to ignore (eat) white-space
if (ch == '+')
{
var ch2 = GetChar();
if (ch2 == '+')
token += '+';
else
PutChar(ch2);
}
PutChar(ch);
return token;
}

Related

How to parse dot operator in language syntax?

Let's say I'm writing a parser that parses the following syntax:
foo.bar().baz = 5;
The grammar rules look something like this:
program: one or more statement
statement: expression followed by ";"
expression: one of:
- identifier (\w+)
- number (\d+)
- func call: expression "(" ")"
- dot operator: expression "." identifier
Two expressions have a problem, the func call and the dot operator. This is because the expressions are recursive and look for another expression at the start, causing a stack overflow. I will focus on the dot operator for this quesition.
We face a similar problem with the plus operator. However, rather than using an expression you would do something like this to solve it (look for a "term" instead):
add operation: term "+" term
term: one of:
- number (\d+)
- "(" expression ")"
The term then includes everything except the add operation itself. To ensure that multiple plus operators can be chained together without using parenthesis, one would rather do:
add operation: term, one or more of ("+" followed by term)
I was thinking a similar solution could for for the dot operator or for function calls.
However, the dot operator works a little differently. We always evaluate from left-to-right and need to allow full expressions so that you can do function calls etc. in-between. With parenthesis, an example might be:
(foo.bar()).baz = 5;
Unfortunately, I do not want to require parenthesis. This would end up being the case if following the method used for the plus operator.
How could I go about implementing this?
Currently my parser never peeks ahead, but even if I do look ahead, it still seems tricky to accomplish.

The easy solution would be to use a bottom-up parser which doesn't drop into a bottomless pit on left recursion, but I suppose you have already rejected that solution.
I don't understand your objection to using a looping construct, though. Postfix modifiers like field lookup and function call are not really different from binary operators like addition (except, of course, for the fact that they will not need to claim an eventual right operand). Plus and minus intermingle freely, which you can parse with a repetition like:
additive: term ( '+' term | '-' term )*
Similarly, postfix modifiers can be easily parsed with something like:
postfixed: atom ( '.' ID | '(' opt-expr-list `)` )*
I'm using a form of extended BNF: parentheses group; | separates alternatives and binds less stringly than concatenation; and * means "zero or more repetitions" of the atom on its left.
Another postfix operator which falls into the same category is array/map subscripting ('[' expr ']'), although you might also have other postfix operators.
Note that like the additive syntax above, selecting the appropriate alternative does not require looking beyond the next token. It's hard to parse without being able to peek one token into the future. Fortunately, that's very little overhead.

One way could be for the dot operator to parse a non-dot expression, that is, a rule that is the same as expression but without the dot operator. This prevents recursion.
Then, when the non-dot expression has been parsed, check if a dot and an identifier follows. If this is not the case, we are done. If this is the case, wrap the current node up in a dot operation node. Then, keep track of the entire string text that has been parsed for this operation so far. Then revert everything back to before the operation was being parsed, and now re-parse a "custom expression", where the first directly-nested expression would really be trying to match the exact string that was parsed before rather than a real expression. Repeat until there are no more dot-identifier pairs (this should happen automatically by the new "custom expression").
This is messy, complicated and possibly slow, and I'm not entirely sure if it'll work but I'll try it out. I'd appreciate alternative solutions.

ANTLR4 Assembler Language Parser - issues - miscellaneous comments

I am trying to write a parser for the IBM Assembler Language, Example below.
Comment lines start with a star* at the first character, however there are 2 problems
Beyond a set point in the line there can also be descriptive text, but there is no star* neccessary.
The descriptive can/does contain lexer tokens, such as ENTRY or INPUT.....
* TYPE.
ARG DSECT
NXENT DS F some comment text ENTRY NUMBER
NMADR DS F some comment text INPUT NAME
NAADR DS F some comment text
NATYP DS F some comment text
NAENT DS F some comment text
ORG NATYP some comment text
In my lexer I have devised the following, which works absolutley fine:
fragment CommentLine: Star {getCharPositionInLine() == 1}? .*? Nl
;
fragment Star: '*';
fragment Nl: '\r'? '\n' ;
COMMENT_LINE
: CommentLine -> channel (COMMENT)
;
My question is how do I manage the line comments starting at a particular char position in the parser grammer? I.e. Parser -> NAME DS INT? LETTER ??????????

Sending comments to a COMMENT channel (or -> skiping them) is a technique used to avoid having to define all the places comments are valid in your parser rules.
(Old 360+ Assembler programmer here)
Since there are not really ways to place arbitrarily positioned comments in Assembler source, you don't really need to deal with shunting them off to the side. Actually because of the way comments are handled in assembler source, there's just NOT a way to identify them in a Lexer rule.
Since it can be a parser rule, you could set up a rule like:
trailingComment: (ID | STRING | NUMBER)* EOL;
where ID, STRING, NUMBER, etc. are just the tokens in your lexer (You'd need to include pretty much all of them... a good argument, for not getting down to tokens for MVC, CLC, CLI, (all the op codes... the path to madness). And of course EOL is your rule to match end of line (probably '\r?\n')
You would then end each of your rules for parsing a line that can contain a trailing comment (pretty much all of them) with the trailingComment rule.

ANTLR Tries to Match an Expression That Wasn't Specified as an Option

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.

What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.

Pin & recoverWhile in a .bnf (Parsing)

I've searched the internet far and wide (for at least half a day now) and I can't seem to find the answers needed.
Currently I'm trying to create a .bnf-file for an IntelliJ-Plugin with custom language support.
A few tutorials mention the existance of {pin=1},{pin=2} and {recoverWhile=xyz}, but I didn't find any real explanation on their uses, and if there are any other things I should know (maybe a {pin=3} also exists?).
So could somebody tell me what exactly those flags, methods or however they're called are, and how to use them in my .bnf, please?
Thank you for your help and best regards,
Fuchs

These attributes are explained here:
https://github.com/JetBrains/Grammar-Kit/blob/master/HOWTO.md#22-using-recoverwhile-attribute
https://github.com/JetBrains/Grammar-Kit/blob/master/TUTORIAL.md
But the usage is not trivial. A good idea is to use Live Preview to play around with it.
My understanding:
Pin and recoverWhile attributes are used to recover parser from errors.
Pin specifies a part of the rule (by index or literally) after successful parsing of which the rule considered successful.
In the example:
expr ::= expr1 "+" expr2 {pin=1}
if expr1 is matched, the whole rule will be considered successful and parser will try yo match the rest.
if pin=2 the rule will be considered successful after matching "+" and will fail if expr1 or "+" not matched.
RecoverWhile attribute specifies where to skip after parsing the rule. Independently of its success.
For example
{recoverWhile=expr_recover}
expr_recover ::= !(";" | ".")
will skip all input before ";" or ".". I.e. parser will start matching next rule from ";" or ".".

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.

The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.

What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart