Bison matching wrong tokens

Bison matching wrong tokens - parsing

While having a problem with PLY, I tried rephrasing the same grammar fragment in bison and encountered a similar problem. This suggests I might be doing something wrong.
A symbolic representation of the grammar fragment looks like this:
document -> fragment?
fragment -> { \n line* \n fragment? }
line -> [^\n]+ \n
The relevant lex lines:
[{}] return *yytext;
[^\n]+ return ANYTHING;
\n return EOL;
The relevant bison lines:
multiline: '{' EOL lines EOL multiline '}'
|
;
lines: lines ANYTHING EOL
|
;
The grammar is deterministic, for all I know should even be LALR(1) (haven't really tried to build the tableau, though). A document like "{\n\n}" parses OK, but a document where the multiline elements are nested (e.g. "{\n\n{\n\n}}") does not, the lexer sees the last "}}" as a token "ANYTHING" rather than two '}'s.
What am I doing wrong?

[{}] return *yytext;
[^{}\n]+ return ANYTHING;
\n return EOL;
Lex is greedy: if two patterns match the current input, the longest match wins. In the original lex fragment, the [^\n]+ pattern catches lines with { or } in them.

Related

Why are the newline and the white space treated differently in the lexer specification?

I am using F#'s FsLex to generate a lexer. I have difficulties to understand the following two lines from a textbook. Why is the newline (\n) treated differently from the white space? In particular, what does "lexbuf.EndPos <- lexbuf.EndPos.NextLine" do differently from "Tokenize lexbuf"?
rule Tokenize = parse
| [' ' '\t' '\r'] { Tokenize lexbuf }
| '\n' { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Tokenize lexbuf }

A rule is essentially a function that takes a lexer buffer as an argument. Each case on the left side of your rule matches a given character (e.g., '\n') or class of characters ([' ' '\t' '\r']) in your input. The expression on the right size of the rule, inside the curly braces { ... }, defines an action. The purpose of the definition you pasted in appears to be a tokenizer.
The expression Tokenize lexbuf is a recursive call to the Tokenize rule. In essence, this rule ignores whitespace character. Why? Because tokenizers aim to simplify the input. Whitespace typically has no meaning in a programming language, so this rule filters it out. Tokenized input generally makes writing your parser simpler later on. You'll eventually want to add other cases to your Tokenize rule (e.g., for keywords, assignment statements, and other expressions) to produce a complete lexer definition.
The second rule, the one that matches \n, also ignores the whitespace, but as you correctly point out, it does something different. What it's doing is updating the position of the end of the line (lexbuf.EndPos) with the position of the next line's end (lexbuf.EndPos.NextLine) before recursively calling Tokenize again. Why? Presumably so that the end position is correct on the next recursive call.
Since you're only showing a lexer fragment here, I can only guess as to what lexbug.EndPos is used for, but it's pretty common to keep that information around for diagnostic purposes.

Antlr Matlab grammar lexing conflict

I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.

The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".

Ordering lexer rules in a grammar using ANTLR4

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
I want the parser to be able to handle something like this:
Hello << name >>, how are you?
At runtime I will replace "<< name >>" with the user's name.
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
Here is my grammar:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
If I run this parser on the above sentence, I get a parse tree that looks like this:
Anything highlighted in red is a parse error.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
And run the parser, I get a parse tree like this:
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
How do I get past this conundrum?
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
Thanks for any help!

From The Definitive ANTLR 4 Reference :
ANTLR resolves lexical ambiguities by
matching the input string to the rule specified first in the grammar.
With your grammar (in Question.g4) and a t.text file containing
Hello << name >>, how are you at nine o'clock?
the execution of
$ grun Question doc -tokens -diagnostics t.text
gives
[#0,0:4='Hello',<WORD>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<WORD>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<WORD>,1:18]
[#6,22:24='are',<WORD>,1:22]
[#7,26:28='you',<WORD>,1:26]
[#8,30:31='at',<WORD>,1:30]
[#9,33:36='nine',<WORD>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}
Now change WORD to word in the item rule, and add a word rule :
item: (func | word) PUNCT? ;
word: WORD | ID ;
and put ID before WORD :
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
The tokens are now
[#0,0:4='Hello',<ID>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<ID>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<ID>,1:18]
[#6,22:24='are',<ID>,1:22]
[#7,26:28='you',<ID>,1:26]
[#8,30:31='at',<ID>,1:30]
[#9,33:36='nine',<ID>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.

As "500 - Internal Server Error" already mentioned in his comment ANTLR will match lexer rules in the order they are defined in the grammar (the topmost rule will be matched first) and if a certain input has been matched ANTLR won't try to match it differently.
In your case the WORD and ID rule can both match input like abc but as WORD is declared first abc will always be matched as a WORD and never as an ID. In fact ID will never be matched as there is no valid input as an ID that can not be matched by WORD.
However if your only goal is to replace whatever is in between << and >> you'd be better off using regular expressions. However if you still want to use ANTLR for it you should reduce your grammar to only care about the essentials. That is to distinguish between any input and input in between << and >>. Therefore your grammar should look something like this:
start: (INTERESTING | UNINTERESTING) ;
INTERESTING: '<<' .*? '>>' ;
UNINTERESTING: (~[<])+ | '<' ;
Or you could skip the UNINTERESTING completely.

Layout in Rascal

When I import the Lisra recipe,
import demo::lang::Lisra::Syntax;
This creates the syntax:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical AtomExp = (![0-9()\t-\n\r\ ])+ !>> ![0-9()\t-\n\r\ ];
start syntax LispExp
= IntegerLiteral
| AtomExp
| "(" LispExp* ")"
;
Through the start syntax-definition, layout should be ignored around the input when it is parsed, as is stated in the documentation: http://tutor.rascal-mpl.org/Rascal/Declarations/SyntaxDefinition/SyntaxDefinition.html
However, when I type:
rascal>(LispExp)` (something)`
This gives me a concrete syntax fragment error (or a ParseError when using the parse-function), in contrast to:
rascal>(LispExp)`(something)`
Which succesfully parses. I tried this both with one of the latest versions of Rascal as well as the Eclipse plugin version. Am I doing something wrong here?
Thank you.
Ps. Lisra's parse-function:
public Lval parse(str txt) = build(parse(#LispExp, txt));
Also fails on the example:
rascal>parse(" (something)")
|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>): ParseError(|unknown:///|(0,1,<1,0>,<1,1>))
at *** somewhere ***(|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>))
at parse(|project://rascal/src/org/rascalmpl/library/demo/lang/Lisra/Parse.rsc|(163,3,<7,44>,<7,47>))
at $shell$(|stdin:///|(0,13,<1,0>,<1,13>))

When you define a start non-terminal Rascal defines two non-terminals in one go:
rascal>start syntax A = "a";
ok
One non-terminal is A, the other is start[A]. Given a layout non-terminal in scope, say L, the latter is automatically defined by (something like) this rule:
syntax start[A] = L before A top L after;
If you call a parser or wish to parse a concrete fragment, you can use either non-terminal:
parse(#start[A], " a ") // parse using the start non-terminal and extra layout
parse(A, "a") // parse only an A
(start[A]) ` a ` // concrete fragment for the start-non-terminal
(A) `a` // concrete fragment for only an A
[start[A]] " a "
[A] "a"

Jison Lex without white spaces

I have this Jison lexer and parser:
%lex
%%
\s+ /* skip whitespace */
'D01' return 'D01'
[xX][+-]?[0-9]+ return 'COORD'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start source
%%
source
: command EOF;
command
: D01 COORD;
It will tokenize and parse D01 X45 but not D01X45. What am I missing?

Unlike (f)lex -- or, indeed, the vast majority of scanner generators, jison scanners do not implement the longest-match rule. Instead, the first matching pattern wins.
In order to make this work for keywords, jison scanners also implement the restriction that simple literal strings -- like "D01" -- only match if they end on a word-boundary.
The workaround is to enclose the literal string pattern with redundant parentheses:
("D01") { return 'D01'; }
This is documented in the jison wiki

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart