Jison Lex without white spaces - flex-lexer

I have this Jison lexer and parser:
%lex
%%
\s+ /* skip whitespace */
'D01' return 'D01'
[xX][+-]?[0-9]+ return 'COORD'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start source
%%
source
: command EOF;
command
: D01 COORD;
It will tokenize and parse D01 X45 but not D01X45. What am I missing?

Unlike (f)lex -- or, indeed, the vast majority of scanner generators, jison scanners do not implement the longest-match rule. Instead, the first matching pattern wins.
In order to make this work for keywords, jison scanners also implement the restriction that simple literal strings -- like "D01" -- only match if they end on a word-boundary.
The workaround is to enclose the literal string pattern with redundant parentheses:
("D01") { return 'D01'; }
This is documented in the jison wiki

Related

Why are the newline and the white space treated differently in the lexer specification?

I am using F#'s FsLex to generate a lexer. I have difficulties to understand the following two lines from a textbook. Why is the newline (\n) treated differently from the white space? In particular, what does "lexbuf.EndPos <- lexbuf.EndPos.NextLine" do differently from "Tokenize lexbuf"?
rule Tokenize = parse
| [' ' '\t' '\r'] { Tokenize lexbuf }
| '\n' { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Tokenize lexbuf }
A rule is essentially a function that takes a lexer buffer as an argument. Each case on the left side of your rule matches a given character (e.g., '\n') or class of characters ([' ' '\t' '\r']) in your input. The expression on the right size of the rule, inside the curly braces { ... }, defines an action. The purpose of the definition you pasted in appears to be a tokenizer.
The expression Tokenize lexbuf is a recursive call to the Tokenize rule. In essence, this rule ignores whitespace character. Why? Because tokenizers aim to simplify the input. Whitespace typically has no meaning in a programming language, so this rule filters it out. Tokenized input generally makes writing your parser simpler later on. You'll eventually want to add other cases to your Tokenize rule (e.g., for keywords, assignment statements, and other expressions) to produce a complete lexer definition.
The second rule, the one that matches \n, also ignores the whitespace, but as you correctly point out, it does something different. What it's doing is updating the position of the end of the line (lexbuf.EndPos) with the position of the next line's end (lexbuf.EndPos.NextLine) before recursively calling Tokenize again. Why? Presumably so that the end position is correct on the next recursive call.
Since you're only showing a lexer fragment here, I can only guess as to what lexbug.EndPos is used for, but it's pretty common to keep that information around for diagnostic purposes.

how to write custom function and variable in jison?

my lex code is
/* description: Parses end executes mathematical expressions. */
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[a-zA-Z] return 'FUNCTION'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
/* operator associations and precedence */
%start expressions
%% /* language grammar */
expressions
: e EOF
{return $1;}
;
e
| FUNCTION '('e')'
{$$=$3}
| NUMBER
{$$ = Number(yytext);}
;
i got error
Parse error on line 1:
balaji()
-^
Expecting '(', got 'FUNCTION'
what i want to pass myfun(a,b,...) and also myfun(a) in this parser.thank you for your valuable time going to spent for me.
[a-zA-Z] matches a single alphabetic character (in this case, the letter b), returning FUNCTION. When the next token is needed, it again matches a single alphabetic character (a), returning another FUNCTION token. But of course the grammar doesn't allow two consecutive FUNCTIONs; it's expecting a (, as it says.
You probably intended [a-zA-Z]+, although a better identifier pattern is [A-Za-z_][A-Za-z0-9_]*, which matches things like my_function_2.

ANTLR4 - Make space optional between tokens

I have the following grammar:
grammar Hello;
prog: stat+ EOF;
stat: DELIMITER_OPEN expr DELIMITER_CLOSE;
expr: NOTES COMMA value=VAR_VALUE #delim_body;
VAR_VALUE: ANBang*;
NOTES: WS* 'notes' WS*;
COMMA: ',';
DELIMITER_OPEN: '<<!';
DELIMITER_CLOSE: '!>>';
fragment ANBang: AlphaNum | Bang;
fragment AlphaNum: [a-zA-Z0-9];
fragment Bang: '!';
WS : [ \t\r\n]+ -> skip ;
Parsing the following works:
<<! notes, Test !>>
and the variable value is "Test", however, the parser fails when I eliminate the space between the DELIMITER_OPEN and NOTES:
<<!notes, Test !>>
line 1:3 mismatched input 'notes' expecting NOTES
This is yet another case of badly ordered lexer rules.
When the lexer scans for the next token, it first tries to find the rule which will match the longest token. If several rules match, it will disambiguate by choosing the first one in definition order.
<<! notes, Test !>> will be tokenized as such:
DELIMITER_OPEN NOTES COMMA VAR_VALUE WS DELIMITER_CLOSE
This is because the NOTES rule can match the following:
<<! notes, Test !>>
\____/
Which includes the whitespace. If you remove it:
<<!notes, Test !>>
Then both the NOTES and VAR_VALUE rules can match the text notes, and, VAR_VALUE is defined first in the grammar, so it gets precedence. The tokenization is:
DELIMITER_OPEN VAR_VALUE COMMA VAR_VALUE WS DELIMITER_CLOSE
and it doesn't match your expr rule.
Change your rules like this to fix the problem:
NOTES: 'notes';
VAR_VALUE: ANBang+;
Adding WS* to other rules doesn't make much sense, since WS is skipped. And declaring a token as having a possible zero width * is also meaningless, so use + instead. Finally, reorder the rules so that the most specific ones match fist.
This way, notes becomes a keyword in your grammar. If you don't want it to be a keyword, remove the NOTES rule altogether, and use the VAR_VALUE rule with a predicate. Alternatively, you could use lexer modes.

Jison: Distinguishing between digits and numbers

I have the following minimal example of a grammar I'd like to use with Jison.
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[0-9] return 'DIGIT'
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e SEPARATOR d EOF
{return $1;}
;
d
: DIGIT
{$$ = Number(yytext);}
;
e
: NUMBER
{$$ = Number(yytext);}
;
Here I have defined both NUMBER and DIGIT in order to allow for both digits and numbers, depending on the context. What I do not know, is how I define the context. The above example always returns
Expecting 'DIGIT', got 'NUMBER'
when I try to run it in the Jison debugger. How can I define the grammar in order to always expect a digit after a separator? I tried the following which does not work either
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e SEPARATOR d EOF
{return $1;}
;
d
: [0-9]
{$$ = Number(yytext);}
;
e
: [0-9]+("."[0-9]+)?\b
{$$ = Number(yytext);}
;
The classic scanner/parser model (originally from lex/yacc, and implemented by jison as well) puts the scanner before the parser. In other words, the scanner is expected to tokenize the input stream without regard to parsing context.
Most lexical scanner generators, including jison, provide a mechanism for the scanner to adapt to context (see "start conditions"), but the scanner is responsible for tracking context on its own, and that gets quite ugly.
The simplest solution in this case is to define only a NUMBER token, and have the parser check for validity in the semantic action of rules which actually require a DIGIT. That will work because the difference between DIGIT and NUMBER does not affect the parse other than to make some parses illegal. It would be different if the difference between NUMBER and DIGIT determined which production to use, but that would probably be ambiguous since all digits are actually numbers as well.
Another solution is to allow either NUMBER or DIGIT where a number is allowed. That would require changing e so that it accepted either NUMBER or DIGIT, and ensuring that DIGIT wins out in the case that both NUMBER and DIGIT are possible. That requires putting its rule earlier in the grammar file, and adding the \b at the end:
[0-9]\b return 'DIGIT'
[0-9]+("."[0-9]+)?\b return 'NUMBER'

Bison matching wrong tokens

While having a problem with PLY, I tried rephrasing the same grammar fragment in bison and encountered a similar problem. This suggests I might be doing something wrong.
A symbolic representation of the grammar fragment looks like this:
document -> fragment?
fragment -> { \n line* \n fragment? }
line -> [^\n]+ \n
The relevant lex lines:
[{}] return *yytext;
[^\n]+ return ANYTHING;
\n return EOL;
The relevant bison lines:
multiline: '{' EOL lines EOL multiline '}'
|
;
lines: lines ANYTHING EOL
|
;
The grammar is deterministic, for all I know should even be LALR(1) (haven't really tried to build the tableau, though). A document like "{\n\n}" parses OK, but a document where the multiline elements are nested (e.g. "{\n\n{\n\n}}") does not, the lexer sees the last "}}" as a token "ANYTHING" rather than two '}'s.
What am I doing wrong?
[{}] return *yytext;
[^{}\n]+ return ANYTHING;
\n return EOL;
Lex is greedy: if two patterns match the current input, the longest match wins. In the original lex fragment, the [^\n]+ pattern catches lines with { or } in them.

Resources