Flex/Bison: lexing ambiguous tokens - parsing

I'm dealing with a tricky problem in my flex/bison lexer/parser.
Here are some flex rules, for roman numerals and arbitrary identifiers:
"I"|"II"|"III"|"IV"|"V"|"VI"|"VII"|"i"|"ii"|"iii"|"iv"|"v"|"vi"|"vii" { return NUMERAL; }
"foobar" { return FOOBAR; }
[A-Za-z0-9_]+ { return IDENTIFIER; }
Now, consider this simple grammar:
%token <numeral> NUMERAL
%token <foobar> FOOBAR
%token <identifier> IDENTIFIER
: numeral foobar { }
Finally, here is an example input:
I intend for this to lex as the numeral IV, followed by a FOOBAR. However, how can I prevent this from lexing as the numeral I followed by the identifier "Vfoobar", or just identifier "IVfoobar", which are both invalid?

If you really want to process this at lexer level, then you have to make sure the rule for IDENTIFIER doesn't match strings starting with a roman numeral (I,II,... vii ...).
That's because Lex selects the rule that matches the longest input.
Maybe excluding roman numeral letters from the first char of an IDENTIFIER would make a satisfying set of valid identifiers?
{?i:[a-z0-9_]{-}[ivxlcdm]}{?i:[a-z0-9_]}* { return IDENTIFIER; }


Unary minus messes up parsing

Here is the grammar of the language id' like to parse:
expr ::= val | const | (expr) | unop expr | expr binop expr
var ::= letter
const ::= {digit}+
unop ::= -
binop ::= /*+-
I'm using an example from the haskell wiki.
The semantics and token parser are not shown here.
exprparser = buildExpressionParser table term <?> "expression"
table = [ [Prefix (m_reservedOp "-" >> return (Uno Oppo))]
,[Infix (m_reservedOp "/" >> return (Bino Quot)) AssocLeft
,Infix (m_reservedOp "*" >> return (Bino Prod)) AssocLeft]
,[Infix (m_reservedOp "-" >> return (Bino Diff)) AssocLeft
,Infix (m_reservedOp "+" >> return (Bino Somm)) AssocLeft]
term = m_parens exprparser
<|> fmap Var m_identifier
<|> fmap Con m_natural
The minus char appears two times, once as unary, once as binary operator.
On input "1--2", the parser gives only
Con 1
instead of the expected
"Bino Diff (Con 1) (Uno Oppo (Con 2))"
Any help welcome.Full code here
The purpose of reservedOp is to create a parser (which you've named m_reservedOp) that parses the given string of operator symbols while ensuring that it is not the prefix of a longer string of operator symbols. You can see this from the definition of reservedOp in the source:
reservedOp name =
lexeme $ try $
do{ _ <- string name
; notFollowedBy (opLetter languageDef) <?> ("end of " ++ show name)
Note that the supplied name is parsed only if it is not followed by any opLetter symbols.
In your case, the string "--2" can't be parsed by m_reservedOp "-" because, even though it starts with the valid operator "-", this string occurs as the prefix of a longer valid operator "--".
In a language with single-character operators, you probably don't want to use reservedOp at all, unless you want to disallow adjacent operators without intervening whitespace. Just use symbol "-", which will always parse "-", no matter what follows (and consume following whitespace, if any). Also, in a language with a fixed set of operators (i.e., no user-defined operators), you probably won't use the operator parser, so you won't need opStart, or reservedOpNames. Without reservedOp or operator, the opLetter parser isn't used, so you can drop it too.
This is probably pretty confusing, and the Parsec documentation does a terrible job of explaining how the "reserved" mechanism is supposed to work. Here's a primer:
Let's start with identifiers, instead of operators. In a typical language that allows user-defined identifiers (i.e., pretty much any language, since "variables" and "functions" have user-defined names) and may also have some reserved words that aren't allowed as identifiers, the relevant settings in the GenLanguageDef are:
identStart -- parser for first character of valid identifier
identLetter -- second and following characters of valid identifier
reservedNames -- list of reserved names not allowed as identifiers
The lexeme (whitespace-absorbing) parsers created using the GenTokenParser object are:
identifier - Parses an unknown, user-defined identifier. It parses a character from identStart followed by zero or more identLetters up to the first non-identLetter. (It never parses a partial identifier, so it'll never leave more identLetters on the table.) Additionally, it checks that the identifier is not in the list reservedNames.
symbol - Parses the given string. If the string is a reserved word, no check is made that it isn't part of a larger valid identifier. So, symbol "for" would match the beginning of foreground = "black", which is rarely what you want. Note that symbol makes no use of identStart, identLetter, or reservedNames.
reserved - Parses the given string, and then ensures that it's not followed by an identLetter. So, m_reserved "for" will parse for (i=1; ... but not parse foreground = "black". Usually, the supplied string will be a valid identifier, but no check is made for this, so you can write m_reserved "15" if you want -- in a language with the usual sorts of alphanumeric identifiers, this would parse "15" provided it wasn't following by a letter or another digit. Also, maybe somewhat surprisingly, no check is made that the supplied string is in reservedNames.
If that makes sense to you, then the operator settings follow the exact same pattern. The relevant settings are:
opStart -- parser for first character of valid operator
opLetter -- valid second and following operator chars, for multichar operators
reservedOpNames -- list of reserved operator names not allowed as user-defined operators
and the relevant parsers are:
operator - Parses an unknown, user-defined operator starting with an opStart and followed by zero or more opLetters up to the first non-opLetter. So, operator applied to the string "--2" will always take the whole operator "--", never just the prefix "-". An additional check is made that the resulting operator is not in the reservedOpNames list.
symbol - Exactly as for identifiers. It parses a string with no checks or reference to opStart, opLetter, or reservedOpNames, so symbol "-" will parse the first character of the string "--" just fine, leaving the second "-" character for a later parser.
reservedOp - Parses the given string, ensuring it's not followed by opLetter. So, m_reservedOp "-" will parse the start of "-x" but not "--2", assuming - matches opLetter. As before, no check is made that the string is in reservedOpNames.

how to write custom function and variable in jison?

my lex code is
/* description: Parses end executes mathematical expressions. */
/* lexical grammar */
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[a-zA-Z] return 'FUNCTION'
<<EOF>> return 'EOF'
. return 'INVALID'
/* operator associations and precedence */
%start expressions
%% /* language grammar */
: e EOF
{return $1;}
| FUNCTION '('e')'
{$$ = Number(yytext);}
i got error
Parse error on line 1:
Expecting '(', got 'FUNCTION'
what i want to pass myfun(a,b,...) and also myfun(a) in this parser.thank you for your valuable time going to spent for me.
[a-zA-Z] matches a single alphabetic character (in this case, the letter b), returning FUNCTION. When the next token is needed, it again matches a single alphabetic character (a), returning another FUNCTION token. But of course the grammar doesn't allow two consecutive FUNCTIONs; it's expecting a (, as it says.
You probably intended [a-zA-Z]+, although a better identifier pattern is [A-Za-z_][A-Za-z0-9_]*, which matches things like my_function_2.

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
And the language says for example something like:
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
// NEW:
/* empty */
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

Jison: Distinguishing between digits and numbers

I have the following minimal example of a grammar I'd like to use with Jison.
/* lexical grammar */
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[0-9] return 'DIGIT'
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
%start expressions
%% /* language grammar */
{return $1;}
{$$ = Number(yytext);}
{$$ = Number(yytext);}
Here I have defined both NUMBER and DIGIT in order to allow for both digits and numbers, depending on the context. What I do not know, is how I define the context. The above example always returns
Expecting 'DIGIT', got 'NUMBER'
when I try to run it in the Jison debugger. How can I define the grammar in order to always expect a digit after a separator? I tried the following which does not work either
/* lexical grammar */
\s+ /* skip whitespace */
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
%start expressions
%% /* language grammar */
{return $1;}
: [0-9]
{$$ = Number(yytext);}
: [0-9]+("."[0-9]+)?\b
{$$ = Number(yytext);}
The classic scanner/parser model (originally from lex/yacc, and implemented by jison as well) puts the scanner before the parser. In other words, the scanner is expected to tokenize the input stream without regard to parsing context.
Most lexical scanner generators, including jison, provide a mechanism for the scanner to adapt to context (see "start conditions"), but the scanner is responsible for tracking context on its own, and that gets quite ugly.
The simplest solution in this case is to define only a NUMBER token, and have the parser check for validity in the semantic action of rules which actually require a DIGIT. That will work because the difference between DIGIT and NUMBER does not affect the parse other than to make some parses illegal. It would be different if the difference between NUMBER and DIGIT determined which production to use, but that would probably be ambiguous since all digits are actually numbers as well.
Another solution is to allow either NUMBER or DIGIT where a number is allowed. That would require changing e so that it accepted either NUMBER or DIGIT, and ensuring that DIGIT wins out in the case that both NUMBER and DIGIT are possible. That requires putting its rule earlier in the grammar file, and adding the \b at the end:
[0-9]\b return 'DIGIT'
[0-9]+("."[0-9]+)?\b return 'NUMBER'

Jison Lex without white spaces

I have this Jison lexer and parser:
\s+ /* skip whitespace */
'D01' return 'D01'
[xX][+-]?[0-9]+ return 'COORD'
<<EOF>> return 'EOF'
. return 'INVALID'
%start source
: command EOF;
: D01 COORD;
It will tokenize and parse D01 X45 but not D01X45. What am I missing?
Unlike (f)lex -- or, indeed, the vast majority of scanner generators, jison scanners do not implement the longest-match rule. Instead, the first matching pattern wins.
In order to make this work for keywords, jison scanners also implement the restriction that simple literal strings -- like "D01" -- only match if they end on a word-boundary.
The workaround is to enclose the literal string pattern with redundant parentheses:
("D01") { return 'D01'; }
This is documented in the jison wiki
