Parse further an expression in a special case - parsing

At the moment my frontend can parse such normal expressions as 123, "abcd", "=123", "=TRUE+123"... The following are related code:
(* in `syntax.ml`: *)
and expression =
| E_integer of int
| E_string of string
(* in `parser.mly`: *)
expression:
| INTEGER { E_integer $1 }
| STRING { E_string $1 }
Now I would like to refine the parser, so that, when we meet a string starting with =, we try to evaluate it as a formula, not a literal string. So syntax.ml turns to be:
(* in `syntax.ml`: *)
and expression =
| E_integer of int
| E_string of string
| E_formula of formula
and formula =
| F_integer of int
| F_boolean of bool
| F_Add of formula * formula
The question is I am not sure how to change parser.mly, I tried this which did not work (This expression has type string but an expression was expected of type Syntax.formula):
(* in `parser.mly`: *)
expression:
| INTEGER { E_integer $1 }
| STRING {
if String.sub $1 1 1 <> "="
then E_string $1
else E_formula (String.sub $1 2 ((String.length $1) - 1)) }
I don't know how to let the parser know, for a string beginning with =, I need to parse it further based on the rules for formula... Could anyone help?
Following the comment of gasche:
I agree that I need to have a parser for formula. Now the question is if I need a separate lexer.mll for formula. I hope not, because it is logic to lex the whole program only one time, no? Also, can I add directly the formula grammar to the existing parser.mly?
In the current lexer.mll, I have:
let STRING = double_quote ([^ '\x0D' '\x0A' '\x22'])* double_quote
rule token = parse
| STRING as s { STRING s }
I think i can directly do something here:
let STRING = double_quote ([^ '\x0D' '\x0A' '\x22'])* double_quote
let FORMULA_STRING = double_quote = ([^ '\x0D' '\x0A' '\x22'])* double_quote
rule token = parse
| FORMULA_STRING as fs { XXXXX }
| STRING as s { STRING s }
I am not sure what I should write at the place of XXXXX, should it be Parser_formula.formula token fs, in the case that I have separately parser_formula.mly? What if I have only parser.mly which contains all the grammars including the one of formula?

The problem is with your line
else E_formula (String.sub $1 2 ((String.length $1) - 1))
Instead of (String.sub ...), which has type string, you should return a value of type Syntax.formula. If you had a parse_formula : string -> Syntax.formula function you could here write
else E_formula (parse_formula (String.sub $1 2 ((String.length $1) - 1)))
I think you could define such a function by defining the formula grammar as a separate parser first.
Edit: following you own edit:
if you go the route of calling a different parser for formulas, you don't need to define a different lexer
if you choose to handle the distinction between strings and formulas at the lexer level (are you sure that's correct? what about real string that would begin with '='?), then you don't need to have a separate parser for formulas, you can have them as rules in your current grammar. But to do that you need your lexer to behave in a more fine-grained ways on formulas: instead of just recognizing "=.*" as a single token, you should recognize "= as a beginning-of-formula, and lex the rest of the formula until you encounter the closing ". To avoid conflicts you may want to handle simple strings with a lexing rule rather than a simple regexp as well.
If you can get the second approach to work, I think it is indeed a simpler idea.
PS: please use menhir variable naming facilities instead of $1 as soon as the variables are not consecutive (because of intermediary terminals) or you need to repeat it more than once.

Continuing on #gasche 's answer.
You want to include new syntactic rules in your parser, which means that you need to change the grammar rules in parser.mly to accomodate these new rules.
The String.sub approach is somewhat in the right direction, but you are actually doing by hand what the mly file could let you automate.
Consider your formula type: the F_Add datatype there let you encode a binary sum formula, thus containing 2 formulas.
In the mly file, you could describe it as:
formula:
INTEGER { F_Integer $1 }
| BOOL { F_Bool $1 }
| formula PLUS formula { F_Add ($1, $3) }
;
Note how the grammar rule definition mirrors the formula type definition.
As you can see, the recursive property of formulas is nicely handled by the grammar rule for you.
Concerning lexer.mll, the regular expressions STRING and FORMULA_STRING are exactly the same. If you use them both in the same lexer rule (as in your code snippet), it will not work as you expect it to. The lexer has no knowledge of what is going on in the parser, it cannot choose to provide a STRING or a FORMULA_STRING when it's convenient for the parser to fill a specific rule in. With ocamlyacc (and with the tools it drew inspiration from), it works the other way round: the parser receives tokens which the lexer has recognized from the text stream, and tries to find the rule which correspond to them, according to what he has already figured out before.
Note that the BOOL terminal must be regonized by _lexer.mll(just likeINTEGER`), so you will need to amend it with the proper regular expression.
Also, you should ask yourself the following questions:
in the =5 formula, isn't there somewhere an expression waiting to be discovered?
If so, could you reformulate the definition of a formula in terms of expressions and new tokens?

Related

Set a rule based on the value of a global variable

In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).

Preferring one alternative

An excerpt of my ANTLR v4 grammar looks like this:
expression:
| expression BINARY_OPERATOR expression
| unaryExpression
| nularExpression
;
unaryExpression:
ID expression
;
nularExpression:
ID
| NUMBER
| STRING
;
My goal is to match the language without knowing all the necessary keywords and therefore I'm simply matching keywords as IDs.
However there are binary operators that take an argument to both sides of the keyword (e.g. keyword ) and therefore they need "special treatment". As you can see I already included this "special treatment" in the expression rule.
The actual problem now consists of the fact that some of these binary operators can be used as unary operators (=normal keywords) as well meaning that the left argument does not have to be specified.
The above grammar can't habdle this case because everytime I tried to implement this I ended up with every binary operator being consumed as a unary operator.
Example:
Let's assume count is a binary operator.
Possible syntaxes are <arg1> count <arg2> and count <arg>
All my attempts to implement the above mentioned case ended up grouping myArgument count otherArgument like (myArgument (count (otherArgument) ) ) instead of (myArgument) count (otherArgument)
My brain tellsme that the solution to this problem is to tell the parser always to take two arguments for a binary operator and if this fails it should try to consume the binary operator as an unary one.
Does anybody know how to accomplish this?
How about something like this:
lower_precedence_expression
: ID higher_precedence_expression
| higher_precedence_expression
;
higher_precedence_expression
: higher_precedence_expression ID lower_precedence_expression
| ID
| NUMBER
| STRING
;
?

ANTLR4 Context-sensitive rule: unexpected parsing/resynchronization when failing semantic predicate

I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Resources