Parser combinators and left-recursion - parsing

So it is well known that the top-down parsing paradigm can not deal with left-recursion. The grammar must either be refactored to get rid of left-recursion or some other paradigm must be used. I've been working on a parser combinator library and since I'm doing this in a language that allows for global side-effects it struck me that I can use some global store for tracking which rules have fired and which ones have not. This scheme of guarding on certain conditions lets me deal with very simple cases of left-recursion at the cost of some extra annotation on the combinators. Here's an example grammar in TypeScript
var expression = Parser.delay(_ => TestGrammar.expression);
var in_plus = false;
class TestGrammar {
static terminal = Parser.m(x => 'a' === x);
static op = Parser.m(x => '+' === x);
static plus = expression.on_success(_ => {
in_plus = false;
}).then(TestGrammar.op).then(expression).on_enter(_ => {
in_plus = true;
}).guard(_ => !in_plus);
static expression = TestGrammar.plus.or(TestGrammar.terminal);
}
console.log(TestGrammar.expression.parse_input('a+a+a+a'));
The idea is pretty simple. In cases where we might get stuck in a loop the rules are amended with guards like in the case of plus in the above example. The rule fails if we hit a looping condition and the guard is lifted as soon as we make progress.
What I'd like to know is if this idea has been explored and analysed. I'd rather not go down this rabbit hole and try to figure stuff out if this is a dead end.

Have a look at GLL algorithms
(e.g. https://github.com/djspiewak/gll-combinators).
They can handle ambiguous and left-recursive grammars efficiently.
They do not directly call the parser function of the sub-parser, but keep a 'todo'-list of (Parser,Position) tupels (called the Trampoline).
This way the endless loop (recursing into self) is avoided (no tupel is added twice).

Related

is there an option<> computation expression anywhere?

Am I being stupid?
Some things look nicer in a monad/computation expression, I have lots of code using the seq computation expression, but everytime I hit an option<> I have to revert to "Option.map" etc.
Slightly jarring (when I was doing this in C# I wrote a linq operator for a IMaybe<> type and it all looked nice and consistent).
I can, but don't especially want to write one, there must be one (or more) out there, which one do people use?
i.e.
so rather than
let postCodeMaybe =
personMaybe
|> Option.bind (fun p -> p.AddressMaybe())
|> Option.map (fun -> p.Postcode())
we can go
let postCodeMaybe =
option {
for p in personMaybe do
for a in p.AddressMaybe() do
a.Postcode()
}
I have no problem with the former code, except that in my context it exists in lots of "seq" computational expressions that look like the latter (and some developers who will look at this code will come from a C#/LINQ background which are basically the latter).
There is no option CE in FSharp.Core but it exists in some libraries.
I think one of the reasons why many CEs are not provided in FSharp.Core is because there are many different ways of doing them, if you google for an option computation expression builder you'll find some are strict, others lazy, some support side-effects others support blending multiple return values into one.
Regarding libraries, there is one in F#x called maybe then you have F#+ which provides generic computation expressions so you have something like Linq where as long as the type implement some methods the CE is ready to use, for options you can use let x:option<_> = monad' { ... or let x:option<_> = monad.plus' { ... if you prefer to have the one that allows multiple return values.
Your code will look like:
let x: option<_> = option {
let! p = personMaybe
let! a = p.AddressMaybe()
return a }
Where option is your option builder, could be one of the above mentioned.
We use FsToolkit.ErrorHandling. It is simple, actively maintained and works well.
It has CEs for Option, Result, ResultOption, Validation, AsyncResult, AsyncResultOption.
https://www.nuget.org/packages/FsToolkit.ErrorHandling/
https://github.com/demystifyfp/FsToolkit.ErrorHandling

How does the data structure for a lexical analysis look?

I know the lexical analyser tokenizes the input and stores it in a stream, or at least that is what I understood. Unfortunately nearly all articles I have read only talk about lexing simple expressions. What I am interested in is how to tokenize something like:
if (fooBar > 5) {
for (var i = 0; i < alot.length; i++) {
fooBar += 2 + i;
}
}
Please note that this is pseudo code.
Question: I would like to know how the data structure looks like for tokens created by the lexer? I really have no idea for the example i gave above where code is nested. Some example would be nice.
First of all, tokens are not necessarily stored. Some compilers do store the tokens in a table or other data structure, but for a simple compiler (if there is such a thing) it's sufficient in most cases that the lexer can return the type of the next token to be parsed and then in some cases the parser might ask the lexer for the actual text that the token is made up of.
If we use your sample code,
if (fooBar > 5) {
for (var i = 0; i < alot.length; i++) {
fooBar += 2 + i;
}
}
The type of the first token in this sample might be defined as TOK_IF corresponding to the "if" keyword. The next token might be TOK_LPAREN, then TOK_IDENT, then TOK_GREATER, then TOK_INT_LITERAL, and so on. What exactly the types should be is defined by you as the author of the lexer (or tokenizer) code. (Note that there are about a million different tools to help you avoid the somewhat tedious task of coming up with these details by hand.)
Except for TOK_IDENT and TOK_INT_LITERAL the tokens we've seen so far are defined entirely by their type. For these two, we would need to be able to ask the lexer for the underlying text so that we can evaluate the value of the token.
So a tiny excerpt of the parser dealing with an IF statement in pseudo-code might look something like:
...
switch(lexer.GetNextTokenType())
case TOK_IF:
{
// "if" statement
if (lexer.GetNextTokenType() != TOK_LPAREN)
throw SyntaxError('( expected');
ParseRelationalExpression(lexer);
if (lexer.GetNextTokenType() != TOK_RPAREN)
throw SyntaxError(') expected');
...
and so on.
If the compiler did choose to actually store the tokens for later reference, and some compilers do e.g. to allow for more efficient backtracking, one way would be to use a structure similar to the following
struct {
int TokenType;
char* TokenStart;
int TokenLength;
}
The container for these might be a linked list or std::vector (assuming C++).

How to get a location of a number value?

Suppose I want to print all locations with a hard coded value, in a context were d is an M3 Declaration:
top-down visit(d) {
case \number(str numberValue) :
println("hardcode value <numberValue> encountered at <d#\src>");
}
The problem is that the location <d#\src> is too generic, it yields the entire declaration (like the whole procedure). The expression <numberValue#\src> seems more appropriate, but it is not allowed, probably because we are too low in the parse tree.
So, the question is how to get a parent E (the closest parent, like the expression itself) of <numberValue> such that <E#\src> is defined?
One option is to add an extra level in the top-down traversal:
top-down visit(d) {
case \expressionStatement(Expression stmt): {
case \number(str numberValue) :
println("hardcode value <numberValue> encountered at <stmt#\src>");
}
}
This works, but has some drawbacks:
It is ugly, in order to cover all cases we have to add many more variants;
It is very inefficient.
So, what is the proper way to get the location of a numberValue (and similar low-level constructs like stringValue's, etc)?
You should be able to do the following:
top-down visit(d) {
case n:\number(str numberValue) :
println("hardcode value <numberValue> encountered at <n#\src>");
}
Saying n:\number(str numberValue) will bind the name n to the \number node that the case matched. You can then use this in your message to get the location for n. Just make sure that n isn't already in scope. You should be able to create similar patterns for the other scenarios you mentioned as well.

Pass values as arguments to rules

When implementing real world (TM) languages, I often encounter a situation like this:
(* language Foo *)
type A = ... (* parsed by parse_A *)
type B = ... (* parsed by parse_B *)
type collection = { as : A list ; bs : B list }
(* parser ParseFoo.mly *)
parseA : ... { A ( ... ) }
parseB : ... { B ( ... ) }
parseCollectionElement : parseA { .. } | parseB { .. }
parseCollection : nonempty_list (parseCollectionElement) { ... }
Obviously (in functional style), it would be best to pass the partially parsed collection record to each invocation of the semantic actions of parseA and parseB and update the list elements accordingly.
Is that even possible using menhir, or does one have to use the ugly hack of using a mutable global variable?
Well, you're quite limited in what you're allowed to do in menhir/ocamlyacc semantic actions. If you find this really frustrating, you can try parsec-like parsers, e.g. mparser, that allow you to use OCaml in your rules to full extent.
My personal aproach to this kind of problems is to stay on a most primitive level in the parser, without trying to define anything sophisticated, and lift parser output to higher level later.
But your case looks simple enough to me. Here, instead of using parametrized menhir rule, you can just write a list rule by hand, and produce a collection in its semantic rule. nonempty_list is a syntactic sugar, that, as any other sugars, works in most cases, but in general is less general.

Design alternatives to extending object with interface

While working through Expert F# again, I decided to implement the application for manipulating algebraic expressions. This went well and now I've decided as a next exercise to expand on that by building a more advanced application.
My first idea was to have a setup that allows for a more extendible way of creating functions without having to recompile. To that end I have something like:
type IFunction =
member x.Name : string with get
/// additional members omitted
type Expr =
| Num of decimal
| Var of string
///... omitting some types here that don't matter
| FunctionApplication of IFunction * Expr list
So that say a Sin(x) could be represented a:
let sin = { new IFunction() with member x.Name = "SIN" }
let sinExpr = FunctionApplication(sin,Var("x"))
So far all good, but the next idea that I would like to implement is having additional interfaces to represent function of properties. E.g.
type IDifferentiable =
member Derivative : int -> IFunction // Get the derivative w.r.t a variable index
One of the ideas the things I'm trying to achieve here is that I implement some functions and all the logic for them and then move on to the next part of the logic I would like to implement. However, as it currently stands, that means that with every interface I add, I have to revisit all the IFunctions that I've implemented. Instead, I'd rather have a function:
let makeDifferentiable (f : IFunction) (deriv : int -> IFunction) =
{ f with
interface IDifferentiable with
member x.Derivative = deriv }
but as discussed in this question, that is not possible. The alternative that is possible, doesn't meet my extensibility requirement. My question is what alternatives would work well?
[EDIT] I was asked to expand on the "doesn't meet my extenibility requirement" comment. The way this function would work is by doing something like:
let makeDifferentiable (deriv : int -> IFunction) (f : IFunction)=
{ new IFunction with
member x.Name = f.Name
interface IDifferentiable with
member x.Derivative = deriv }
However, ideally I would keep on adding additional interfaces to an object as I add them. So if I now wanted to add an interface that tell whether on function is even:
type IsEven =
abstract member IsEven : bool with get
then I would like to be able to (but not obliged, as in, if I don't make this change everything should still compile) to change my definition of a sine from
let sin = { new IFunction with ... } >> (makeDifferentiable ...)
to
let sin = { new IFunction with ... } >> (makeDifferentiable ...) >> (makeEven false)
The result of which would be that I could create an object that implements the IFunction interface as well as potentially, but not necessarily a lot of different other interfaces as well; the operations I'd then define on them, would potentially be able to optimize what they are doing based on whether or not a certain function implements an interface. This will also allow me to add additional features/interfaces/operations first without having to change the functions I've defined (though they wouldn't take advantage of the additional features, things wouldn't be broken either.[/EDIT]
The only thing I can think of right now is to create a dictionary for each feature that I'd like to implement, with function names as keys and the details to build an interface on the fly, e.g. along the lines:
let derivative (f : IFunction) =
match derivativeDictionary.TryGetValue(f.Name) with
| false, _ -> None
| true, d -> d.Derivative
This would require me to create one such function per feature that I add in addition to one dictionary per feature. Especially if implemented asynchronously with agents, this might be not that slow, but it still feels a little clunky.
I think the problem that you're trying to solve here is what is called The Expression Problem. You're essentially trying to write code that would be extensible in two directions. Discriminated unions and object-oriented model give you one or the other:
Discriminated union makes it easy to add new operations (just write a function with pattern matching), but it is hard to add a new kind of expression (you have to extend the DU and modify all code
that uses it).
Interfaces make it easy to add new kinds of expressions (just implement the interface), but it is hard to add new operations (you have to modify the interface and change all code that creates it.
In general, I don't think it is all that useful to try to come up with solutions that let you do both (they end up being terribly complicated), so my advice is to pick the one that you'll need more often.
Going back to your problem, I'd probably represent the function just as a function name together with the parameters:
type Expr =
| Num of decimal
| Var of string
| Application of string * Expr list
Really - an expression is just this. The fact that you can take derivatives is another part of the problem you're solving. Now, to make the derivative extensible, you can just keep a dictionary of the derivatives:
let derrivatives =
dict [ "sin", (fun [arg] -> Application("cos", [arg]))
... ]
This way, you have an Expr type that really models just what an expression is and you can write differentiation function that will look for the derivatives in the dictionary.

Resources