Pass values as arguments to rules - parsing

When implementing real world (TM) languages, I often encounter a situation like this:
(* language Foo *)
type A = ... (* parsed by parse_A *)
type B = ... (* parsed by parse_B *)
type collection = { as : A list ; bs : B list }
(* parser ParseFoo.mly *)
parseA : ... { A ( ... ) }
parseB : ... { B ( ... ) }
parseCollectionElement : parseA { .. } | parseB { .. }
parseCollection : nonempty_list (parseCollectionElement) { ... }
Obviously (in functional style), it would be best to pass the partially parsed collection record to each invocation of the semantic actions of parseA and parseB and update the list elements accordingly.
Is that even possible using menhir, or does one have to use the ugly hack of using a mutable global variable?

Well, you're quite limited in what you're allowed to do in menhir/ocamlyacc semantic actions. If you find this really frustrating, you can try parsec-like parsers, e.g. mparser, that allow you to use OCaml in your rules to full extent.
My personal aproach to this kind of problems is to stay on a most primitive level in the parser, without trying to define anything sophisticated, and lift parser output to higher level later.
But your case looks simple enough to me. Here, instead of using parametrized menhir rule, you can just write a list rule by hand, and produce a collection in its semantic rule. nonempty_list is a syntactic sugar, that, as any other sugars, works in most cases, but in general is less general.

Related

FsCheck: Override generator for a type, but only in the context of a single parent generator

I seem to often run into cases where I want to generate some complex structure, but a special variation with a member type generated differently.
For example, consider this tree
type Tree<'LeafData,'INodeData> =
| LeafNode of 'LeafData
| InternalNode of 'INodeData * Tree<'LeafData,'INodeData> list
I want to generate cases like
No internal node is childless
There are no leaf-type nodes
Only a limited subset of leaf types are used
These are simple to do if I override all generation of a corresponding child type.
The problem is that it seems register is inherently a thread-level action, and there is no gen-local alternative.
For example, what I want could look like
let limitedLeafs =
gen {
let leafGen = Arb.generate<LeafType> |> Gen.filter isAllowedLeaf
do! registerContextualArb (leafGen |> Arb.fromGen)
return! Arb.generate<Tree<NodeType, LeafType>>
}
This Tree example specifically can work around with some creative type shuffling, but that's not always possible.
It's also possible to use some sort of recursive map that enforces assumptions, but that seems relatively complex if the above is possible. I might be misunderstanding the nature of FsCheck generators though.
Does anyone know how to accomplish this kind of gen-local override?
There's a few options here - I'm assuming you're on FsCheck 2.x but keep scrolling for an option in FsCheck 3.
The first is the most natural one but is more work, which is to break down the generator explicitly to the level you need, and then put humpty dumpty together again. I.e don't rely on the type-based generator derivation so much - if I understand your example correctly that would mean implementing a recursive generator - relying on Arb.generate<LeafType> for the generic types.
Second option - Config has an Arbitrary field which you can use to override Arbitrary instances. These overrides will take effect even if the overridden types are part of the automatically generated ones. So as a sketch you could try:
Check.One ({Config.Quick with Arbitrary = [| typeof<MyLeafArbitrary>) |]) (fun safeTree -> ...)
More extensive example which uses FsCheck.Xunit's PropertyAttribute but the principle is the same, set on the Config instead.
Final option! :) In FsCheck 3 (prerelease) you can configure this via a new (as of yet undocumented) concept ArbMap which makes the map from type to Arbitrary instance explicit, instead of this static global nonsense in 2.x (my bad of course. seemed like a good idea at the time.) The implementation is here which may not tell you all that much - the idea is that you put an ArbMap instance together which contains your "safe" generators for the subparts, then you ArbMap.mergeWith that safe map with ArbMap.defaults (thus overriding the default generators with your safe ones, in the resulting ArbMap) and then you use ArbMap.arbitrary or ArbMap.generate with the resulting map.
Sorry for the long winded explanation - but all in all that should give you the best of both worlds - you can reuse the generic union type generator in FsCheck, while surgically overriding certain types in that context.
FsCheck guidance on this is:
To define a generator that generates a subset of the normal range of values for an existing type, say all the even ints, it makes properties more readable if you define a single-case union case, and register a generator for the new type:
As an example, they suggest you could define arbitrary even integers like this:
type EvenInt = EvenInt of int with
static member op_Explicit(EvenInt i) = i
type ArbitraryModifiers =
static member EvenInt() =
Arb.from<int>
|> Arb.filter (fun i -> i % 2 = 0)
|> Arb.convert EvenInt int
Arb.register<ArbitraryModifiers>() |> ignore
You could then generate and test trees whose leaves are even integers like this:
let ``leaves are even`` (tree : Tree<EvenInt, string>) =
let rec leaves = function
| LeafNode leaf -> [leaf]
| InternalNode (_, children) ->
children |> List.collect leaves
leaves tree
|> Seq.forall (fun (EvenInt leaf) ->
leaf % 2 = 0)
Check.Quick ``leaves are even`` // output: Ok, passed 100 tests.
To be honest, I like your idea of a "gen-local override" better, but I don't think FsCheck supports it.

Are there use cases for single case variants in Ocaml?

I've been reading F# articles and they use single case variants to create distinct incompatible types. However in Ocaml I can use private module types or abstract types to create distinct types. Is it common in Ocaml to use single case variants like in F# or Haskell?
Another specialized use case fo a single constructor variant is to erase some type information with a GADT (and an existential quantification).
For instance, in
type showable = Show: 'a * ('a -> string) -> showable
let show (Show (x,f)) = f x
let showables = [ Show (0,string_of_int); Show("string", Fun.id) ]
The constructor Show pairs an element of a given type with a printing function, then forget the concrete type of the element. This makes it possible to have a list of showable elements, even if each elements had a different concrete types.
For what it's worth it seems to me this wasn't particularly common in OCaml in the past.
I've been reluctant to do this myself because it has always cost something: the representation of type t = T of int was always bigger than just the representation of an int.
However recently (probably a few years) it's possible to declare types as unboxed, which removes this obstacle:
type [#unboxed] t = T of int
As a result I've personally been using single-constructor types much more frequently recently. There are many advantages. For me the main one is that I can have a distinct type that's independent of whether it's representation happens to be the same as another type.
You can of course use modules to get this effect, as you say. But that is a fairly heavy solution.
(All of this is just my opinion naturally.)
Yet another case for single-constructor types (although it does not quite match your initial question of creating distinct types): fancy records. (By contrast with other answers, this is more a syntactic convenience than a fundamental feature.)
Indeed, using a relatively recent feature (introduced with OCaml 4.03, in 2016) which allows writing constructor arguments with a record syntax (including mutable fields!), you can prefix regular records with a constructor name, Coq-style.
type t = MakeT of {
mutable x : int ;
mutable y : string ;
}
let some_t = MakeT { x = 4 ; y = "tea" }
(* val some_t : t = MakeT {x = 4; y = "tea"} *)
It does not change anything at runtime (just like Constr (a,b) has the same representation as (a,b), provided Constr is the only constructor of its type). The constructor makes the code a bit more explicit to the human eye, and it also provides the type information required to disambiguate field names, thus avoiding the need for type annotations. It is similar in function to the usual module trick, but more systematic.
Patterns work just the same:
let (MakeT { x ; y }) = some_t
(* val x : int = 4 *)
(* val y : string = "tea" *)
You can also access the “contained” record (at no runtime cost), read and modify its fields. This contained record however is not a first-class value: you cannot store it, pass it to a function nor return it.
let (MakeT fields) = some_t in fields.x (* returns 4 *)
let (MakeT fields) = some_t in fields.x <- 42
(* some_t is now MakeT {x = 42; y = "tea"} *)
let (MakeT fields) = some_t in fields
(* ^^^^^^
Error: This form is not allowed as the type of the inlined record could escape. *)
Another use case of single-constructor (polymorphic) variants is documenting something to the caller of a function. For instance, perhaps there's a caveat with the value that your function returns:
val create : unit -> [ `Must_call_close of t ]
Using a variant forces the caller of your function to pattern-match on this variant in their code:
let (`Must_call_close t) = create () in (* ... *)
This makes it more likely that they'll pay attention to the message in the variant, as opposed to documentation in an .mli file that could get missed.
For this use case, polymorphic variants are a bit easier to work with as you don't need to define an intermediate type for the variant.

Parsing procedure calls for a toy language

I have a certain toy language that defines, amongst others, procedures and procedure calls, using EBNF syntax:
program = procedure, {procedure} ;
procedure = "procedure", NAME, bracedblock ;
bracedBlock = "{" , statementlist , "}" ;
statementlist = statement, { statement } ;
statement = define | if | while | call | // others omitted for brevity ;
define = NAME, "=", expression, ";"
if = "if", conditionalblock, "then", bracedBlock, "else", bracedBlock
call = "call" , NAME, ";" ;
// other definitions omitted for brevity
A tokeniser for a program in this language has been implemented, and returns a vector of tokens.
Now, parsing said program without the procedure calls, is fairly straightforward: one can define a recursive descent parser using the above grammar directly, and simply parse through the tokens. Some further notes:
Each procedure may call any other procedure except itself, directly or indirectly (i.e. no recursion), and these need not necessarily be in the order of appearance in the source code (i.e. B may be defined after A, and A may call B, or vice versa).
Procedure names need to be unique, and 'reserved keywords' may be used as variable/procedure names.
Whitespace does not matter, at least amongst tokens of different type: similar to C/C++.
There is no scoping rule: all variables are global.
The concept of a 'line number' is important: each statement has one or more line numbers associated with it: define statements have only 1 line number each, for instance, whereas an if statement, which is itself a parent of two statement lists, has multiple line numbers. For instance:
LN CODE
procedure A {
1. a = 5;
2. b = 7;
3. c = 3;
4. 5. if (b < c) then { call C; } else {
6. call B;
}
procedure B {
7. d = 5;
8. while (d > 2) {
9. d = d + 1; }
}
procedure C {
10. e = 10;
11. f = 8;
12. call B;
}
Line numbers are continuous throughout the program; only procedure definitions and the else keyword aren't assigned line numbers. The line numbers are defined by grammar, rather than their position in source code: for instance, consider 'lines' 4 and 5.
There are some relationships that need to be set in a database given each statement and its line number, variables used, variables set, and child containers. This is a key consideration.
My question is therefore this: how can I parse these function calls, maintain the integrity of the line numbers, and set the relationships?
I have considered the 'OS' way of doing things: upon encounter of a procedure call, look ahead for a procedure that matches said called procedure, parse the callee, and unroll the call stack back to the caller. However, this ruins the line number ordering: if the above program were to be parsed this way, C would have line numbers 6 to 8 inclusive, rather than 10 to 12 inclusive.
Another solution is to parse the entire program once in order, maintain a toposort of procedure calls, and then parse a second time by following said toposort. This is problematic because of implementation details.
Is there a possibly better way to do this?
It's always tempting to try to completely process a program text in a single on-line pass. Unfortunately, it is practically never the simplest solution. Trying to do everything at once in a linear progression results in a kind of spaghetti of intertwined computations, and making it all work almost always involves unnecessary restrictions on the language which will later prove to be unfortunate.
So I'd encourage you to reconsider some of your design decisions. If you use the parser just to build up some kind of structural representation of the program -- whether it's an abstract syntax tree or a vector of three-address code, or some other alternative -- and then do further processing in a series of single-purpose passes over that structural representations, you'll likely find that the code is:
much simpler, because computations don't have to be intermingled;
more general, because each pass can be done in the most convenient order rather than restricting inputs to fit a linear ordering;
more readable and more maintainable.
Persisting data structures over multiple passes might increase storage requirements slightly. But the structures are unlikely to occupy enough storage that this will be noticeable. And it probably will not increase the computation time; indeed, it might even reduce the time because the individual passes are simpler and easier to optimise.

Parser combinators and left-recursion

So it is well known that the top-down parsing paradigm can not deal with left-recursion. The grammar must either be refactored to get rid of left-recursion or some other paradigm must be used. I've been working on a parser combinator library and since I'm doing this in a language that allows for global side-effects it struck me that I can use some global store for tracking which rules have fired and which ones have not. This scheme of guarding on certain conditions lets me deal with very simple cases of left-recursion at the cost of some extra annotation on the combinators. Here's an example grammar in TypeScript
var expression = Parser.delay(_ => TestGrammar.expression);
var in_plus = false;
class TestGrammar {
static terminal = Parser.m(x => 'a' === x);
static op = Parser.m(x => '+' === x);
static plus = expression.on_success(_ => {
in_plus = false;
}).then(TestGrammar.op).then(expression).on_enter(_ => {
in_plus = true;
}).guard(_ => !in_plus);
static expression = TestGrammar.plus.or(TestGrammar.terminal);
}
console.log(TestGrammar.expression.parse_input('a+a+a+a'));
The idea is pretty simple. In cases where we might get stuck in a loop the rules are amended with guards like in the case of plus in the above example. The rule fails if we hit a looping condition and the guard is lifted as soon as we make progress.
What I'd like to know is if this idea has been explored and analysed. I'd rather not go down this rabbit hole and try to figure stuff out if this is a dead end.
Have a look at GLL algorithms
(e.g. https://github.com/djspiewak/gll-combinators).
They can handle ambiguous and left-recursive grammars efficiently.
They do not directly call the parser function of the sub-parser, but keep a 'todo'-list of (Parser,Position) tupels (called the Trampoline).
This way the endless loop (recursing into self) is avoided (no tupel is added twice).

Design alternatives to extending object with interface

While working through Expert F# again, I decided to implement the application for manipulating algebraic expressions. This went well and now I've decided as a next exercise to expand on that by building a more advanced application.
My first idea was to have a setup that allows for a more extendible way of creating functions without having to recompile. To that end I have something like:
type IFunction =
member x.Name : string with get
/// additional members omitted
type Expr =
| Num of decimal
| Var of string
///... omitting some types here that don't matter
| FunctionApplication of IFunction * Expr list
So that say a Sin(x) could be represented a:
let sin = { new IFunction() with member x.Name = "SIN" }
let sinExpr = FunctionApplication(sin,Var("x"))
So far all good, but the next idea that I would like to implement is having additional interfaces to represent function of properties. E.g.
type IDifferentiable =
member Derivative : int -> IFunction // Get the derivative w.r.t a variable index
One of the ideas the things I'm trying to achieve here is that I implement some functions and all the logic for them and then move on to the next part of the logic I would like to implement. However, as it currently stands, that means that with every interface I add, I have to revisit all the IFunctions that I've implemented. Instead, I'd rather have a function:
let makeDifferentiable (f : IFunction) (deriv : int -> IFunction) =
{ f with
interface IDifferentiable with
member x.Derivative = deriv }
but as discussed in this question, that is not possible. The alternative that is possible, doesn't meet my extensibility requirement. My question is what alternatives would work well?
[EDIT] I was asked to expand on the "doesn't meet my extenibility requirement" comment. The way this function would work is by doing something like:
let makeDifferentiable (deriv : int -> IFunction) (f : IFunction)=
{ new IFunction with
member x.Name = f.Name
interface IDifferentiable with
member x.Derivative = deriv }
However, ideally I would keep on adding additional interfaces to an object as I add them. So if I now wanted to add an interface that tell whether on function is even:
type IsEven =
abstract member IsEven : bool with get
then I would like to be able to (but not obliged, as in, if I don't make this change everything should still compile) to change my definition of a sine from
let sin = { new IFunction with ... } >> (makeDifferentiable ...)
to
let sin = { new IFunction with ... } >> (makeDifferentiable ...) >> (makeEven false)
The result of which would be that I could create an object that implements the IFunction interface as well as potentially, but not necessarily a lot of different other interfaces as well; the operations I'd then define on them, would potentially be able to optimize what they are doing based on whether or not a certain function implements an interface. This will also allow me to add additional features/interfaces/operations first without having to change the functions I've defined (though they wouldn't take advantage of the additional features, things wouldn't be broken either.[/EDIT]
The only thing I can think of right now is to create a dictionary for each feature that I'd like to implement, with function names as keys and the details to build an interface on the fly, e.g. along the lines:
let derivative (f : IFunction) =
match derivativeDictionary.TryGetValue(f.Name) with
| false, _ -> None
| true, d -> d.Derivative
This would require me to create one such function per feature that I add in addition to one dictionary per feature. Especially if implemented asynchronously with agents, this might be not that slow, but it still feels a little clunky.
I think the problem that you're trying to solve here is what is called The Expression Problem. You're essentially trying to write code that would be extensible in two directions. Discriminated unions and object-oriented model give you one or the other:
Discriminated union makes it easy to add new operations (just write a function with pattern matching), but it is hard to add a new kind of expression (you have to extend the DU and modify all code
that uses it).
Interfaces make it easy to add new kinds of expressions (just implement the interface), but it is hard to add new operations (you have to modify the interface and change all code that creates it.
In general, I don't think it is all that useful to try to come up with solutions that let you do both (they end up being terribly complicated), so my advice is to pick the one that you'll need more often.
Going back to your problem, I'd probably represent the function just as a function name together with the parameters:
type Expr =
| Num of decimal
| Var of string
| Application of string * Expr list
Really - an expression is just this. The fact that you can take derivatives is another part of the problem you're solving. Now, to make the derivative extensible, you can just keep a dictionary of the derivatives:
let derrivatives =
dict [ "sin", (fun [arg] -> Application("cos", [arg]))
... ]
This way, you have an Expr type that really models just what an expression is and you can write differentiation function that will look for the derivatives in the dictionary.

Resources