Good type design in Haskell for the AST of a simple language - parsing

I'm new to Haskell, and am working through the Haskell LLVM tutorial. In it, the author defines a simple algebraic data type to represent the AST.
type Name = String
data Expr
= Float Double
| BinOp Op Expr Expr
| Var String
| Call Name [Expr]
| Function Name [Expr] Expr
| Extern Name [Expr]
deriving (Eq, Ord, Show)
data Op
= Plus
| Minus
| Times
| Divide
deriving (Eq, Ord, Show)
However, this is not an ideal structure, because the parser actually expects that the list of Expr in an Extern will only ever contain expressions representing variables (i.e. parameters in this situation cannot be arbitrary expressions). I would like to make the types reflect this constraint (making it easier to generate random valid ASTs using QuickCheck); however, for the sake of consistency in the parser functions (which all have type Parser Expr), I don't just want to say | Expr Name [Name]. I would like to do something like this:
data Expr
= ...
| Var String
...
| Function Name [Expr] Expr
| Extern Name [Var] -- enforce constraint here
deriving (Eq, Ord, Show)
But it's not possible in Haskell.
To summarize, Extern and Var should both be Expr, and Extern should have a list of Vars representing parameters. Would the best way be to split all of these out and make them instances of an Expr typeclass (that wouldn't have any methods)? Or is there a more idiomatic method (or would it be better to scrap these types and do something totally different)?

Disclaimer, I'm the author of the LLVM tutorial you mentioned.
Just use Extern Name [Name], everything after Chapter 3 onward in the tutorial uses that exact definition anyways. I think I just forgot to make Chapter 2 Syntax.hs consistent with the others.
I wouldn't worry about making the parser definitions consistent, it's fine for them to return different types. Here's what the later parsers use. identifier is just the parsec builtin for the alphanumeric identifier from the LanguageDef that becomes the Name type in the AST.
extern :: Parser Expr
extern = do
reserved "extern"
name <- identifier
args <- parens $ many identifier
return $ Extern name args

Related

Adding additional parameters to data constructors using infix operators

I have written a data constructor like
data Expr = IntL Integer | Expr :*: Expr
and would like to annotate it with extra constructor parameters (such as positional information) like this:
data Expr = IntL Integer Pos | Expr :*: Expr Pos
However GHC does not like this:
Expected kind '* -> *' but 'Expr' has kind '*'
In the type 'Expr Position'
In the definition of data constructor ':*:'
In the data declaration for 'Expr'
I know I could use something like Mul Expr Expr Pos as a work around or even wrap Expr in another data constructor, but I'd really like to use the infix operator and cannot figure a way to do so! Is this possible?
I've tried wrapping the constructor in brackets:
data Expr = IntL Integer Pos | (Expr :*: Expr) Pos
And also making :*: a prefix:
data Expr = IntL Integer Pos | (:*:) Expr Expr Pos
but this does not allow me to pattern match in the same way. I'm not sure this even makes sense as a type constructor but thought I'd ask just in case.
It might be better to do this with an extra constructor, so:
infixl 6 :*:
infixl 7 :#
data Expr = IntL Integer | PosExpr :*: PosExpr
data PosExpr = Expr :# Pos
Then you can construct items with:
(IntL 5 :# foo :*: IntL 6 :# bar) :# qux

Obscure Antlr Error when Parsing Data Type

I am trying to parse a variable type for a toy language meant to teach Antlr fundamentals. I wish to parse at the rule var using the below code.
// Parser
var : TYPE ID;
// Lexer
TYPE: SIGNED PTR? DIMENSIONS?
| UNSIGNED PTR? DIMENSIONS?
| UNSIGNABLE PTR? DIMENSIONS?;
fragment DIMENSIONS : '[' ((NAT | ':') ',')* (NAT | ':')? ']';
fragment SIGNED : 'I16' | 'I32' | 'I64' | 'F32' | 'CHAR';
fragment UNSIGNED : 'U_I16' | 'U_I32' | 'U_I64' | 'U_F32' | 'U_CHAR';
fragment UNSIGNABLE : 'VOID' | 'STR' | 'BOOL' | 'CPLX';
PTR : 'PTR';
NAT : [0-9]+;
ID : [A-Z][A-Z0-9_]*;
However, when I test my program with the example declaration I32 HELLO_9, I receive the following error.
line 1:0 missing TYPE at 'I32'
PTR and DIMENSIONS should be marked as optional, so I am unsure as to why my lexer will not identify the I32 token for the SIGNED fragment. As a secondary question, I wonder how it is ever possible for professional programmers to create sophisticated projects with Antlr. I have experimented with Haskell parsing libraries in the past and it appears (from my subjective view) that Antlr is more prone to producing obscure errors. My perception is probably just a consequence of my inexperience, and I would be thankful to hear the opinions of a more suave programmer.
Given your grammar, I can't reproduce this. If I add SPACE : [ \t\r\n] -> skip; to it, the following code:
TLexer lexer = new TLexer(CharStreams.fromString("I32 HELLO_9"));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTree root = parser.var();
System.out.println(root.toStringTree(parser));
produces no warnings/errors and prints:
(var I32 HELLO_9)
representing the parse tree:
The real problem is something #rici mentioned, or it is hidden by the fact that you've minimized your real grammar and this minimized form does not produce the error your real grammar does.

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Can an interpreter be implemented with a symbol table?

Often I hear that using a symbol table optimizes look ups of symbols in a programming language. Currently, my language is implemented only as an interpreter, not as a compiler. I do not yet want to allocate the time to build a compiler, so I'm attempting to optimize the interpreter. The language is based on Scheme semantics and syntax for the most part, and is statically-scoped. I use the AST for executing code at run-time (in my interpreter, implemented as discriminated unions just like the AST in Write Yourself a Scheme in 48 Hours.
Unfortunately, symbol look-up in my interpreter is slow due to the use of an F# Map to contain and look up symbols by name. (Well, in truth, it uses a Trie, but the performance is similarly problematic). I would like to instead use a symbol tree to achieve faster symbol lookup. However, I don't know if or how one can implement symbols tables in an interpreter. I hear about them only in the context of a compiler.
Is this possible? If the implementation strategy or performance differs from a symbol table in a compiler, could you describe the differences? Finally, is there an existing reference implementation of a symbol tree in an interpreter I might look at?
Thank you!
A symbol table associates some information with every symbol. In an interpreter, you would perhaps associate values with symbols. Map is one implementation particularly suitable for functional interpreters.
If you want to optimize your interpreter, get rid of the need for a symbol table at runtime. One way to to go is De Bruijn idexing.
There is also nice literature on mechanically deriving optimized interpreters, VMs and compilers from a functional interpreter, for example:
http://www.brics.dk/RS/03/14/BRICS-RS-03-14.pdf
For a simple example, consider lambda calculus with constants encoded with De Bruijn indices. Notice that the evaluator gets by without a symbol table, because it can use integers for lookup.
type exp =
| App of exp * exp
| Const of int
| Fn of exp
| Var of int
type value =
| Closure of exp * env
| Number of int
and env = value []
let lookup env i = Array.get env i
let extend value env = Array.append [| value |] env
let empty () : env = Array.empty
let eval exp =
let rec eval env exp =
match exp with
| App (f, x) ->
match eval env f with
| Closure (bodyF, envF) ->
let vx = eval env x
eval (extend vx envF) bodyF
| _ -> failwith "?"
| Const x -> Number x
| Fn e -> Closure (e, env)
| Var x -> lookup env x
eval (empty ()) exp

Resources