Parsers and Compilers for Dummies. Where to start? [duplicate] - parsing

This question already has answers here:
Learning to write a compiler [closed]
(38 answers)
Closed 8 years ago.
This is a good listing, but what is the best one for a complete newb in this area. One for someone coming from a higher level background (VB6,C#,Java,Python) - not to familiar with C or C++. I'm much more interested in hand-written parsing versus Lex/Yacc at this stage.
If I had just majored in Computer Science instead of Psychology I might have taken a class on this in college. Oh well.

Please have a look at: learning to write a compiler
Also interesting:
how to write a programming language
parsing where can i learn about it
learning resources on parsers interpreters and compilers (ok you already mentioned this one.
And there are more on the topic. But I can give a short introduction:
The first step is the lexical analysis. A stream of characters is translated into a stream of tokens. Tokens can be simple like == <= + - (etc) and they can be complex like identifiers and numbers. If you like I can elaborate on this.
The next step is to translate the tokenstream into a syntaxtree or an other representation. This is called the parsing step.
Before you can create a parser, you need to write the grammar. For example we create an expression parser:
Tokens
addOp = '+' | '-';
mulOp = '*' | '/';
parLeft = '(';
parRight = ')';
number = digit, {digit};
digit = '0'..'9';
Each token can have different representations: + and = are both addOp and
23 6643 and 223322 are all numbers.
The language
exp = term | exp, addOp, term;
// an expression is a series of terms separated by addOps.
term = factor | term, mulOp, factor;
// a term is a series of factors separated by mulOps
factor = addOp, factor | parLeft, exp, parRight | number;
// a factor can be an addOp followed by another factor,
// an expression enclosed in parentheses or a number.
The lexer
We create a state engine that walks through the char stream, creating a token
s00
'+', '-' -> s01 // if a + or - is found, read it and go to state s01.
'*', '/' -> s02
'(' -> s03
')' -> s04
'0'..'9' -> s05
whitespace -> ignore and retry // if a whitespace is found ignore it
else ERROR // sorry but we don't recognize this character in this state.
s01
found TOKEN addOp // ok we have found an addOp, stop reading and return token
s02
found TOKEN mulOp
s03
found TOKEN parLeft
s04
found TOKEN parRight
s05
'0'..'9' -> s05 // as long as we find digits, keep collecting them
else found number // last digit read, we have a number
Parser
It is now time to create a simple parser/evaluator. This is complete in code. Normally they are created using tables. But we keep it simple. Read the tokens and calculate the result.
ParseExp
temp = ParseTerm // start by reading a term
while token = addOp do
// as long as we read an addop keep reading terms
if token('+') then temp = temp + ParseTerm // + so we add the term
if token('-') then temp = temp - ParseTerm // - so we subtract the term
od
return temp // we are done with the expression
ParseTerm
temp = ParseFactor
while token = mulOp do
if token('*') then temp = temp * ParseFactor
if token('/') then temp = temp / ParseFactor
od
return temp
ParseFactor
if token = addOp then
if token('-') then return - ParseFactor // yes we can have a lot of these
if token('+') then return ParseFactor
else if token = parLeft then
return ParseExpression
if not token = parRight then ERROR
else if token = number then
return EvaluateNumber // use magic to translate a string into a number
This was a simple example. In real examples you will see that error handling is a big part of the parser.
I hope this clarified a bit ;-).

If you're a complete n00b, the most accessible resource (in both senses of the term) is probably Jack Crenshaw's tutorial. It's nowhere near comprehensive but for getting started, I can't think of anything close except for books that are long out of print.

I'd like to suggest an article that I wrote called Implementing Programming Languages using C# 4.0. I've tried to make it accessible for newcomers. It isn't comprehensive, but afterwards it should be easier to understand other more advanced texts.

Related

How to create grammar for applying De Morgan's theorem to an expression using yacc?

I would like to apply Demorgan's theorem to an input using yacc and lex.
The input could be any expression such as a+b, !(A+B) etc:
The expression a+b should result in !a∙!b
The expression !(a+b) should result in a+b
I think the lex part is done but I'm having difficulty with the yacc grammar needed to apply the laws to an expression.
What I'm trying to implement is the following algorithm. Consider the following equation as input: Y = A+B
After applying De Morgan's law it becomes: !Y = !(A+B)
Finally, expanding the parentheses should result in !Y = !A∙!B
here lex code:
%{
#include <stdio.h>
#include "y.tab.h"
extern int yylval;
int yywrap (void);
%}
%%
[a-zA-Z]+ {yylval = *yytext; return ALPHABET;}
"&&" return AND;
"||" return OR;
"=" return ('=');
[\t] ;
\n return 0;
. return yytext[0];
"0exit" return 0;
%%
int yywrap (void)
{
return 1;
}
Here is my yacc code:
%{
#include <stdio.h>
int yylex (void);
void yyerror (char *);
extern FILE* yyin;
%}
%token ALPHABET
%left '+''*'
%right '=' '!' NOT
%left AND OR
%start check
%%
check : expr {printf("%s\n",$$);}
;
expr : plus
|plus '+' plus {$$ = $1 + $3;}
;
plus : times
|times '*' times {$$ = $1 * $3;}
;
times : and_op
|and_op AND and_op{$$ = $1 && $3;}
;
and_op : or_op
|or_op OR or_op {$$ = $1 || $3;}
;
or_op : not_op
|'!' not_op {$$ = !$2;}
;
not_op : paren
|'(' paren ')' {$$ = $2;}
;
paren :
|ALPHABET {$$ = $1;}
;
/*
E: E '+' E {$$ = $1 + $3;}
|E '*' E {$$ = $1 * $3;}
|E '=' E {$$ = $1 = $3;}
|E AND E {$$ = ($1 && $3);}
|E OR E {$$ = ($1 || $3);}
|'(' E ')' {$$ = $2;}
|'!' E %prec NOT {$$ = !$2;}
|ALPHABET {$$ = $1;}
;*/
%%
int main()
{
char filename[30];
char * line = NULL;
size_t len = 0;
printf("\nEnter filename\n");
scanf("%s",filename);
FILE *fp = fopen(filename, "r");
if(fp == NULL)
{
fprintf(stderr,"Can't read file %s\n",filename);
exit(EXIT_FAILURE);
}
yyin = fp;
// while (getline(&line, &len, fp) != -1)
// {
// printf("%s",line);
// }
// printf("Enter the expression:\n");
do
{
yyparse();
}while(!feof(yyin));
return 0;
}
You are trying to build a computer algebra system.
Your task is conceptually simple:
Define a lexer for the atoms of your "boolean" expressions
Define a parser for propositional logic in terms of the lexemes
Build a tree that stores the expressions
Define procedures that implement logical equivalences (DeMorgan's theorem is one), that find a place in the tree where it can be applied by matching tree structure, and then modifying the tree accordingly
Run those procedures to achieve the logic rewrites you want
Prettyprint the final AST as the answer
But conceptually simple doesn't necessarily mean easy to do and get it all right.
(f)lex and yacc are designed to help you do steps 1-3 in a relatively straightforward way; their documentation contains a pretty good guide.
They won't help with steps 4-6 at all, and this is where the real work happens. (Your grammar looks like a pretty good start for this part).
(You can do 1-3 without flex and yacc by building
a recursive descent parser that also happens to build the AST as it goes).
Step 4 can be messy, because you have to decide what logical theorems you wish to use, and then write a procedure for each one to do tree matching, and tree smashing, to achieve the desired result. You can do it; its just procedural code that walks up and down the tree comparing node types and relations to children for a match, and then delinking nodes, deleting nodes, creating nodes, and relinking them to effect the tree modification. This is just a bunch of code.
A subtley of algebraic rewrites is now going to bite you: (boolean) algebra has associative and commutative operators. What this means is that some algebra rules will apply to parts of the tree that are arbitrarily far apart. Consider this rule:
a*(b + !a) => a*(b)
What happens when the actual term being parsed looks like:
q*(a + b + c + ... !q ... + z)
"Simple" procedural code to look at the tree now has to walk arbitrarily far down on of the subtrees to find where the rule can apply. Suddenly coding the matching logic isn't so easy, nor is the tree-smash to implement the effect.
If we ignore associative and commutative issues, for complex matches and modifications, the code might be a bit clumsy to write and hard to read; after you've done it once this will be obvious. If you only want to do DeMorgan-over-or, you can do it relatively easily by just coding it. If you want to implement lots of boolean algebras rules for simplification, this will start to be painful. What you'd ideally like to do is express the logic rules in the same notation as your boolean logic so they are easily expressed, but now you need something that can read and interpret the logic rules. That is complex piece of code, but if done right, you can code the logic rules something like the following:
rule deMorgan_for_or(t1:boolexp, t2:boolexp):boolexp->boolexp
" ! (\t1 + \t2) " -> " !\t1 * !\t2 ";
A related problem (step 5) is, where do you want apply the logic rules? Just because you can apply DeMorgan's law in 15 places in a very big logic term, doesn't mean you necessarily want to do that. So somewhere you need to have a control mechanism that decides which of your many rules should apply, and where they should apply. This gets you into metaprogramming, a whole new topic.
If your rules are "monotonic", that is, they in effect can only be applied once, you can simply run them all everywhere and get a terminating computation, if that monotonic answer is the one you want. If you have rules that are inverses (e.g., !(x+y) => !x * !y, and !a * !b => !(a+b)), then your rules may run forever repeatedly doing and undoing a rewrite. So you have to be careful to ensure you get termination.
Finally, when you have the modified tree, you'll need to print it back out in readable form (Step 6). See my SO answer on how to build a prettyprinter.
Doing all of this for one or two rules by yourself is a great learning exercise.
Doing it with the idea of producing a useful tool is a whole different animal. There what you want is a set of infrastructure that makes this easy to express: a program transformation system. You can see a complete example of this what it looks like for a system doing arithmetic rather than boolean computations using surface syntax rewrite rules, including the handling the associative and commutative rewrite issues. In another example, you can see what it looks like for boolean logic (see simplify_boolean near end of page), which shows a real example for rules like I wrote above.

Parsing expressions from Token Stream

I'm trying to parse expressions for a simple scripting language, but I'm confused. Right now, only numbers and string literals can be parsed as an expression:
int x = 5;
double y = 3.4;
str t = "this is a string";
However, I am confused on parsing more complex expressions:
int a = 5 + 9;
int x = (5 + a) - (a ^ 2);
I'm thinking I would implement it like:
do {
// no clue what I would do here?
if (current token is a semi colon) break;
}
while (true);
Any help would be great, I have no clue where to start. Thanks.
EDIT:
My parser is a Recursive Descent Parser
My expression "class" is as follows:
typedef struct s_Expression {
char type;
Token *value;
struct s_Expression *leftHand;
char operand;
struct s_Expression *rightHand;
} ExpressionNode;
Someone mentioned that a Recursive Descent Parser is capable of parsing expressions without doing infix, to postfix. Preferably, I would like an Expression like this:
For example this:
int x = (5 + 5) - (a / b);
Would be parsed into this:
Note: this isn't valid C, this is just some pseudo ish code to get my point across simply :)
ExpressionNode lh;
lh.leftHand = new ExpressionNode(5);
lh.operand = '+'
lh.rightHand = new ExpressionNode(5);
ExpressionNode rh;
rh.leftHand = new ExpressionNode(a);
rh.operand = '/';
rh.rightHand = new ExpressionNode(b);
ExpressionNode res;
res.leftHand = lh;
res.operand = '-';
res.rightHand = rh;
I asked the question pretty late at night, so sorry if I wasn't clear and that I completely forgot what my original goal was.
There are multiple methods to do this. One I've used in the past is to read the input string (that is, the programming language code) and convert expressions from infix to reverse-polish notation. Here's a really good post about doing just that:
http://andreinc.net/2010/10/05/converting-infix-to-rpn-shunting-yard-algorithm/
Note that = is an operator also, so your parsing should really include the whole file of code, and not just certain expressions.
Once in reverse-polish, expressions are super-easy to evaluate. You just pop the stack, storing operands as you go, until you hit an operator. Pop as many operands from the stack as required by the operator and perform your operation.
The method I ended up using is Operator precedence parsing.
parse_expression_1 (lhs, min_precedence)
lookahead := peek next token
while lookahead is a binary operator whose precedence is >= min_precedence
op := lookahead
advance to next token
rhs := parse_primary ()
lookahead := peek next token
while lookahead is a binary operator whose precedence is greater
than op's, or a right-associative operator
whose precedence is equal to op's
rhs := parse_expression_1 (rhs, lookahead's precedence)
lookahead := peek next token
lhs := the result of applying op with operands lhs and rhs
return lhs
If you are building a recursive descent parser, implementing a shunting yard is a lot of unnecessary work because recursive descent is perfectly capable of evaluating expressions or outputting RPN by itself.
You haven't told us anything about your programming language, your Token structs, or even what you are trying to achieve, so it is really hard to give you sample code. Take a look at Eric White's implementation of a recursive descent parser in C#. If you provide more details we will be able to help more.

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Parsing, which method choose?

I'm working on a compiler (language close to C) and I've to implement it in C. My main question is how to choose the right parsing method in order to be efficient while coding my compiler.
Here's my current grammar:
http://img11.hostingpics.net/pics/273965Capturedcran20130417192526.png
I was thinking about making a top-down parser LL(1) as described here: http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/07-Top-Down-Parsing.pdf
Could it be an efficient choice considering this grammar, knowing that I first have to remove the left recursive rules. Do you have any other advices?
Thank you,
Mentinet
Lots of answers here, but they get things confused. Yes, there are LL and LR parsers, but that isn't really your choice.
You have a grammar. There are tools that automatically create a parser for you given a grammar. The venerable Yacc and Bison do this. They create an LR parser (LALR, actually). There are also tools that create an LL parser for you, like ANTLR. The downsides of tools like this are they inflexible. Their automatically generated syntax error messages suck, error recovery is hard and the older ones encourage you to factor your code in one way - which happens to be the wrong way. The right way is to have your parser spit out an Abstract Syntax Tree, and then have the compiler generate code from that. The tools want you to mix parser and compiler code.
When you are using automated tools like this the differences in power between LL, LR and LALR really does matter. You can't "cheat" to extend their power. (Power in this case means being able to generate a parser for valid context free grammar. A valid context free grammar is one that generates a unique, correct parse tree for every input, or correctly says it doesn't match the grammar.) We currently have no parser generator that can create parser for every valid grammar. However LR can handle more grammars than any other sort. Not being able to handle a grammar isn't a disaster as you can re-write the grammar in a form the parser generator can accept. However, it isn't always obvious how that should be done, and worse it effects the Abstract Syntax Tree generated which means weaknesses in the parser ripple down through the rest of your code - like the compiler.
The reason there are LL, LALR and LR parsers is a long time ago, the job of generating a LR parser was taxing for a modern computer both in terms of time and memory. (Note this is the it takes the generate the parser, which only happens when you write it. The generated parser runs very quickly.) But that was a looong time ago. Generating a LR(1) parser takes far less than 1GB of RAM for a moderately complex language and on a modern computer takes under a second. For that reason you are far better off with an LR automatic parser generator, like Hyacc.
The other option is you write your own parser. In this case there is only one choice: an LL parser. When people here say writing LR is hard, they understate the case. It is near impossible for a human to manually create an LR parser. You might think this means if you write your own parser you are constrained to use LL(1) grammars. But that isn't quite true. Since you are writing the code, you can cheat. You can lookahead an arbitrary number of symbols, and because you don't have to output anything till you are good and ready the Abstract Syntax Tree doesn't have to match the grammar you are using. This ability to cheat makes up for all of lost power between LL and LR(1), and often then some.
Hand written parsers have their downsides of course. There is no guarantee that your parser actually matches your grammar, or for that matter no checking if your grammar is valid (ie recognises the language you think it does). They are longer, and they are even worse at encouraging you to mix parsing code with compile code. They are also obviously implemented in only one language, whereas a parser generator often spit out their results in several different languages. Even if they don't, an LR parse table can be represented in a data structure containing only constants (say in JSON), and the actual parser is only 100 lines of codes or so. But there are also upsides to hand written parser. Because you wrote the code, you know what is going on, so it is easier to do error recovery and generate sane error messages.
In the end, the tradeoff often works like this:
For one off jobs, you are far better using a LR(1) parser generator. The generator will check your grammar, save you work, and modern ones split out the Abstract Syntax Tree directly, which is exactly what you want.
For highly polished tools like mcc or gcc, use a hand written LL parser. You will be writing lots of unit tests to guard your back anyway, error recovery and error messages are much easier to get right, and they can recognise a larger class of languages.
The only other question I have is: why C? Compilers aren't generally time critical code. There are very nice parsing packages out there that will allow you to get the job done in 1/2 the code if you willing to have your compiler run a fair bit slower - my own Lrparsing for instance. Bear in mind a "fair bit slower" here means "hardly noticeable to a human". I guess the answer is "the assignment I am working on specifies C". To give you an idea, here is how simple getting from your grammar to parse tree becomes when you relax the requirement. This program:
#!/usr/bin/python
from lrparsing import *
class G(Grammar):
Exp = Ref("Exp")
int = Token(re='[0-9]+')
id = Token(re='[a-zA-Z][a-zA-Z0-9_]*')
ActArgs = List(Exp, ',', 1)
FunCall = id + '(' + Opt(ActArgs) + ')'
Exp = Prio(
id | int | Tokens("[]", "False True") | Token('(') + List(THIS, ',', 1, 2) + ')' |
Token("! -") + THIS,
THIS << Tokens("* / %") << THIS,
THIS << Tokens("+ -") << THIS,
THIS << Tokens("== < > <= >= !=") << THIS,
THIS << Tokens("&&") << THIS,
THIS << Tokens("||") << THIS,
THIS << Tokens(":") << THIS)
Type = (
Tokens("", "Int Bool") |
Token('(') + THIS + ',' + THIS + ')' |
Token('[') + THIS + ']')
Stmt = (
Token('{') + THIS * Many + '}' |
Keyword("if") + '(' + Exp + ')' << THIS + Opt(Keyword('else') + THIS) |
Keyword("while") + '(' + Exp + ')' + THIS |
id + '=' + Exp + ';' |
FunCall + ';' |
Keyword('return') + Opt(Exp) + ';')
FArgs = List(Type + id, ',', 1)
RetType = Type | Keyword('void')
VarDecl = Type + id + '=' + Exp + ';'
FunDecl = (
RetType + id + '(' + Opt(FArgs) + ')' +
'{' + VarDecl * Many + Stmt * Some + '}')
Decl = VarDecl | FunDecl
Prog = Decl * Some
COMMENTS = Token(re="/[*](?:[^*]|[*][^/])*[*]/") | Token(re="//[^\n\r]*")
START = Prog
EXAMPLE = """\
Int factorial(Int n) {
Int result = 1;
while (n > 1) {
result = result * n;
n = n - 1;
}
return result;
}
"""
parse_tree = G.parse(EXAMPLE)
print G.repr_parse_tree(parse_tree)
Produces this output:
(START (Prog (Decl (FunDecl
(RetType (Type 'Int'))
(id 'factorial') '('
(FArgs
(Type 'Int')
(id 'n')) ')' '{'
(VarDecl
(Type 'Int')
(id 'result') '='
(Exp (int '1')) ';')
(Stmt 'while' '('
(Exp
(Exp (id 'n')) '>'
(Exp (int '1'))) ')'
(Stmt '{'
(Stmt
(id 'result') '='
(Exp
(Exp (id 'result')) '*'
(Exp (id 'n'))) ';')
(Stmt
(id 'n') '='
(Exp
(Exp (id 'n')) '-'
(Exp (int '1'))) ';') '}'))
(Stmt 'return'
(Exp (id 'result')) ';') '}'))))
The most efficient way to build a parser is to use a specific tool which purpose of existance is to build parsers. They used to be called compiler compilers, but nowadays the focus has shifted (broadened) to language workbenches which provide you with more aid to build your own language. For instance, almost any language workbench would provide you with IDE support and syntax highlighting for your language right off the bat, just by looking at a grammar. They also help immensely with debugging your grammar and your language (you didn’t expect left recursion to be the biggest of your problems, did you?).
Among the best currently supported and developing language workbenches one could name:
Rascal
Spoofax
MPS
MetaEdit+
Xtext
If you really so inclined, or if you consider writing a parser yourself just for amusement and experience, the best modern algorithms are SGLR, GLL and Packrat. Each one of those is a quintessence of algorithmic research that lasted half a century, so do not expect to understand them fully in a flash, and do not expect any good to come out of the first couple of “fixes” you’ll come up with. If you do come up with a nice improvement, though, do not hesitate to share your findings with the authors or publish it otherwise!
Thank you for all those advices but we finally decided to build our own recursive-descent parser by using exactly the same method as here: http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html
Indeed, we changed the grammar in order to remove the left-recursive rules and because the grammar I showed in my first message isn't LL(1), we used our token list (made by our scanner) to proceed a lookahead which go further. It looks that it works quite well.
Now we have the build an AST within those recursive functions. Would you have any suggestions? Tips? Thank you very much.
The most efficient parsers are LR-Parsers and LR-parsers are bit difficult to implement .You can go for recursive descent parsing technique as it is easier to implement in C.

Parsec and user defined state

I'm trying to implement js parser in haskell. But I'm stuck with automatic semicolon insertion. I have created test project to play around with problem, but I can not figure out how to solve the problem.
In my test project program is a list of expressions (unary or binary):
data Program = Program [Expression]
data Expression
= UnaryExpression Number
| PlusExpression Number Number
Input stream is a list of tokens:
data Token
= SemicolonToken
| NumberToken Number
| PlusToken
I want to parse inputs like these:
1; - Unary expression
1 + 2; - Binary expression
1; 2 + 3; - Two expressions (unary and binary)
1 2 + 3; - Same as previous input, but first semicolon is missing. So parser consume token 1, but token 2 is not allowed by any production of grammar (next expected token is semicolon or plus). Rule of automatic semicolon insertion says that in this case a semicolon is automatically inserted before token 2.
So, what is the most elegant way to implement such parser behavior.
You have
expression = try unaryExpression <|> plusExpression
but that doesn't work, since a UnaryExpression is a prefix of a PlusExpression. So for
input2 = [NumberToken Number1, PlusToken, NumberToken Number1, SemicolonToken]
the parser happily parses the first NumberToken and automatically adds a semicolon, since the next token is a PlusToken and not a SemicolonToken. Then it tries to parse the next Expression, but the next is a PlusToken, no Expression can start with that.
Change the order in which the parsers are tried,
expression = try plusExpression <|> unaryExpression
and it will first try to parse a PlusExpression, and only when that fails resort to the shorter parse of a UnaryExpression.

Resources