Proper bison definition file for a multi-statement language with Jison - parsing

I am trying to get a handle on jison, which is a javascript implementation of Bison. My specific language I am trying to parse looks like this:
foo(10)
bar()
foo(20)
baz()
I want to parse this into something like:
return [
{ func: 'foo', arg: 10 },
{ func: 'bar' },
{ func: 'foo', arg: 20 },
{ func: 'baz' }
]
So far, my jison definition looks like this:
var grammar = {
'lex': {
'rules': [
['\\s+', '/* skip whitespace */'],
['[0-9]+\\b', 'return "INTEGER";'],
['\\(', 'return "OPEN_PAREN"'],
['\\)', 'return "CLOSE_PAREN"'],
['[\\w]+\\s*(?=\\()', 'return "FUNC_NAME"'],
['$', 'return "LINE_END"']
]
},
'bnf': {
"expression": [["e LINE_END", "return $1;"]],
"e": [
["FUNC_NAME OPEN_PAREN INTEGER CLOSE_PAREN", "$$ = { func: $1, arg: $3 };"],
["FUNC_NAME OPEN_PAREN CLOSE_PAREN", "$$ = { func: $1 };"]
]
}
};
When I run this, I get this error:
Error: Parse error on line 1:
forward(10)turnLeft()forward(2
-----------^
Expecting 'LINE_END', got 'FUNC_NAME'
at Parser.parseError (eval at createParser (/Users/user/project/node_modules/jison/lib/jison.js:1327:13), <anonymous>:30:21)
at Parser.parse (eval at createParser (/Users/user/project/node_modules/jison/lib/jison.js:1327:13), <anonymous>:97:22)
at Object.<anonymous> (/Users/user/project/generate-parser.js:39:20)
at Module._compile (module.js:660:30)
at Object.Module._extensions..js (module.js:671:10)
at Module.load (module.js:573:32)
at tryModuleLoad (module.js:513:12)
at Function.Module._load (module.js:505:3)
at Function.Module.runMain (module.js:701:10)
at startup (bootstrap_node.js:193:16)
Clearly, I am not understanding something to do with line endings separating individual statements. I've looked at the examples for jison and I've read and re-read the Bison docs but still do not fully understand the "Semantic Actions" aspect.
What am I missing here? In my mind, I have defined e in two terminal forms, and the nonterminal form e LINE_END.
Any help would be greatly appreciated, thanks!

Related

Jison parser generator shift reduce conflict with parenthesis, how to solve?

I'm trying to implement parenthesis in my parser but i got conflict in my grammar.
"Conflict in grammar: multiple actions possible when lookahead token is )"
Here is simplified version of it:
// grammar
{
"Root": ["", "Body"],
"Body": ["Line", "Body TERMINATOR Line"],
"Line": ["Expression", "Statement"],
"Statement": ["VariableDeclaration", "Call", "With", "Invocation"],
"Expression": ["Value", "Parenthetical", "Operation", "Assign"],
"Identifier": ["IDENTIFIER"],
"Literal": ["String", "Number"],
"Value": ["Literal", "ParenthesizedInvocation"],
"Accessor": [". Property"],
"ParenthesizedInvocation": ["Value ParenthesizedArgs"],
"Invocation": ["Value ArgList"],
"Call": ["CALL ParenthesizedInvocation"],
"ParenthesizedArgs": ["( )", "( ArgList )"],
"ArgList": ["Arg", "ArgList , Arg"],
"Arg": ["Expression", "NamedArg"],
"NamedArg": ["Identifier := Value"],
"Parenthetical": ["( Expression )"],
"Operation": ["Expression + Expression", "Expression - Expression"]
}
//precedence
[
['right', 'RETURN'],
['left', ':='],
['left', '='],
['left', 'IF'],
['left', 'ELSE', 'ELSE_IF'],
['left', 'LOGICAL'],
['left', 'COMPARE'],
['left', '&'],
['left', '-', '+'],
['left', 'MOD'],
['left', '\\'],
['left', '*', '/'],
['left', '^'],
['left', 'CALL'],
['left', '(', ')'],
['left', '.'],
]
In my implementation i need function calls like this (with parenthesis and comma separated):
Foo(1, 2)
Foo 1, 2
And be able to use regular parenthesis for priority of operations. Even in function calls (but only in parenthesized function calls):
Foo(1, (2 + 4) / 2)
Foo 1, 2
Function call without parenthesis treated as statement, function call with parenthesis treated as expression.
How can i solve this conflict?
In VBA, function call statements (as opposed to expressions) have two forms (simplified):
CALL name '(' arglist ')'
name arglist
Note that the second one does not have parentheses around the argument list. That's precisely to avoid the ambiguity of:
Func (3)
which is the ambiguity you're running into.
The ambiguity is that it is not clear whether the parentheses are parentheses which surround an argument list, or parentheses which surround an parenthesized expression. That's not an essential ambiguity, since the result is effectively the same. But it's still important because of the possibility that the program continues like this:
Foo (3), (4)
in which case, it is essential that the parentheses be parsed as parentheses surrounding a parenthesized expression.
So one possibility is to modify your grammar to be similar to the grammar in the VBA reference:
call-statement = "Call" (simple-name-expression / member-access-expression / index-expression / with-expression)
call-statement =/ (simple-name-expression / member-access-expression / with-expression) argument-list
But I suppose that you really want to implement a language similar to VBA without being strictly conformant. That makes things slightly more complicated.
As a first approximation, you can require that the form name '(' [arglist] ')' have at least two arguments (unless it's empty):
# Not tested
"Invocation": ["Literal '(' ArgList2 ')' ", "Literal '(' ')' ", "Literal ArgList"],
"ArgList": ["Arg", "ArgList2"],
"ArgList2": ["Arg ',' Arg", "ArgList2 ',' Arg"],

Happy Parse Error

I'm currently using the alex and happy lexer/parser generators to implement a parser for the Ethereum Smart contract language solidity. Currently I'm using a reduced grammar in order to simplify the initial development.
I'm running into an error parsing the 'contract' section of the my test contract file.
The following is the code for the grammar:
ProgSource :: { ProgSource }
ProgSource : SourceUnit { ProgSource $1 }
SourceUnit : PragmaDirective { SourceUnit $1}
PragmaDirective : "pragma" ident ";" {Pragma $2 }
| {- empty -} { [] }
ImportDirective :
"import" stringLiteral ";" { ImportDir $2 }
ContractDefinition : contract ident "{" ContractPart "}" { Contract $2 $3 }
ContractPart : StateVarDecl { ContractPart $1 }
StateVarDecl : TypeName "public" ident ";" { StateVar $1 $3 }
| TypeName "public" ident "=" Expression ";" { StateV $1 $3 $5 }
The following file is my test 'contract':
pragma solidity;
contract identifier12 {
public variable = 1;
}
The result is from passing in my test contract into the main function of my parser.
$ cat test.txt | ./main
main: Parse error at TContract (AlexPn 17 2 1)2:1
CallStack (from HasCallStack):
error, called at ./Parser.hs:232:3 in main:Parser
From the error it suggest that the issue is the first letter of the 'contract' token, on line 2 column 1. But from my understanding this should parse properly?
You defined ProgSource to be a single SourceUnit, so the parser fails when the second one is encountered. I guess you wanted it to be a list of SourceUnits.
The same applies to ContractPart.
Also, didn't you mean to quote "contract" in ContractDefinition? And in the same production, $3 should be $4.

How to parse type names defined during parse

I'm using pegjs to define a grammar that allows new types to be defined. How do I then recognize those types subsequent to their definition? I have a production that defines the built in types, e.g.
BuiltInType
= "int"
/ "float"
/ "string"
/ TYPE_NAME
But what do I do for the last one? I don't know what possible strings will be type names until they are defined in the source code.
In the traditional way of parsing where there is both a lexer and a parser, the parser would add the type name to a table and the lexer would use this table to determine whether to return TYPE_NAME or IDENTIFIER for a particular token. But pegjs does not have this separation.
You're right, you cannot (easily) modify pegjs' generated parser on the fly without knowing a lot about its internals. But what you lose from a standard LALR, you gain in interspersing JavaScript code throughout the parser rules themselves.
To accomplish your goal, you'll need to recognize new types (in context) and keep them for use later, as in:
{
// predefined types
const types = {'int':true, 'float':true, 'string':true}
// variable storage
const vars = {}
}
start = statement statement* {
console.log(JSON.stringify({types:types,vars:vars}, null, 2))
}
statement
= WS* typedef EOL
/ WS* vardef EOL
typedef "new type definition" // eg. 'define myNewType'
= 'define' SP+ type:symbol {
if(types[type]) {
throw `attempted redefinition of: "${type}"`
}
types[type]=true
}
// And then, when you need to recognize a type, something like:
vardef "variable declaration" // eg: 'let foo:myNewType=10'
= 'let' SP+ name:symbol COLON type:symbol SP* value:decl_assign? {
if(!types[type]) {
throw `unknown type encountered: ${type}`
}
vars[name] = { name: name, type:type, value: value }
}
decl_assign "variable declaration assignment"
= '=' SP* value:number {
return value
}
symbol = $( [a-zA-Z][a-zA-Z0-9]* )
number = $( ('+' / '-')? [1-9][0-9]* ( '.' [0-9]+ )? )
COLON = ':'
SP = [ \t]
WS = [ \t\n]
EOL = '\n'
which, when asked to parse:
define fooType
let bar:fooType = 1
will print:
{
"types": {
"int": true,
"float": true,
"string": true,
"fooType": true
},
"vars": {
"bar": {
"name": "bar",
"type": "fooType",
"value": "1"
}
}
}

What causes Happy to throw a parse error?

I've written a lexer in Alex and I'm trying to hook it up to a parser written in Happy. I'll try my best to summarize my problem without pasting huge chunks of code.
I know from my unit tests of my lexer that the string "\x7" is lexed to:
[TokenNonPrint '\x7', TokenEOF]
My token type (spit out by the lexer), is Token. I've defined lexWrap and alexEOF as described here, which gives me the following header and token declarations:
%name parseTokens
%tokentype { Token }
%lexer { lexWrap } { alexEOF }
%monad { Alex }
%error { parseError }
%token
NONPRINT {TokenNonPrint $$}
PLAIN { TokenPlain $$ }
I invoke the parser+lexer combo with the following:
parseExpr :: String -> Either String [Expr]
parseExpr s = runAlex s parseTokens
And here are my first few productions:
exprs :: { [Expr] }
exprs
: {- empty -} { trace "exprs 30" [] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
nonprint :: { Cmd }
: NONPRINT { NonPrint $ parseNonPrint $1}
expr :: { Expr }
expr
: nonprint {trace "expr 44" $ Cmd $ $1}
| PLAIN { trace "expr 37" $ Plain $1 }
I'll leave out the datatype declarations of Expr and NonPrint since they're long and only the constructors Cmd and NonPrint matter here. The function parseNonPrint is defined at the bottom of Parse.y as:
parseNonPrint :: Char -> NonPrint
parseNonPrint '\x7' = Bell
Also, my error handling function looks like:
parseError :: Token -> Alex a
parseError tokens = error ("Error processing token: " ++ show tokens)
Written like this, I expect the following hspec test to pass:
parseExpr "\x7" `shouldBe` Right [Cmd (NonPrint Bell)]
But instead, I see "exprs 30" print once (even though I'm running 5 different unit tests) and all of my tests of parseExpr return Right []. I don't understand why that would be the case, but I changed the exprs production to prevent it:
exprs :: { [Expr] }
exprs
: expr { trace "exprs 30" [$1] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
Now all of my tests fail on the first token they hit --- parseExpr "\x7" fails with:
uncaught exception: ErrorCall (Error processing token: TokenNonPrint '\a')
And I'm thoroughly confused, since I would expect the parser to take the path exprs -> expr -> nonprint -> NONPRINT and succeed. I don't see why this input would put the parser in an error state. None of the trace statements are hit (optimized away?).
What am I doing wrong?
It turns out the cause of this error was the innocuous line
%lexer { lexWrap } { alexEOF }
which was recommended by the linked question about using Alex with Happy (unfortunately, one of the top Google results for queries like "using Alex as a monadic lexer with Happy). The fix is to change it to the following:
%lexer { lexWrap } { TokenEOF }
I had to dig in to the generated code to uncover the issue. It is caused by the code derived from the %tokens directive, which looks as follows (I commented out all of my token declarations except for TokenNonPrint while trying to track down the error):
happyNewToken action sts stk
= lexWrap(\tk ->
let cont i = happyDoAction i tk action sts stk in
case tk of {
alexEOF -> happyDoAction 2# tk action sts stk; -- !!!!
TokenNonPrint happy_dollar_dollar -> cont 1#;
_ -> happyError' tk
})
Evidently, Happy transforms each line of the %tokens directive in to one branch of a pattern match. It also inserts a branch for whatever was identified to it as the EOF token in the %lexer directive.
By inserting the name of a value, alexEOF, rather than a data constructor, TokenEOF, this branch of the case statement has the effect of re-binding the name alexEOF to whatever token was passed in to lexWrap, shadowing the original binding and short-circuiting the case statement so that it hits the EOF rule every time, which somehow results in Happy entering an error state.
The mistake isn't caught by the type system, since the identifier alexEOF (or TokenEOF) doesn't appear anywhere else in the generated code. Misusing the %lexer directive like this will cause GHC to emit a warning, but, since the warning appears in generated code, it's impossible to distinguish it from all of the other harmless warnings the code throws out.

Errors and failures in Scala Parser Combinators

I would like to implement a parser for some defined language using Scala Parser Combinators. However, the software that will compile the language does not implements all the language's feature, so I would like to fail if these features are used. I tried to forge a small example below :
object TestFail extends JavaTokenParsers {
def test: Parser[String] =
"hello" ~ "world" ^^ { case _ => ??? } |
"hello" ~ ident ^^ { case "hello" ~ id => s"hi, $id" }
}
I.e., the parser succeeds on "hello" + some identifier, but fails if the identifier is "world". I see that there exist fail() and err() parsers in the Parsers class, but I cannot figure out how to use them, as they return Parser[Nothing] instead of a String. The documentation does not seem to cover this use caseā€¦
In this case you want err, not failure, since if the first parser in a disjunction fails you'll just move on to the second, which isn't what you want.
The other issue is that ^^ is the equivalent of map, but you want flatMap, since err("whatever") is a Parser[Nothing], not a Nothing. You could use the flatMap method on Parser, but in this context it's more idiomatic to use the (completely equivalent) >> operator:
object TestFail extends JavaTokenParsers {
def test: Parser[String] =
"hello" ~> "world" >> (x => err(s"Can't say hello to the $x!")) |
"hello" ~ ident ^^ { case "hello" ~ id => s"hi, $id" }
}
Or, a little more simply:
object TestFail extends JavaTokenParsers {
def test: Parser[String] =
"hello" ~ "world" ~> err(s"Can't say hello to the world!") |
"hello" ~ ident ^^ { case "hello" ~ id => s"hi, $id" }
}
Either approach should do what you want.
You could use ^? method:
object TestFail extends JavaTokenParsers {
def test: Parser[String] =
"hello" ~> ident ^? (
{ case id if id != "world" => s"hi, $id" },
s => s"Should not use '$s' here."
)
}

Resources