I'm currently using the alex and happy lexer/parser generators to implement a parser for the Ethereum Smart contract language solidity. Currently I'm using a reduced grammar in order to simplify the initial development.
I'm running into an error parsing the 'contract' section of the my test contract file.
The following is the code for the grammar:
ProgSource :: { ProgSource }
ProgSource : SourceUnit { ProgSource $1 }
SourceUnit : PragmaDirective { SourceUnit $1}
PragmaDirective : "pragma" ident ";" {Pragma $2 }
| {- empty -} { [] }
ImportDirective :
"import" stringLiteral ";" { ImportDir $2 }
ContractDefinition : contract ident "{" ContractPart "}" { Contract $2 $3 }
ContractPart : StateVarDecl { ContractPart $1 }
StateVarDecl : TypeName "public" ident ";" { StateVar $1 $3 }
| TypeName "public" ident "=" Expression ";" { StateV $1 $3 $5 }
The following file is my test 'contract':
pragma solidity;
contract identifier12 {
public variable = 1;
}
The result is from passing in my test contract into the main function of my parser.
$ cat test.txt | ./main
main: Parse error at TContract (AlexPn 17 2 1)2:1
CallStack (from HasCallStack):
error, called at ./Parser.hs:232:3 in main:Parser
From the error it suggest that the issue is the first letter of the 'contract' token, on line 2 column 1. But from my understanding this should parse properly?
You defined ProgSource to be a single SourceUnit, so the parser fails when the second one is encountered. I guess you wanted it to be a list of SourceUnits.
The same applies to ContractPart.
Also, didn't you mean to quote "contract" in ContractDefinition? And in the same production, $3 should be $4.
Related
I am writing a Python interpreter in OCaml using ocamllex, and in order to handle the indentation-based syntax, I want to
tokenize the input using ocamllex
iterate through the list of lexed tokens and insert INDENT and DEDENT tokens as needed for the parser
parse this list into an AST
However, in ocamllex, the lexing step produces a lexbuf stream which can't be easily iterated through to do the indentation checking. Is there a good way to extract a list of tokens from lexbuf, i.e.
let lexbuf = (Lexing.from_channel stdin) in
let token_list = tokenize lexbuf
where token_list has type Parser.token list? My hack was to define a trivial parser like
tokenize: /* used by the parser to read the input into the indentation function */
| token EOL { $1 # [EOL] }
| EOL { SEP :: [EOL] }
token:
| COLON { [COLON] }
| TAB { [TAB] }
| RETURN { [RETURN] }
...
| token token %prec RECURSE { $1 # $2 }
and to call this like
let lexbuf = (Lexing.from_channel stdin) in
let temp = (Parser.tokenize Scanner.token) lexbuf in (* char buffer to token list *)
but this has all sorts of issues with shift-reduce errors and unnecessary complexity. Is there a better way to write a lexbuf -> Parser.token list function in OCaml?
I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html
I'm implementing a parser for a language similar to Oberon.
I've successfully written the lexer using Alex since I can see that the list of tokens returned by the lexer is correct.
When I give the tokens list to the parser, it stops at the first token.
This is my parser:
...
%name myParse
%error { parseError }
%token
KW_PROCEDURE { KW_TokenProcedure }
KW_END { KW_TokenEnd }
';' { KW_TokenSemiColon }
identifier { TokenVariableIdentifier $$ }
%%
ProcedureDeclaration : ProcedureHeading ';' ProcedureBody identifier { putStrLn("C") }
ProcedureHeading : KW_PROCEDURE identifier { putStrLn("D") }
ProcedureBody : KW_END { putStrLn("E") }
| DeclarationSequence KW_END { putStrLn("F") }
DeclarationSequence : ProcedureDeclaration { putStrLn("G") }
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
main = do
inStr <- getContents
print (alexScanTokens inStr)
myParse (alexScanTokens inStr)
putStrLn("DONE")
}
This is the test code I give to the parser:
PROCEDURE proc;
END proc
This is the token list returned by the lexer:
[KW_TokenProcedure,TokenVariableIdentifier "proc",KW_TokenSemiColon,KW_TokenEnd,TokenVariableIdentifier "proc"]
The parser does't give any error, but it sticks to my ProcedureDeclaration rule, printing only C.
This is what the output looks like:
C
DONE
Any idea why?
UPDATE:
I've made a first step forward and I was able to parse the test input given before. Now I changed my parser to recognize the declaration of multiple procedures on the same level. To do this, this is how my new parse looks like:
...
%name myParse
%error { parseError }
%token
KW_PROCEDURE { KW_TokenProcedure }
KW_END { KW_TokenEnd }
';' { KW_TokenSemiColon }
identifier { TokenVariableIdentifier $$ }
%%
ProcedureDeclarationList : ProcedureDeclaration { $1 }
| ProcedureDeclaration ';' ProcedureDeclarationList { $3:[$1] }
ProcedureDeclaration : ProcedureHeading ';' ProcedureBody identifier { addProcedureToProcedure $1 $3 }
ProcedureHeading : KW_PROCEDURE identifier { defaultProcedure { procedureName = $2 } }
ProcedureBody : KW_END { Nothing }
| DeclarationSequence KW_END { Just $1 }
DeclarationSequence : ProcedureDeclarationList { $1 }
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
main = do
inStr <- getContents
let result = myParse (alexScanTokens inStr)
putStrLn ("result: " ++ show(result))
}
The thing is, it fails to compile giving me this error:
Occurs check: cannot construct the infinite type: t5 ~ [t5]
Expected type: HappyAbsSyn t5 t5 t6 t7 t8 t9
-> HappyAbsSyn t5 t5 t6 t7 t8 t9
-> HappyAbsSyn t5 t5 t6 t7 t8 t9
-> HappyAbsSyn t5 t5 t6 t7 t8 t9
Actual type: HappyAbsSyn t5 t5 t6 t7 t8 t9
-> HappyAbsSyn t5 t5 t6 t7 t8 t9
-> HappyAbsSyn t5 t5 t6 t7 t8 t9
-> HappyAbsSyn [t5] t5 t6 t7 t8 t9
...
I know for sure that it's caused by the second element of my ProcedureDeclarationsList rule, but I don't understand why.
There are two things to note here.
happy uses the first production rule as the top-level production for myParse.
Your first production rule is ProcedureDeclaration, so that's all it's going to try to parse. You probably want to make DeclarationSequence the first rule.
The return type of your productions are IO-actions, and in Haskell IO-actions are values. They are not "executed" until they become part of main. That means you need to write your productions like this:
DeclarationSequence : ProcedureDeclaration
{ do $1; putStrLn("G") }
ProcedureDeclaration : ProcedureHeading ';' ProcedureBody identifier
{ do $1; $3; putStrLn("C") }
That is, the return value of the DeclarationSequence rule is the IO-action returned by ProcedureDeclaration followed by putStrLn "G".
And the return value of the ProducedureDeclaration rule is the action returned by ProcudureHeading followed by the action returned by ProcedureBody followed by putStrLn "C".
You could also write the RHS of the rules using the >> operator:
{ $1 >> putStrLn "G" }
{ $1 >> $3 >> putStrLn "C" }
Note that you have to decide the order in which to sequence the actions - i.e. pre-/post-/in- order.
Working example: http://lpaste.net/162432
It seems okay your expression has been parsed just fine. Check the return type of myParse, I guess it will be IO (), and the actual action will be putStrLn("D") - is what your wrote in ProcedureDeclaration. Next, your put call to myParse in the do block, it will be interpreted as print .. >> myParse (..) >> putStrLn .. or just linking monadic actions. myParse will return an action which will print "D" so the output is exactly what one would expect.
You have other actions defined in ProcedureBody and DeclarationSequence. But you never use these actions in any way, it's like you will write:
do
let a = putStrLn "E"
putStrLn("C")
Which will output "C", a is not used by any means. Same with your parser. If you want to invoke these actions, try to write $1 >> putStrLn("C") >> $2 in ProcedureDeclaration associated code.
I would like to parse a set of expressions, for instance:X[3], X[-3], XY[-2], X[4]Y[2], etc.
In my parser.mly, index (which is inside []) is defined as follows:
index:
| INTEGER { $1 }
| MINUS INTEGER { 0 - $2 }
The token INTEGER, MINUS etc. are defined in lexer as normal.
I try to parse an example, it fails. However, if I comment | MINUS INTEGER { 0 - $2 }, it works well. So the problem is certainly related to that. To debug, I want to get more information, in other words I want to know what is considered to be MINUS INTEGER. I tried to add print:
index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n" $2; 0 - $2 }
But nothing is printed while parsing.
Could anyone tell me how to print information or debug that?
I tried coming up with an example of what you describe and was able to get output of 8 with what I show below. [This example is completely stripped down so that it only works for [1] and [- 1 ], but I believe it's equivalent logically to what you said you did.]
However, I also notice that your example's debug string in your example does not have an explicit flush with %! at the end, so that the debugging output might not be flushed to the terminal until later than you expect.
Here's what I used:
Test.mll:
{
open Ytest
open Lexing
}
rule test =
parse
"-" { MINUS }
| "1" { ONE 1 }
| "[" { LB }
| "]" { RB }
| [ ' ' '\t' '\r' '\n' ] { test lexbuf }
| eof { EOFTOKEN }
Ytest.mly:
%{
%}
%token <int> ONE
%token MINUS LB RB EOFTOKEN
%start item
%type <int> index item
%%
index:
ONE { 2 }
| MINUS ONE { Printf.printf "%n" 8; $2 }
item : LB index RB EOFTOKEN { $2 }
Parse.ml
open Test;;
open Ytest;;
open Lexing;;
let lexbuf = Lexing.from_channel stdin in
ignore (Ytest.item Test.test lexbuf)
I'm trying to build up some skills in lexing/parsing grammars. I'm looking back on a simple parser I wrote for SQL, and I'm not altogether happy with it -- it seems like there should have been an easier way to write the parser.
SQL tripped me up because it has a lot of optional tokens and repetition. For example:
SELECT *
FROM t1
INNER JOIN t2
INNER JOIN t3
WHERE t1.ID = t2.ID and t1.ID = t3.ID
Is equivalent to:
SELECT *
FROM t1
INNER JOIN t2 ON t1.ID = t2.ID
INNER JOIN t3 on t1.ID = t3.ID
The ON and WHERE clauses are optional and can occur more than once. I handled these in my parser as follows:
%{
open AST
%}
// ...
%token <string> ID
%token AND OR COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
// ...
%%
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
// WHERE clause is optional
whereClause:
| { None }
| WHERE whereList { Some($2) }
whereList:
| ID op ID { Cond($1, $2, $3) }
| ID op ID AND whereList { And(Cond($1, $2, $3), $5) }
| ID op ID OR whereList { Or(Cond($1, $2, $3), $5) }
// Join clause is optional
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
// "On" clause is optional
joinOnClause:
| { None }
| ON whereList { Some($2) }
// ...
%%
In other words, I handled optional syntax by breaking it into separate rules, and handled repetition using recursion. This works, but it breaks parsing into a bunch of little subroutines, and it's very hard to see what the grammar actually represents.
I think it would be much easier to write if I could specify optional syntax inside brackets and repetition with an * or +. This would reduce my whereClause and joinList rules to the following:
whereClause:
| { None }
// $1 $2, I envision $2 being return as an (ID * op * ID * cond) list
| WHERE [ ID op ID { (AND | OR) }]+
{ Some([for name1, op, name2, _ in $1 -> name1, op, name2]) }
joinClause:
| { None }
// $1, returned as (joinType
// * JOIN
// * ID
// * ((ID * op * ID) list) option) list
| [ (INNER | LEFT | RIGHT) JOIN ID [ON whereClause] ]*
{ let joinType, _, table, onClause = $1;
Some(table, joinType, onClause) }
I think this form is much easier to read and expresses the grammar it's trying to capture more intuitively. Unfortunately, I can't find anything in either the Ocaml or F# documentation which supports this notation or anything similar.
Is there an easy way to represent grammars with optional or repetitive tokens in OcamlYacc or FsYacc?
When you compose all the little pieces you should get something like you want though. Instead of:
(INNER | RIGHT | LEFT )
you just have
inner_right_left
and define that to be the union of those three keywords.
You can also define the union in the lexer. in the way you define the tokens, or using camlp4, I haven't done much in that area, so I cannot advise you to take those routes. And I don't think they'll work for you as well as just having little pieces everywhere.
EDIT:
So, for camlp4 you can look at Camlp4 Grammar module and a tutorial and a better tutorial. This isn't exactly what you want, mind you, but it's pretty close. The documentation is pretty bad, as expressed in the recent discussion on the ocaml groups, but for this specific area, I don't think you'll have too many problems. I did a little with it and can field more questions.
Menhir allows to parametrize nonterminal symbols by another symbols and provides the library of standard shortcuts, like optionals and lists, and you can create your own. Example:
option(X): x=X { Some x}
| { None }
There is also some syntax sugar, 'token?' is equivalent to 'option(token)', 'token+' to 'nonempty_list(token)'.
All of this really shortens grammar definition. Also it is supported by ocamlbuild and can be a drop-in replacement for ocamlyacc. Highly recommended!
Funny, I used it to parse SQL too :)