I'm implementing an OCaml-like language using rust-peg and my parser has a bug.
I defined if-statement grammar, but it doesn't work.
I'm guessing the test-case input is parsed as Apply(Apply(Apply(Apply(f, then), 2) else), 4). I mean "then" is parsed as Ident, not keyword.
I have no idea for fixing this apply-expression grammar. Do you have any ideas?
#[derive(Clone, PartialEq, Eq, Debug)]
pub enum Expression {
Number(i64),
If {
cond: Box<Expression>,
conseq: Box<Expression>,
alt: Box<Expression>,
},
Ident(String),
Apply(Box<Expression>, Box<Expression>),
}
use peg::parser;
use toplevel::expression;
use Expression::*;
parser! {
pub grammar toplevel() for str {
rule _() = [' ' | '\n']*
pub rule expression() -> Expression
= expr()
rule expr() -> Expression
= if_expr()
/ apply_expr()
rule if_expr() -> Expression
= "if" _ cond:expr() _ "then" _ conseq:expr() _ "else" _ alt:expr() {
Expression::If {
cond: Box::new(cond),
conseq: Box::new(conseq),
alt: Box::new(alt)
}
}
rule apply_expr() -> Expression
= e1:atom() _ e2:atom() { Apply(Box::new(e1), Box::new(e2)) }
/ atom()
rule atom() -> Expression
= number()
/ id:ident() { Ident(id) }
rule number() -> Expression
= n:$(['0'..='9']+) { Expression::Number(n.parse().unwrap()) }
rule ident() -> String
= id:$(['a'..='z' | 'A'..='Z']['a'..='z' | 'A'..='Z' | '0'..='9']*) { id.to_string() }
}}
fn main() {
assert_eq!(expression("1"), Ok(Number(1)));
assert_eq!(
expression("myFunc 10"),
Ok(Apply(
Box::new(Ident("myFunc".to_string())),
Box::new(Number(10))
))
);
// failed
assert_eq!(
expression("if f then 2 else 3"),
Ok(If {
cond: Box::new(Ident("f".to_string())),
conseq: Box::new(Number(2)),
alt: Box::new(Number(3))
})
);
}
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `Err(ParseError { location: LineCol { line: 1, column: 11, offset: 10 }, expected: ExpectedSet { expected: {"\"then\"", "\' \' | \'\\n\'"} } })`,
right: `Ok(If { cond: Ident("f"), conseq: Number(2), alt: Number(3) })`', src/main.rs:64:5
PEG uses ordered choice. This means that when you write R = A / B for some rule R, if in a position A succesfully parses, it will never try B, even if the choice of A leads to problems later on. This is the core difference with context-free grammars, and often overlooked.
In particular, when you write apply = atom atom / atom, if it's possible to parse two atoms in a row, it will never try to parse just a single one, even if it means the rest doesn't make sense later.
Combine this with the fact that then and else are perfectly good identifiers in your grammar, you get the issue that you see.
Related
I am using the PEST parser and I am testing a simple example to get familiar with the syntax. I am trying to get every instance of ++ throughout the string but I am running into some issues. I think it may be an issue with the ANY keyword but I am not sure. Can anyone help point me in the right direction as to what is going wrong?
Here is my grammar.pest file
incrementing = {(prefix ~ ANY+ ~ "++" ~ suffix)}
prefix = {(NEWLINE | WHITESPACE)*}
suffix = {(NEWLINE | WHITESPACE)*}
WHITESPACE = _{ " " }
Here is my test case
//parses a file a matching rule and returns all instances of the rule
fn parse_file_contents_for_rule(rule: Rule, file_contents: &str) -> Option<Pairs<Rule>> {
SolgaParser::parse(rule, file_contents).ok()
}
fn parse_incrementing(file_contents: &str) {
//parse the file for the rule
let targets = parse_file_contents_for_rule(Rule::incrementing, file_contents);
//if there are matches
if targets.is_some() {
//iterate through all of the matches
for target in targets.unwrap().into_iter() {
println!("{}", target.as_str());
}
}
}
#[test]
fn test_parse_incrementing() {
let file_contents = r#"
index++;
a_thing++;
another_thing++;
should_not_match;
should_match++;
"#;
parse_incrementing(file_contents);
}
In your example, ANY+ is probably matching till the end of the line, so the ++ pattern is never matched, and therefore the whole incrementing rule is never matched.
Try changing it to (!"+" ~ ANY)+
I'm writing parsers in Nom 5 using functions, not macros. My goal is to write a parser that recognizes a string composed entirely of uppercase characters. Ideally, it would have the same return signature as alpha1.
use nom::{
character::complete::{alpha1, char, line_ending, not_line_ending},
combinator::{cut, map, not, recognize},
error::{context, ParseError, VerboseError},
multi::{many0, many1},
IResult,
};
fn uppercase_char<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
let chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
take_while(move |c| chars.contains(c))(i)
}
// Matches 1 or more consecutive uppercase characters
fn upper1<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
recognize(many1(uppercase_char))(i)
}
Although this compiles, the simple unit test I wrote fails:
#[test]
fn test_upper_string_ok() {
let input_text = "ADAM";
let output = upper1::<VerboseError<&str>>(input_text);
dbg!(&output);
let expected = Ok(("ADAM", ""));
assert_eq!(output, expected);
}
The failure output is
---- parse::tests::test_upper_string_ok stdout ----
[src/parse.rs:110] &output = Err(
Error(
VerboseError {
errors: [
(
"",
Nom(
Many1,
),
),
],
},
),
)
thread 'parse::tests::test_upper_string_ok' panicked at 'assertion failed: `(left == right)`
left: `Err(Error(VerboseError { errors: [("", Nom(Many1))] }))`,
right: `Ok(("ADAM", ""))`', src/parse.rs:112:9
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
take_while will recognize 0 or more characters, so when used inside of many1 as you did, it will first parse the entire "ADAM" string. Then when many1 calls it again, since take_while can recognize an empty string, it will succeed, but many0 and many1 have a protection against that mistake: if the underlying parser did not consume any input, they will return an error.
For what you need, the uppercase_char function should be enough, no need for recognize and many1. Although you might want to replace take_while with take_while1
I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html
This is a demo code
label:
var id
let id = 10
goto label
If allowed keyword as identifier will be
let:
var var
let var = 10
goto let
This is totally legal code. But it seems very hard to do this in antlr.
AFAIK, If antlr match a token let, will never fallback to id token. so for antlr it will see
LET_TOKEN :
VAR_TOKEN <missing ID_TOKEN>VAR_TOKEN
LET_TOKEN <missing ID_TOKEN>VAR_TOKEN = 10
although antlr allowed predicate, I have to control ever token match and problematic. grammar become this
grammar Demo;
options {
language = Go;
}
#parser::members{
var _need = map[string]bool{}
func skip(name string,v bool){
_need[name] = !v
fmt.Println("SKIP",name,v)
}
func need(name string)bool{
fmt.Println("NEED",name,_need[name])
return _need[name]
}
}
proj#init{skip("inst",false)}: (line? NL)* EOF;
line
: VAR ID
| LET ID EQ? Integer
;
NL: '\n';
VAR: {need("inst")}? 'var' {skip("inst",true)};
LET: {need("inst")}? 'let' {skip("inst",true)};
EQ: '=';
ID: ([a-zA-Z] [a-zA-Z0-9]*);
Integer: [0-9]+;
WS: [ \t] -> skip;
Looks so terrible.
But this is easy in peg, test this in pegjs
Expression = (Line? _ '\n')* ;
Line
= 'var' _ ID
/ 'let' _ ID _ "=" _ Integer
Integer "integer"
= [0-9]+ { return parseInt(text(), 10); }
ID = [a-zA-Z] [a-zA-Z0-9]*
_ "whitespace"
= [ \t]*
I actually done this in peggo and javacc.
My question is how to handle these grammars in antlr4.6, I was so excited about the antlr4.6 go target, but seems I choose the wrong tool for my grammar ?
The simplest way is to define a parser rule for identifiers:
id: ID | VAR | LET;
VAR: 'var';
LET: 'let';
ID: [a-zA-Z] [a-zA-Z0-9]*;
And then use id instead of ID in your parser rules.
A different way is to use ID for identifiers and keywords, and use predicates for disambiguation. But it's less readable, so I'd use the first way instead.
I am trying to write an optimizing brainfuck compiler in Rust. Currently it stores tokens in a flat vector, which works, but I am having trouble changing it to use a syntax tree:
#[derive(Clone, PartialEq, Eq)]
pub enum Token {
Output,
Input,
Loop(Vec<Token>),
Move(i32),
Add(i32, i32),
LoadOut(i32, i32),
}
use Token::*;
pub fn parse(code: &str) -> Vec<Token> {
let mut alltokens = Vec::new();
let mut tokens = &mut alltokens;
let mut tokvecs: Vec<&mut Vec<Token>> = Vec::new();
for i in code.chars() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => {
tokens.push(Loop(Vec::new()));
tokvecs.push(&mut tokens);
if let &mut Loop(mut newtokens) = tokens.last_mut().unwrap() {
tokens = &mut newtokens;
}
},
']' => {
tokens = tokvecs.pop().unwrap();
},
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => (),
};
}
alltokens
}
What I am having trouble figuring out is how to handle the [ command. The current implementation in the code is one of several I have tried, all of which have failed. I think it may require use of Rust's Box, but I can't quite understand how that is used.
The branch handling the [ command is probably completely wrong, but I'm not sure how it should be done. It pushes a Loop (a variant of the Token enum) containing a vector to the tokens vector. The problem is to then get a mutable borrow of the vector in that Loop, which the if let statement is supposed to do.
The code fails to compile since newtokens does not outlive the end of the if let block. Is it possible to get a mutable reference to the vector inside Loop, and set tokens to it? If not, what could be done instead?
Ok, last time I was pretty close; it looks like I missed the ref keyword:
if let &mut Loop(ref mut newtokens) = (&mut tokens).last_mut().unwrap()
I missed it since there were other borrow checker errors everywhere. I decided to simplify your code to tackle them:
pub fn parse(code: &str) -> Vec<Token> {
let mut tokens = Vec::new();
for i in code.chars() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => {
tokens.push(Loop(Vec::new()));
if let &mut Loop(ref mut newtokens) = (&mut tokens).last_mut().unwrap() {
let bracket_tokens: &mut Vec<Token> = newtokens;
}
},
']' => {
()
},
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => unreachable!(),
};
}
tokens
}
I merged all of the token variables (you don't really need them) and changed tokens = &mut newtokens; to let bracket_tokens: &mut Vec<Token> = newtokens; (I think this was more or less your intention). This allows you to manipulate the Vector inside the Loop.
However, this code still has issues and won't parse brainf*ck's loops; I wanted to make it work, but it required a significant change of approach. You are welcome to try to expand this variant further but it might be a painful experience, especially if you are not too familiar with the borrow checker's rules yet.
I suggest looking at brainf*ck interpreters implementations (e.g. this one) by other people (though not too old, as Rust's syntax has changed before 1.0 went live) to get an idea how this can be done.
I've gotten the code to work by making it a recursive function:
#[derive(Clone, PartialEq, Eq)]
pub enum Token {
Output,
Input,
Loop(Vec<Token>),
Move(i32),
Add(i32, i32),
LoadOut(i32, i32),
}
use Token::*;
pub fn parse(code: &str) -> Vec<Token> {
_parse(&mut code.chars())
}
fn _parse(chars: &mut std::str::Chars) -> Vec<Token> {
let mut tokens = Vec::new();
while let Some(i) = chars.next() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => tokens.push(Loop(_parse(chars))),
']' => { break; }
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => (),
};
}
tokens
}
It seems to work, and is reasonably simple and elegant (I'd still be interested to see a solution that doesn't use recursion).