Parsing brainf*ck code to tree in Rust - parsing

I am trying to write an optimizing brainfuck compiler in Rust. Currently it stores tokens in a flat vector, which works, but I am having trouble changing it to use a syntax tree:
#[derive(Clone, PartialEq, Eq)]
pub enum Token {
Output,
Input,
Loop(Vec<Token>),
Move(i32),
Add(i32, i32),
LoadOut(i32, i32),
}
use Token::*;
pub fn parse(code: &str) -> Vec<Token> {
let mut alltokens = Vec::new();
let mut tokens = &mut alltokens;
let mut tokvecs: Vec<&mut Vec<Token>> = Vec::new();
for i in code.chars() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => {
tokens.push(Loop(Vec::new()));
tokvecs.push(&mut tokens);
if let &mut Loop(mut newtokens) = tokens.last_mut().unwrap() {
tokens = &mut newtokens;
}
},
']' => {
tokens = tokvecs.pop().unwrap();
},
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => (),
};
}
alltokens
}
What I am having trouble figuring out is how to handle the [ command. The current implementation in the code is one of several I have tried, all of which have failed. I think it may require use of Rust's Box, but I can't quite understand how that is used.
The branch handling the [ command is probably completely wrong, but I'm not sure how it should be done. It pushes a Loop (a variant of the Token enum) containing a vector to the tokens vector. The problem is to then get a mutable borrow of the vector in that Loop, which the if let statement is supposed to do.
The code fails to compile since newtokens does not outlive the end of the if let block. Is it possible to get a mutable reference to the vector inside Loop, and set tokens to it? If not, what could be done instead?

Ok, last time I was pretty close; it looks like I missed the ref keyword:
if let &mut Loop(ref mut newtokens) = (&mut tokens).last_mut().unwrap()
I missed it since there were other borrow checker errors everywhere. I decided to simplify your code to tackle them:
pub fn parse(code: &str) -> Vec<Token> {
let mut tokens = Vec::new();
for i in code.chars() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => {
tokens.push(Loop(Vec::new()));
if let &mut Loop(ref mut newtokens) = (&mut tokens).last_mut().unwrap() {
let bracket_tokens: &mut Vec<Token> = newtokens;
}
},
']' => {
()
},
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => unreachable!(),
};
}
tokens
}
I merged all of the token variables (you don't really need them) and changed tokens = &mut newtokens; to let bracket_tokens: &mut Vec<Token> = newtokens; (I think this was more or less your intention). This allows you to manipulate the Vector inside the Loop.
However, this code still has issues and won't parse brainf*ck's loops; I wanted to make it work, but it required a significant change of approach. You are welcome to try to expand this variant further but it might be a painful experience, especially if you are not too familiar with the borrow checker's rules yet.
I suggest looking at brainf*ck interpreters implementations (e.g. this one) by other people (though not too old, as Rust's syntax has changed before 1.0 went live) to get an idea how this can be done.

I've gotten the code to work by making it a recursive function:
#[derive(Clone, PartialEq, Eq)]
pub enum Token {
Output,
Input,
Loop(Vec<Token>),
Move(i32),
Add(i32, i32),
LoadOut(i32, i32),
}
use Token::*;
pub fn parse(code: &str) -> Vec<Token> {
_parse(&mut code.chars())
}
fn _parse(chars: &mut std::str::Chars) -> Vec<Token> {
let mut tokens = Vec::new();
while let Some(i) = chars.next() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => tokens.push(Loop(_parse(chars))),
']' => { break; }
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => (),
};
}
tokens
}
It seems to work, and is reasonably simple and elegant (I'd still be interested to see a solution that doesn't use recursion).

Related

PEG: What is wrong wrong with my grammar for if statement?

I'm implementing an OCaml-like language using rust-peg and my parser has a bug.
I defined if-statement grammar, but it doesn't work.
I'm guessing the test-case input is parsed as Apply(Apply(Apply(Apply(f, then), 2) else), 4). I mean "then" is parsed as Ident, not keyword.
I have no idea for fixing this apply-expression grammar. Do you have any ideas?
#[derive(Clone, PartialEq, Eq, Debug)]
pub enum Expression {
Number(i64),
If {
cond: Box<Expression>,
conseq: Box<Expression>,
alt: Box<Expression>,
},
Ident(String),
Apply(Box<Expression>, Box<Expression>),
}
use peg::parser;
use toplevel::expression;
use Expression::*;
parser! {
pub grammar toplevel() for str {
rule _() = [' ' | '\n']*
pub rule expression() -> Expression
= expr()
rule expr() -> Expression
= if_expr()
/ apply_expr()
rule if_expr() -> Expression
= "if" _ cond:expr() _ "then" _ conseq:expr() _ "else" _ alt:expr() {
Expression::If {
cond: Box::new(cond),
conseq: Box::new(conseq),
alt: Box::new(alt)
}
}
rule apply_expr() -> Expression
= e1:atom() _ e2:atom() { Apply(Box::new(e1), Box::new(e2)) }
/ atom()
rule atom() -> Expression
= number()
/ id:ident() { Ident(id) }
rule number() -> Expression
= n:$(['0'..='9']+) { Expression::Number(n.parse().unwrap()) }
rule ident() -> String
= id:$(['a'..='z' | 'A'..='Z']['a'..='z' | 'A'..='Z' | '0'..='9']*) { id.to_string() }
}}
fn main() {
assert_eq!(expression("1"), Ok(Number(1)));
assert_eq!(
expression("myFunc 10"),
Ok(Apply(
Box::new(Ident("myFunc".to_string())),
Box::new(Number(10))
))
);
// failed
assert_eq!(
expression("if f then 2 else 3"),
Ok(If {
cond: Box::new(Ident("f".to_string())),
conseq: Box::new(Number(2)),
alt: Box::new(Number(3))
})
);
}
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `Err(ParseError { location: LineCol { line: 1, column: 11, offset: 10 }, expected: ExpectedSet { expected: {"\"then\"", "\' \' | \'\\n\'"} } })`,
right: `Ok(If { cond: Ident("f"), conseq: Number(2), alt: Number(3) })`', src/main.rs:64:5
PEG uses ordered choice. This means that when you write R = A / B for some rule R, if in a position A succesfully parses, it will never try B, even if the choice of A leads to problems later on. This is the core difference with context-free grammars, and often overlooked.
In particular, when you write apply = atom atom / atom, if it's possible to parse two atoms in a row, it will never try to parse just a single one, even if it means the rest doesn't make sense later.
Combine this with the fact that then and else are perfectly good identifiers in your grammar, you get the issue that you see.

With closures as parameter and return values, is Fn or FnMut more idiomatic?

Continuing from How do I write combinators for my own parsers in Rust?, I stumbled into this question concerning bounds of functions that consume and/or yield functions/closures.
From these slides, I learned that to be convenient for consumers, you should try to take functions as FnOnce and return as Fn where possible. This gives the caller most freedom what to pass and what to do with the returned function.
In my example, FnOnce is not possible because I need to call that function multiple times. While trying to make it compile I arrived at two possibilities:
pub enum Parsed<'a, T> {
Some(T, &'a str),
None(&'a str),
}
impl<'a, T> Parsed<'a, T> {
pub fn unwrap(self) -> (T, &'a str) {
match self {
Parsed::Some(head, tail) => (head, &tail),
_ => panic!("Called unwrap on nothing."),
}
}
pub fn is_none(&self) -> bool {
match self {
Parsed::None(_) => true,
_ => false,
}
}
}
pub fn achar(character: char) -> impl Fn(&str) -> Parsed<char> {
move |input|
match input.chars().next() {
Some(c) if c == character => Parsed::Some(c, &input[1..]),
_ => Parsed::None(input),
}
}
pub fn some_v1<T>(parser: impl Fn(&str) -> Parsed<T>) -> impl Fn(&str) -> Parsed<Vec<T>> {
move |input| {
let mut re = Vec::new();
let mut pos = input;
loop {
match parser(pos) {
Parsed::Some(head, tail) => {
re.push(head);
pos = tail;
}
Parsed::None(_) => break,
}
}
Parsed::Some(re, pos)
}
}
pub fn some_v2<T>(mut parser: impl FnMut(&str) -> Parsed<T>) -> impl FnMut(&str) -> Parsed<Vec<T>> {
move |input| {
let mut re = Vec::new();
let mut pos = input;
loop {
match parser(pos) {
Parsed::Some(head, tail) => {
re.push(head);
pos = tail;
}
Parsed::None(_) => break,
}
}
Parsed::Some(re, pos)
}
}
#[test]
fn try_it() {
assert_eq!(some_v1(achar('#'))("##comment").unwrap(), (vec!['#', '#'], "comment"));
assert_eq!(some_v2(achar('#'))("##comment").unwrap(), (vec!['#', '#'], "comment"));
}
playground
Now I don't know which version is to be preferred. Version 1 takes Fn which is less general, but version 2 needs its parameter mutable.
Which one is more idiomatic/should be used and what is the rationale behind?
Update: Thanks jplatte for the suggestion on version one. I updated the code here, that case I find even more interesting.
Comparing some_v1 and some_v2 as you wrote them I would say version 2 should definitely be preferred because it is more general. I can't think of a good example for a parsing closure that would implement FnMut but not Fn, but there's really no disadvantage to parser being mut - as noted in the first comment on your question this doesn't constrain the caller in any way.
However, there is a way in which you can make version 1 more general (not strictly more general, just partially) than version 2, and that is by returning impl Fn(&str) -> … instead of impl FnMut(&str) -> …. By doing that, you get two functions that each are less constrained than the other in some way, so it might even make sense to keep both:
Version 1 with the return type change would be more restrictive in its argument (the callable can't mutate its associated data) but less restrictive in its return type (you guarantee that the returned callable doesn't mutate its associated data)
Version 2 would be less restrictive in its argument (the callable is allowed to mutate its associated data) but more restrictive in its return type (the returned callable might mutate its associated data)

Peeking at stdin using match

I'm trying to port a translator/parser example from an old compiler textbook from C into Rust.
I have the following code:
use std::io::Read;
fn lexan() {
let mut input = std::io::stdin().bytes().peekable();
loop {
match input.peek() {
Some(&ch) => {
match ch {
_ => println!("{:?}", input.next()),
}
}
None => break,
}
}
}
At this point I'm not actively trying to parse the input, just get my head around how match works. The aim is to add parse branches to the inner match. Unfortunately this fails to compile because I appear to fail in understanding the semantics of match:
error[E0507]: cannot move out of borrowed content
--> src/main.rs:7:18
|
7 | Some(&ch) => {
| ^--
| ||
| |hint: to prevent move, use `ref ch` or `ref mut ch`
| cannot move out of borrowed content
From what I understand, this error is because I don't own the return value of the match. The thing is, I don't believe that I'm using the return value of either match. I thought perhaps input.next() may have been the issue, but the same error occurs with or without this part (or indeed, the entire println! call).
What am I missing here? It's been some time since I looked at Rust (and never in a serious level of effort), and most of the search results for things of this nature appear to be out of date.
It's got nothing to do with the return value of match, or even match itself::
use std::io::Read;
fn lexan() {
let mut input = std::io::stdin().bytes().peekable();
if let Some(&ch) = input.peek() {}
}
The issue is that you are attempting to bind the result of Peekable::peek while dereferencing it (that's what the & in &ch does). In this case, the return type is an Option<&Result<u8, std::io::Error>> because the Bytes iterator returns errors from the underlying stream. Since this type does not implement Copy, trying to dereference the type requires that you transfer ownership of the value. You cannot do so as you don't own the original value — thus the error message.
The piece that causes the inability to copy is the error type of the Result. Because of that, you can match one level deeper:
match input.peek() {
Some(&Ok(ch)) => {
match ch {
_ => println!("{:?}", input.next()),
}
}
Some(&Err(_)) => panic!(),
None => break,
}
Be aware that this code is pretty close to being uncompilable though. The result of peek will be invalidated when next is called, so many small changes to this code will trigger the borrow checker to fail the code. I'm actually a bit surprised the above worked on the first go.
If you didn't care about errors at all, you could do
while let Some(&Ok(ch)) = input.peek() {
match ch {
_ => println!("{:?}", input.next()),
}
}
Unfortunately, you can't split the middle, as this would cause the borrow of input to last during the call to next:
while let Some(x) = input.peek() {
match *x {
Ok(ch) => {
match ch {
_ => println!("{:?}", input.next()),
}
}
Err(_) => {}
}
// Could still use `x` here, compiler doesn't currently see that we don't
}

What is the sequence combinator in Chomp?

I'm attempting to parse a subset of JSON that only contains a single, non-nested object with string only values that may contain escape sequences. E.g.
{
"A KEY": "SOME VALUE",
"Another key": "Escape sequences \n \r \\ \/ \f \t \u263A"
}
Using the Chomp parser combinator in Rust. I have it parsing this structure ignoring escape sequences but am having trouble working out how to handle the escape sequences. Looking at other quoted string parsers that use combinators such as:
Arc JSON parser
PHP parser-combinator
Paka
They each use a sequence combinator, what is the equivalent in Chomp?
Chomp is based on Attoparsec and Parsec, so for parsing escaped strings I would use the scan parser to obtain the slice between the " characters while keeping any escaped " characters.
The sequence combinator is just the ParseResult::bind method, used to chain the match of the " character and the escaped string itself so that it will be able to parse "foo\\"bar" and not just foo\\"bar. You get this for free when you use the parse! macro, each ; is implicitly converted into a bind call to chain the parsers together.
The linked parsers use a many and or combinator and allocate a vector for the resulting characters. Paka does not seem to do any transformation on the resulting array, and PHP is using a regex with a callback to unescape the string.
This is code translated from Attoparsec's Aeson benchmark for parsing a JSON-string while not unescaping any escaped characters.
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::buffer::IntoStream;
use chomp::buffer::Stream;
pub fn json_string(i: Input<u8>) -> U8Result<&[u8]> {
parse!{i;
token(b'"');
let escaped_str = scan(false, |s, c| if s { Some(false) }
else if c == b'"' { None }
else { Some(c == b'\\') });
token(b'"');
ret escaped_str
}
}
#[test]
fn test_it() {
let r = "\"foo\\\"bar\\tbaz\"".as_bytes().into_stream().parse(json_string);
assert_eq!(r, Ok(&b"foo\\\"bar\\tbaz"[..]));
}
The parser above is not equivalent, it yields a slice of u8 borrowed from the source buffer/slice. If you want an owned Vec of the data you should preferably use [T]::to_vec or String::from_utf8 instead of building a parser using many and or since it will not be as fast as scan and the result is the same.
If you want to parse UTF-8 and escape-sequences you can filter the resulting slice and then calling String::from_utf8 on the Vec (Rust strings are UTF-8, to use a string containing invalid UTF-8 can result in undefined behaviour). If performance is an issue you should build that into the parser most likely.
The documentation states (emphasis mine):
Using parsers is almost entirely done using the parse! macro, which enables us to do three distinct things:
Sequence parsers over the remaining input
Store intermediate results into datatypes
Return a datatype at the end, which may be the result of any arbitrary computation over the intermediate results.
It then provides this example of parsing a sequence of two numbers followed by a constant string:
fn f(i: Input<u8>) -> U8Result<(u8, u8, u8)> {
parse!{i;
let a = digit();
let b = digit();
string(b"missiles");
ret (a, b, a + b)
}
}
fn digit(i: Input<u8>) -> U8Result<u8> {
satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}
There is also ParseResult::bind and ParseResult::then which are documented to sequentially compose a result with a second action.
Because I'm always interested in parsing, I went ahead and played with it a bit to see how it would look. I'm not happy with the deep indenting that would happen with the nested or calls, but there's probably something better that can be done. This is just one possible solution:
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::ascii::is_alpha;
use chomp::buffer::{Source, Stream, ParseError};
use std::str;
use std::iter::FromIterator;
#[derive(Debug)]
pub enum StringPart<'a> {
String(&'a [u8]),
Newline,
Slash,
}
impl<'a> StringPart<'a> {
fn from_bytes(s: &[u8]) -> StringPart {
match s {
br#"\\"# => StringPart::Slash,
br#"\n"# => StringPart::Newline,
s => StringPart::String(s),
}
}
}
impl<'a> FromIterator<StringPart<'a>> for String {
fn from_iter<I>(iterator: I) -> Self
where I: IntoIterator<Item = StringPart<'a>>
{
let mut s = String::new();
for part in iterator {
match part {
StringPart::String(p) => s.push_str(str::from_utf8(p).unwrap()),
StringPart::Newline => s.push('\n'),
StringPart::Slash => s.push('\\'),
}
}
s
}
}
fn json_string_part(i: Input<u8>) -> U8Result<StringPart> {
or(i,
|i| parse!{i; take_while1(is_alpha)},
|i| or(i,
|i| parse!{i; string(br"\\")},
|i| parse!{i; string(br"\n")}),
).map(StringPart::from_bytes)
}
fn json_string(i: Input<u8>) -> U8Result<String> {
many1(i, json_string_part)
}
fn main() {
let input = br#"\\stuff\n"#;
let mut i = Source::new(input as &[u8]);
println!("Input has {} bytes", input.len());
loop {
match i.parse(json_string) {
Ok(x) => {
println!("Result has {} bytes", x.len());
println!("{:?}", x);
},
Err(ParseError::Retry) => {}, // Needed to refill buffer when necessary
Err(ParseError::EndOfInput) => break,
Err(e) => { panic!("{:?}", e); }
}
}
}

Using parser-combinators to parse string with escaped characters?

I'm trying to use the combine library in Rust to parse a string. The real data that I'm trying to parse looks something like this:
A79,216,0,4,2,2,N,"US\"PS"
So at the end of that data is a string in quotes, but the string will contain escaped characters as well. I can't figure out how to parse those escaped characters in between the other quotes.
extern crate parser_combinators;
use self::parser_combinators::*;
fn main() {
let s = r#""HE\"LLO""#;
let data = many(satisfy(|c| c != '"')); // Fails on escaped " obviously
let mut str_parser = between(satisfy(|c| c == '"'), satisfy(|c| c == '"'), data);
let result : Result<(String, &str), ParseError> = str_parser.parse(s);
match result {
Ok((value, _)) => println!("{:?}", value),
Err(err) => println!("{}", err),
}
}
//=> "HE\\"
The code above will parse that string successfully but will obviously fail on the escaped character in the middle, printing out "HE\\" in the end.
I want to change the code above so that it prints "HE\\\"LLO".
How do I do that?
I have a mostly functional JSON parser as a benchmark for parser-combinators which parses this sort of escaped characters. I have included a link to it and a slightly simplified version of it below.
fn json_char(input: State<&str>) -> ParseResult<char, &str> {
let (c, input) = try!(satisfy(|c| c != '"').parse_state(input));
let mut back_slash_char = satisfy(|c| "\"\\nrt".chars().find(|x| *x == c).is_some()).map(|c| {
match c {
'"' => '"',
'\\' => '\\',
'n' => '\n',
'r' => '\r',
't' => '\t',
c => c//Should never happen
}
});
match c {
'\\' => input.combine(|input| back_slash_char.parse_state(input)),
_ => Ok((c, input))
}
}
json_char
Since this parser may consume 1 or 2 characters it is not enough to use the primitive combinators and so we need to introduce a function which can branch on the character which is parsed.
I ran into the same problem and ended up with the following solution:
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
Or in other words, a string is delimited by " followed by many escaped_characters or anything that isn't a closing ", and is closed by a closing ".
Here's a full example of how I'm using this:
pub enum Operand {
String { value: String },
}
fn escaped_character<I>() -> impl Parser<Input = I, Output = char>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('\\'),
any(),
).and_then(|(_, x)| match x {
'0' => Ok('\0'),
'n' => Ok('\n'),
'\\' => Ok('\\'),
'"' => Ok('"'),
_ => Err(StreamErrorFor::<I>::unexpected_message(format!("Invalid escape sequence \\{}", x)))
})
}
#[test]
fn parse_escaped_character() {
let expected = Ok(('\n', " foo"));
assert_eq!(expected, escaped_character().easy_parse("\\n foo"))
}
fn string_operand<I>() -> impl Parser<Input = I, Output = Operand>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
.map(|(_,value,_)| Operand::String { value: value.into_iter().collect() })
}
#[test]
fn parse_string_operand() {
let expected = Ok((Operand::String { value: "foo \" bar \n baz \0".into() }, ""));
assert_eq!(expected, string_operand().easy_parse(r#""foo \" bar \n baz \0""#))
}

Resources