Using parser-combinators to parse string with escaped characters? - parsing

I'm trying to use the combine library in Rust to parse a string. The real data that I'm trying to parse looks something like this:
A79,216,0,4,2,2,N,"US\"PS"
So at the end of that data is a string in quotes, but the string will contain escaped characters as well. I can't figure out how to parse those escaped characters in between the other quotes.
extern crate parser_combinators;
use self::parser_combinators::*;
fn main() {
let s = r#""HE\"LLO""#;
let data = many(satisfy(|c| c != '"')); // Fails on escaped " obviously
let mut str_parser = between(satisfy(|c| c == '"'), satisfy(|c| c == '"'), data);
let result : Result<(String, &str), ParseError> = str_parser.parse(s);
match result {
Ok((value, _)) => println!("{:?}", value),
Err(err) => println!("{}", err),
}
}
//=> "HE\\"
The code above will parse that string successfully but will obviously fail on the escaped character in the middle, printing out "HE\\" in the end.
I want to change the code above so that it prints "HE\\\"LLO".
How do I do that?

I have a mostly functional JSON parser as a benchmark for parser-combinators which parses this sort of escaped characters. I have included a link to it and a slightly simplified version of it below.
fn json_char(input: State<&str>) -> ParseResult<char, &str> {
let (c, input) = try!(satisfy(|c| c != '"').parse_state(input));
let mut back_slash_char = satisfy(|c| "\"\\nrt".chars().find(|x| *x == c).is_some()).map(|c| {
match c {
'"' => '"',
'\\' => '\\',
'n' => '\n',
'r' => '\r',
't' => '\t',
c => c//Should never happen
}
});
match c {
'\\' => input.combine(|input| back_slash_char.parse_state(input)),
_ => Ok((c, input))
}
}
json_char
Since this parser may consume 1 or 2 characters it is not enough to use the primitive combinators and so we need to introduce a function which can branch on the character which is parsed.

I ran into the same problem and ended up with the following solution:
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
Or in other words, a string is delimited by " followed by many escaped_characters or anything that isn't a closing ", and is closed by a closing ".
Here's a full example of how I'm using this:
pub enum Operand {
String { value: String },
}
fn escaped_character<I>() -> impl Parser<Input = I, Output = char>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('\\'),
any(),
).and_then(|(_, x)| match x {
'0' => Ok('\0'),
'n' => Ok('\n'),
'\\' => Ok('\\'),
'"' => Ok('"'),
_ => Err(StreamErrorFor::<I>::unexpected_message(format!("Invalid escape sequence \\{}", x)))
})
}
#[test]
fn parse_escaped_character() {
let expected = Ok(('\n', " foo"));
assert_eq!(expected, escaped_character().easy_parse("\\n foo"))
}
fn string_operand<I>() -> impl Parser<Input = I, Output = Operand>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
.map(|(_,value,_)| Operand::String { value: value.into_iter().collect() })
}
#[test]
fn parse_string_operand() {
let expected = Ok((Operand::String { value: "foo \" bar \n baz \0".into() }, ""));
assert_eq!(expected, string_operand().easy_parse(r#""foo \" bar \n baz \0""#))
}

Related

How to build a numbered list parser in nom?

I'd like to parse a numbered list using nom in Rust.
For example, 1. Milk 2. Bread 3. Bacon.
I could use separated_list1 with an appropriate separator parser and element parser.
fn parser(input: &str) -> IResult<&str, Vec<&str>> {
preceded(
tag("1. "),
separated_list1(
tuple((tag(" "), digit1, tag(". "))),
take_while(is_alphabetic),
),
)(input)
}
However, this does not validate the increasing index numbers.
For example, it would happily parse invalid lists like 1. Milk 3. Bread 4. Bacon or 1. Milk 8. Bread 1. Bacon.
It seems there is no built-in nom parser that can do this. So I ventured to try to build my own first parser...
My idea was to implement a parser similar to separated_list1 but which keeps track of the index and passes it to the separator as argument. It could accept a closure as argument that can then create the separator parser based on the index argument.
fn parser(input: &str) -> IResult<&str, Vec<&str>> {
preceded(
tag("1. "),
separated_list1(
|index: i32| tuple((tag(" "), tag(&index.to_string()), tag(". "))),
take_while(is_alphabetic),
),
)(input)
}
I tried to use the implementation of separated_list1 and change the separator argument to G: FnOnce(i32) -> Parser<I, O2, E>,, create an index variable let mut index = 1;, pass it to sep(index) in the loop, and increase it at the end of the loop index += 1;.
However, Rust's type system is not happy!
How can I make this work?
Here's the full code for reproduction
use nom::{
error::{ErrorKind, ParseError},
Err, IResult, InputLength, Parser,
};
pub fn separated_numbered_list1<I, O, O2, E, F, G>(
mut sep: G,
mut f: F,
) -> impl FnMut(I) -> IResult<I, Vec<O>, E>
where
I: Clone + InputLength,
F: Parser<I, O, E>,
G: FnOnce(i32) -> Parser<I, O2, E>,
E: ParseError<I>,
{
move |mut i: I| {
let mut res = Vec::new();
let mut index = 1;
// Parse the first element
match f.parse(i.clone()) {
Err(e) => return Err(e),
Ok((i1, o)) => {
res.push(o);
i = i1;
}
}
loop {
let len = i.input_len();
match sep(index).parse(i.clone()) {
Err(Err::Error(_)) => return Ok((i, res)),
Err(e) => return Err(e),
Ok((i1, _)) => {
// infinite loop check: the parser must always consume
if i1.input_len() == len {
return Err(Err::Error(E::from_error_kind(i1, ErrorKind::SeparatedList)));
}
match f.parse(i1.clone()) {
Err(Err::Error(_)) => return Ok((i, res)),
Err(e) => return Err(e),
Ok((i2, o)) => {
res.push(o);
i = i2;
}
}
}
}
index += 1;
}
}
}
Try to manually use many1(), separated_pair(), and verify()
fn validated(input: &str) -> IResult<&str, Vec<(u32, &str)>> {
let current_index = Cell::new(1u32);
let number = map_res(digit1, |s: &str| s.parse::<u32>());
let valid = verify(number, |digit| {
let i = current_index.get();
if digit == &i {
current_index.set(i + 1);
true
} else {
false
}
});
let pair = preceded(multispace0, separated_pair(valid, tag(". "), alpha1));
//give current_index time to be used and dropped with a temporary binding. This will not compile without the temporary binding
let tmp = many1(pair)(input);
tmp
}
#[test]
fn test_success() {
let input = "1. Milk 2. Bread 3. Bacon";
assert_eq!(validated(input), Ok(("", vec![(1, "Milk"), (2, "Bread"), (3, "Bacon")])));
}
#[test]
fn test_fail() {
let input = "2. Bread 3. Bacon 1. Milk";
validated(input).unwrap_err();
}

How do I parse uppercase strings in Nom?

I'm writing parsers in Nom 5 using functions, not macros. My goal is to write a parser that recognizes a string composed entirely of uppercase characters. Ideally, it would have the same return signature as alpha1.
use nom::{
character::complete::{alpha1, char, line_ending, not_line_ending},
combinator::{cut, map, not, recognize},
error::{context, ParseError, VerboseError},
multi::{many0, many1},
IResult,
};
fn uppercase_char<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
let chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
take_while(move |c| chars.contains(c))(i)
}
// Matches 1 or more consecutive uppercase characters
fn upper1<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
recognize(many1(uppercase_char))(i)
}
Although this compiles, the simple unit test I wrote fails:
#[test]
fn test_upper_string_ok() {
let input_text = "ADAM";
let output = upper1::<VerboseError<&str>>(input_text);
dbg!(&output);
let expected = Ok(("ADAM", ""));
assert_eq!(output, expected);
}
The failure output is
---- parse::tests::test_upper_string_ok stdout ----
[src/parse.rs:110] &output = Err(
Error(
VerboseError {
errors: [
(
"",
Nom(
Many1,
),
),
],
},
),
)
thread 'parse::tests::test_upper_string_ok' panicked at 'assertion failed: `(left == right)`
left: `Err(Error(VerboseError { errors: [("", Nom(Many1))] }))`,
right: `Ok(("ADAM", ""))`', src/parse.rs:112:9
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
take_while will recognize 0 or more characters, so when used inside of many1 as you did, it will first parse the entire "ADAM" string. Then when many1 calls it again, since take_while can recognize an empty string, it will succeed, but many0 and many1 have a protection against that mistake: if the underlying parser did not consume any input, they will return an error.
For what you need, the uppercase_char function should be enough, no need for recognize and many1. Although you might want to replace take_while with take_while1

Parsing brainf*ck code to tree in Rust

I am trying to write an optimizing brainfuck compiler in Rust. Currently it stores tokens in a flat vector, which works, but I am having trouble changing it to use a syntax tree:
#[derive(Clone, PartialEq, Eq)]
pub enum Token {
Output,
Input,
Loop(Vec<Token>),
Move(i32),
Add(i32, i32),
LoadOut(i32, i32),
}
use Token::*;
pub fn parse(code: &str) -> Vec<Token> {
let mut alltokens = Vec::new();
let mut tokens = &mut alltokens;
let mut tokvecs: Vec<&mut Vec<Token>> = Vec::new();
for i in code.chars() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => {
tokens.push(Loop(Vec::new()));
tokvecs.push(&mut tokens);
if let &mut Loop(mut newtokens) = tokens.last_mut().unwrap() {
tokens = &mut newtokens;
}
},
']' => {
tokens = tokvecs.pop().unwrap();
},
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => (),
};
}
alltokens
}
What I am having trouble figuring out is how to handle the [ command. The current implementation in the code is one of several I have tried, all of which have failed. I think it may require use of Rust's Box, but I can't quite understand how that is used.
The branch handling the [ command is probably completely wrong, but I'm not sure how it should be done. It pushes a Loop (a variant of the Token enum) containing a vector to the tokens vector. The problem is to then get a mutable borrow of the vector in that Loop, which the if let statement is supposed to do.
The code fails to compile since newtokens does not outlive the end of the if let block. Is it possible to get a mutable reference to the vector inside Loop, and set tokens to it? If not, what could be done instead?
Ok, last time I was pretty close; it looks like I missed the ref keyword:
if let &mut Loop(ref mut newtokens) = (&mut tokens).last_mut().unwrap()
I missed it since there were other borrow checker errors everywhere. I decided to simplify your code to tackle them:
pub fn parse(code: &str) -> Vec<Token> {
let mut tokens = Vec::new();
for i in code.chars() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => {
tokens.push(Loop(Vec::new()));
if let &mut Loop(ref mut newtokens) = (&mut tokens).last_mut().unwrap() {
let bracket_tokens: &mut Vec<Token> = newtokens;
}
},
']' => {
()
},
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => unreachable!(),
};
}
tokens
}
I merged all of the token variables (you don't really need them) and changed tokens = &mut newtokens; to let bracket_tokens: &mut Vec<Token> = newtokens; (I think this was more or less your intention). This allows you to manipulate the Vector inside the Loop.
However, this code still has issues and won't parse brainf*ck's loops; I wanted to make it work, but it required a significant change of approach. You are welcome to try to expand this variant further but it might be a painful experience, especially if you are not too familiar with the borrow checker's rules yet.
I suggest looking at brainf*ck interpreters implementations (e.g. this one) by other people (though not too old, as Rust's syntax has changed before 1.0 went live) to get an idea how this can be done.
I've gotten the code to work by making it a recursive function:
#[derive(Clone, PartialEq, Eq)]
pub enum Token {
Output,
Input,
Loop(Vec<Token>),
Move(i32),
Add(i32, i32),
LoadOut(i32, i32),
}
use Token::*;
pub fn parse(code: &str) -> Vec<Token> {
_parse(&mut code.chars())
}
fn _parse(chars: &mut std::str::Chars) -> Vec<Token> {
let mut tokens = Vec::new();
while let Some(i) = chars.next() {
match i {
'+' => tokens.push(Add(0, 1)),
'-' => tokens.push(Add(0, -1)),
'>' => tokens.push(Move(1)),
'<' => tokens.push(Move(-1)),
'[' => tokens.push(Loop(_parse(chars))),
']' => { break; }
',' => tokens.push(Input),
'.' => {
tokens.push(LoadOut(0, 0));
tokens.push(Output);
}
_ => (),
};
}
tokens
}
It seems to work, and is reasonably simple and elegant (I'd still be interested to see a solution that doesn't use recursion).

Parsing values contained inside nested brackets

I'm just fooling about and strangely found it a bit tricky to parse nested brackets in a simple recursive function.
For example, if the program's purpose it to lookup user details, it may go from {{name surname} age} to {Bob Builder age} and then to Bob Builder 20.
Here is a mini-program for summing totals in curly brackets that demonstrates the concept.
// Parses string recursively by eliminating brackets
def parse(s: String): String = {
if (!s.contains("{")) s
else {
parse(resolvePair(s))
}
}
// Sums one pair and returns the string, starting at deepest nested pair
// e.g.
// {2+10} lollies and {3+{4+5}} peanuts
// should return:
// {2+10} lollies and {3+9} peanuts
def resolvePair(s: String): String = {
??? // Replace the deepest nested pair with it's sumString result
}
// Sums values in a string, returning the result as a string
// e.g. sumString("3+8") returns "11"
def sumString(s: String): String = {
val v = s.split("\\+")
v.foldLeft(0)(_.toInt + _.toInt).toString
}
// Should return "12 lollies and 12 peanuts"
parse("{2+10} lollies and {3+{4+5}} peanuts")
Any ideas to a clean bit of code that could replace the ??? would be great. It's mostly out of curiosity that I'm searching for an elegant solution to this problem.
Parser combinators can handle this kind of situation:
import scala.util.parsing.combinator.RegexParsers
object BraceParser extends RegexParsers {
override def skipWhitespace = false
def number = """\d+""".r ^^ { _.toInt }
def sum: Parser[Int] = "{" ~> (number | sum) ~ "+" ~ (number | sum) <~ "}" ^^ {
case x ~ "+" ~ y => x + y
}
def text = """[^{}]+""".r
def chunk = sum ^^ {_.toString } | text
def chunks = rep1(chunk) ^^ {_.mkString} | ""
def apply(input: String): String = parseAll(chunks, input) match {
case Success(result, _) => result
case failure: NoSuccess => scala.sys.error(failure.msg)
}
}
Then:
BraceParser("{2+10} lollies and {3+{4+5}} peanuts")
//> res0: String = 12 lollies and 12 peanuts
There is some investment before getting comfortable with parser combinators but I think it is really worth it.
To help you decipher the syntax above:
regular expression and strings have implicit conversions to create primitive parsers with strings results, they have type Parser[String].
the ^^ operator allows to apply a function to the parsed elements
it can convert a Parser[String] into a Parser[Int] by doing ^^ {_.toInt}
Parser is a monad and Parser[T].^^(f) is equivalent to Parser[T].map(f)
the ~, ~> and <~ requires some inputs to be in a certain sequence
the ~> and <~ drop one side of the input out of the result
the case a ~ b allows to pattern match the results
Parser is a monad and (p ~ q) ^^ { case a ~ b => f(a, b) } is equivalent to for (a <- p; b <- q) yield (f(a, b))
(p <~ q) ^^ f is equivalent to for (a <- p; _ <- q) yield f(a)
rep1 is a repetition of 1 or more element
| tries to match an input with the parser on its left and if failing it will try the parser on the right
How about
def resolvePair(s: String): String = {
val open = s.lastIndexOf('{')
val close = s.indexOf('}', open)
if((open >= 0) && (close > open)) {
val (a,b) = s.splitAt(open+1)
val (c,d) = b.splitAt(close-open-1)
resolvePair(a.dropRight(1)+sumString(c).toString+d.drop(1))
} else
s
}
I know it's ugly but I think it works fine.

Scala parsing mutually recursive functions for SML

I'm trying to write a parser in Scala for SML with Tokens. It almost works the way I want it to work, except for the fact that this currently parses
let fun f x = r and fun g y in r end;
instead of
let fun f x = r and g y in r end;
How do I change my code so that it recognizes that it doesn't need a FunToken for the second function?
def parseDef:Def = {
currentToken match {
case ValToken => {
eat(ValToken);
val nme:String = currentToken match {
case IdToken(x) => {advance; x}
case _ => error("Expected a name after VAL.")
}
eat(EqualToken);
VAL(nme,parseExp)
}
case FunToken => {
eat(FunToken);
val fnme:String = currentToken match {
case IdToken(x) => {advance; x}
case _ => error("Expected a name after VAL.")
}
val xnme:String = currentToken match {
case IdToken(x) => {advance; x}
case _ => error("Expected a name after VAL.")
}
def parseAnd:Def = currentToken match {
case AndToken => {eat(AndToken); FUN(fnme,xnme,parseExp,parseAnd)}
case _ => NOFUN
}
FUN(fnme,xnme,parseExp,parseAnd)
}
case _ => error("Expected VAL or FUN.");
}
}
Just implement the right grammar. Instead of
def ::= "val" id "=" exp | fun
fun ::= "fun" id id "=" exp ["and" fun]
SML's grammar actually is
def ::= "val" id "=" exp | "fun" fun
fun ::= id id "=" exp ["and" fun]
Btw, I think there are other problems with your parsing of fun. AFAICS, you are not parsing any "=" in the fun case. Moreover, after an "and", you are not even parsing any identifiers, just the function body.
You could inject the FunToken back into your input stream with an "uneat" function. This is not the most elegant solution, but it's the one that requires the least modification of your current code.
def parseAnd:Def = currentToken match {
case AndToken => { eat(AndToken);
uneat(FunToken);
FUN(fnme,xnme,parseExp,parseAnd) }
case _ => NOFUN
}

Resources