How to build a numbered list parser in nom? - parsing

I'd like to parse a numbered list using nom in Rust.
For example, 1. Milk 2. Bread 3. Bacon.
I could use separated_list1 with an appropriate separator parser and element parser.
fn parser(input: &str) -> IResult<&str, Vec<&str>> {
preceded(
tag("1. "),
separated_list1(
tuple((tag(" "), digit1, tag(". "))),
take_while(is_alphabetic),
),
)(input)
}
However, this does not validate the increasing index numbers.
For example, it would happily parse invalid lists like 1. Milk 3. Bread 4. Bacon or 1. Milk 8. Bread 1. Bacon.
It seems there is no built-in nom parser that can do this. So I ventured to try to build my own first parser...
My idea was to implement a parser similar to separated_list1 but which keeps track of the index and passes it to the separator as argument. It could accept a closure as argument that can then create the separator parser based on the index argument.
fn parser(input: &str) -> IResult<&str, Vec<&str>> {
preceded(
tag("1. "),
separated_list1(
|index: i32| tuple((tag(" "), tag(&index.to_string()), tag(". "))),
take_while(is_alphabetic),
),
)(input)
}
I tried to use the implementation of separated_list1 and change the separator argument to G: FnOnce(i32) -> Parser<I, O2, E>,, create an index variable let mut index = 1;, pass it to sep(index) in the loop, and increase it at the end of the loop index += 1;.
However, Rust's type system is not happy!
How can I make this work?
Here's the full code for reproduction
use nom::{
error::{ErrorKind, ParseError},
Err, IResult, InputLength, Parser,
};
pub fn separated_numbered_list1<I, O, O2, E, F, G>(
mut sep: G,
mut f: F,
) -> impl FnMut(I) -> IResult<I, Vec<O>, E>
where
I: Clone + InputLength,
F: Parser<I, O, E>,
G: FnOnce(i32) -> Parser<I, O2, E>,
E: ParseError<I>,
{
move |mut i: I| {
let mut res = Vec::new();
let mut index = 1;
// Parse the first element
match f.parse(i.clone()) {
Err(e) => return Err(e),
Ok((i1, o)) => {
res.push(o);
i = i1;
}
}
loop {
let len = i.input_len();
match sep(index).parse(i.clone()) {
Err(Err::Error(_)) => return Ok((i, res)),
Err(e) => return Err(e),
Ok((i1, _)) => {
// infinite loop check: the parser must always consume
if i1.input_len() == len {
return Err(Err::Error(E::from_error_kind(i1, ErrorKind::SeparatedList)));
}
match f.parse(i1.clone()) {
Err(Err::Error(_)) => return Ok((i, res)),
Err(e) => return Err(e),
Ok((i2, o)) => {
res.push(o);
i = i2;
}
}
}
}
index += 1;
}
}
}

Try to manually use many1(), separated_pair(), and verify()
fn validated(input: &str) -> IResult<&str, Vec<(u32, &str)>> {
let current_index = Cell::new(1u32);
let number = map_res(digit1, |s: &str| s.parse::<u32>());
let valid = verify(number, |digit| {
let i = current_index.get();
if digit == &i {
current_index.set(i + 1);
true
} else {
false
}
});
let pair = preceded(multispace0, separated_pair(valid, tag(". "), alpha1));
//give current_index time to be used and dropped with a temporary binding. This will not compile without the temporary binding
let tmp = many1(pair)(input);
tmp
}
#[test]
fn test_success() {
let input = "1. Milk 2. Bread 3. Bacon";
assert_eq!(validated(input), Ok(("", vec![(1, "Milk"), (2, "Bread"), (3, "Bacon")])));
}
#[test]
fn test_fail() {
let input = "2. Bread 3. Bacon 1. Milk";
validated(input).unwrap_err();
}

Related

Building expression parser with Dart petitparser, getting stuck on node visitor

I've got more of my expression parser working (Dart PetitParser to get at AST datastructure created with ExpressionBuilder). It appears to be generating accurate ASTs for floats, parens, power, multiply, divide, add, subtract, unary negative in front of both numbers and expressions. (The nodes are either literal strings, or an object that has a precedence with a List payload that gets walked and concatenated.)
I'm stuck now on visiting the nodes. I have clean access to the top node (thanks to Lukas), but I'm stuck on deciding whether or not to add a paren. For example, in 20+30*40, we don't need parens around 30*40, and the parse tree correctly has the node for this closer to the root so I'll hit it first during traversal. However, I don't seem to have enough data when looking at the 30*40 node to determine if it needs parens before going on to the 20+.. A very similar case would be (20+30)*40, which gets parsed correctly with 20+30 closer to the root, so once again, when visiting the 20+30 node I need to add parens before going on to *40.
This has to be a solved problem, but I never went to compiler school, so I know just enough about ASTs to be dangerous. What "a ha" am I missing?
// rip-common.dart:
import 'package:petitparser/petitparser.dart';
// import 'package:petitparser/debug.dart';
class Node {
int precedence;
List<dynamic> args;
Node([this.precedence = 0, this.args = const []]) {
// nodeList.add(this);
}
#override
String toString() => 'Node($precedence $args)';
String visit([int fromPrecedence = -1]) {
print('=== visiting $this ===');
var buf = StringBuffer();
var parens = (precedence > 0) &&
(fromPrecedence > 0) &&
(precedence < fromPrecedence);
print('<$fromPrecedence $precedence $parens>');
// for debugging:
var curlyOpen = '';
var curlyClose = '';
buf.write(parens ? '(' : curlyOpen);
for (var arg in args) {
if (arg is Node) {
buf.write(arg.visit(precedence));
} else if (arg is String) {
buf.write(arg);
} else {
print('not Node or String: $arg');
buf.write('$arg');
}
}
buf.write(parens ? ')' : curlyClose);
print('$buf for buf');
return '$buf';
}
}
class RIPParser {
Parser _make_parser() {
final builder = ExpressionBuilder();
var number = char('-').optional() &
digit().plus() &
(char('.') & digit().plus()).optional();
// precedence 5
builder.group()
..primitive(number.flatten().map((a) => Node(0, [a])))
..wrapper(char('('), char(')'), (l, a, r) => Node(0, [a]));
// negation is a prefix operator
// precedence 4
builder.group()..prefix(char('-').trim(), (op, a) => Node(4, [op, a]));
// power is right-associative
// precedence 3
builder.group()..right(char('^').trim(), (a, op, b) => Node(3, [a, op, b]));
// multiplication and addition are left-associative
// precedence 2
builder.group()
..left(char('*').trim(), (a, op, b) => Node(2, [a, op, b]))
..left(char('/').trim(), (a, op, b) => Node(2, [a, op, b]));
// precedence 1
builder.group()
..left(char('+').trim(), (a, op, b) => Node(1, [a, op, b]))
..left(char('-').trim(), (a, op, b) => Node(1, [a, op, b]));
final parser = builder.build().end();
return parser;
}
Result _result(String input) {
var parser = _make_parser(); // eventually cache
var result = parser.parse(input);
return result;
}
String parse(String input) {
var result = _result(input);
if (result.isFailure) {
return result.message;
} else {
print('result.value = ${result.value}');
return '$result';
}
}
String visit(String input) {
var result = _result(input);
var top_node = result.value; // result.isFailure ...
return top_node.visit();
}
}
// rip_cmd_example.dart
import 'dart:io';
import 'package:rip_common/rip_common.dart';
void main() {
print('start');
String input;
while (true) {
input = stdin.readLineSync();
if (input.isEmpty) {
break;
}
print(RIPParser().parse(input));
print(RIPParser().visit(input));
}
;
print('done');
}
As you've observed, the ExpressionBuilder already assembles the tree in the right precedence order based on the operator groups you've specified.
This also happens for the wrapping parens node created here: ..wrapper(char('('), char(')'), (l, a, r) => Node(0, [a])). If I test for this node, I get back the input string for your example expressions: var parens = precedence == 0 && args.length == 1 && args[0] is Node;.
Unless I am missing something, there should be no reason for you to track the precedence manually. I would also recommend that you create different node classes for the different operators: ValueNode, ParensNode, NegNode, PowNode, MulNode, ... A bit verbose, but much easier to understand what is going on, if each of them can just visit (print, evaluate, optimize, ...) itself.

Match a slug with Nom

I've been trying for some time to find a decent solution for Nom to recognize the slug as an alpha1.
So I could parse something like this
fn parse<'a>(text: &'a str) -> IResult<&'a str, &'a str> {
delimited(char(':'), slug, char(':'))(text)
}
assert!(
parse(":hello-world-i-only-accept-alpha-numeric-char-and-dashes:"),
"hello-world-i-only-accept-alpha-numeric-char-and-dashes"
);
I tried with something like this but it seems to doesn't work.
fn slug<T, E: ParseError<T>>(input: T) -> IResult<T, T, E>
where
T: InputTakeAtPosition,
<T as InputTakeAtPosition>::Item: AsChar + Clone,
{
input.split_at_position1(
|item| {
let c = item.clone().as_char();
!(item.is_alpha() || c == '-')
},
ErrorKind::Char,
)
}
PS: Do you know how to tell Nom that the "-" in a slug must not be at the beginning nor the end?
There is nom::multi::separated_list for exactly this. And since you want the result to be string itself rather than a vector of segments, combining it with nom::combinator::recognize will do the trick:
use std::error::Error;
use nom::{
IResult,
character::complete::{alphanumeric1, char},
combinator::recognize,
multi::separated_list,
sequence::delimited,
};
fn slug_parse<'a>(text: &'a str) -> IResult<&'a str, &'a str> {
let slug = separated_list(char('-'), alphanumeric1);
delimited(char(':'), recognize(slug), char(':'))(text)
}
fn main() -> Result<(), Box<dyn Error>> {
let (_, res) = slug_parse(":hello-world-i-only-accept-alpha-numeric-char-and-dashes:")?;
assert_eq!(
res,
"hello-world-i-only-accept-alpha-numeric-char-and-dashes"
);
Ok(())
}

Borrowed RefCell does not last long enough when iterating over a list

I'm trying to implement a linked list to understand smart pointers in Rust. I defined a Node:
use std::{cell::RefCell, rc::Rc};
struct Node {
val: i32,
next: Option<Rc<RefCell<Node>>>,
}
and iterate like
fn iterate(node: Option<&Rc<RefCell<Node>>>) -> Vec<i32> {
let mut p = node;
let mut result = vec![];
loop {
if p.is_none() {
break;
}
result.push(p.as_ref().unwrap().borrow().val);
p = p.as_ref().unwrap().borrow().next.as_ref();
}
result
}
the compiler reports an error:
error[E0716]: temporary value dropped while borrowed
--> src/main.rs:27:13
|
27 | p = p.as_ref().unwrap().borrow().next.as_ref();
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -
| | |
| | temporary value is freed at the end of this statement
| | ... and the borrow might be used here, when that temporary is dropped and runs the destructor for type `std::cell::Ref<'_, Node>`
| creates a temporary which is freed while still in use
| a temporary with access to the borrow is created here ...
|
= note: consider using a `let` binding to create a longer lived value
What happened? Can't we use a reference to iterate on a node defined this way?
Instead of assigning p the borrowed reference, you need to clone the Rc:
use std::cell::RefCell;
use std::rc::Rc;
struct Node {
val: i32,
next: Option<Rc<RefCell<Node>>>,
}
fn iterate(node: Option<Rc<RefCell<Node>>>) -> Vec<i32> {
let mut p = node;
let mut result = vec![];
loop {
let node = match p {
None => break,
Some(ref n) => Rc::clone(n), // Clone the Rc
};
result.push(node.as_ref().borrow().val); //works because val is Copy
p = match node.borrow().next {
None => None,
Some(ref next) => Some(Rc::clone(next)), //clone the Rc
};
}
result
}
fn main() {
let node = Some(Rc::new(RefCell::new(Node {
val: 0,
next: Some(Rc::new(RefCell::new(Node { val: 1, next: None }))),
})));
let result = iterate(node);
print!("{:?}", result)
}
This is necessary because you are trying to use a variable with a shorter lifespan in a context that requires a longer lifespan. The result of p.as_ref().unwrap().borrow() is dropped (i.e. freed, de-allocated) after the loop iteration, but you are trying to use its members in the next loop (this is called use after free and one of the design goals of Rust is to prevent that).
The issue is that borrows do not own the object. If you want to use the next as p in the next loop, then p will have to own the object. This can be achieved with Rc (i.e. 'reference counted') and allows for multiple owners in a single thread.
What if the definition of Node::next is Option<Box<RefCell<Node>>>, how to iterate over this list?
Yes, I'm also very confused with RefCell, without RefCell we can iterate over list using reference only, but will fail with RefCell. I even tried to add a vector of Ref to save the reference, but still can not success.
If you drop the RefCell you can iterate it like this:
struct Node {
val: i32,
next: Option<Box<Node>>,
}
fn iterate(node: Option<Box<Node>>) -> Vec<i32> {
let mut result = vec![];
let mut next = node.as_ref().map(|n| &**n);
while let Some(n) = next.take() {
result.push(n.val);
let x = n.next.as_ref().map(|n| &**n);
next = x;
}
result
}
fn main() {
let node = Some(Box::new(Node {
val: 0,
next: Some(Box::new(Node { val: 1, next: None })),
}));
let result = iterate(node);
print!("{:?}", result)
}
Maybe it's possible with a RefCell as well, but I was not able to work around the lifetime issues.
I bring a little different code from above answer, one match expression in the loop.
fn iterate(node: Option<Rc<RefCell<ListNode>>>) -> Vec<i32>{
let mut result = vec![];
let mut p = match node{
Some(x) => Rc::clone(&x),
None => return result,
};
loop {
result.push(p.as_ref().borrow().val); //works because val is Copy
let node = match &p.borrow().next{
Some(x) => Rc::clone(&x),
None => break,
};
p = node;
}
result
}

Why does the closure for `take_while` take its argument by reference?

Here is an example from Rust by Example:
fn is_odd(n: u32) -> bool {
n % 2 == 1
}
fn main() {
println!("Find the sum of all the squared odd numbers under 1000");
let upper = 1000;
// Functional approach
let sum_of_squared_odd_numbers: u32 =
(0..).map(|n| n * n) // All natural numbers squared
.take_while(|&n| n < upper) // Below upper limit
.filter(|n| is_odd(*n)) // That are odd
.fold(0, |sum, i| sum + i); // Sum them
println!("functional style: {}", sum_of_squared_odd_numbers);
}
Why does the closure for take_while take its argument by reference, while all the others take by value?
The implementation of Iterator::take_while is quite illuminating:
fn next(&mut self) -> Option<I::Item> {
if self.flag {
None
} else {
self.iter.next().and_then(|x| {
if (self.predicate)(&x) {
Some(x)
} else {
self.flag = true;
None
}
})
}
}
If the value returned from the underlying iterator were directly passed to the predicate, then ownership of the value would also be transferred. After the predicate was called, there would no longer be a value to return from the TakeWhile adapter if the predicate were true!

Using parser-combinators to parse string with escaped characters?

I'm trying to use the combine library in Rust to parse a string. The real data that I'm trying to parse looks something like this:
A79,216,0,4,2,2,N,"US\"PS"
So at the end of that data is a string in quotes, but the string will contain escaped characters as well. I can't figure out how to parse those escaped characters in between the other quotes.
extern crate parser_combinators;
use self::parser_combinators::*;
fn main() {
let s = r#""HE\"LLO""#;
let data = many(satisfy(|c| c != '"')); // Fails on escaped " obviously
let mut str_parser = between(satisfy(|c| c == '"'), satisfy(|c| c == '"'), data);
let result : Result<(String, &str), ParseError> = str_parser.parse(s);
match result {
Ok((value, _)) => println!("{:?}", value),
Err(err) => println!("{}", err),
}
}
//=> "HE\\"
The code above will parse that string successfully but will obviously fail on the escaped character in the middle, printing out "HE\\" in the end.
I want to change the code above so that it prints "HE\\\"LLO".
How do I do that?
I have a mostly functional JSON parser as a benchmark for parser-combinators which parses this sort of escaped characters. I have included a link to it and a slightly simplified version of it below.
fn json_char(input: State<&str>) -> ParseResult<char, &str> {
let (c, input) = try!(satisfy(|c| c != '"').parse_state(input));
let mut back_slash_char = satisfy(|c| "\"\\nrt".chars().find(|x| *x == c).is_some()).map(|c| {
match c {
'"' => '"',
'\\' => '\\',
'n' => '\n',
'r' => '\r',
't' => '\t',
c => c//Should never happen
}
});
match c {
'\\' => input.combine(|input| back_slash_char.parse_state(input)),
_ => Ok((c, input))
}
}
json_char
Since this parser may consume 1 or 2 characters it is not enough to use the primitive combinators and so we need to introduce a function which can branch on the character which is parsed.
I ran into the same problem and ended up with the following solution:
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
Or in other words, a string is delimited by " followed by many escaped_characters or anything that isn't a closing ", and is closed by a closing ".
Here's a full example of how I'm using this:
pub enum Operand {
String { value: String },
}
fn escaped_character<I>() -> impl Parser<Input = I, Output = char>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('\\'),
any(),
).and_then(|(_, x)| match x {
'0' => Ok('\0'),
'n' => Ok('\n'),
'\\' => Ok('\\'),
'"' => Ok('"'),
_ => Err(StreamErrorFor::<I>::unexpected_message(format!("Invalid escape sequence \\{}", x)))
})
}
#[test]
fn parse_escaped_character() {
let expected = Ok(('\n', " foo"));
assert_eq!(expected, escaped_character().easy_parse("\\n foo"))
}
fn string_operand<I>() -> impl Parser<Input = I, Output = Operand>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
.map(|(_,value,_)| Operand::String { value: value.into_iter().collect() })
}
#[test]
fn parse_string_operand() {
let expected = Ok((Operand::String { value: "foo \" bar \n baz \0".into() }, ""));
assert_eq!(expected, string_operand().easy_parse(r#""foo \" bar \n baz \0""#))
}

Resources