Match a slug with Nom - parsing

I've been trying for some time to find a decent solution for Nom to recognize the slug as an alpha1.
So I could parse something like this
fn parse<'a>(text: &'a str) -> IResult<&'a str, &'a str> {
delimited(char(':'), slug, char(':'))(text)
}
assert!(
parse(":hello-world-i-only-accept-alpha-numeric-char-and-dashes:"),
"hello-world-i-only-accept-alpha-numeric-char-and-dashes"
);
I tried with something like this but it seems to doesn't work.
fn slug<T, E: ParseError<T>>(input: T) -> IResult<T, T, E>
where
T: InputTakeAtPosition,
<T as InputTakeAtPosition>::Item: AsChar + Clone,
{
input.split_at_position1(
|item| {
let c = item.clone().as_char();
!(item.is_alpha() || c == '-')
},
ErrorKind::Char,
)
}
PS: Do you know how to tell Nom that the "-" in a slug must not be at the beginning nor the end?

There is nom::multi::separated_list for exactly this. And since you want the result to be string itself rather than a vector of segments, combining it with nom::combinator::recognize will do the trick:
use std::error::Error;
use nom::{
IResult,
character::complete::{alphanumeric1, char},
combinator::recognize,
multi::separated_list,
sequence::delimited,
};
fn slug_parse<'a>(text: &'a str) -> IResult<&'a str, &'a str> {
let slug = separated_list(char('-'), alphanumeric1);
delimited(char(':'), recognize(slug), char(':'))(text)
}
fn main() -> Result<(), Box<dyn Error>> {
let (_, res) = slug_parse(":hello-world-i-only-accept-alpha-numeric-char-and-dashes:")?;
assert_eq!(
res,
"hello-world-i-only-accept-alpha-numeric-char-and-dashes"
);
Ok(())
}

Related

How to build a numbered list parser in nom?

I'd like to parse a numbered list using nom in Rust.
For example, 1. Milk 2. Bread 3. Bacon.
I could use separated_list1 with an appropriate separator parser and element parser.
fn parser(input: &str) -> IResult<&str, Vec<&str>> {
preceded(
tag("1. "),
separated_list1(
tuple((tag(" "), digit1, tag(". "))),
take_while(is_alphabetic),
),
)(input)
}
However, this does not validate the increasing index numbers.
For example, it would happily parse invalid lists like 1. Milk 3. Bread 4. Bacon or 1. Milk 8. Bread 1. Bacon.
It seems there is no built-in nom parser that can do this. So I ventured to try to build my own first parser...
My idea was to implement a parser similar to separated_list1 but which keeps track of the index and passes it to the separator as argument. It could accept a closure as argument that can then create the separator parser based on the index argument.
fn parser(input: &str) -> IResult<&str, Vec<&str>> {
preceded(
tag("1. "),
separated_list1(
|index: i32| tuple((tag(" "), tag(&index.to_string()), tag(". "))),
take_while(is_alphabetic),
),
)(input)
}
I tried to use the implementation of separated_list1 and change the separator argument to G: FnOnce(i32) -> Parser<I, O2, E>,, create an index variable let mut index = 1;, pass it to sep(index) in the loop, and increase it at the end of the loop index += 1;.
However, Rust's type system is not happy!
How can I make this work?
Here's the full code for reproduction
use nom::{
error::{ErrorKind, ParseError},
Err, IResult, InputLength, Parser,
};
pub fn separated_numbered_list1<I, O, O2, E, F, G>(
mut sep: G,
mut f: F,
) -> impl FnMut(I) -> IResult<I, Vec<O>, E>
where
I: Clone + InputLength,
F: Parser<I, O, E>,
G: FnOnce(i32) -> Parser<I, O2, E>,
E: ParseError<I>,
{
move |mut i: I| {
let mut res = Vec::new();
let mut index = 1;
// Parse the first element
match f.parse(i.clone()) {
Err(e) => return Err(e),
Ok((i1, o)) => {
res.push(o);
i = i1;
}
}
loop {
let len = i.input_len();
match sep(index).parse(i.clone()) {
Err(Err::Error(_)) => return Ok((i, res)),
Err(e) => return Err(e),
Ok((i1, _)) => {
// infinite loop check: the parser must always consume
if i1.input_len() == len {
return Err(Err::Error(E::from_error_kind(i1, ErrorKind::SeparatedList)));
}
match f.parse(i1.clone()) {
Err(Err::Error(_)) => return Ok((i, res)),
Err(e) => return Err(e),
Ok((i2, o)) => {
res.push(o);
i = i2;
}
}
}
}
index += 1;
}
}
}
Try to manually use many1(), separated_pair(), and verify()
fn validated(input: &str) -> IResult<&str, Vec<(u32, &str)>> {
let current_index = Cell::new(1u32);
let number = map_res(digit1, |s: &str| s.parse::<u32>());
let valid = verify(number, |digit| {
let i = current_index.get();
if digit == &i {
current_index.set(i + 1);
true
} else {
false
}
});
let pair = preceded(multispace0, separated_pair(valid, tag(". "), alpha1));
//give current_index time to be used and dropped with a temporary binding. This will not compile without the temporary binding
let tmp = many1(pair)(input);
tmp
}
#[test]
fn test_success() {
let input = "1. Milk 2. Bread 3. Bacon";
assert_eq!(validated(input), Ok(("", vec![(1, "Milk"), (2, "Bread"), (3, "Bacon")])));
}
#[test]
fn test_fail() {
let input = "2. Bread 3. Bacon 1. Milk";
validated(input).unwrap_err();
}

How can I match an exact tag using the nom library in Rust

I'm working on a tiny duration parsing library written in Rust, and using the nom library. In this library, I define a second parser combinator function. Its responsibility is to parse the various acceptable formats for representing seconds in a textual format.
pub fn duration(input: &str) -> IResult<&str, std::time::Duration> {
// Some code combining the various time format combinators
// to match the format "10 days, 8 hours, 7 minutes and 6 seconds"
}
pub fn seconds(input: &str) -> IResult<&str, u64> {
terminated(unsigned_integer_64, preceded(multispace0, second))(input)
}
fn second(input: &str) -> IResult<&str, &str> {
alt((
tag("seconds"),
tag("second"),
tag("secs"),
tag("sec"),
tag("s"),
))(input)
}
So far, the tag combinator was behaving as I expected. However, I discovered recently that the following example fails, and is by definition failing:
assert!(second("se").is_err())
Indeed, the documentation states that "The input data will be compared to the tag combinator’s argument and will return the part of the input that matches the argument".
However, as my example hopefully illustrates, what I would like to achieve is for some flavor of tag that would fail if the whole input could not be parsed. I looked into explicitly checking if there is a rest after parsing the input; and found that it would work. Also, unsuccessfully explored using some flavors of the complete and take combinators to achieve that.
What would be an idiomatic way to parse an "exact match" of a word, and fail on a partial result (that would return a rest)?
You can use the all consuming combinator, which succeeds if the whole input has been consumed by its child parser:
// nom 6.1.2
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::combinator::all_consuming;
use nom::IResult;
fn main() {
assert!(second("se").is_err());
}
fn second(input: &str) -> IResult<&str, &str> {
all_consuming(alt((
tag("seconds"),
tag("second"),
tag("secs"),
tag("sec"),
tag("s"),
)))(input)
}
Update
I think I misunderstood your original question. Maybe this is closer to what you need. The key is that you should write smaller parsers, and then combine them:
use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::character::complete::digit1;
use nom::combinator::all_consuming;
use nom::sequence::{terminated, tuple};
use nom::IResult;
#[derive(Debug)]
struct Time {
min: u32,
sec: u32,
}
fn main() {
//OK
let parsed = time("10 minutes, 5 seconds");
println!("{:?}", parsed);
//OK
let parsed = time("10 mins, 5 s");
println!("{:?}", parsed);
//Error -> although `min` is a valid tag, it would expect `, ` afterwards, instead of `ts`
let parsed = time("10 mints, 5 s");
println!("{:?}", parsed);
//Error -> there must not be anything left after "5 s"
let parsed = time("10 mins, 5 s, ");
println!("{:?}", parsed);
// Error -> although it starts with `sec` which is a valid tag, it will fail, because it would expect EOF
let parsed = time("10 min, 5 sections");
println!("{:?}", parsed);
}
fn time(input: &str) -> IResult<&str, Time> {
// parse the minutes section and **expect** a delimiter, because there **must** be another section afterwards
let (rem, min) = terminated(minutes_section, delimiter)(input)?;
// parse the minutes section and **expect** EOF - i.e. there should not be any input left to parse
let (rem, sec) = all_consuming(seconds_section)(rem)?;
// rem should be empty slice
IResult::Ok((rem, Time { min, sec }))
}
// This function combines several parsers to parse the minutes section:
// NUMBER[sep]TAG-MINUTES
fn minutes_section(input: &str) -> IResult<&str, u32> {
let (rem, (min, _sep, _tag)) = tuple((number, separator, minutes))(input)?;
IResult::Ok((rem, min))
}
// This function combines several parsers to parse the seconds section:
// NUMBER[sep]TAG-SECONDS
fn seconds_section(input: &str) -> IResult<&str, u32> {
let (rem, (sec, _sep, _tag)) = tuple((number, separator, seconds))(input)?;
IResult::Ok((rem, sec))
}
fn number(input: &str) -> IResult<&str, u32> {
digit1(input).map(|(remaining, number)| {
// it can panic if the string represents a number
// that does not fit into u32
let n = number.parse().unwrap();
(remaining, n)
})
}
fn minutes(input: &str) -> IResult<&str, &str> {
alt((
tag("minutes"),
tag("minute"),
tag("mins"),
tag("min"),
tag("m"),
))(input)
}
fn seconds(input: &str) -> IResult<&str, &str> {
alt((
tag("seconds"),
tag("second"),
tag("secs"),
tag("sec"),
tag("s"),
))(input)
}
// This function parses the separator between the number and the tag:
//N<separator>tag -> 5[sep]minutes
fn separator(input: &str) -> IResult<&str, &str> {
tag(" ")(input)
}
// This function parses the delimiter between the sections:
// X minutes<delimiter>Y seconds -> 1 min[delimiter]2 sec
fn delimiter(input: &str) -> IResult<&str, &str> {
tag(", ")(input)
}
Here I have created a set of basic parsers for the building blocks, such as "number", "separator", "delimiter", the various markers (min, sec, etc). None of those expect to be "whole words". Instead you should use combinators, such as terminated, tuple, all_consuming to mark where the "exact word" ends.

With closures as parameter and return values, is Fn or FnMut more idiomatic?

Continuing from How do I write combinators for my own parsers in Rust?, I stumbled into this question concerning bounds of functions that consume and/or yield functions/closures.
From these slides, I learned that to be convenient for consumers, you should try to take functions as FnOnce and return as Fn where possible. This gives the caller most freedom what to pass and what to do with the returned function.
In my example, FnOnce is not possible because I need to call that function multiple times. While trying to make it compile I arrived at two possibilities:
pub enum Parsed<'a, T> {
Some(T, &'a str),
None(&'a str),
}
impl<'a, T> Parsed<'a, T> {
pub fn unwrap(self) -> (T, &'a str) {
match self {
Parsed::Some(head, tail) => (head, &tail),
_ => panic!("Called unwrap on nothing."),
}
}
pub fn is_none(&self) -> bool {
match self {
Parsed::None(_) => true,
_ => false,
}
}
}
pub fn achar(character: char) -> impl Fn(&str) -> Parsed<char> {
move |input|
match input.chars().next() {
Some(c) if c == character => Parsed::Some(c, &input[1..]),
_ => Parsed::None(input),
}
}
pub fn some_v1<T>(parser: impl Fn(&str) -> Parsed<T>) -> impl Fn(&str) -> Parsed<Vec<T>> {
move |input| {
let mut re = Vec::new();
let mut pos = input;
loop {
match parser(pos) {
Parsed::Some(head, tail) => {
re.push(head);
pos = tail;
}
Parsed::None(_) => break,
}
}
Parsed::Some(re, pos)
}
}
pub fn some_v2<T>(mut parser: impl FnMut(&str) -> Parsed<T>) -> impl FnMut(&str) -> Parsed<Vec<T>> {
move |input| {
let mut re = Vec::new();
let mut pos = input;
loop {
match parser(pos) {
Parsed::Some(head, tail) => {
re.push(head);
pos = tail;
}
Parsed::None(_) => break,
}
}
Parsed::Some(re, pos)
}
}
#[test]
fn try_it() {
assert_eq!(some_v1(achar('#'))("##comment").unwrap(), (vec!['#', '#'], "comment"));
assert_eq!(some_v2(achar('#'))("##comment").unwrap(), (vec!['#', '#'], "comment"));
}
playground
Now I don't know which version is to be preferred. Version 1 takes Fn which is less general, but version 2 needs its parameter mutable.
Which one is more idiomatic/should be used and what is the rationale behind?
Update: Thanks jplatte for the suggestion on version one. I updated the code here, that case I find even more interesting.
Comparing some_v1 and some_v2 as you wrote them I would say version 2 should definitely be preferred because it is more general. I can't think of a good example for a parsing closure that would implement FnMut but not Fn, but there's really no disadvantage to parser being mut - as noted in the first comment on your question this doesn't constrain the caller in any way.
However, there is a way in which you can make version 1 more general (not strictly more general, just partially) than version 2, and that is by returning impl Fn(&str) -> … instead of impl FnMut(&str) -> …. By doing that, you get two functions that each are less constrained than the other in some way, so it might even make sense to keep both:
Version 1 with the return type change would be more restrictive in its argument (the callable can't mutate its associated data) but less restrictive in its return type (you guarantee that the returned callable doesn't mutate its associated data)
Version 2 would be less restrictive in its argument (the callable is allowed to mutate its associated data) but more restrictive in its return type (the returned callable might mutate its associated data)

Singly-Linked List in Rust

I've been trying to teach myself some Rust lately and wanted to practice a bit by implementing a simple linked list. I took some inspiration from the Rust library's linked list and tried to replicate the parts I already understood. Also I decided to make it singly-linked for now.
struct Node<T> {
element: T,
next: Option<Box<Node<T>>>,
}
impl<T> Node<T> {
fn new(element: T) -> Self {
Node {
element: element,
next: None,
}
}
fn append(&mut self, element: Box<Node<T>>) {
self.next = Some(element);
}
}
pub struct LinkedList<T> {
head: Option<Box<Node<T>>>,
tail: Option<Box<Node<T>>>,
len: u32,
}
impl<T> LinkedList<T> {
pub fn new() -> Self {
head: None,
tail: None,
len: 0,
}
pub fn push(&mut self, element: T) {
let node: Box<Node<T>> = Box::new(Node::new(element));
match self.tail {
None => self.head = Some(node),
Some(mut ref tail) => tail.append(node),
}
self.tail = Some(node);
self.len += 1;
}
pub fn pop(&mut self) -> Option<T> {
//not implemented
}
pub fn get(&self, index: u32) -> Option<T> {
//not implemented
}
}
This is what I've got so far; from what I understand, the problem with this code is that the Box can not have more than one reference to it in order to preserve memory safety.
So when I set the list head to node in
None => self.head = Some(node),
I can't then go ahead and set
self.tail = Some(node);
later, am I correct so far in my understanding? What would be the correct way to do this? Do I have to use Shared like in the library or is there a way in which the Box or some other type can be utilized?
Your issue is that you are attempting to use a value (node) after having moved it; since Box<Node<T>> does not implement Copy, when you use it in the match expression:
match self.tail {
None => self.head = Some(node),
Some(ref mut tail) => tail.append(node),
}
node is moved either to self.head or to self.tail and can no longer be used later. Other than reading the obligatory Learning Rust With Entirely Too Many Linked Lists to see the different ways in which you can implement linked lists in Rust, I suggest that you first do some more research in the field of Rust's basic concepts, especially:
Ownership
References and Borrowing
What are move semantics?
You can go with something simpler than that, only using your nodes
use std::fmt;
struct Payload {
id: i32,
value: i32,
}
impl fmt::Display for Payload {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "({}, {})", self.id, self.value)
}
}
struct Node<T> {
element: T,
next: Option<Box<Node<T>>>,
}
impl<T> Node<T> where T: std::fmt::Display{
fn new(element: T) -> Self {
Node {
element: element,
next: None,
}
}
fn append(&mut self, element: T) {
match &mut self.next {
None => {let n = Node {
element: element,
next: None,
};
self.next = Some(Box::new(n));
},
Some(ref mut x) => x.append(element),
}
}
fn list(& self) {
println!("{}", self.element);
match &self.next {
None => {},
Some(x) => x.list(),
}
}
}
fn main(){
let mut h = Node::new(Payload {id:1, value:1});
h.append(Payload {id:2, value:2});
h.append(Payload {id:3, value:3});
h.append(Payload {id:4, value:4});
h.append(Payload {id:5, value:5});
h.list();
h.append(Payload {id:6, value:6});
h.list();
}

Using parser-combinators to parse string with escaped characters?

I'm trying to use the combine library in Rust to parse a string. The real data that I'm trying to parse looks something like this:
A79,216,0,4,2,2,N,"US\"PS"
So at the end of that data is a string in quotes, but the string will contain escaped characters as well. I can't figure out how to parse those escaped characters in between the other quotes.
extern crate parser_combinators;
use self::parser_combinators::*;
fn main() {
let s = r#""HE\"LLO""#;
let data = many(satisfy(|c| c != '"')); // Fails on escaped " obviously
let mut str_parser = between(satisfy(|c| c == '"'), satisfy(|c| c == '"'), data);
let result : Result<(String, &str), ParseError> = str_parser.parse(s);
match result {
Ok((value, _)) => println!("{:?}", value),
Err(err) => println!("{}", err),
}
}
//=> "HE\\"
The code above will parse that string successfully but will obviously fail on the escaped character in the middle, printing out "HE\\" in the end.
I want to change the code above so that it prints "HE\\\"LLO".
How do I do that?
I have a mostly functional JSON parser as a benchmark for parser-combinators which parses this sort of escaped characters. I have included a link to it and a slightly simplified version of it below.
fn json_char(input: State<&str>) -> ParseResult<char, &str> {
let (c, input) = try!(satisfy(|c| c != '"').parse_state(input));
let mut back_slash_char = satisfy(|c| "\"\\nrt".chars().find(|x| *x == c).is_some()).map(|c| {
match c {
'"' => '"',
'\\' => '\\',
'n' => '\n',
'r' => '\r',
't' => '\t',
c => c//Should never happen
}
});
match c {
'\\' => input.combine(|input| back_slash_char.parse_state(input)),
_ => Ok((c, input))
}
}
json_char
Since this parser may consume 1 or 2 characters it is not enough to use the primitive combinators and so we need to introduce a function which can branch on the character which is parsed.
I ran into the same problem and ended up with the following solution:
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
Or in other words, a string is delimited by " followed by many escaped_characters or anything that isn't a closing ", and is closed by a closing ".
Here's a full example of how I'm using this:
pub enum Operand {
String { value: String },
}
fn escaped_character<I>() -> impl Parser<Input = I, Output = char>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('\\'),
any(),
).and_then(|(_, x)| match x {
'0' => Ok('\0'),
'n' => Ok('\n'),
'\\' => Ok('\\'),
'"' => Ok('"'),
_ => Err(StreamErrorFor::<I>::unexpected_message(format!("Invalid escape sequence \\{}", x)))
})
}
#[test]
fn parse_escaped_character() {
let expected = Ok(('\n', " foo"));
assert_eq!(expected, escaped_character().easy_parse("\\n foo"))
}
fn string_operand<I>() -> impl Parser<Input = I, Output = Operand>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
.map(|(_,value,_)| Operand::String { value: value.into_iter().collect() })
}
#[test]
fn parse_string_operand() {
let expected = Ok((Operand::String { value: "foo \" bar \n baz \0".into() }, ""));
assert_eq!(expected, string_operand().easy_parse(r#""foo \" bar \n baz \0""#))
}

Resources