I'm attempting to parse a subset of JSON that only contains a single, non-nested object with string only values that may contain escape sequences. E.g.
{
"A KEY": "SOME VALUE",
"Another key": "Escape sequences \n \r \\ \/ \f \t \u263A"
}
Using the Chomp parser combinator in Rust. I have it parsing this structure ignoring escape sequences but am having trouble working out how to handle the escape sequences. Looking at other quoted string parsers that use combinators such as:
Arc JSON parser
PHP parser-combinator
Paka
They each use a sequence combinator, what is the equivalent in Chomp?
Chomp is based on Attoparsec and Parsec, so for parsing escaped strings I would use the scan parser to obtain the slice between the " characters while keeping any escaped " characters.
The sequence combinator is just the ParseResult::bind method, used to chain the match of the " character and the escaped string itself so that it will be able to parse "foo\\"bar" and not just foo\\"bar. You get this for free when you use the parse! macro, each ; is implicitly converted into a bind call to chain the parsers together.
The linked parsers use a many and or combinator and allocate a vector for the resulting characters. Paka does not seem to do any transformation on the resulting array, and PHP is using a regex with a callback to unescape the string.
This is code translated from Attoparsec's Aeson benchmark for parsing a JSON-string while not unescaping any escaped characters.
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::buffer::IntoStream;
use chomp::buffer::Stream;
pub fn json_string(i: Input<u8>) -> U8Result<&[u8]> {
parse!{i;
token(b'"');
let escaped_str = scan(false, |s, c| if s { Some(false) }
else if c == b'"' { None }
else { Some(c == b'\\') });
token(b'"');
ret escaped_str
}
}
#[test]
fn test_it() {
let r = "\"foo\\\"bar\\tbaz\"".as_bytes().into_stream().parse(json_string);
assert_eq!(r, Ok(&b"foo\\\"bar\\tbaz"[..]));
}
The parser above is not equivalent, it yields a slice of u8 borrowed from the source buffer/slice. If you want an owned Vec of the data you should preferably use [T]::to_vec or String::from_utf8 instead of building a parser using many and or since it will not be as fast as scan and the result is the same.
If you want to parse UTF-8 and escape-sequences you can filter the resulting slice and then calling String::from_utf8 on the Vec (Rust strings are UTF-8, to use a string containing invalid UTF-8 can result in undefined behaviour). If performance is an issue you should build that into the parser most likely.
The documentation states (emphasis mine):
Using parsers is almost entirely done using the parse! macro, which enables us to do three distinct things:
Sequence parsers over the remaining input
Store intermediate results into datatypes
Return a datatype at the end, which may be the result of any arbitrary computation over the intermediate results.
It then provides this example of parsing a sequence of two numbers followed by a constant string:
fn f(i: Input<u8>) -> U8Result<(u8, u8, u8)> {
parse!{i;
let a = digit();
let b = digit();
string(b"missiles");
ret (a, b, a + b)
}
}
fn digit(i: Input<u8>) -> U8Result<u8> {
satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}
There is also ParseResult::bind and ParseResult::then which are documented to sequentially compose a result with a second action.
Because I'm always interested in parsing, I went ahead and played with it a bit to see how it would look. I'm not happy with the deep indenting that would happen with the nested or calls, but there's probably something better that can be done. This is just one possible solution:
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::ascii::is_alpha;
use chomp::buffer::{Source, Stream, ParseError};
use std::str;
use std::iter::FromIterator;
#[derive(Debug)]
pub enum StringPart<'a> {
String(&'a [u8]),
Newline,
Slash,
}
impl<'a> StringPart<'a> {
fn from_bytes(s: &[u8]) -> StringPart {
match s {
br#"\\"# => StringPart::Slash,
br#"\n"# => StringPart::Newline,
s => StringPart::String(s),
}
}
}
impl<'a> FromIterator<StringPart<'a>> for String {
fn from_iter<I>(iterator: I) -> Self
where I: IntoIterator<Item = StringPart<'a>>
{
let mut s = String::new();
for part in iterator {
match part {
StringPart::String(p) => s.push_str(str::from_utf8(p).unwrap()),
StringPart::Newline => s.push('\n'),
StringPart::Slash => s.push('\\'),
}
}
s
}
}
fn json_string_part(i: Input<u8>) -> U8Result<StringPart> {
or(i,
|i| parse!{i; take_while1(is_alpha)},
|i| or(i,
|i| parse!{i; string(br"\\")},
|i| parse!{i; string(br"\n")}),
).map(StringPart::from_bytes)
}
fn json_string(i: Input<u8>) -> U8Result<String> {
many1(i, json_string_part)
}
fn main() {
let input = br#"\\stuff\n"#;
let mut i = Source::new(input as &[u8]);
println!("Input has {} bytes", input.len());
loop {
match i.parse(json_string) {
Ok(x) => {
println!("Result has {} bytes", x.len());
println!("{:?}", x);
},
Err(ParseError::Retry) => {}, // Needed to refill buffer when necessary
Err(ParseError::EndOfInput) => break,
Err(e) => { panic!("{:?}", e); }
}
}
}
Related
I am using the PEST parser and I am testing a simple example to get familiar with the syntax. I am trying to get every instance of ++ throughout the string but I am running into some issues. I think it may be an issue with the ANY keyword but I am not sure. Can anyone help point me in the right direction as to what is going wrong?
Here is my grammar.pest file
incrementing = {(prefix ~ ANY+ ~ "++" ~ suffix)}
prefix = {(NEWLINE | WHITESPACE)*}
suffix = {(NEWLINE | WHITESPACE)*}
WHITESPACE = _{ " " }
Here is my test case
//parses a file a matching rule and returns all instances of the rule
fn parse_file_contents_for_rule(rule: Rule, file_contents: &str) -> Option<Pairs<Rule>> {
SolgaParser::parse(rule, file_contents).ok()
}
fn parse_incrementing(file_contents: &str) {
//parse the file for the rule
let targets = parse_file_contents_for_rule(Rule::incrementing, file_contents);
//if there are matches
if targets.is_some() {
//iterate through all of the matches
for target in targets.unwrap().into_iter() {
println!("{}", target.as_str());
}
}
}
#[test]
fn test_parse_incrementing() {
let file_contents = r#"
index++;
a_thing++;
another_thing++;
should_not_match;
should_match++;
"#;
parse_incrementing(file_contents);
}
In your example, ANY+ is probably matching till the end of the line, so the ++ pattern is never matched, and therefore the whole incrementing rule is never matched.
Try changing it to (!"+" ~ ANY)+
How can I consume a list of tokens that may or may not be separated by a space?
I'm trying to parse Chinese romanization (pinyin) in the cedict format with nom (6.1.2). For example "ni3 hao3 ma5" which is, due to human error in transcription, sometimes written as "ni3hao3ma5" or "ni3hao3 ma5" (note the variable spacing).
I have written a parser that will handle individual syllables e.g. ["ni3", "hao3", "ma5"], and I'm trying to use a nom::multi::separated_list0 to parse it like so:
nom::multi::separated_list0(
nom::character::complete::space0,
syllable,
)(i)?;
However, I get a Err(Error(Error { input: "", code: SeparatedList })) after all the tokens have been consumed.
The problem with using
nom::multi::separated_list0(
nom::character::complete::space0,
syllable,
)(i)?;
Is that the space0 delimiter matches empty string, so it will reach the end of the input string and the separated_list0 will continue to try to consume the empty string, hence the Err(Error(Error { input: "", code: SeparatedList })).
The solution in my case was to use nom::multi::many1 and handling the optional spaces in the inner parser instead of nom::multi::separated_list0 like so:
fn syllables(i: &str) -> IResult<&str, Vec<Syllable>> {
// many 👇 instead of separated_list0
multi::many1(syllable)(i)
}
fn syllable(i: &str) -> IResult<&str, Syllable> {
let (rest, (_, pronunciation, tone)) = sequence::tuple((
// and handle the optional space
// here 👇
character::complete::space0,
character::complete::alpha1,
character::complete::digit0,
))(i)?;
Ok((rest, Syllable::new(pronunciation, tone)))
}
Continuing from How do I write combinators for my own parsers in Rust?, I stumbled into this question concerning bounds of functions that consume and/or yield functions/closures.
From these slides, I learned that to be convenient for consumers, you should try to take functions as FnOnce and return as Fn where possible. This gives the caller most freedom what to pass and what to do with the returned function.
In my example, FnOnce is not possible because I need to call that function multiple times. While trying to make it compile I arrived at two possibilities:
pub enum Parsed<'a, T> {
Some(T, &'a str),
None(&'a str),
}
impl<'a, T> Parsed<'a, T> {
pub fn unwrap(self) -> (T, &'a str) {
match self {
Parsed::Some(head, tail) => (head, &tail),
_ => panic!("Called unwrap on nothing."),
}
}
pub fn is_none(&self) -> bool {
match self {
Parsed::None(_) => true,
_ => false,
}
}
}
pub fn achar(character: char) -> impl Fn(&str) -> Parsed<char> {
move |input|
match input.chars().next() {
Some(c) if c == character => Parsed::Some(c, &input[1..]),
_ => Parsed::None(input),
}
}
pub fn some_v1<T>(parser: impl Fn(&str) -> Parsed<T>) -> impl Fn(&str) -> Parsed<Vec<T>> {
move |input| {
let mut re = Vec::new();
let mut pos = input;
loop {
match parser(pos) {
Parsed::Some(head, tail) => {
re.push(head);
pos = tail;
}
Parsed::None(_) => break,
}
}
Parsed::Some(re, pos)
}
}
pub fn some_v2<T>(mut parser: impl FnMut(&str) -> Parsed<T>) -> impl FnMut(&str) -> Parsed<Vec<T>> {
move |input| {
let mut re = Vec::new();
let mut pos = input;
loop {
match parser(pos) {
Parsed::Some(head, tail) => {
re.push(head);
pos = tail;
}
Parsed::None(_) => break,
}
}
Parsed::Some(re, pos)
}
}
#[test]
fn try_it() {
assert_eq!(some_v1(achar('#'))("##comment").unwrap(), (vec!['#', '#'], "comment"));
assert_eq!(some_v2(achar('#'))("##comment").unwrap(), (vec!['#', '#'], "comment"));
}
playground
Now I don't know which version is to be preferred. Version 1 takes Fn which is less general, but version 2 needs its parameter mutable.
Which one is more idiomatic/should be used and what is the rationale behind?
Update: Thanks jplatte for the suggestion on version one. I updated the code here, that case I find even more interesting.
Comparing some_v1 and some_v2 as you wrote them I would say version 2 should definitely be preferred because it is more general. I can't think of a good example for a parsing closure that would implement FnMut but not Fn, but there's really no disadvantage to parser being mut - as noted in the first comment on your question this doesn't constrain the caller in any way.
However, there is a way in which you can make version 1 more general (not strictly more general, just partially) than version 2, and that is by returning impl Fn(&str) -> … instead of impl FnMut(&str) -> …. By doing that, you get two functions that each are less constrained than the other in some way, so it might even make sense to keep both:
Version 1 with the return type change would be more restrictive in its argument (the callable can't mutate its associated data) but less restrictive in its return type (you guarantee that the returned callable doesn't mutate its associated data)
Version 2 would be less restrictive in its argument (the callable is allowed to mutate its associated data) but more restrictive in its return type (the returned callable might mutate its associated data)
In Python, a function called os.path.join() allows concatenating multiple strings into one path using the path separator of the operating system. In Rust, there is only a function join() that appends a string or a path to an existing path. This problem can't be solved with a normal function as a normal function needs to have a fixed number of arguments.
I'm looking for a macro that takes an arbitrary number of strings and paths and returns the joined path.
There's a reasonably simple example in the documentation for PathBuf:
use std::path::PathBuf;
let path: PathBuf = [r"C:\", "windows", "system32.dll"].iter().collect();
Once you read past the macro syntax, it's not too bad. Basically, we take require at least two arguments, and the first one needs to be convertible to a PathBuf via Into. Each subsequent argument is pushed on the end, which accepts anything that can be turned into a reference to a Path.
macro_rules! build_from_paths {
($base:expr, $($segment:expr),+) => {{
let mut base: ::std::path::PathBuf = $base.into();
$(
base.push($segment);
)*
base
}}
}
fn main() {
use std::{
ffi::OsStr,
path::{Path, PathBuf},
};
let a = build_from_paths!("a", "b", "c");
println!("{:?}", a);
let b = build_from_paths!(PathBuf::from("z"), OsStr::new("x"), Path::new("y"));
println!("{:?}", b);
}
A normal function which takes an iterable (e.g. a slice) can solve the problem in many contexts:
use std::path::{Path, PathBuf};
fn join_all<P, Ps>(parts: Ps) -> PathBuf
where
Ps: IntoIterator<Item = P>,
P: AsRef<Path>,
{
parts.into_iter().fold(PathBuf::new(), |mut acc, p| {
acc.push(p);
acc
})
}
fn main() {
let parts = vec!["/usr", "bin", "man"];
println!("{:?}", join_all(&parts));
println!("{:?}", join_all(&["/etc", "passwd"]));
}
Playground
I am very, very new to Rust and trying to implement some simple things to get the feel for the language. Right now, I'm stumbling over the best way to implement a class-like struct that involves casting a string to an int. I'm using a global-namespaced function and it feels wrong to my Ruby-addled brain.
What's the Rustic way of doing this?
use std::io;
struct Person {
name: ~str,
age: int
}
impl Person {
fn new(input_name: ~str) -> Person {
Person {
name: input_name,
age: get_int_from_input(~"Please enter a number for age.")
}
}
fn print_info(&self) {
println(fmt!("%s is %i years old.", self.name, self.age));
}
}
fn get_int_from_input(prompt_message: ~str) -> int {
println(prompt_message);
let my_input = io::stdin().read_line();
let my_val =
match from_str::<int>(my_input) {
Some(number_string) => number_string,
_ => fail!("got to put in a number.")
};
return my_val;
}
fn main() {
let first_person = Person::new(~"Ohai");
first_person.print_info();
}
This compiles and has the desired behaviour, but I am at a loss for what to do here--it's obvious I don't understand the best practices or how to implement them.
Edit: this is 0.8
Here is my version of the code, which I have made more idiomatic:
use std::io;
struct Person {
name: ~str,
age: int
}
impl Person {
fn print_info(&self) {
println!("{} is {} years old.", self.name, self.age);
}
}
fn get_int_from_input(prompt_message: &str) -> int {
println(prompt_message);
let my_input = io::stdin().read_line();
from_str::<int>(my_input).expect("got to put in a number.")
}
fn main() {
let first_person = Person {
name: ~"Ohai",
age: get_int_from_input("Please enter a number for age.")
};
first_person.print_info();
}
fmt!/format!
First, Rust is deprecating the fmt! macro, with printf-based syntax, in favor of format!, which uses syntax similar to Python format strings. The new version, Rust 0.9, will complain about the use of fmt!. Therefore, you should replace fmt!("%s is %i years old.", self.name, self.age) with format!("{} is {} years old.", self.name, self.age). However, we have a convenience macro println!(...) that means exactly the same thing as println(format!(...)), so the most idiomatic way to write your code in Rust would be
println!("{} is {} years old.", self.name, self.age);
Initializing structs
For a simple type like Person, it is idiomatic in Rust to create instances of the type by using the struct literal syntax:
let first_person = Person {
name: ~"Ohai",
age: get_int_from_input("Please enter a number for age.")
};
In cases where you do want a constructor, Person::new is the idiomatic name for a 'default' constructor (by which I mean the most commonly used constructor) for a type Person. However, it would seem strange for the default constructor to require initialization from user input. Usually, I think you would have a person module, for example (with person::Person exported by the module). In this case, I think it would be most idiomatic to use a module-level function fn person::prompt_for_age(name: ~str) -> person::Person. Alternatively, you could use a static method on Person -- Person::prompt_for_age(name: ~str).
&str vs. ~str in function parameters
I've changed the signature of get_int_from_input to take a &str instead of ~str. ~str denotes a string allocated on the exchange heap -- in other words, the heap that malloc/free in C, or new/delete in C++ operate on. Unlike in C/C++, however, Rust enforces the requirement that values on the exchange heap can only be owned by one variable at a time. Therefore, taking a ~str as a function parameter means that the caller of the function can't reuse the ~str argument that it passed in -- it would have to make a copy of the ~str using the .clone method.
On the other hand, &str is a slice into the string, which is just a reference to a range of characters in the string, so it doesn't require a new copy of the string to be allocated when a function with a &str parameter is called.
The reason to use &str rather than ~str for prompt_message in get_int_from_input is that the function doesn't need to hold onto the message past the end of the function. It only uses the prompt message in order to print it (and println takes a &str, not a ~str). Once you change the function to take &str, you can call it like get_int_from_input("Prompt") instead of get_int_from_input(~"Prompt"), which avoids the unnecessary allocation of "Prompt" on the heap (and similarly, you can avoid having to clone s in the code below):
let s: ~str = ~"Prompt";
let i = get_int_from_input(s.clone());
println(s); // Would complain that `s` is no longer valid without cloning it above
// if `get_int_from_input` takes `~str`, but not if it takes `&str`.
Option<T>::expect
The Option<T>::expect method is the idiomatic shortcut for the match statement you have, where you want to either return x if you get Some(x) or fail with a message if you get None.
Returning without return
In Rust, it is idiomatic (following the example of functional languages like Haskell and OCaml) to return a value without explicitly writing a return statement. In fact, the return value of a function is the result of the last expression in the function, unless the expression is followed by a semicolon (in which case it returns (), a.k.a. unit, which is essentially an empty placeholder value -- () is also what is returned by functions without an explicit return type, such as main or print_info).
Conclusion
I'm not a great expert on Rust by any means. If you want help on anything related to Rust, you can try, in addition to Stack Overflow, the #rust IRC channel on irc.mozilla.org or the Rust subreddit.
This isn't really rust-specifc, but try to split functionality into discrete units. Don't mix the low-level tasks of putting strings on the terminal and getting strings from the terminal with the more directly relevant (and largely implementation dependent) tasks of requesting a value, and verify it. When you do that, the design decisions you should make start to arise on their own.
For instance, you could write something like this (I haven't compiled it, and I'm new to rust myself, so they're probably at LEAST one thing wrong with this :) ).
fn validated_input_prompt<T>(prompt: ~str) {
println(prompt);
let res = io::stdin().read_line();
loop {
match res.len() {
s if s == 0 => { continue; }
s if s > 0 {
match T::from_str(res) {
Some(t) -> {
return t
},
None -> {
println("ERROR. Please try again.");
println(prompt);
}
}
}
}
}
}
And then use it as:
validated_input_prompt<int>("Enter a number:")
or:
validated_input_prompt<char>("Enter a Character:")
BUT, to make the latter work, you'd need to implement FromStr for chars, because (sadly) rust doesn't seem to do it by default. Something LIKE this, but again, I'm not really sure of the rust syntax for this.
use std::from_str::*;
impl FromStr for char {
fn from_str(s: &str) -> Option<Self> {
match len(s) {
x if x >= 1 => {
Option<char>.None
},
x if x == 0 => {
None,
},
}
return s[0];
}
}
A variation of telotortium's input reading function that doesn't fail on bad input. The loop { ... } keyword is preferred over writing while true { ... }. In this case using return is fine since the function is returning early.
fn int_from_input(prompt: &str) -> int {
println(prompt);
loop {
match from_str::<int>(io::stdin().read_line()) {
Some(x) => return x,
None => println("Oops, that was invalid input. Try again.")
};
}
}