How do I parse uppercase strings in Nom? - parsing

I'm writing parsers in Nom 5 using functions, not macros. My goal is to write a parser that recognizes a string composed entirely of uppercase characters. Ideally, it would have the same return signature as alpha1.
use nom::{
character::complete::{alpha1, char, line_ending, not_line_ending},
combinator::{cut, map, not, recognize},
error::{context, ParseError, VerboseError},
multi::{many0, many1},
IResult,
};
fn uppercase_char<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
let chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
take_while(move |c| chars.contains(c))(i)
}
// Matches 1 or more consecutive uppercase characters
fn upper1<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
recognize(many1(uppercase_char))(i)
}
Although this compiles, the simple unit test I wrote fails:
#[test]
fn test_upper_string_ok() {
let input_text = "ADAM";
let output = upper1::<VerboseError<&str>>(input_text);
dbg!(&output);
let expected = Ok(("ADAM", ""));
assert_eq!(output, expected);
}
The failure output is
---- parse::tests::test_upper_string_ok stdout ----
[src/parse.rs:110] &output = Err(
Error(
VerboseError {
errors: [
(
"",
Nom(
Many1,
),
),
],
},
),
)
thread 'parse::tests::test_upper_string_ok' panicked at 'assertion failed: `(left == right)`
left: `Err(Error(VerboseError { errors: [("", Nom(Many1))] }))`,
right: `Ok(("ADAM", ""))`', src/parse.rs:112:9
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

take_while will recognize 0 or more characters, so when used inside of many1 as you did, it will first parse the entire "ADAM" string. Then when many1 calls it again, since take_while can recognize an empty string, it will succeed, but many0 and many1 have a protection against that mistake: if the underlying parser did not consume any input, they will return an error.
For what you need, the uppercase_char function should be enough, no need for recognize and many1. Although you might want to replace take_while with take_while1

Related

Parsing a variably space delimited list with nom

How can I consume a list of tokens that may or may not be separated by a space?
I'm trying to parse Chinese romanization (pinyin) in the cedict format with nom (6.1.2). For example "ni3 hao3 ma5" which is, due to human error in transcription, sometimes written as "ni3hao3ma5" or "ni3hao3 ma5" (note the variable spacing).
I have written a parser that will handle individual syllables e.g. ["ni3", "hao3", "ma5"], and I'm trying to use a nom::multi::separated_list0 to parse it like so:
nom::multi::separated_list0(
nom::character::complete::space0,
syllable,
)(i)?;
However, I get a Err(Error(Error { input: "", code: SeparatedList })) after all the tokens have been consumed.
The problem with using
nom::multi::separated_list0(
nom::character::complete::space0,
syllable,
)(i)?;
Is that the space0 delimiter matches empty string, so it will reach the end of the input string and the separated_list0 will continue to try to consume the empty string, hence the Err(Error(Error { input: "", code: SeparatedList })).
The solution in my case was to use nom::multi::many1 and handling the optional spaces in the inner parser instead of nom::multi::separated_list0 like so:
fn syllables(i: &str) -> IResult<&str, Vec<Syllable>> {
// many 👇 instead of separated_list0
multi::many1(syllable)(i)
}
fn syllable(i: &str) -> IResult<&str, Syllable> {
let (rest, (_, pronunciation, tone)) = sequence::tuple((
// and handle the optional space
// here 👇
character::complete::space0,
character::complete::alpha1,
character::complete::digit0,
))(i)?;
Ok((rest, Syllable::new(pronunciation, tone)))
}

PEG: What is wrong wrong with my grammar for if statement?

I'm implementing an OCaml-like language using rust-peg and my parser has a bug.
I defined if-statement grammar, but it doesn't work.
I'm guessing the test-case input is parsed as Apply(Apply(Apply(Apply(f, then), 2) else), 4). I mean "then" is parsed as Ident, not keyword.
I have no idea for fixing this apply-expression grammar. Do you have any ideas?
#[derive(Clone, PartialEq, Eq, Debug)]
pub enum Expression {
Number(i64),
If {
cond: Box<Expression>,
conseq: Box<Expression>,
alt: Box<Expression>,
},
Ident(String),
Apply(Box<Expression>, Box<Expression>),
}
use peg::parser;
use toplevel::expression;
use Expression::*;
parser! {
pub grammar toplevel() for str {
rule _() = [' ' | '\n']*
pub rule expression() -> Expression
= expr()
rule expr() -> Expression
= if_expr()
/ apply_expr()
rule if_expr() -> Expression
= "if" _ cond:expr() _ "then" _ conseq:expr() _ "else" _ alt:expr() {
Expression::If {
cond: Box::new(cond),
conseq: Box::new(conseq),
alt: Box::new(alt)
}
}
rule apply_expr() -> Expression
= e1:atom() _ e2:atom() { Apply(Box::new(e1), Box::new(e2)) }
/ atom()
rule atom() -> Expression
= number()
/ id:ident() { Ident(id) }
rule number() -> Expression
= n:$(['0'..='9']+) { Expression::Number(n.parse().unwrap()) }
rule ident() -> String
= id:$(['a'..='z' | 'A'..='Z']['a'..='z' | 'A'..='Z' | '0'..='9']*) { id.to_string() }
}}
fn main() {
assert_eq!(expression("1"), Ok(Number(1)));
assert_eq!(
expression("myFunc 10"),
Ok(Apply(
Box::new(Ident("myFunc".to_string())),
Box::new(Number(10))
))
);
// failed
assert_eq!(
expression("if f then 2 else 3"),
Ok(If {
cond: Box::new(Ident("f".to_string())),
conseq: Box::new(Number(2)),
alt: Box::new(Number(3))
})
);
}
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `Err(ParseError { location: LineCol { line: 1, column: 11, offset: 10 }, expected: ExpectedSet { expected: {"\"then\"", "\' \' | \'\\n\'"} } })`,
right: `Ok(If { cond: Ident("f"), conseq: Number(2), alt: Number(3) })`', src/main.rs:64:5
PEG uses ordered choice. This means that when you write R = A / B for some rule R, if in a position A succesfully parses, it will never try B, even if the choice of A leads to problems later on. This is the core difference with context-free grammars, and often overlooked.
In particular, when you write apply = atom atom / atom, if it's possible to parse two atoms in a row, it will never try to parse just a single one, even if it means the rest doesn't make sense later.
Combine this with the fact that then and else are perfectly good identifiers in your grammar, you get the issue that you see.

How to get the output for several sequential nom parsers when the input is a &str?

This question is almost identical to Capture the entire contiguous matched input with nom, but I have to parse UTF-8 text as input (&str) not just bytes (&[u8]). I am trying to get the whole match for several parsers:
named!(parse <&str, &str>,
recognize!(
chain!(
is_not_s!(".") ~
tag_s!(".") ~
is_not_s!( "./ \r\n\t" ),
|| {}
)
)
);
And it causes this error:
no method named "offset" found for type "&str" in the current scope
Is the only way to do this to switch to &[u8] as input and then do map_res!?
there's an Offset trait implementation for &str that will be available in the next version of nom. There is no planned release date yet for nom 2.0, so in the meantime, you can copy the implementation in your code:
use nom::Offset;
impl Offset for str {
fn offset(&self, second: &Self) -> usize {
let fst = self.as_ptr();
let snd = second.as_ptr();
snd as usize - fst as usize
}
}

What is the sequence combinator in Chomp?

I'm attempting to parse a subset of JSON that only contains a single, non-nested object with string only values that may contain escape sequences. E.g.
{
"A KEY": "SOME VALUE",
"Another key": "Escape sequences \n \r \\ \/ \f \t \u263A"
}
Using the Chomp parser combinator in Rust. I have it parsing this structure ignoring escape sequences but am having trouble working out how to handle the escape sequences. Looking at other quoted string parsers that use combinators such as:
Arc JSON parser
PHP parser-combinator
Paka
They each use a sequence combinator, what is the equivalent in Chomp?
Chomp is based on Attoparsec and Parsec, so for parsing escaped strings I would use the scan parser to obtain the slice between the " characters while keeping any escaped " characters.
The sequence combinator is just the ParseResult::bind method, used to chain the match of the " character and the escaped string itself so that it will be able to parse "foo\\"bar" and not just foo\\"bar. You get this for free when you use the parse! macro, each ; is implicitly converted into a bind call to chain the parsers together.
The linked parsers use a many and or combinator and allocate a vector for the resulting characters. Paka does not seem to do any transformation on the resulting array, and PHP is using a regex with a callback to unescape the string.
This is code translated from Attoparsec's Aeson benchmark for parsing a JSON-string while not unescaping any escaped characters.
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::buffer::IntoStream;
use chomp::buffer::Stream;
pub fn json_string(i: Input<u8>) -> U8Result<&[u8]> {
parse!{i;
token(b'"');
let escaped_str = scan(false, |s, c| if s { Some(false) }
else if c == b'"' { None }
else { Some(c == b'\\') });
token(b'"');
ret escaped_str
}
}
#[test]
fn test_it() {
let r = "\"foo\\\"bar\\tbaz\"".as_bytes().into_stream().parse(json_string);
assert_eq!(r, Ok(&b"foo\\\"bar\\tbaz"[..]));
}
The parser above is not equivalent, it yields a slice of u8 borrowed from the source buffer/slice. If you want an owned Vec of the data you should preferably use [T]::to_vec or String::from_utf8 instead of building a parser using many and or since it will not be as fast as scan and the result is the same.
If you want to parse UTF-8 and escape-sequences you can filter the resulting slice and then calling String::from_utf8 on the Vec (Rust strings are UTF-8, to use a string containing invalid UTF-8 can result in undefined behaviour). If performance is an issue you should build that into the parser most likely.
The documentation states (emphasis mine):
Using parsers is almost entirely done using the parse! macro, which enables us to do three distinct things:
Sequence parsers over the remaining input
Store intermediate results into datatypes
Return a datatype at the end, which may be the result of any arbitrary computation over the intermediate results.
It then provides this example of parsing a sequence of two numbers followed by a constant string:
fn f(i: Input<u8>) -> U8Result<(u8, u8, u8)> {
parse!{i;
let a = digit();
let b = digit();
string(b"missiles");
ret (a, b, a + b)
}
}
fn digit(i: Input<u8>) -> U8Result<u8> {
satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}
There is also ParseResult::bind and ParseResult::then which are documented to sequentially compose a result with a second action.
Because I'm always interested in parsing, I went ahead and played with it a bit to see how it would look. I'm not happy with the deep indenting that would happen with the nested or calls, but there's probably something better that can be done. This is just one possible solution:
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::ascii::is_alpha;
use chomp::buffer::{Source, Stream, ParseError};
use std::str;
use std::iter::FromIterator;
#[derive(Debug)]
pub enum StringPart<'a> {
String(&'a [u8]),
Newline,
Slash,
}
impl<'a> StringPart<'a> {
fn from_bytes(s: &[u8]) -> StringPart {
match s {
br#"\\"# => StringPart::Slash,
br#"\n"# => StringPart::Newline,
s => StringPart::String(s),
}
}
}
impl<'a> FromIterator<StringPart<'a>> for String {
fn from_iter<I>(iterator: I) -> Self
where I: IntoIterator<Item = StringPart<'a>>
{
let mut s = String::new();
for part in iterator {
match part {
StringPart::String(p) => s.push_str(str::from_utf8(p).unwrap()),
StringPart::Newline => s.push('\n'),
StringPart::Slash => s.push('\\'),
}
}
s
}
}
fn json_string_part(i: Input<u8>) -> U8Result<StringPart> {
or(i,
|i| parse!{i; take_while1(is_alpha)},
|i| or(i,
|i| parse!{i; string(br"\\")},
|i| parse!{i; string(br"\n")}),
).map(StringPart::from_bytes)
}
fn json_string(i: Input<u8>) -> U8Result<String> {
many1(i, json_string_part)
}
fn main() {
let input = br#"\\stuff\n"#;
let mut i = Source::new(input as &[u8]);
println!("Input has {} bytes", input.len());
loop {
match i.parse(json_string) {
Ok(x) => {
println!("Result has {} bytes", x.len());
println!("{:?}", x);
},
Err(ParseError::Retry) => {}, // Needed to refill buffer when necessary
Err(ParseError::EndOfInput) => break,
Err(e) => { panic!("{:?}", e); }
}
}
}

Using parser-combinators to parse string with escaped characters?

I'm trying to use the combine library in Rust to parse a string. The real data that I'm trying to parse looks something like this:
A79,216,0,4,2,2,N,"US\"PS"
So at the end of that data is a string in quotes, but the string will contain escaped characters as well. I can't figure out how to parse those escaped characters in between the other quotes.
extern crate parser_combinators;
use self::parser_combinators::*;
fn main() {
let s = r#""HE\"LLO""#;
let data = many(satisfy(|c| c != '"')); // Fails on escaped " obviously
let mut str_parser = between(satisfy(|c| c == '"'), satisfy(|c| c == '"'), data);
let result : Result<(String, &str), ParseError> = str_parser.parse(s);
match result {
Ok((value, _)) => println!("{:?}", value),
Err(err) => println!("{}", err),
}
}
//=> "HE\\"
The code above will parse that string successfully but will obviously fail on the escaped character in the middle, printing out "HE\\" in the end.
I want to change the code above so that it prints "HE\\\"LLO".
How do I do that?
I have a mostly functional JSON parser as a benchmark for parser-combinators which parses this sort of escaped characters. I have included a link to it and a slightly simplified version of it below.
fn json_char(input: State<&str>) -> ParseResult<char, &str> {
let (c, input) = try!(satisfy(|c| c != '"').parse_state(input));
let mut back_slash_char = satisfy(|c| "\"\\nrt".chars().find(|x| *x == c).is_some()).map(|c| {
match c {
'"' => '"',
'\\' => '\\',
'n' => '\n',
'r' => '\r',
't' => '\t',
c => c//Should never happen
}
});
match c {
'\\' => input.combine(|input| back_slash_char.parse_state(input)),
_ => Ok((c, input))
}
}
json_char
Since this parser may consume 1 or 2 characters it is not enough to use the primitive combinators and so we need to introduce a function which can branch on the character which is parsed.
I ran into the same problem and ended up with the following solution:
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
Or in other words, a string is delimited by " followed by many escaped_characters or anything that isn't a closing ", and is closed by a closing ".
Here's a full example of how I'm using this:
pub enum Operand {
String { value: String },
}
fn escaped_character<I>() -> impl Parser<Input = I, Output = char>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('\\'),
any(),
).and_then(|(_, x)| match x {
'0' => Ok('\0'),
'n' => Ok('\n'),
'\\' => Ok('\\'),
'"' => Ok('"'),
_ => Err(StreamErrorFor::<I>::unexpected_message(format!("Invalid escape sequence \\{}", x)))
})
}
#[test]
fn parse_escaped_character() {
let expected = Ok(('\n', " foo"));
assert_eq!(expected, escaped_character().easy_parse("\\n foo"))
}
fn string_operand<I>() -> impl Parser<Input = I, Output = Operand>
where
I: Stream<Item = char>,
I::Error: ParseError<I::Item, I::Range, I::Position>,
{
(
char('"'),
many1::<Vec<char>, _>(choice((
escaped_character(),
satisfy(|c| c != '"'),
))),
char('"')
)
.map(|(_,value,_)| Operand::String { value: value.into_iter().collect() })
}
#[test]
fn parse_string_operand() {
let expected = Ok((Operand::String { value: "foo \" bar \n baz \0".into() }, ""));
assert_eq!(expected, string_operand().easy_parse(r#""foo \" bar \n baz \0""#))
}

Resources