Peg parser - support for escape characters - parsing

I'm working on a Peg parser. Among other structures, it needs to parse a tag directive. A tag can contain any character. If you want the tag to include a curly brace } you can escape it with a backslash. If you need a literal backslash, that should also be escaped. I tried to implement this inspired by the Peg grammer for JSON: https://github.com/pegjs/pegjs/blob/master/examples/json.pegjs
There are two problems:
an escaped backslash results in two backslash characters instead of one. Example input:
{ some characters but escape with a \\ }
the parser breaks on an escaped curly \}. Example input:
{ some characters but escape \} with a \\ }
The relevant grammer is:
Tag
= "{" _ tagContent:$(TagChar+) _ "}" {
return { type: "tag", content: tagContent }
}
TagChar
= [^\}\r\n]
/ Escape
sequence:(
"\\" { return {type: "char", char: "\\"}; }
/ "}" { return {type: "char", char: "\x7d"}; }
)
{ return sequence; }
_ "whitespace"
= [ \t\n\r]*
Escape
= "\\"
You can easily test grammar and test input with the online PegJS sandbox: https://pegjs.org/online
I hope somebody has an idea to resolve this.

These errors are both basically typos.
The first problem is the character class in your regular expression for tag characters. In a character class, \ continues to be an escape character, so [^\}\r\n] matches any character other than } (written with an unnecessary backslash escape), carriage return or newline. \ is such a character, so it's matched by the character class, and Escape is never tried.
Since your pattern for tag characters doesn't succeed in recognising \ as an Escape, the tag { \\ } is parsed as four characters (space, backslash, backslash, space) and the tag { \} } is parsed as terminating on the first }, creating a syntax error.
So you should fix the character class to [^}\\\r\n] (I put the closing brace first in order to make it easier to read the falling timber. The order is irrelevant.)
Once you do that, you'll find that the parser still returns the string with the backslashes intact. That's because of the $ in your Tag pattern: "{" _ tagContent:$(TagChar+) _ "}". According to the documentation, the meaning of the $ operator is: (emphasis added)
$ expression
Try to match the expression. If the match succeeds, return the matched text instead of the match result.

For reference, the correct grammer is as follows:
Tag
= "{" _ tagContent:TagChar+ _ "}" {
return { type: "tag", content: tagContent.map(c => c.char || c).join('') }
}
TagChar
= [^}\\\r\n]
/ Escape
sequence:(
"\\" { return {type: "char", char: "\\"}; }
/ "}" { return {type: "char", char: "\x7d"}; }
)
{ return sequence; }
_ "whitespace"
= [ \t\n\r]*
Escape
= "\\"
When using the following input:
{ some characters but escape \} with a \\ }
it will return:
{
"type": "tag",
"content": "some characters but escape } with a \ "
}

Related

How to not progress in Rust pest parser

I am trying to build a basic Latex parser using pest library. For the moment, I only care about lines, bold format and plain text. I am struggling with the latter. To simplify the problem, I assume that it cannot contain these two chars: \, }.
lines = { line ~ (NEWLINE ~ line)* }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
inner = #{ char* }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}
main = {
SOI ~
lines ~
EOI
}
Using this webapp, we can see that my grammar eats the char after the plain text.
Input:
Before \textbf{middle} after.
New line
Output:
- lines > line
- token > text_plain > inner: "Before "
- token > text_plain > inner: "textbf{middle"
- token > text_plain > inner: " after."
- token > text_plain > inner: "New line"
If I replace ${ inner ~ ("\\" | "}" | NEWLINE) } by ${ inner }, it fails. If add the & in front of the suffix, it does not work either.
How can I change my grammar so that lines and bold tags are detected?
The rule
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
certainly matches the character following inner (which must be a backslash, close brace, or newline). That's not what you want: you want the following character to be part of the next token. But it's definitely seems to me reasonable to ask what happened to that character, since the token corresponding to text_plain clearly doesn't show it.
The answer, apparently, is a subtlety in how tokens are formed. According to the Pest book:
When the rule starts being parsed, the starting part of the token is being produced, with the ending part being produced when the rule finishes parsing.
The key here, it turns out, is what is not being said. ("\\" | "}" | NEWLINE) is not a rule, and therefore it does not trigger any token pairs. So when you iterate over the tokens inside text_plain, you only see the token generated by inner.
None of that is really relevant, since text_plain should not attempt to match the following character in any event. I suppose you realised that, because you say you tried to change the rule to text_plain = { inner }, but that "failed". It would have been useful to know what "failure" meant here, but I suppose that it was because Pest complained about the attempt to use a repetition operator on a rule which can match the empty string.
Since inner is a *-repetition, it can match the empty string; defining text_plain as a copy of inner means that text_plain can also match the empty string; that means that token ({ text_bold | text_plain }) can match the empty string, and that makes token* illegal because Pest doesn't allow applying repetition operators to a nullable rule. The simplest solution is to change inner from char* to char+, which forces it to match at least one character.
In the following, I actually got rid of inner altogether, since it seems redundant:
main = { SOI ~ lines ~ EOI }
lines = { line ~ (NEWLINE ~ line)* ~ NEWLINE? }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = #{ char+ }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}

Flex find substring until character

This is my lexer.l file:
%{
#include "../h/Tokens.h"
%}
%option yylineno
%%
[+-]?([1-9]*\.[0-9]+)([eE][+-]?[0-9])? return FLOAT;
[+-]?[1-9]([eE][+-]?[0-9])? return INTEGER;
\"(\\\\|\\\"|[^\"])*\" return STRING;
(true|false) return BOOLEAN;
(func|val|if|else|while|for)* return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
. printf("Unexpected or invalid token: '%s'\n", yytext);
%%
int yywrap(void)
{
return 1;
}
Now, if my lexer finds an unexpected token, it sends an error for every character. I want it to send an error message for every substring until a whitespace or operator.
Example:
Input:
foo bar baz
~±`≥ hello
Output:
Identifier.
Identifier.
Identifier.
Unexpected or invalid token: '~±`≥'
Identifier.
Is there a way to do this with a regex pattern?
Thanks.
Certainly it is possible to do with a regex. But you can't do it with a regex independent of your other token rules. And it may not be trivial to find a correct regex.
In this fairly simple example, though, it's reasonably simple, although there is a corner case. Since there are no multicharacter operators, a character cannot start a token unless it is alphabetic, numeric, one of the operators (-+*.,:;) or a double-quote. And therefore any sequence of such characters is an invalid sequence. Also, I think that you really want ignore whitespace characters (based on the example output), even though your question doesn't show any rule which matches whitespace. So on the assumption that you just left out the whitespace rule, which would be something like
[[:space:]]+ { /* Ignore whitespace */ }
your regex to match a sequence of illegal characters would be
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr, "Invalid sequence %s\m", yytext); }
The corner-case is an unterminated string literal; that is, a token which starts with a " but does not include the matching closing quote. Such a token must necessarily extend to the end of the input, and it can easily be matched by using your string pattern, leaving out the final ". (That works because (f)lex always uses the longest matching pattern, so if there is a terminating " the correct string literal will be matched.)
There are a number of errors in your patterns:
It's almost always a bad idea to match +- at the start of a numeric literal. If you do that, then x+2 will not be correctly analysed; your lexer will return two tokens, an IDENTIFIER and an INTEGER, instead of the correct three tokens (IDENTIFIER, PLUS, INTEGER).
Your FLOAT pattern won't accept numbers starting which contain a 0 before the decimal point, so 0.5 and 10.3 will both fail. Also, you force the exponent to be a single digit, so 1.3E11 won't be matched either. And you force the user to put a digit after the decimal point; most languages accept 3. as equivalent to 3.0. (That last one is not necessarily an error, but it's unconventional.)
Your INTEGER pattern won't accept numbers containing a 0, such as 10. But it will accept scientific notation, which is a little odd; in most languages 3E10 is a floating point constant, not an integer.
Your KEYWORD pattern accepts keywords which are made up of a concatenated series of words, such as forwhilefuncif. You probably didn't intend to put a * at the end of the pattern.
Your string literal pattern allows any sequence of characters other than ", which means a backslash \ will be allowed to match as a single character, even if it is followed by a quote or a backslash. That will result in some string literals not being correctly terminated. For example, given the string literal
"\\"
(which is a string literal containing a single backslash), the regex will match the initial ", then the \ as a single character, and then the \" sequence, and then whatever follows the string literal until it encounters another quote.
The error is the result of flex requiring \ to be escaped inside bracket expressions, unlike Posix regular expressions where \ loses special significance inside brackets.
So that would leave you with something like this:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
[[:space:]]+ /* Ignore whitespace */
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? {
return FLOAT;
}
0|[1-9][0-9]* return INTEGER;
true|false return BOOLEAN;
func|val|if|else|while|for return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
\"(\\\\|\\\"|[^\\"])*\" return STRING;
\"(\\\\|\\\"|[^\\"])* { fprintf(stderr,
"Unterminated string literal\n"); }
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr,
"Invalid sequence %s\m", yytext); }
(If any of those patterns look mysterious, you might want to review the description of flex patterns in the flex manual.)
But I have a feeling that you were looking for something different: a way of magically adapting to any change in the token patterns without excess analysis.
That's possible, too, but I don't know how to do it without code repetition. The basic idea is simple enough: when we encounter an unmatchable character, we just append it to the end of an error token and when we find a valid token, we emit the error message and clear the error token.
The problem is the "when we find a valid token" part, because that means that we need to insert an action at the beginning of every rule other than the error rule. The easiest way to do that is to use a macro, which at least avoids writing out the code for every action.
(F)lex does provide us with some useful tools we can build this on. We'll use one of (f)lex's special actions, yymore(), which causes the current match to be appended to the token being built, which is useful to build up the error token.
In order to know the length of the error token (and therefore to know if there is one), we need an additional variable. Fortunately, (f)lex allows us to define our own local variables inside the scanner. Then we define the macro E_ (whose name was chosen to be short, in order to avoid cluttering the rule actions), which prints the error message, moves yytext over the error token, and resets the error count.
Putting that together:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
int nerrors = 0; /* To keep track of the length of the error token */
/* This macro must be inserted at the beginning of every rule,
* except the fallback error rule.
*/
#define E_ \
if (nerrors > 0) { \
fprintf(stderr, "Invalid sequence %.*s\n", nerrors, yytext); \
yytext += nerrors; yyleng -= nerrors; nerrors = 0; \
} else /* Absorb the following semicolon */
[[:space:]]+ { E_; /* Ignore whitespace */ }
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? { E_; return FLOAT; }
0|[1-9][0-9]* { E_; return INTEGER; }
true|false { E_; return BOOLEAN; }
func|val|if|else|while|for { E_; return KEYWORD; }
[A-Za-z_][A-Za-z_0-9]* { E_; return IDENTIFIER; }
"+" { E_; return PLUS; }
"-" { E_; return MINUS; }
"*" { E_; return MULTI; }
"." { E_; return DOT; }
"," { E_; return COMMA; }
":" { E_; return COLON; }
";" { E_; return SEMICOLON; }
\"(\\\\|\\\"|[^\\"])*\" { E_; return STRING; }
\"(\\\\|\\\"|[^\\"])* { E_;
fprintf(stderr,
"Unterminated string literal\n"); }
. { yymore(); ++nerror; }
That all assumes that we're happy to just produce an error message inside the scanner, and otherwise ignore the erroneous characters. But it may be better to actually return an error indication and let the caller decide how to handle the error. That introduces an extra wrinkle because it requires us to return two tokens in a single action.
For a simple solution, we use another (f)lex feature, yyless(), which allows us to rescan part or all of the current token. We can use that to remove the error token from the current token, instead of adjusting yytext and yyleng. (yyless will do that adjustment for us.) That means that after an error, the next correct token is scanned twice. That may seem inefficient, but it's probably acceptable because:
Most tokens are short,
There's not really much point in optimising for errors. It's much more useful to optimise processing of correct inputs.
To accomplish that, we just need a small change to the E_ macro:
#define E_ \
if (nerrors > 0) { \
yyless(nerrors); \
fprintf(stderr, "Invalid sequence %s\n", yytext); \
nerrors = 0; \
return BAD_INPUT; \
} else /* Absorb the following semicolon */

Parser expression grammar - how to match any string excluding a single character?

I'd like to write a PEG that matches filesystem paths. A path element is any character except / in posix linux.
There is an expression in PEG to match any character, but I cannot figure out how to match any character except one.
The peg parser I'm using is PEST for rust.
You could find the PEST syntax in https://docs.rs/pest/0.4.1/pest/macro.grammar.html#syntax, in particular there is a "negative lookahead"
!a — matches if a doesn't match without making progress
So you could write
!["/"] ~ any
Example:
// cargo-deps: pest
#[macro_use] extern crate pest;
use pest::*;
fn main() {
impl_rdp! {
grammar! {
path = #{ soi ~ (["/"] ~ component)+ ~ eoi }
component = #{ (!["/"] ~ any)+ }
}
}
println!("should be true: {}", Rdp::new(StringInput::new("/bcc/cc/v")).path());
println!("should be false: {}", Rdp::new(StringInput::new("/bcc/cc//v")).path());
}

What is the sequence combinator in Chomp?

I'm attempting to parse a subset of JSON that only contains a single, non-nested object with string only values that may contain escape sequences. E.g.
{
"A KEY": "SOME VALUE",
"Another key": "Escape sequences \n \r \\ \/ \f \t \u263A"
}
Using the Chomp parser combinator in Rust. I have it parsing this structure ignoring escape sequences but am having trouble working out how to handle the escape sequences. Looking at other quoted string parsers that use combinators such as:
Arc JSON parser
PHP parser-combinator
Paka
They each use a sequence combinator, what is the equivalent in Chomp?
Chomp is based on Attoparsec and Parsec, so for parsing escaped strings I would use the scan parser to obtain the slice between the " characters while keeping any escaped " characters.
The sequence combinator is just the ParseResult::bind method, used to chain the match of the " character and the escaped string itself so that it will be able to parse "foo\\"bar" and not just foo\\"bar. You get this for free when you use the parse! macro, each ; is implicitly converted into a bind call to chain the parsers together.
The linked parsers use a many and or combinator and allocate a vector for the resulting characters. Paka does not seem to do any transformation on the resulting array, and PHP is using a regex with a callback to unescape the string.
This is code translated from Attoparsec's Aeson benchmark for parsing a JSON-string while not unescaping any escaped characters.
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::buffer::IntoStream;
use chomp::buffer::Stream;
pub fn json_string(i: Input<u8>) -> U8Result<&[u8]> {
parse!{i;
token(b'"');
let escaped_str = scan(false, |s, c| if s { Some(false) }
else if c == b'"' { None }
else { Some(c == b'\\') });
token(b'"');
ret escaped_str
}
}
#[test]
fn test_it() {
let r = "\"foo\\\"bar\\tbaz\"".as_bytes().into_stream().parse(json_string);
assert_eq!(r, Ok(&b"foo\\\"bar\\tbaz"[..]));
}
The parser above is not equivalent, it yields a slice of u8 borrowed from the source buffer/slice. If you want an owned Vec of the data you should preferably use [T]::to_vec or String::from_utf8 instead of building a parser using many and or since it will not be as fast as scan and the result is the same.
If you want to parse UTF-8 and escape-sequences you can filter the resulting slice and then calling String::from_utf8 on the Vec (Rust strings are UTF-8, to use a string containing invalid UTF-8 can result in undefined behaviour). If performance is an issue you should build that into the parser most likely.
The documentation states (emphasis mine):
Using parsers is almost entirely done using the parse! macro, which enables us to do three distinct things:
Sequence parsers over the remaining input
Store intermediate results into datatypes
Return a datatype at the end, which may be the result of any arbitrary computation over the intermediate results.
It then provides this example of parsing a sequence of two numbers followed by a constant string:
fn f(i: Input<u8>) -> U8Result<(u8, u8, u8)> {
parse!{i;
let a = digit();
let b = digit();
string(b"missiles");
ret (a, b, a + b)
}
}
fn digit(i: Input<u8>) -> U8Result<u8> {
satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}
There is also ParseResult::bind and ParseResult::then which are documented to sequentially compose a result with a second action.
Because I'm always interested in parsing, I went ahead and played with it a bit to see how it would look. I'm not happy with the deep indenting that would happen with the nested or calls, but there's probably something better that can be done. This is just one possible solution:
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::ascii::is_alpha;
use chomp::buffer::{Source, Stream, ParseError};
use std::str;
use std::iter::FromIterator;
#[derive(Debug)]
pub enum StringPart<'a> {
String(&'a [u8]),
Newline,
Slash,
}
impl<'a> StringPart<'a> {
fn from_bytes(s: &[u8]) -> StringPart {
match s {
br#"\\"# => StringPart::Slash,
br#"\n"# => StringPart::Newline,
s => StringPart::String(s),
}
}
}
impl<'a> FromIterator<StringPart<'a>> for String {
fn from_iter<I>(iterator: I) -> Self
where I: IntoIterator<Item = StringPart<'a>>
{
let mut s = String::new();
for part in iterator {
match part {
StringPart::String(p) => s.push_str(str::from_utf8(p).unwrap()),
StringPart::Newline => s.push('\n'),
StringPart::Slash => s.push('\\'),
}
}
s
}
}
fn json_string_part(i: Input<u8>) -> U8Result<StringPart> {
or(i,
|i| parse!{i; take_while1(is_alpha)},
|i| or(i,
|i| parse!{i; string(br"\\")},
|i| parse!{i; string(br"\n")}),
).map(StringPart::from_bytes)
}
fn json_string(i: Input<u8>) -> U8Result<String> {
many1(i, json_string_part)
}
fn main() {
let input = br#"\\stuff\n"#;
let mut i = Source::new(input as &[u8]);
println!("Input has {} bytes", input.len());
loop {
match i.parse(json_string) {
Ok(x) => {
println!("Result has {} bytes", x.len());
println!("{:?}", x);
},
Err(ParseError::Retry) => {}, // Needed to refill buffer when necessary
Err(ParseError::EndOfInput) => break,
Err(e) => { panic!("{:?}", e); }
}
}
}

BNF to handle escape sequence

I use this BNF to parser my script:
{identset} = {ASCII} - {"\{\}}; //<--all ascii charset except '\"' '{' and '}'
{strset} = {ASCII} - {"};
ident = {identset}*;
str = {strset}*;
node ::= ident "{" nodes "}" | //<--entry point
"\"" str "\"" |
ident;
nodes ::= node nodes |
node;
It can parse correctly the following text into tree structure
doc {
title { "some title goes here" }
refcode { "SDS-1" }
rev { "1.0" }
revdate { "04062010" }
body {
"this is the body of the document
all text should go here"
chapter { "some inline section" }
"text again"
}
}
my question is, how do I handle escape sequence inside string literal:
"some text of \"quotation\" should escape"
Define str as:
str = ( strset strescape ) *;
with
strescape = { \\ } {\" } ;

Resources