Parser expression grammar - how to match any string excluding a single character? - parsing

I'd like to write a PEG that matches filesystem paths. A path element is any character except / in posix linux.
There is an expression in PEG to match any character, but I cannot figure out how to match any character except one.
The peg parser I'm using is PEST for rust.

You could find the PEST syntax in https://docs.rs/pest/0.4.1/pest/macro.grammar.html#syntax, in particular there is a "negative lookahead"
!a — matches if a doesn't match without making progress
So you could write
!["/"] ~ any
Example:
// cargo-deps: pest
#[macro_use] extern crate pest;
use pest::*;
fn main() {
impl_rdp! {
grammar! {
path = #{ soi ~ (["/"] ~ component)+ ~ eoi }
component = #{ (!["/"] ~ any)+ }
}
}
println!("should be true: {}", Rdp::new(StringInput::new("/bcc/cc/v")).path());
println!("should be false: {}", Rdp::new(StringInput::new("/bcc/cc//v")).path());
}

Related

Using `ANY` is not working when picking up patterns using the Rust PEST parser

I am using the PEST parser and I am testing a simple example to get familiar with the syntax. I am trying to get every instance of ++ throughout the string but I am running into some issues. I think it may be an issue with the ANY keyword but I am not sure. Can anyone help point me in the right direction as to what is going wrong?
Here is my grammar.pest file
incrementing = {(prefix ~ ANY+ ~ "++" ~ suffix)}
prefix = {(NEWLINE | WHITESPACE)*}
suffix = {(NEWLINE | WHITESPACE)*}
WHITESPACE = _{ " " }
Here is my test case
//parses a file a matching rule and returns all instances of the rule
fn parse_file_contents_for_rule(rule: Rule, file_contents: &str) -> Option<Pairs<Rule>> {
SolgaParser::parse(rule, file_contents).ok()
}
fn parse_incrementing(file_contents: &str) {
//parse the file for the rule
let targets = parse_file_contents_for_rule(Rule::incrementing, file_contents);
//if there are matches
if targets.is_some() {
//iterate through all of the matches
for target in targets.unwrap().into_iter() {
println!("{}", target.as_str());
}
}
}
#[test]
fn test_parse_incrementing() {
let file_contents = r#"
index++;
a_thing++;
another_thing++;
should_not_match;
should_match++;
"#;
parse_incrementing(file_contents);
}
In your example, ANY+ is probably matching till the end of the line, so the ++ pattern is never matched, and therefore the whole incrementing rule is never matched.
Try changing it to (!"+" ~ ANY)+

Flex find substring until character

This is my lexer.l file:
%{
#include "../h/Tokens.h"
%}
%option yylineno
%%
[+-]?([1-9]*\.[0-9]+)([eE][+-]?[0-9])? return FLOAT;
[+-]?[1-9]([eE][+-]?[0-9])? return INTEGER;
\"(\\\\|\\\"|[^\"])*\" return STRING;
(true|false) return BOOLEAN;
(func|val|if|else|while|for)* return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
. printf("Unexpected or invalid token: '%s'\n", yytext);
%%
int yywrap(void)
{
return 1;
}
Now, if my lexer finds an unexpected token, it sends an error for every character. I want it to send an error message for every substring until a whitespace or operator.
Example:
Input:
foo bar baz
~±`≥ hello
Output:
Identifier.
Identifier.
Identifier.
Unexpected or invalid token: '~±`≥'
Identifier.
Is there a way to do this with a regex pattern?
Thanks.
Certainly it is possible to do with a regex. But you can't do it with a regex independent of your other token rules. And it may not be trivial to find a correct regex.
In this fairly simple example, though, it's reasonably simple, although there is a corner case. Since there are no multicharacter operators, a character cannot start a token unless it is alphabetic, numeric, one of the operators (-+*.,:;) or a double-quote. And therefore any sequence of such characters is an invalid sequence. Also, I think that you really want ignore whitespace characters (based on the example output), even though your question doesn't show any rule which matches whitespace. So on the assumption that you just left out the whitespace rule, which would be something like
[[:space:]]+ { /* Ignore whitespace */ }
your regex to match a sequence of illegal characters would be
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr, "Invalid sequence %s\m", yytext); }
The corner-case is an unterminated string literal; that is, a token which starts with a " but does not include the matching closing quote. Such a token must necessarily extend to the end of the input, and it can easily be matched by using your string pattern, leaving out the final ". (That works because (f)lex always uses the longest matching pattern, so if there is a terminating " the correct string literal will be matched.)
There are a number of errors in your patterns:
It's almost always a bad idea to match +- at the start of a numeric literal. If you do that, then x+2 will not be correctly analysed; your lexer will return two tokens, an IDENTIFIER and an INTEGER, instead of the correct three tokens (IDENTIFIER, PLUS, INTEGER).
Your FLOAT pattern won't accept numbers starting which contain a 0 before the decimal point, so 0.5 and 10.3 will both fail. Also, you force the exponent to be a single digit, so 1.3E11 won't be matched either. And you force the user to put a digit after the decimal point; most languages accept 3. as equivalent to 3.0. (That last one is not necessarily an error, but it's unconventional.)
Your INTEGER pattern won't accept numbers containing a 0, such as 10. But it will accept scientific notation, which is a little odd; in most languages 3E10 is a floating point constant, not an integer.
Your KEYWORD pattern accepts keywords which are made up of a concatenated series of words, such as forwhilefuncif. You probably didn't intend to put a * at the end of the pattern.
Your string literal pattern allows any sequence of characters other than ", which means a backslash \ will be allowed to match as a single character, even if it is followed by a quote or a backslash. That will result in some string literals not being correctly terminated. For example, given the string literal
"\\"
(which is a string literal containing a single backslash), the regex will match the initial ", then the \ as a single character, and then the \" sequence, and then whatever follows the string literal until it encounters another quote.
The error is the result of flex requiring \ to be escaped inside bracket expressions, unlike Posix regular expressions where \ loses special significance inside brackets.
So that would leave you with something like this:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
[[:space:]]+ /* Ignore whitespace */
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? {
return FLOAT;
}
0|[1-9][0-9]* return INTEGER;
true|false return BOOLEAN;
func|val|if|else|while|for return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
\"(\\\\|\\\"|[^\\"])*\" return STRING;
\"(\\\\|\\\"|[^\\"])* { fprintf(stderr,
"Unterminated string literal\n"); }
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr,
"Invalid sequence %s\m", yytext); }
(If any of those patterns look mysterious, you might want to review the description of flex patterns in the flex manual.)
But I have a feeling that you were looking for something different: a way of magically adapting to any change in the token patterns without excess analysis.
That's possible, too, but I don't know how to do it without code repetition. The basic idea is simple enough: when we encounter an unmatchable character, we just append it to the end of an error token and when we find a valid token, we emit the error message and clear the error token.
The problem is the "when we find a valid token" part, because that means that we need to insert an action at the beginning of every rule other than the error rule. The easiest way to do that is to use a macro, which at least avoids writing out the code for every action.
(F)lex does provide us with some useful tools we can build this on. We'll use one of (f)lex's special actions, yymore(), which causes the current match to be appended to the token being built, which is useful to build up the error token.
In order to know the length of the error token (and therefore to know if there is one), we need an additional variable. Fortunately, (f)lex allows us to define our own local variables inside the scanner. Then we define the macro E_ (whose name was chosen to be short, in order to avoid cluttering the rule actions), which prints the error message, moves yytext over the error token, and resets the error count.
Putting that together:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
int nerrors = 0; /* To keep track of the length of the error token */
/* This macro must be inserted at the beginning of every rule,
* except the fallback error rule.
*/
#define E_ \
if (nerrors > 0) { \
fprintf(stderr, "Invalid sequence %.*s\n", nerrors, yytext); \
yytext += nerrors; yyleng -= nerrors; nerrors = 0; \
} else /* Absorb the following semicolon */
[[:space:]]+ { E_; /* Ignore whitespace */ }
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? { E_; return FLOAT; }
0|[1-9][0-9]* { E_; return INTEGER; }
true|false { E_; return BOOLEAN; }
func|val|if|else|while|for { E_; return KEYWORD; }
[A-Za-z_][A-Za-z_0-9]* { E_; return IDENTIFIER; }
"+" { E_; return PLUS; }
"-" { E_; return MINUS; }
"*" { E_; return MULTI; }
"." { E_; return DOT; }
"," { E_; return COMMA; }
":" { E_; return COLON; }
";" { E_; return SEMICOLON; }
\"(\\\\|\\\"|[^\\"])*\" { E_; return STRING; }
\"(\\\\|\\\"|[^\\"])* { E_;
fprintf(stderr,
"Unterminated string literal\n"); }
. { yymore(); ++nerror; }
That all assumes that we're happy to just produce an error message inside the scanner, and otherwise ignore the erroneous characters. But it may be better to actually return an error indication and let the caller decide how to handle the error. That introduces an extra wrinkle because it requires us to return two tokens in a single action.
For a simple solution, we use another (f)lex feature, yyless(), which allows us to rescan part or all of the current token. We can use that to remove the error token from the current token, instead of adjusting yytext and yyleng. (yyless will do that adjustment for us.) That means that after an error, the next correct token is scanned twice. That may seem inefficient, but it's probably acceptable because:
Most tokens are short,
There's not really much point in optimising for errors. It's much more useful to optimise processing of correct inputs.
To accomplish that, we just need a small change to the E_ macro:
#define E_ \
if (nerrors > 0) { \
yyless(nerrors); \
fprintf(stderr, "Invalid sequence %s\n", yytext); \
nerrors = 0; \
return BAD_INPUT; \
} else /* Absorb the following semicolon */

ocaml menhir parser production is never reduced

I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html

PEG grammar to accept late definition

I want to write a PEG parser with PackCC (but also peg/leg or other libraries are possible) which is able to calculate some fields with variables on random position.
The first simplified approach is the following grammar:
%source {
int vars[256];
}
statement <- e:term EOL { printf("answer=%d\n", e); }
term <- l:primary
( '+' r:primary { l += r; }
/ '-' r:primary { l -= r; }
)* { $$ = l; }
/ i:var '=' s:term { $$ = vars[i] = s; }
/ e:primary { $$ = e; }
primary <- < [0-9]+ > { $$ = atoi($1); }
/ i:var !'=' { $$ = vars[i]; }
var <- < [a-z] > { $$ = $1[0]; }
EOL <- '\n' / ';'
%%
When testing with sequential order, it works fine:
a=42;a+1
answer=42
answer=43
But when having the variable definition behind the usage, it fails:
a=42;a+b;b=1
answer=42
answer=42
answer=1
And even deeper chained late definitions shall work, like:
a=42;a+b;b=c;c=1
answer=42
answer=42
answer=0
answer=1
Lets think about the input not as a sequential programming language, but more as a Excel-like spreadsheet e.g.:
A1: 42
A2: =A1+A3
A3: 1
Is it possible to parse and handle such kind of text with a PEG grammar?
Is two-pass or multi-pass an option here?
Or do I need to switch over to old style lex/yacc flex/bison?
I'm not familiar with PEG per se, but it looks like what you have is an attributed grammar where you perform the execution logic directly within the semantic action.
That won't work if you have use before definition.
You can use the same parser generator but you'll probably have to define some sort of abstract syntax tree to capture the semantics and postpone evaluation until you've parsed all input.
Yes, it is possible to parse this with a PEG grammar. PEG is effectively greedy LL(*) with infinite lookahead. Expressions like this are easy.
But the grammar you have written is left recursive, which is not PEG. Although some PEG parsers can handle left recursion, until you're an expert it's best to avoid it, and use only right recursion if needed.

What is the sequence combinator in Chomp?

I'm attempting to parse a subset of JSON that only contains a single, non-nested object with string only values that may contain escape sequences. E.g.
{
"A KEY": "SOME VALUE",
"Another key": "Escape sequences \n \r \\ \/ \f \t \u263A"
}
Using the Chomp parser combinator in Rust. I have it parsing this structure ignoring escape sequences but am having trouble working out how to handle the escape sequences. Looking at other quoted string parsers that use combinators such as:
Arc JSON parser
PHP parser-combinator
Paka
They each use a sequence combinator, what is the equivalent in Chomp?
Chomp is based on Attoparsec and Parsec, so for parsing escaped strings I would use the scan parser to obtain the slice between the " characters while keeping any escaped " characters.
The sequence combinator is just the ParseResult::bind method, used to chain the match of the " character and the escaped string itself so that it will be able to parse "foo\\"bar" and not just foo\\"bar. You get this for free when you use the parse! macro, each ; is implicitly converted into a bind call to chain the parsers together.
The linked parsers use a many and or combinator and allocate a vector for the resulting characters. Paka does not seem to do any transformation on the resulting array, and PHP is using a regex with a callback to unescape the string.
This is code translated from Attoparsec's Aeson benchmark for parsing a JSON-string while not unescaping any escaped characters.
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::buffer::IntoStream;
use chomp::buffer::Stream;
pub fn json_string(i: Input<u8>) -> U8Result<&[u8]> {
parse!{i;
token(b'"');
let escaped_str = scan(false, |s, c| if s { Some(false) }
else if c == b'"' { None }
else { Some(c == b'\\') });
token(b'"');
ret escaped_str
}
}
#[test]
fn test_it() {
let r = "\"foo\\\"bar\\tbaz\"".as_bytes().into_stream().parse(json_string);
assert_eq!(r, Ok(&b"foo\\\"bar\\tbaz"[..]));
}
The parser above is not equivalent, it yields a slice of u8 borrowed from the source buffer/slice. If you want an owned Vec of the data you should preferably use [T]::to_vec or String::from_utf8 instead of building a parser using many and or since it will not be as fast as scan and the result is the same.
If you want to parse UTF-8 and escape-sequences you can filter the resulting slice and then calling String::from_utf8 on the Vec (Rust strings are UTF-8, to use a string containing invalid UTF-8 can result in undefined behaviour). If performance is an issue you should build that into the parser most likely.
The documentation states (emphasis mine):
Using parsers is almost entirely done using the parse! macro, which enables us to do three distinct things:
Sequence parsers over the remaining input
Store intermediate results into datatypes
Return a datatype at the end, which may be the result of any arbitrary computation over the intermediate results.
It then provides this example of parsing a sequence of two numbers followed by a constant string:
fn f(i: Input<u8>) -> U8Result<(u8, u8, u8)> {
parse!{i;
let a = digit();
let b = digit();
string(b"missiles");
ret (a, b, a + b)
}
}
fn digit(i: Input<u8>) -> U8Result<u8> {
satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}
There is also ParseResult::bind and ParseResult::then which are documented to sequentially compose a result with a second action.
Because I'm always interested in parsing, I went ahead and played with it a bit to see how it would look. I'm not happy with the deep indenting that would happen with the nested or calls, but there's probably something better that can be done. This is just one possible solution:
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::ascii::is_alpha;
use chomp::buffer::{Source, Stream, ParseError};
use std::str;
use std::iter::FromIterator;
#[derive(Debug)]
pub enum StringPart<'a> {
String(&'a [u8]),
Newline,
Slash,
}
impl<'a> StringPart<'a> {
fn from_bytes(s: &[u8]) -> StringPart {
match s {
br#"\\"# => StringPart::Slash,
br#"\n"# => StringPart::Newline,
s => StringPart::String(s),
}
}
}
impl<'a> FromIterator<StringPart<'a>> for String {
fn from_iter<I>(iterator: I) -> Self
where I: IntoIterator<Item = StringPart<'a>>
{
let mut s = String::new();
for part in iterator {
match part {
StringPart::String(p) => s.push_str(str::from_utf8(p).unwrap()),
StringPart::Newline => s.push('\n'),
StringPart::Slash => s.push('\\'),
}
}
s
}
}
fn json_string_part(i: Input<u8>) -> U8Result<StringPart> {
or(i,
|i| parse!{i; take_while1(is_alpha)},
|i| or(i,
|i| parse!{i; string(br"\\")},
|i| parse!{i; string(br"\n")}),
).map(StringPart::from_bytes)
}
fn json_string(i: Input<u8>) -> U8Result<String> {
many1(i, json_string_part)
}
fn main() {
let input = br#"\\stuff\n"#;
let mut i = Source::new(input as &[u8]);
println!("Input has {} bytes", input.len());
loop {
match i.parse(json_string) {
Ok(x) => {
println!("Result has {} bytes", x.len());
println!("{:?}", x);
},
Err(ParseError::Retry) => {}, // Needed to refill buffer when necessary
Err(ParseError::EndOfInput) => break,
Err(e) => { panic!("{:?}", e); }
}
}
}

Resources