How to not progress in Rust pest parser - parsing

I am trying to build a basic Latex parser using pest library. For the moment, I only care about lines, bold format and plain text. I am struggling with the latter. To simplify the problem, I assume that it cannot contain these two chars: \, }.
lines = { line ~ (NEWLINE ~ line)* }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
inner = #{ char* }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}
main = {
SOI ~
lines ~
EOI
}
Using this webapp, we can see that my grammar eats the char after the plain text.
Input:
Before \textbf{middle} after.
New line
Output:
- lines > line
- token > text_plain > inner: "Before "
- token > text_plain > inner: "textbf{middle"
- token > text_plain > inner: " after."
- token > text_plain > inner: "New line"
If I replace ${ inner ~ ("\\" | "}" | NEWLINE) } by ${ inner }, it fails. If add the & in front of the suffix, it does not work either.
How can I change my grammar so that lines and bold tags are detected?

The rule
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
certainly matches the character following inner (which must be a backslash, close brace, or newline). That's not what you want: you want the following character to be part of the next token. But it's definitely seems to me reasonable to ask what happened to that character, since the token corresponding to text_plain clearly doesn't show it.
The answer, apparently, is a subtlety in how tokens are formed. According to the Pest book:
When the rule starts being parsed, the starting part of the token is being produced, with the ending part being produced when the rule finishes parsing.
The key here, it turns out, is what is not being said. ("\\" | "}" | NEWLINE) is not a rule, and therefore it does not trigger any token pairs. So when you iterate over the tokens inside text_plain, you only see the token generated by inner.
None of that is really relevant, since text_plain should not attempt to match the following character in any event. I suppose you realised that, because you say you tried to change the rule to text_plain = { inner }, but that "failed". It would have been useful to know what "failure" meant here, but I suppose that it was because Pest complained about the attempt to use a repetition operator on a rule which can match the empty string.
Since inner is a *-repetition, it can match the empty string; defining text_plain as a copy of inner means that text_plain can also match the empty string; that means that token ({ text_bold | text_plain }) can match the empty string, and that makes token* illegal because Pest doesn't allow applying repetition operators to a nullable rule. The simplest solution is to change inner from char* to char+, which forces it to match at least one character.
In the following, I actually got rid of inner altogether, since it seems redundant:
main = { SOI ~ lines ~ EOI }
lines = { line ~ (NEWLINE ~ line)* ~ NEWLINE? }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = #{ char+ }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}

Related

Peg parser - support for escape characters

I'm working on a Peg parser. Among other structures, it needs to parse a tag directive. A tag can contain any character. If you want the tag to include a curly brace } you can escape it with a backslash. If you need a literal backslash, that should also be escaped. I tried to implement this inspired by the Peg grammer for JSON: https://github.com/pegjs/pegjs/blob/master/examples/json.pegjs
There are two problems:
an escaped backslash results in two backslash characters instead of one. Example input:
{ some characters but escape with a \\ }
the parser breaks on an escaped curly \}. Example input:
{ some characters but escape \} with a \\ }
The relevant grammer is:
Tag
= "{" _ tagContent:$(TagChar+) _ "}" {
return { type: "tag", content: tagContent }
}
TagChar
= [^\}\r\n]
/ Escape
sequence:(
"\\" { return {type: "char", char: "\\"}; }
/ "}" { return {type: "char", char: "\x7d"}; }
)
{ return sequence; }
_ "whitespace"
= [ \t\n\r]*
Escape
= "\\"
You can easily test grammar and test input with the online PegJS sandbox: https://pegjs.org/online
I hope somebody has an idea to resolve this.
These errors are both basically typos.
The first problem is the character class in your regular expression for tag characters. In a character class, \ continues to be an escape character, so [^\}\r\n] matches any character other than } (written with an unnecessary backslash escape), carriage return or newline. \ is such a character, so it's matched by the character class, and Escape is never tried.
Since your pattern for tag characters doesn't succeed in recognising \ as an Escape, the tag { \\ } is parsed as four characters (space, backslash, backslash, space) and the tag { \} } is parsed as terminating on the first }, creating a syntax error.
So you should fix the character class to [^}\\\r\n] (I put the closing brace first in order to make it easier to read the falling timber. The order is irrelevant.)
Once you do that, you'll find that the parser still returns the string with the backslashes intact. That's because of the $ in your Tag pattern: "{" _ tagContent:$(TagChar+) _ "}". According to the documentation, the meaning of the $ operator is: (emphasis added)
$ expression
Try to match the expression. If the match succeeds, return the matched text instead of the match result.
For reference, the correct grammer is as follows:
Tag
= "{" _ tagContent:TagChar+ _ "}" {
return { type: "tag", content: tagContent.map(c => c.char || c).join('') }
}
TagChar
= [^}\\\r\n]
/ Escape
sequence:(
"\\" { return {type: "char", char: "\\"}; }
/ "}" { return {type: "char", char: "\x7d"}; }
)
{ return sequence; }
_ "whitespace"
= [ \t\n\r]*
Escape
= "\\"
When using the following input:
{ some characters but escape \} with a \\ }
it will return:
{
"type": "tag",
"content": "some characters but escape } with a \ "
}

ocaml menhir parser production is never reduced

I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html

Debug parser by printing useful information

I would like to parse a set of expressions, for instance:X[3], X[-3], XY[-2], X[4]Y[2], etc.
In my parser.mly, index (which is inside []) is defined as follows:
index:
| INTEGER { $1 }
| MINUS INTEGER { 0 - $2 }
The token INTEGER, MINUS etc. are defined in lexer as normal.
I try to parse an example, it fails. However, if I comment | MINUS INTEGER { 0 - $2 }, it works well. So the problem is certainly related to that. To debug, I want to get more information, in other words I want to know what is considered to be MINUS INTEGER. I tried to add print:
index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n" $2; 0 - $2 }
But nothing is printed while parsing.
Could anyone tell me how to print information or debug that?
I tried coming up with an example of what you describe and was able to get output of 8 with what I show below. [This example is completely stripped down so that it only works for [1] and [- 1 ], but I believe it's equivalent logically to what you said you did.]
However, I also notice that your example's debug string in your example does not have an explicit flush with %! at the end, so that the debugging output might not be flushed to the terminal until later than you expect.
Here's what I used:
Test.mll:
{
open Ytest
open Lexing
}
rule test =
parse
"-" { MINUS }
| "1" { ONE 1 }
| "[" { LB }
| "]" { RB }
| [ ' ' '\t' '\r' '\n' ] { test lexbuf }
| eof { EOFTOKEN }
Ytest.mly:
%{
%}
%token <int> ONE
%token MINUS LB RB EOFTOKEN
%start item
%type <int> index item
%%
index:
ONE { 2 }
| MINUS ONE { Printf.printf "%n" 8; $2 }
item : LB index RB EOFTOKEN { $2 }
Parse.ml
open Test;;
open Ytest;;
open Lexing;;
let lexbuf = Lexing.from_channel stdin in
ignore (Ytest.item Test.test lexbuf)

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

Having some simple problems with Scala combinator parsers

First, the code:
package com.digitaldoodles.markup
import scala.util.parsing.combinator.{Parsers, RegexParsers}
import com.digitaldoodles.rex._
class MarkupParser extends RegexParsers {
val stopTokens = (Lit("{{") | "}}" | ";;" | ",,").lookahead
val name: Parser[String] = """[##!$]?[a-zA-Z][a-zA-Z0-9]*""".r
val content: Parser[String] = (patterns.CharAny ** 0 & stopTokens).regex
val function: Parser[Any] = name ~ repsep(content, "::") <~ ";;"
val block1: Parser[Any] = "{{" ~> function
val block2: Parser[Any] = "{{" ~> function <~ "}}"
val lst: Parser[Any] = repsep("[a-z]", ",")
}
object ParseExpr extends MarkupParser {
def main(args: Array[String]) {
println("Content regex is ", (patterns.CharAny ** 0 & stopTokens).regex)
println(parseAll(block1, "{{#name 3:4:foo;;"))
println(parseAll(block2, "{{#name 3:4:foo;; stuff}}"))
println(parseAll(lst, "a,b,c"))
}
}
then, the run results:
[info] == run ==
[info] Running com.digitaldoodles.markup.ParseExpr
(Content regex is ,(?:[\s\S]{0,})(?=(?:(?:\{\{|\}\})|;;)|\,\,))
[1.18] parsed: (#name~List(3:4:foo))
[1.24] failure: `;;' expected but `}' found
{{#name 3:4:foo;; stuff}}
^
[1.1] failure: string matching regex `\z' expected but `a' found
a,b,c
^
I use a custom library to assemble some of my regexes, so I've printed out the "content" regex; its supposed to be basically any text up to but not including certain token patterns, enforced using a positive lookahead assertion.
Finally, the problems:
1) The first run on "block1" succeeds, but shouldn't, because the separator in the "repsep" function is "::", yet ":" are parsed as separators.
2) The run on "block2" fails, presumably because the lookahead clause isn't working--but I can't figure out why this should be. The lookahead clause was already exercised in the "repsep" on the run on "block1" and seemed to work there, so why should it fail on block 2?
3) The simple repsep exercise on "lst" fails because internally, the parser engine seems to be looking for a boundary--is this something I need to work around somehow?
Thanks,
Ken
1) No, "::" are not parsed as separators. If it did, the output would be (#name~List(3, 4, foo)).
2) It happens because "}}" is also a delimiter, so it takes the longest match it can -- the one that includes ";;" as well. If you make the preceding expression non-eager, it will then fail at "s" on "stuff", which I presume is what you expected.
3) You passed a literal, not a regex. Modify "[a-z]" to "[a-z]".r and it will work.

Resources