BNF to handle escape sequence - parsing

I use this BNF to parser my script:
{identset} = {ASCII} - {"\{\}}; //<--all ascii charset except '\"' '{' and '}'
{strset} = {ASCII} - {"};
ident = {identset}*;
str = {strset}*;
node ::= ident "{" nodes "}" | //<--entry point
"\"" str "\"" |
ident;
nodes ::= node nodes |
node;
It can parse correctly the following text into tree structure
doc {
title { "some title goes here" }
refcode { "SDS-1" }
rev { "1.0" }
revdate { "04062010" }
body {
"this is the body of the document
all text should go here"
chapter { "some inline section" }
"text again"
}
}
my question is, how do I handle escape sequence inside string literal:
"some text of \"quotation\" should escape"

Define str as:
str = ( strset strescape ) *;
with
strescape = { \\ } {\" } ;

Related

Peg parser - support for escape characters

I'm working on a Peg parser. Among other structures, it needs to parse a tag directive. A tag can contain any character. If you want the tag to include a curly brace } you can escape it with a backslash. If you need a literal backslash, that should also be escaped. I tried to implement this inspired by the Peg grammer for JSON: https://github.com/pegjs/pegjs/blob/master/examples/json.pegjs
There are two problems:
an escaped backslash results in two backslash characters instead of one. Example input:
{ some characters but escape with a \\ }
the parser breaks on an escaped curly \}. Example input:
{ some characters but escape \} with a \\ }
The relevant grammer is:
Tag
= "{" _ tagContent:$(TagChar+) _ "}" {
return { type: "tag", content: tagContent }
}
TagChar
= [^\}\r\n]
/ Escape
sequence:(
"\\" { return {type: "char", char: "\\"}; }
/ "}" { return {type: "char", char: "\x7d"}; }
)
{ return sequence; }
_ "whitespace"
= [ \t\n\r]*
Escape
= "\\"
You can easily test grammar and test input with the online PegJS sandbox: https://pegjs.org/online
I hope somebody has an idea to resolve this.
These errors are both basically typos.
The first problem is the character class in your regular expression for tag characters. In a character class, \ continues to be an escape character, so [^\}\r\n] matches any character other than } (written with an unnecessary backslash escape), carriage return or newline. \ is such a character, so it's matched by the character class, and Escape is never tried.
Since your pattern for tag characters doesn't succeed in recognising \ as an Escape, the tag { \\ } is parsed as four characters (space, backslash, backslash, space) and the tag { \} } is parsed as terminating on the first }, creating a syntax error.
So you should fix the character class to [^}\\\r\n] (I put the closing brace first in order to make it easier to read the falling timber. The order is irrelevant.)
Once you do that, you'll find that the parser still returns the string with the backslashes intact. That's because of the $ in your Tag pattern: "{" _ tagContent:$(TagChar+) _ "}". According to the documentation, the meaning of the $ operator is: (emphasis added)
$ expression
Try to match the expression. If the match succeeds, return the matched text instead of the match result.
For reference, the correct grammer is as follows:
Tag
= "{" _ tagContent:TagChar+ _ "}" {
return { type: "tag", content: tagContent.map(c => c.char || c).join('') }
}
TagChar
= [^}\\\r\n]
/ Escape
sequence:(
"\\" { return {type: "char", char: "\\"}; }
/ "}" { return {type: "char", char: "\x7d"}; }
)
{ return sequence; }
_ "whitespace"
= [ \t\n\r]*
Escape
= "\\"
When using the following input:
{ some characters but escape \} with a \\ }
it will return:
{
"type": "tag",
"content": "some characters but escape } with a \ "
}

Rascal ambiguity not resolved by disambiguation rules

I'm trying to get a disambiguation working, one in the same vein as the question I asked a few days ago. In that previous question, there was an undocumented limitation in the language implementation; I'm wondering if there's something similar going on here.
Tests [tuvw]1 are all throwing ambiguity exceptions (BTW: How do you catch those? [Edit: answered]). All of them look like they ought to pass. Note that they have to be unambiguous in order to pass. Neither the priority rule Scheme nor the reserve rules UnknownScheme[23] seem to be removing the ambiguity. There might be some interaction with follow rules I'm not understanding; it might be another limitation or a defect. What's up?
I'm on the unstable branch. Version (from Eclipse): 0.10.0.201806220838
EDIT.
I modified the example code to more clearly highlight what's happening. I removed some redundant tests and the tests that were behaving correctly. I expanded some possibly-verbose diagnostics. I changed the exposition above to match. Newer results follow.
It looks like there are two different things at play here. "http" is being accepted (correctly) by both KnownScheme and UnknownScheme in tests s1[ab]. It seems to be behaving as if the priority declaration in Scheme just isn't functioning, as if > is being substituted with |.
In the other case, tests s1[cde] are failing, but s1f is passing. This looks even more like a defect. It's possible to reserve a single keyword, apparently, but not more than one. Since the various reservation declarations are failing, it's no surprise that there's an ambiguity when put into an alternative.
module ssce
import analysis::grammars::Ambiguity;
import IO;
lexical Scheme = AnyScheme ;
lexical AnyScheme = KnownScheme > UnknownScheme ;
lexical AnySchemeChar = [a-z*];
lexical KnownScheme = KnownSchemes !>> AnySchemeChar ;
lexical KnownSchemes = "http" | "https" | "http*" | "javascript" ;
lexical UnknownScheme = UnknownFixedScheme | UnknownWildScheme ;
lexical UnknownFixedScheme = [a-z]+ !>> AnySchemeChar ;
lexical UnknownWildScheme = [a-z]* '*' AnySchemeChar* !>> AnySchemeChar ;
lexical Scheme2 = UnknownScheme2 | KnownScheme ;
lexical UnknownScheme2 = UnknownScheme \ KnownSchemes ;
lexical Scheme3 = UnknownScheme3 | KnownScheme ;
lexical UnknownScheme3 = AnySchemeChar+ \ KnownSchemes ;
lexical Scheme4 = UnknownScheme4 | KnownScheme ;
lexical UnknownScheme4 = AnySchemeChar+ \ ("http"|"https") ;
lexical Scheme5 = UnknownScheme5 | KnownScheme ;
lexical UnknownScheme5 = AnySchemeChar+ \ "http" ;
test bool t1() { return parseAccept( #Scheme, "http" ); }
test bool u1() { return parseAccept( #Scheme2, "http" ); }
test bool v1() { return parseAccept( #Scheme3, "http" ); }
test bool w1() { return parseAccept( #Scheme4, "http" ); }
test bool x1() { return parseAccept( #Scheme5, "http" ); }
test bool s1a() { return parseAccept( #KnownScheme, "http" ); }
test bool s1b() { return parseAccept( #UnknownScheme, "http" ); }
test bool s1c() { return parseReject( #UnknownScheme2, "http" ); }
test bool s1d() { return parseReject( #UnknownScheme3, "http" ); }
test bool s1e() { return parseReject( #UnknownScheme4, "http" ); }
test bool s1f() { return parseReject( #UnknownScheme5, "http" ); }
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] " + a + ", " + b);
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}
I've been making this ambiguity diagnostics tool, and here's what it came up with for your grammar. It seems you've discovered more things we need to document and write little checkers for.
Well-formedness of \ is murky.
The problem is that the \ operator only accepts literal strings, such as A \ "a" \ "b" or a keyword non-terminal defined like keyword Hello = "a" | "b";, used as A \ Hello, and nothing else. So also A \ ("a" | "b") is not allowed, and also indirect non-terminals like A \ Hello where lexical Hello = Bye; lexical Bye = "if" | "then"; also not allowed. Only the simplest of the simplest forms.
Well-formedness of follow-restrictions
Similar rules for !>> disallow any non-terminal to the right of the !>> operator.
So [a-z]+ !>> [a-z] or [a-z]+ !>> "*", but not [a-z]+ \ myCharClass where lexical myCharClass = [a-z];
Names for character-classes is on our todoy list; but they will not be like non-terminals. More like aliases which will be substituted at parser generator time.
Whole words
Keyword reservation only works if you subtract the sentence from the whole word. Sometimes you have to group non-terminals to get this right:
lexical Ex = ([a-z]+ "*") \ "https*" instead of lexical Ex = [a-z]+ "*" \ "https*")
The latter would try to subtract the "https*" language from the "*" language. The first works.
case-insensitivity
'if' is defined by lexical 'if' = [iI][fF];
"if" is defined by lexical "if" = [i][f];
'*' is defined by lexical '*' = [*];
"*" is defined by lexical "*" = [*];
New grammar
I used a random generator to generate all the ambiguities I could find, and resolved them step by step by adding keyword reservation:
lexical Scheme = AnyScheme ;
lexical AnyScheme = KnownScheme > UnknownScheme ;
lexical AnySchemeChar = [a-z*];
lexical KnownScheme = KnownSchemes !>> AnySchemeChar ;
keyword KnownSchemes = "http" | "https" | "http*" | "javascript" ;
lexical UnknownScheme = UnknownFixedScheme | UnknownWildScheme ;
lexical UnknownFixedScheme = [a-z]+ !>> AnySchemeChar \ KnownSchemes ;
lexical UnknownWildScheme = ([a-z]* '*' AnySchemeChar*) !>> AnySchemeChar \ KnownSchemes ;

In CUP: How to make something optional to parse?

PROC_DECL -> "proc" [ "ret" TYPE ] NAME
"(" [ PARAM_DECL { "," PARAM_DECL } ] ")"
"{" { DECL } { STMT } "}"
This is the grammar for a Procedure declaration.
How do you say that the "ret" TYPE is optional without making multiple cases?
Use another production, say ret_stmt, which can be either empty or contain a single return statement so in your .cup file you will have this productions:
ret_stmt ::= // empty
{: /*your action for empty return statement*/ :}
// Single return statement
| "ret":r TYPE:t
{: /*your action for single return statement*/ :}
PROC_DECL ::= "proc":p ret_stmt:r NAME:n
"(" param_list:pl ")"
"{" { DECL } { STMT } "}"
{: /*your action for procedure declaration statement*/ :}
You can use a similar approach with parameters declaration, adding the production param_list.

semicolon as delimiter in custom grammar parsed by flex/bison

I'm trying to write a simple parser for a meta programming language.
Everything works fine, but I want to use ';' as statement delimiter and not newline or ommit the semicolon entirely.
So this is the expected behaviour:
// good code
v1 = v2;
v3 = 23;
should parse without errors
But:
// bad code
v1 = v2
v3 = 23;
should fail
yet if I remove the 'empty' rule from separator both codes fail like this:
ID to ID
Error detected in parsing: syntax error, unexpected ID, expecting SEMICOLON
;
If I leave the 'empty' rule active, then both codes are accepted, which is not desired.
ID to ID // should raise error
ID to NUM;
Any help is welcome here, as most tutorials do not cover delimiters at all.
Here is a simplified version of my parser/lexxer:
parser.l:
%{
#include "parser.tab.h"
#include<stdio.h>
%}
num [0-9]
alpha [a-zA-Z_]
alphanum [a-zA-Z_0-9]
comment "//"[^\n]*"\n"
string \"[^\"]*\"
whitespace [ \t\n]
%x ML_COMMENT
%%
<INITIAL>"/*" {BEGIN(ML_COMMENT); printf("/*");}
<ML_COMMENT>"*/" {BEGIN(INITIAL); printf("*/");}
<ML_COMMENT>[.]+ { }
<ML_COMMENT>[\n]+ { printf("\n"); }
{comment}+ {printf("%s",yytext);}
{alpha}{alphanum}+ { yylval.str= strdup(yytext); return ID;}
{num}+ { yylval.str= strdup(yytext); return NUM;}
{string} { yylval.str= strdup(yytext); return STRING;}
';' {return SEMICOLON;}
"=" {return ASSIGNMENT;}
" "+ { }
<<EOF>> {exit(0); /* this is suboptimal */}
%%
parser.y:
%{
#include<stdio.h>
#include<string.h>
%}
%error-verbose
%union{
char *str;
}
%token <str> ID
%token <str> NUM
%token <str> STRING
%left SEMICOLON
%left ASSIGNMENT
%start input
%%
input: /* empty */
| expression separator input
;
expression: assign
| error {}
;
separator: SEMICOLON
| empty
;
empty:
;
assign: ID ASSIGNMENT ID { printf("ID to ID"); }
| ID ASSIGNMENT STRING { printf("ID to STRING"); }
| ID ASSIGNMENT NUM { printf("ID to NUM"); }
;
%%
yyerror(char* str)
{
printf("Error detected in parsing: %s\n", str);
}
main()
{
yyparse();
}
Compiled like this:
$>flex -t parser.l > parser.lex.yy.c
$>bison -v -d parser.y
$>cc parser.tab.c parser.lex.yy.c -lfl -o parser
Never mind... the problematic line was this one:
';' {return SEMICOLON;}
which required to be changed to
";" {return SEMICOLON;}
Now the behaviour is correct. :-)

Why is this fsyacc input producing F# that does not compile?

My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.

Resources