How can grammar from RFC 2396 (about URI) be expressed in PEG? - parsing

I am trying to come up with a PEG grammar that would parse a hostname according the following BNF of RFC 2396
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
There is no problem with domainlabel and toplabel.
The rule for hostname however, it seems, cannot be expressed in PEG.
Here is why I think so:
If we take the grammar as written in BNF the whole input is consumed by *(domainlabel ".") which doesn't know when to stop since toplabel [ "." ] is indistinguishable from it.
simplified self-contained illustration:
h = (d '.')* t '.'?
d = [dt]
t = [t]
This would parse t, d.d.t and fail on d.d.d which is totally expected, but it fails to parse t. and d.d.t. which both are valid cases.
If we add a lookahead then it would consume t. and d.d.t., but fail on d.t.t..
h = (!(t '.'?)d '.')* t '.'?
d = [dt]
t = [t]
So I am out of ideas, is there a way to express this BNF in PEG?

If you just need to check validity, you can do it like this:
/* Unchanged */
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
/* Diff with above */
nontoplabel = digit | digit *( alphanum | "-" ) alphanum
/* Rephrase */
hostname = 1*( *( nontoplabel "." ) toplabel) [ "." ]
Since nontoplabel and toplabel are distinguishable by their first character, there is no possible ambiguity in the last expression.
The transformation is one of the many possible regular expression identities:
(a | b)* a ==> (b* a)+
You can always replace b in a|b with b-a (using - as the set difference operator).

Related

Left and right recursive parser

This is an evolution of this question.
I need to parse with megaparsec a data structure like
data Foo =
Simple String
Dotted Foo String
Paren String Foo
and I would like to parse to it strings like
foo ::= alphanum
| foo "." alphanum
| alphanum "(" foo ")"
For example a the string "a(b.c).d" should be parsed to Dotted (Paren "a" (Dotted (Simple "b") "c")) "d".
The problem I have is that this is at the same time left and right recursive.
I have no problems writing the parsers for the first and the third case:
parser :: Parser Foo
parser
= try (do
prefix <- alphanum
constant "("
content <- parser
constant ")"
pure $ Paren prefix content
)
<|> Simple alphanum
but I'm not able to put together also the parser for the second case. I tried to approach it with sepBy1 or with makeExprParser but I couldn't get it right
To factor out the left recursion in this:
foo ::= alphanum
| foo "." alphanum
| alphanum "(" foo ")"
You can start by rewriting it to this:
foo ::= alphanum ("(" foo ")")?
| foo "." alphanum
Then you can factor out the left recursion using the standard trick of replacing:
x ::= x y | z
With:
x ::= z x'
x' ::= y x' | ∅
In other words:
x ::= z y*
With x = foo, y = "." alphanum, and z = alphanum ("(" foo ")")?, that becomes:
foo ::= alphanum ("(" foo ")")? ("." alphanum)*
Then I believe your parser can just be something like this, since ? ~ zero or one ~ Maybe ~ optional and * ~ zero or more ~ [] ~ many:
parser = do
prefix <- Simple <$> alphanum
maybeParens <- optional (constant "(" *> parser <* constant ")")
suffixes <- many (constant "." *> alphanum)
let
prefix' = case maybeParens of
Just content -> Paren prefix content
Nothing -> prefix
pure $ foldl' Dotted prefix' suffixes

Exponent operator does not work when no space added? Whats wrong with my grammar

I am trying to write an expression evaluator in which I am trying to add underscore _ as a reserve word which would denote a certain constant value.
Here is my grammar, it successfully parses 5 ^ _ but it fails to parse _^ 5 (without space). It only acts up that way for ^ operator.
COMPILER Formula
CHARACTERS
digit = '0'..'9'.
letter = 'A'..'z'.
TOKENS
number = digit {digit}.
identifier = letter {letter|digit}.
self = '_'.
IGNORE '\r' + '\n'
PRODUCTIONS
Formula = Term{ ( '+' | '-') Term}.
Term = Factor {( '*' | "/" |'%' | '^' ) Factor}.
Factor = number | Self.
Self = self.
END Formula.
What am I missing? I am using Coco/R compiler generator.
Your current definition of the token letter causes this issue because the range A..z includes the _ character and ^ character.
You can rewrite the Formula and Term rules like this:
Formula = Formula ( '+' | '-') Term | Term
Term = Term ( '*' | "/" |'%' | '^' ) Factor | Factor
e.g. https://metacpan.org/pod/distribution/Marpa-R2/pod/Marpa_R2.pod#Synopsis

ANTLR4 grammar to specify parent child relationship

I've created a grammar to express a search through a Map using key and value pairs using an ANTLR4 grammar file:
START: 'SEARCH FOR';
VALUE_EXPRESSION: 'VALUE:'[a-zA-Z0-9]+;
MATCH: 'MATCHING';
COMMA: ',';
KEY_EXPRESSION: 'KEY:'[a-zA-Z0-9]*;
KEY_VALUE_PAIR: KEY_EXPRESSION MATCH VALUE_EXPRESSION;
r : START KEY_VALUE_PAIR (COMMA KEY_VALUE_PAIR)*;
WS: [ \n\t\r]+ -> skip;
The "Interpret Lexer" in ANTLRWorks produces:
And the "Parse Tree" like this:
I'm not sure if this is the correct (or even typical) way to go about parsing an input string but what I'd like to do is have each of the key/value pairs split up and placed under a parent node like such:
[SEARCH FOR] [PAIR], [PAIR]
| |
/ \ / \
/ \ / \
/ \ / \
colour red size small
My belief is that in doing this It will make like easier when I come to walk the tree.
I've searched around and tried to use the caret '^' character to specify the parent but ANTLRWorks always indicates that there is an error in my grammar.
Can anybody help with this, or possibly supply another solution (if this is an atypical approach)?
You can probably simplify this even further. You might want to have a LEXER rule for your keys to keep track of them. So below, I am simply using string as the key. But you could define a lexer rule for 'colour', 'size', etc... Also, I did away with the matching. Instead, I created a set of pairs.
grammar GRAMMAR;
start: START set ;
set
: pair (',' pair)*
;
pair: STRING ':' value ;
value
: STRING
| NUMBER
;
START: 'SEARCH FOR: ' ;
STRING : '"' [a-zA-Z_0-9]* '"' ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

UnsupportedOperationException when I try translate a expression

I defined a syntax for a language of expressions, actually it's more complex, but I simplified to put here, and I defined some functions to translate this expressions for Java (I'm not using the Java syntax, I just translate to strings). I defined some constants in the syntax (I don't know if it's the correct name for this) that calls "MAXINT" and "MININT", and some functions that calls translateExp to translate each expression that I defined in the syntax to a string. The most expressions that I try translate works, but when "MAXINT" or "MININT" appears in the expression don't work and throws UnsupportedOperationException and I don't know why, for example "MAXINT - 1". Somebody knows why and can help me? Another problem that throws UnsupportedOperationException too is when I try translate some expression that have more than one plus (+) or minus (-), like "1+1+1", again, somebody knows why?
My module with the syntax and the functions:
module ExpSyntax
import String;
import ParseTree;
layout Whitespaces = [\t\n\ \r\f]*;
lexical Ident = [a-z A-Z 0-9 _] !<< [a-z A-Z][a-z A-Z 0-9 _]* !>> [a-z A-Z 0-9 _] \ Keywords;
lexical Integer_literal = [0-9]+;
keyword Keywords = "MAXINT" | "MININT";
start syntax Expression
= Expressions_primary
| Expressions_arithmetical;
syntax Expressions_primary
= Data: Ident+ id
| bracket Expr_bracketed: "(" Expression e ")"
;
syntax Expressions_arithmetical
= Integer_lit
| left Addition: Expression e1 "+" Expression e2
| left Difference: Expression e1 "-" Expression e2
;
syntax Integer_lit
= Integer_literal il
| MAX_INT: "MAXINT"
| MIN_INT: "MININT"
;
public str translate(str txt) = translateExp(parse(#Expression, txt));
public str translateExp((Expression) `<Integer_literal i>`) = "<i>";
public str translateExp((Expression) `MAXINT`) = "java.lang.Integer.MAX_VALUE";
public str translateExp((Expression) `MININT`) = "java.lang.Integer.MIN_VALUE";
public str translateExp((Expression) `<Expression e1>+<Expression e2>`) = "<translateExp(e1)> + <translateExp(e2)>";
public str translateExp((Expression) `<Expression e1>-<Expression e2>`) = "<translateExp(e1)> - <translateExp(e2)>";
public str translateExp((Expression) `<Ident id>`) = "<id>";
public str translateExp((Expression) `(<Expression e>)`) = "(<translateExp(e)>)";
This is what happens:
rascal>import ExpSyntax;
ok
rascal>translate("(test + 1) - test2");
str: "(test + 1) - test2"
rascal>translate("MAXINT - 1");
java.lang.UnsupportedOperationException(internal error) at $shell$(|main://$shell$|)
java.lang.UnsupportedOperationException
at org.rascalmpl.ast.Expression.getKeywordArguments(Expression.java:214)
at org.rascalmpl.interpreter.matching.NodePattern.<init>(NodePattern.java:84)
at org.rascalmpl.semantics.dynamic.Tree$Amb.buildMatcher(Tree.java:351)
at org.rascalmpl.ast.AbstractAST.getMatcher(AbstractAST.java:173)
at org.rascalmpl.interpreter.result.RascalFunction.prepareFormals(RascalFunction.java:503)
at org.rascalmpl.interpreter.result.RascalFunction.call(RascalFunction.java:365)
at org.rascalmpl.interpreter.result.OverloadedFunction.callWith(OverloadedFunction.java:327)
at org.rascalmpl.interpreter.result.OverloadedFunction.call(OverloadedFunction.java:305)
at org.rascalmpl.semantics.dynamic.Expression$CallOrTree.interpret(Expression.java:486)
at org.rascalmpl.semantics.dynamic.Statement$Expression.interpret(Statement.java:355)
at org.rascalmpl.semantics.dynamic.Statement$Return.interpret(Statement.java:773)
at org.rascalmpl.interpreter.result.RascalFunction.runBody(RascalFunction.java:467)
at org.rascalmpl.interpreter.result.RascalFunction.call(RascalFunction.java:413)
at org.rascalmpl.interpreter.result.OverloadedFunction.callWith(OverloadedFunction.java:327)
at org.rascalmpl.interpreter.result.OverloadedFunction.call(OverloadedFunction.java:305)
at org.rascalmpl.semantics.dynamic.Expression$CallOrTree.interpret(Expression.java:486)
at org.rascalmpl.semantics.dynamic.Statement$Expression.interpret(Statement.java:355)
at org.rascalmpl.interpreter.Evaluator.eval(Evaluator.java:936)
at org.rascalmpl.semantics.dynamic.Command$Statement.interpret(Command.java:115)
at org.rascalmpl.interpreter.Evaluator.eval(Evaluator.java:1147)
at org.rascalmpl.interpreter.Evaluator.eval(Evaluator.java:1107)
at org.rascalmpl.eclipse.console.RascalScriptInterpreter.execCommand(RascalScriptInterpreter.java:446)
at org.rascalmpl.eclipse.console.RascalScriptInterpreter.run(RascalScriptInterpreter.java:239)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:53)
rascal>translate("1+1+1");
java.lang.UnsupportedOperationException(internal error) at $shell$(|main://$shell$|)
java.lang.UnsupportedOperationException
at org.rascalmpl.ast.Expression.getKeywordArguments(Expression.java:214)
at org.rascalmpl.interpreter.matching.NodePattern.<init>(NodePattern.java:84)
at org.rascalmpl.semantics.dynamic.Tree$Amb.buildMatcher(Tree.java:351)
at org.rascalmpl.ast.AbstractAST.getMatcher(AbstractAST.java:173)
at org.rascalmpl.interpreter.result.RascalFunction.prepareFormals(RascalFunction.java:503)
at org.rascalmpl.interpreter.result.RascalFunction.call(RascalFunction.java:365)
at org.rascalmpl.interpreter.result.OverloadedFunction.callWith(OverloadedFunction.java:327)
at org.rascalmpl.interpreter.result.OverloadedFunction.call(OverloadedFunction.java:305)
at org.rascalmpl.semantics.dynamic.Expression$CallOrTree.interpret(Expression.java:486)
at org.rascalmpl.semantics.dynamic.Statement$Expression.interpret(Statement.java:355)
at org.rascalmpl.semantics.dynamic.Statement$Return.interpret(Statement.java:773)
at org.rascalmpl.interpreter.result.RascalFunction.runBody(RascalFunction.java:467)
at org.rascalmpl.interpreter.result.RascalFunction.call(RascalFunction.java:413)
at org.rascalmpl.interpreter.result.OverloadedFunction.callWith(OverloadedFunction.java:327)
at org.rascalmpl.interpreter.result.OverloadedFunction.call(OverloadedFunction.java:305)
at org.rascalmpl.semantics.dynamic.Expression$CallOrTree.interpret(Expression.java:486)
at org.rascalmpl.semantics.dynamic.Statement$Expression.interpret(Statement.java:355)
at org.rascalmpl.interpreter.Evaluator.eval(Evaluator.java:936)
at org.rascalmpl.semantics.dynamic.Command$Statement.interpret(Command.java:115)
at org.rascalmpl.interpreter.Evaluator.eval(Evaluator.java:1147)
at org.rascalmpl.interpreter.Evaluator.eval(Evaluator.java:1107)
at org.rascalmpl.eclipse.console.RascalScriptInterpreter.execCommand(RascalScriptInterpreter.java:446)
at org.rascalmpl.eclipse.console.RascalScriptInterpreter.run(RascalScriptInterpreter.java:239)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:53)
The error message is a bug, since it should report something more clear, but it appears there is still an ambiguity in your syntax definition. The run-time is trying to build a pattern matcher for Tree$Amb (org.rascalmpl.semantics.dynamic.Tree$Amb.buildMatcher) which we did not implement and will not implement.
From looking at the definition (I did not try it), it seems that the cause is this rule:
lexical Ident = [a-z A-Z 0-9 _] !<< [a-z A-Z][a-z A-Z 0-9 _]* !>> [a-z A-Z 0-9 _] \ Keywords;
Because !>> and \ bind stronger than juxtapositioning, the \ keyword reservation is applied only to the tail of an Ident and not the whole. Please add brackets (a sequence operator) to disambiguate:
lexical Ident = [a-z A-Z 0-9 _] !<< ([a-z A-Z][a-z A-Z 0-9 _]*) !>> [a-z A-Z 0-9 _] \ Keywords;
This should get you a step further.
Then your expression grammar is still ambiguous and can be simplified to this:
start syntax Expression
= Data: Ident+ id
| Integer: Integer_lit
| bracket bracketed: "(" Expression e ")"
> left
( Addition: Expression e1 "+" Expression e2
| Difference: Expression e1 "-" Expression e2
)
;
syntax Integer_lit
= Integer_literal il
| MAX_INT: "MAXINT"
| MIN_INT: "MININT"
;
Short explanation: left only works on directly recursive non-terminals. Since you defined Expressions_arithmatical in terms of Expression there was no direct recursion. A next will support indirect recursion, but for this grammar this would be unnecessary.
Also, I added that + and - are mutually left recursive by putting them in a group, otherwise 1+1-1 would have remained ambiguous.

Antlr parser for and/or logic - how to get expressions between logic operators?

I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.

Resources