I have the following 2 production rules in EBNF:
<CharLiteral> ::= ' " ' [ <Printable> ] ' " '
and
<StringLiteral> ::= ' " ' { <Printable> } ' " '
What is the difference between the two? [] imply 1 or more repetitions and {} imply 0 or more repetitions?
In EBNF, [X] means 0 or 1 X and {X} means 0 or more X.
In JavaCC, [X] means 0 or 1 X for grammar productions; in regular expression productions, you should use (X)? instead. To express 0 or more X in JavaCC use (X)*.
Related
I have the following PEG grammar defined:
Program = _{ SOI ~ Expr ~ EOF }
Expr = { UnaryExpr | BinaryExpr }
Term = _{Int | "(" ~ Expr ~ ")" }
UnaryExpr = { Operator ~ Term }
BinaryExpr = { Term ~ (Operator ~ Term)* }
Operator = { "+" | "-" | "*" | "^" }
Int = #{ Operator? ~ ASCII_DIGIT+ }
WHITESPACE = _{ " " | "\t" }
EOF = _{ EOI | ";" }
And the following expressions are all parsed correctly:
1 + 2
1 - 2
1 + -2
1 - -2
1
+1
-1
But any expression that begins with a negative number errors
-1 + 2
errors with
--> 1:4
|
1 | -1 + 2
| ^---
|
= expected EOI
What I expect (what I would like) is for -1 + 2 to be treated the same as 1 + -2, that is a Binary expression that is made up of two Unary Expressions.
I have toyed around with a lot of variations with no success. And, I'm open to using an entirely different paradigm if I need to, but I'd really like to keep the UnaryExpression idea since I've already built my parser around it.
I'm new to PEG, so I'd appreciate any help.
For what its worth, I'm using Rust v1.59 and https://pest.rs/ to both parse and test my expressions.
You have a small error in the Expr logic. The first part before the | takes precedence if both match.
And -1 is a valid UnaryExpr so the program as a whole is expected to match SOI ~ UnaryExpr ~ EOF in this case. But there is additional data (+ 2) which leads to this error.
If you reverse the possibilities of Expr so that Expr = { BinaryExpr | UnaryExpr } the example works. The reason for that is that first BinaryExpr will be checked and only if that fails UnaryExpr.
I want to create a parser combinator, which will collect all lines below current place, which indentation levels will be greater or equal some i. I think the idea is simple:
Consume a line - if its indentation is:
ok -> do it for next lines
wrong -> fail
Lets consider following code:
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
-- end of line
pEOL = pSym '\n'
pSpace = pSym ' '
pTab = pSym '\t'
indentOf s = case s of
' ' -> 1
'\t' -> 4
-- return the indentation level (number of spaces on the beginning of the line)
pIndent = (+) <$> (indentOf <$> (pSpace <|> pTab)) <*> pIndent `opt` 0
-- returns tuple of (indentation level, result of parsing the second argument)
pIndentLine p = (,) <$> pIndent <*> p <* pEOL
-- SHOULD collect all lines below witch indentations greater or equal i
myParse p i = do
(lind, expr) <- pIndentLine p
if lind < i
then pFail
else do
rest <- myParse p i `opt` []
return $ expr:rest
-- sample inputs
s1 = " a\
\\n a\
\\n"
s2 = " a\
\\na\
\\n"
-- execution
pProgram = myParse (pSym 'a') 1
parse p s = UU.parse ( (,) <$> p <*> pEnd) (createStr (LineColPos 0 0 0) s)
main :: IO ()
main = do
print $ parse pProgram s1
print $ parse pProgram s2
return ()
Which gives following output:
("aa",[])
Test.hs: no correcting alternative found
The result for s1 is correct. The result for s2 should consume first "a" and stop consuming. Where this error comes from?
The parsers which you are constructing will always try to proceed; if necessary input will be discarded or added. However pFail is a dead-end. It acts as a unit element for <|>.
In you parser there is however no other alternative present in case the input does not comply to the language recognised by the parser. In you specification you say you want the parser to fail on input s2. Now it fails with a message saying that is fails, and you are surprised.
Maybe you do not want it to fail, but you want to stop accepting further input? In that case
replace pFail by return [].
Note that the text:
do
rest <- myParse p i `opt` []
return $ expr:rest
can be replaced by (expr:) <$> (myParse p i `opt` [])
A natural way to solve your problem is probably something like
pIndented p = do i <- pGetIndent
(:) <$> p <* pEOL <*> pMany (pToken (take i (repeat ' ')) *> p <* pEOL)
pIndent = length <$> pMany (pSym ' ')
I'm trying to make a rule that will rewrite into a nested tree (similar to a binary tree).
For example:
a + b + c + d;
Would parse to a tree like ( ( (a + b) + c) + d). Basically each root node would have three children (LHS '+' RHS) where LHS could be more nested nodes.
I attempted some things like:
rule: lhs '+' ID;
lhs: ID | rule;
and
rule
: rule '+' ID
| ID '+' ID;
(with some tree rewrites) but they all gave me an error about it being left-recursive. I'm not sure how to solve this without some type of recursion.
EDIT: My latest attempt recurses on the right side which gives the reverse of what I want:
rule:
ID (op='+' rule)?
-> {op == null}? ID
-> ^(BinaryExpression<node=MyBinaryExpression> ID $op rule)
Gives (a + (b + (c + d) ) )
The follow grammar:
grammar T;
options {
output=AST;
}
tokens {
BinaryExpression;
}
parse
: expr ';' EOF -> expr
;
expr
: (atom -> atom) (ADD a=atom -> ^(BinaryExpression $expr ADD $a))*
;
atom
: ID
| NUM
| '(' expr ')'
;
ADD : '+';
NUM : '0'..'9'+;
ID : 'a'..'z'+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
parses your input "a + b + c + d;" as follows:
Did you try
rule: ID '+' rule | ID;
?
I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.
Part of a Lua application of mine is a search bar, and I'm trying to make it understand boolean expressions. I'm using LPeg, but the current grammar gives a strange result:
> re, yajl = require're', require'yajl'
> querypattern = re.compile[=[
QUERY <- ( EXPR / TERM )? S? !. -> {}
EXPR <- S? TERM ( (S OPERATOR)? S TERM )+ -> {}
TERM <- KEYWORD / ( "(" S? EXPR S? ")" ) -> {}
KEYWORD <- ( WORD {":"} )? ( WORD / STRING )
WORD <- {[A-Za-z][A-Za-z0-9]*}
OPERATOR <- {("AND" / "XOR" / "NOR" / "OR")}
STRING <- ('"' {[^"]*} '"' / "'" {[^']*} "'") -> {}
S <- %s+
]=]
> = yajl.to_string(lpeg.match(querypattern, "bar foo"))
"bar"
> = yajl.to_string(lpeg.match(querypattern, "name:bar AND foo"))
> = yajl.to_string(lpeg.match(querypattern, "name:bar AND foo"))
"name"
> = yajl.to_string(lpeg.match(querypattern, "name:'bar' AND foo"))
"name"
> = yajl.to_string(lpeg.match(querypattern, "bar AND (name:foo OR place:here)"))
"bar"
It only parses the first token, and I cannot figure out why it does this. As far as I know, a partial match is impossible because of the !. at the end of the starting non-terminal. How can I fix this?
The match is getting the entire string, but the captures are wrong. Note that
'->' has a higher precedence than concatenation, so you probably need parentheses around things like this:
EXPR <- S? ( TERM ( (S OPERATOR)? S TERM )+ ) -> {}