I'm using Instaparse to parse expressions like:
$(foo bar baz $(frob))
into something like:
[:expr "foo" "bar" "baz" [:expr "frob"]]
I've almost got it, but having trouble with ambiguity. Here's a simplified version of my grammar that repros, attempting to rely on negative lookahead.
(def simple
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'.+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple "$(foo bar)")
which errors:
Parse error at line 1, column 11:
$(foo bar)
^
Expected one of:
")"
#"\s+"
Here I've said a word can be any char, in order to support expressions like:
$(foo () `bar` b-a-z)
etc. Note a word can contain () but it cannot contain $(). Not sure how to express this in the grammar. Seems the problem is <word> is too greedy, consuming the last ) instead of letting expr have it.
Update removed whitespace from word:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'[^ ]+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple2 "$(foo bar)")
; Parse error at line 1, column 11:
; $(foo bar)
; ^
; Expected one of:
; ")"
; #"\s+"
(simple2 "$(foo () bar)")
; Parse error at line 1, column 14:
; $(foo () bar)
; ^
; Expected one of:
; ")"
; #"\s+"
Update 2 more test cases
(simple2 "$(foo bar ())")
(simple2 "$((foo bar baz))")
Update 3 full working parser
For anyone curious, the full working parser, which was outside the scope of this question is:
(def parse
"expr - the top-level expression made up of cmds and sub-exprs. When multiple
cmds are present, it implies they should be sucessively piped.
cmd - a single command consisting of words.
sub-expr - a backticked or $(..)-style sub-expression to be evaluated inline.
parened - a grouping of words wrapped in parenthesis, explicitly tokenized to
allow parenthesis in cmds and disambiguate between sub-expression
syntax."
(insta/parser
"expr = cmd (<space> <pipe> <space> cmd)*
cmd = words
<sub-expr> = <backtick> expr <backtick> | nestable-sub-expr
<nestable-sub-expr> = <dollar> <lparen> expr <rparen>
words = word (<space>* word)*
<word> = sub-expr | parened | word-chars
<word-chars> = #'[^ `$()|]+'
parened = lparen words rparen
<space> = #'[ ]+'
<pipe> = #'[|]'
<dollar> = <'$'>
<lparen> = '('
<rparen> = ')'
<backtick> = <'`'>"))
Example usage:
(parse "foo bar (qux) $(clj (map (partial * $(js 45 * 2)) (range 10))) `frob`")
Parses to:
[:expr [:cmd [:words "foo" "bar" [:parened "(" [:words "qux"] ")"] [:expr [:cmd [:words "clj" [:parened "(" [:words "map" [:parened "(" [:words "partial" "*" [:expr [:cmd [:words "js" "45" "*" "2"]]]] ")"] [:parened "(" [:words "range" "10"] ")"]] ")"]]]] [:expr [:cmd [:words "frob"]]]]]]
This is a parser for a chatbot I wrote, yetibot. It replaces the previous mess of regex-based, by-hand parsing.
I don't really know instaparser, so I just read enough documentation to give me a false sense of security. I also didn't test, and I don't really know what your requirements are.
In particular, I don't know:
1) Whether $() can nest (your grammar makes that impossible, I think, but it seems odd to me)
2) Whether () can contain whitespace without being parsed as more than one word
3) Whether () can contain $()
You'll need to be clear on things like this in order to write the grammar (or, as it happens, to ask for advice).
Update: Revised the grammar based on comments. I removed the productions for $ ( and ) because they seemed unnecessary, and this way the angle-brackets feel easier to deal with.
The following is based on answering the above questions "yes, no, yes" and some random assumptions about regex format. (I'm not totally clear on how angle-brackets work, but I don't think it will be easy to make parentheses output the way you want; I settled for just outputting them as single elements. If I figure out something, I'll edit it.)
<sequence> = element (<space> element)*
<element> = expr | paren_sequence | word
expr = <'$'> <'('> sequence <')'>
<word> = !('$'? '(') #'([^ $()]|\$[^(])+'
<paren_sequence> = '(' sequence ')'
<space> = #'\\s+'
Hope that helps a bit.
Well there are two changes you have to make in order to get both of your examples to work.
1) Add Negative Lookbehind
First, you will need a negative lookbehind in the regex for <word>. That way it will drop all the occurrences of ) as the last character:
<word> = !(dollar lparen) #'[^ ]+(?<!\\))'
So this will fix your first test case:
(simple2 "$(foo bar)")
=> [:expr "foo" "bar"]
2) Add grammar for the last word
Now if you run your second test case it will fail:
(simple2 "$(foo () bar)")
=> Parse error at line 1, column 8:
$(foo () bar)
^
Expected one of:
")" (followed by end-of-string)
#"\s+"
This fails because we have told our grammar to drop the last ) in all instances of <word>. We now have to tell our grammar how to differentiate between the last instance of <word> and other instances. We'll do this by adding a specific <lastword> grammar, and make all other instances of <word> optional. The full grammar would look like this:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word* lastword <rparen>
<word> = !(dollar lparen) #'[^ ]+' <space>+
<lastword> = !(dollar lparen) #'[^ ]+(?<!\\))'
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
And your two test cases should work fine:
(simple2 "$(foo bar)")
=> [:expr "foo" "bar"]
(simple2 "$(foo () bar)")
=> [:expr "foo" "()" "bar"]
Hope this helps.
Related
I am trying to implement a lambda calculus inside of Rascal but am having trouble getting the precedence and parsing to work the way I would like it to. Currently I have a grammar that looks something like:
keyword Keywords= "if" | "then" | "else" | "end" | "fun";
lexical Ident = [a-zA-Z] !>> [a-zA-Z]+ !>> [a-zA-Z0-9] \ Keywords;
lexical Natural = [0-9]+ !>> [0-9];
lexical LAYOUT = [\t-\n\r\ ];
layout LAYOUTLIST = LAYOUT* !>> [\t-\n\r\ ];
start syntax Prog = prog: Exp LAYOUTLIST;
syntax Exp =
var: Ident
| nat: Natural
| bracket "(" Exp ")"
> left app: Exp Exp
> right func: "fun" Ident "-\>" Exp
When I parse a program of the form:
(fun x -> fun y -> x) 1 2
The resulting tree is:
prog(app(
app(
func(
"x",
func(
"y",
var("x")
nat(1),
nat(2))))))
Where really I am looking for something like this (I think):
prog(app(
func(
"x",
app(
func(
"y",
var("x")),
nat(2))),
nat(1)))
I've tried a number of variations of the precedence in the grammar, I've tried wrapping the App rule in parenthesis, and a number of other variations. There seems to be something going on here I don't understand. Any help would be most appreciated. Thanks.
I've used the following grammar, which removes the extra LAYOUTLIST and the dead right, but this should not make a difference. It seems to work as you want when I use the generic implode function :
keyword Keywords= "if" | "then" | "else" | "end" | "fun";
lexical Ident = [a-zA-Z] !>> [a-zA-Z]+ !>> [a-zA-Z0-9] \ Keywords;
lexical Natural = [0-9]+ !>> [0-9];
lexical LAYOUT = [\t-\n\r\ ];
layout LAYOUTLIST = LAYOUT* !>> [\t-\n\r\ ];
start syntax Prog = prog: Exp;
syntax Exp =
var: Ident
| nat: Natural
| bracket "(" Exp ")"
> left app: Exp Exp
> func: "fun" Ident "-\>" Exp
;
Then calling the parser and imploding to an untyped AST (I've removed the location annotations for readability):
rascal>import ParseTree;
ok
rascal>implode(#node, parse(#start[Prog], "(fun x -\> fun y -\> x) 1 2"))
node: "prog"("app"(
"app"(
"func"(
"x",
"func"(
"y",
"var"("x"))),
"nat"("1")),
"nat"("2")))
So, I am guessing you got the grammar right for the shape of tree you want. How do you go from concrete parse tree to abstract AST? Perhaps there is something funny going on there.
Hy guys.
What do you think about these two rules to parse whitespaces and to recognize the different lines of the file I must translate?
1.
line: NEW_LINE {$$ = System.lineSeparator();}
| line NEW_LINE {$$ = $1 + System.lineSeparator();}
where:
NEW_LINE = \r\n|\n|\r in Jflex
2.
whitespace: WHITESPACE {$$ = " ";}
| whitespace WHITESPACE {$$ = $1 + " ";}
where:
WHITESPACE = [ \t] in Jflex
Are they correct? Thanks of all
line: NEW_LINE {$$ = System.lineSeparator();}
| line NEW_LINE {$$ = $1 + System.lineSeparator();}
where:
NEW_LINE = \r\n|\n|\r in Jflex
If you don't really care about multliple newlines, as this grammar suggests, collect them all in the lexer:
NEW_LINE = (\r\n|\n|\r)+ return NEW_LINE;
and not in the parser:
line : NEW_LINE { $$ = System.lineSeparator(); }
Whitespace normally includes line terminators, unless they are significant in your grammar, which they seem to be, but also formfeeds:
WHITESPACE [ \t\f]
and again it is much more efficient to collect it all in the lexer rather than the parser:
WHITESPACE [ \t\f]+
whitespace: WHITESPACE { $$ = strdup(yytext); }
Note that this has to be free()-d whenever it reappears as $1, $2, etc, and isn't copied directly to $$.
But then usually whitespace doesn't appear in the grammar at all, it is just ignored by the lexer:
WHITESPACE [ \t\f]+ ;
unless again you really really need it in the grammar. This is pretty unlikely. You should just be able to work with the non-whitespace tokens the lexer returns to you.
Trying to build a grammar that will parse simple bool expressions.
I am running into an issue when there are multiple expressions.
I need to be able to parse 1..n and/or'ed expressions.
Each example below is a complete expression:
(myitem.isavailable("1234") or myitem.ispresent("1234")) and
myitem.isready("1234")
myitem.value > 4 and myitem.value < 10
myitem.value = yes or myotheritem.value = no
Grammar:
#start = conditionalexpression* | expressiontypes;
conditionalexpression = condition expressiontypes;
expressiontypes = expression | functionexpression;
expression = itemname dot property condition value;
functionexpression = itemname dot functionproperty;
itemname = Word;
propertytypes = property | functionproperty;
property = Word;
functionproperty = Word '(' value ')';
value = Word | QuotedString | Number;
condition = textcondition;
dot = '.';
textcondition = 'or' | 'and' | '<' | '>' | '=';
Developer of ParseKit here.
Here is a ParseKit grammar that matches your example input:
#start = expr;
expr = orExpr;
orExpr = andExpr orTerm*;
orTerm = 'or' andExpr;
// 'and' should bind more tightly than 'or'
andExpr = relExpr andTerm*;
andTerm = 'and' relExpr;
// relational expressions should bind more tightly than 'and'/'or'
relExpr = callExpr relTerm*;
relTerm = relOp callExpr;
// func calls should bind more tightly than relational expressions
callExpr = primaryExpr ('(' argList ')')?;
argList = Empty | atom (',' atom)*;
primaryExpr = atom | '(' expr ')';
atom = obj | literal;
// member access should bind most tightly
obj = id member*;
member = ('.' id);
id = Word;
literal = Number | QuotedString | bool;
bool = 'yes' | 'no';
relOp = '<' | '>' | '=';
To give you an idea of how I arrived at this grammar:
I realized that your language is a simple, composable expression langauge.
I remembered that XPath 1.0 is also a relatively simple expression langauge with a easily available/readable grammar.
I visited the XPath 1.0 spec online and quickly scanned the XPath basic language grammar. That served to provide a quick jumping-off point for desinging your language grammar. If you ignore the path expression part of XPath expressions, XPath is a very good template for a basic expression language.
My grammar above successfully parses all of your example inputs (see below). Hope this helps.
[foo, ., bar, (, "hello", ), or, (, bar, or, baz, >, bat, )]foo/./bar/(/"hello"/)/or/(/bar/or/baz/>/bat/)^
[myitem, ., value, >, 4, and, myitem, ., value, <, 10]myitem/./value/>/4/and/myitem/./value/</10^
[myitem, ., value, =, yes, or, myotheritem, ., value, =, no]myitem/./value/=/yes/or/myotheritem/./value/=/no^
In the code below I can correctly parse white spaces after each of the tokens using Parsec:
whitespace = skipMany (space <?> "")
number :: Parser Integer
number = result <?> "number"
where
result = do {
ds <- many1 digit;
whitespace;
return (read ds)
}
table = result
where
result = [
[Infix (genParser '*' (*)) AssocLeft,
Infix (genParser '/' div) AssocLeft],
[Infix (genParser '+' (+)) AssocLeft,
Infix (genParser '-' (-)) AssocLeft]]
genParser s f = char s >> whitespace >> return f
factor = parenExpr <|> number <?> "parens or number"
where
parenExpr = do {
char '(';
x <- expr;
char ')';
whitespace;
return x
}
expr :: Parser Integer
expr = buildExpressionParser table factor <?> "expression"
However I get a parse error when trying to only parse white spaces before, and after the operators:
whitespace = skipMany (space <?> "")
number :: Parser Integer
number = result <?> "number"
where
result = do {
ds <- many1 digit;
return (read ds)
}
table = result
where
result = [
[Infix (genParser '*' (*)) AssocLeft,
Infix (genParser '/' div) AssocLeft],
[Infix (genParser '+' (+)) AssocLeft,
Infix (genParser '-' (-)) AssocLeft]]
genParser s f = whitespace >> char s >> whitespace >> return f
factor = parenExpr <|> number <?> "parens or number"
where
parenExpr = do {
char '(';
x <- expr;
char ')';
return x
}
expr :: Parser Integer
expr = buildExpressionParser table factor <?> "expression"
The parse error is:
$ ./parsec_example < <(echo "2 * 2 * 3")
"(stdin)" (line 2, column 1):
unexpected end of input
expecting "*"
Why does this happen? Is there some other way to parse white space around just the operators?
When I test your code, 2 * 2 * 3 parses correctly, but 2 + 2 does not. Parsing fails because the parser for * consumes some input and backtracking isn't enabled at that position, so other parsers cannot be tried.
An expression parser created by buildExpressionParser tries to parse each operator in turn until one succeeds. When parsing 2 + 2, the following occurs:
The first 2 is matched by number. The rest of the input is + 2 (note the space at the beginning).
The parser genParser '*' (*) is applied to the input. It consumes the space, but does not match the + character.
The other infix operator parsers automatically fail because some input was consumed by genParser '*' (*).
You can fix this by wrapping the critical part of the parser in try. This saves the input until after char s succeeds. If char s fails, then buildExpressionParser can backtrack and try another infix operator.
genParser s f = try (whitespace >> char s) >> whitespace >> return f
The drawback of this parser is that, because it backtracks to before the leading whitespace before an infix operator, it repeatedly scans whitespace. It is usually better to parse whitespace after a successful match, like the OP's first parser example.
I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.