JavaCC: treat white space like <OR> - parsing

I'm trying to build a simple grammar for Search Engine query.
I've got this so far -
options {
STATIC=false;
MULTI=true;
VISITOR=true;
}
PARSER_BEGIN(SearchParser)
package com.syncplicity.searchservice.infrastructure.parser;
public class SearchParser {}
PARSER_END(SearchParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
<*> TOKEN : {
<#_TERM_CHAR: ~[ " ", "\t", "\n", "\r", "!", "(", ")", "\"", "\\", "/" ] >
| <#_QUOTED_CHAR: ~["\""] >
| <#_WHITESPACE: ( " " | "\t" | "\n" | "\r" | "\u3000") >
}
TOKEN :
{
<AND: "AND">
| <OR: "OR">
| <NOT: ("NOT" | "!")>
| <LBRACKET: "(">
| <RBRACKET: ")">
| <TERM: (<_TERM_CHAR>)+ >
| <QUOTED: "\"" (<_QUOTED_CHAR>)+ "\"">
}
/** Main production. */
ASTQuery query() #Query: {}
{
subQuery()
( <AND> subQuery() #LogicalAnd
| <OR> subQuery() #LogicalOr
| <NOT> subQuery() #LogicalNot
)*
{
return jjtThis;
}
}
void subQuery() #void: {}
{
<LBRACKET> query() <RBRACKET> | term() | quoted()
}
void term() #Term:
{
Token t;
}
{
(
t=<TERM>
)
{
jjtThis.value = t.image;
}
}
void quoted() #Quoted:
{
Token t;
}
{
(
t=<QUOTED>
)
{
jjtThis.value = t.image;
}
}
Looks like it works as I wanted to, e.g it can handle AND, OR, NOT/!, single terms and quoted text.
However I can't force it to handle whitespaces between terms as OR operator. E.g hello world should be treated as hello OR world
I've tried all obvious solutions, like <OR: ("OR" | " ")>, removing " " from SKIP, etc. But it still doesn't work.

Perhaps you don't want whitespace treated as an OR, perhaps you want the OR keyword to be optional. In that case you can use a grammar like this
query --> subquery (<AND> subquery | (<OR>)? subquery | <NOT> subquery)*
However this grammar treat NOT as an infix operator. Also it doesn't reflect precedence. Usually NOT has precedence over AND and AND over OR. Also your main production should look for an EOF. For that you can try
query --> query0 <EOF>
query0 --> query1 ((<OR>)? query1)*
query1 --> query2 (<AND> query2)*
query2 --> <NOT> query2 | subquery
subquery --> <LBRACKET> query0 <RBRACKET> | <TERM> | <QUOTED>

Ok. Suppose you actually do want to require that any missing ORs be replaced by at least one space. Or to put it another way, if there is one or more white spaces where an OR would be permitted, then that white space is considered to be an OR.
As in my other solution, I'll treat NOT as a unary operator and give NOT precedence over AND and AND precedence over either sort of OR.
Change
SKIP : { " " | "\t" | "\n" | "\r" }
to
TOKEN : {<WS : " " | "\t" | "\n" | "\r" > }
Now use a grammar like this
query() --> query0() ows() <EOF>
query0() --> query1()
( LOOKAHEAD( ows() <OR> | ws() (<NOT> | <LBRACKET> | <TERM> | <QUOTED>) )
( ows() (<OR>)?
query1()
)*
query1() --> query2() (LOOKAHEAD(ows() <AND>) ows() <AND> query2())*
query2() --> ows() (<NOT> query2() | subquery())
subquery() --> <LBRACKET> query0() ows() <RBRACKET> | <TERM> | <QUOTED>
ows() --> (<WS>)*
ws() --> (<WS>)+

Related

ANTLR 'or' regular expression

I have a serious problem about | expression.
My grammar contains expression like this.
...ifelse : 'IF' condition 'THEN' dosomething+ 'ENDIF'
...dosomething : assign | print | input;
but dosomething becomes constant. For example :
IF a > 3 THEN
PRINT "HEllo"
b = a
ENDIF
so first dosomething is print and grammar can't read assing, input.
If statements become like this, it works correct
IF a > 3 THEN
PRINT "HEllo"
PRINT myName
ENDIF
So i mean 'or' ( | | )+ expression becomes constants same as first occured expression.
grammar hellog;
prog : command+;
command : maincommand
| expressioncommand
| flowcommand
;
//main
maincommand : printcommand
| inputcommand
;
printcommand : 'PRINT' (IDINT | IDSTR | STRING) NL
| 'PRINT' (IDINT | IDSTR | STRING) (',' (IDINT | IDSTR | STRING))* NL
;
inputcommand : 'INPUT' (IDINT | IDSTR) NL
| 'INPUT' STRING? (IDINT | IDSTR) NL
;
//expression
expressioncommand : intexpression
| strexpression
;
intexpression : IDINT '=' (IDINT | INT) NL
| IDINT '=' (IDINT | INT) (OPERATORMATH (IDINT | INT))* NL
;
strexpression : IDSTR '=' (IDSTR | STRING) NL
| IDSTR '=' (IDSTR | STRING) ('+' (IDSTR | STRING))* NL
;
//flow
flowcommand : ifelseflow
| whileflow
;
ifelseflow : 'IF' conditionflow 'THEN' NL dosomething+ ('ELSEIF' conditionflow 'THEN' NL dosomething+)* ('ELSE' NL dosomething+)? 'ENDIF' NL;
whileflow : 'WHILE' conditionflow NL (dosomething)+ 'WEND' NL;
dosomething : command;
conditionflow : (INT | IDINT) OPERATORBOOL (INT | IDINT)
| (STRING | IDSTR) '=' (STRING | IDSTR)
;
INT : [0-9]+;
STRING : '"' .*? '"';
IDINT : [a-zA-Z]+;
IDSTR : [a-zA-Z]+'$';
NL : '\n';
WS : [ \t\r]+ -> skip;
OPERATORMATH : '+' | '-' | '*' | '/';
OPERATORBOOL : '=' | '>' | '<' | '>=' | '<=';
I just need a grammar to run these expression:
PRINT "Your name"
INPUT name
PRINT "HELLO" name
a = 6
IF a > 3 THEN
PRINT a
a = a -1
END IF
WHILE b = 3
PRINT b
a = b
WEND
My answer isn't exactly about the | alternatives, but please keep reading, because like you, I found implementation of if..else constructs in a BASIC-like language a real challenge to implement. I found some good resources online. When I got it right, many, many problems disappeared all at once and it just started to work. Please take a look at my grammar snip:
ifstmt
: IF condition_block (ELSE IF condition_block)* (ELSE stmt_block)?
;
condition_block
: expr stmt_block
;
stmt_block
: OBRACE statement+ CBRACE
| statement
;
And my implementation (in C# visitor pattern):
public override MuValue VisitIfstmt(LISBASICParser.IfstmtContext context)
{
LISBASICParser.Condition_blockContext[] conditions = context.condition_block();
bool evaluatedBlock = false;
foreach (LISBASICParser.Condition_blockContext condition in conditions)
{
MuValue evaluated = Visit(condition.expr());
if (evaluated.AsBoolean())
{
evaluatedBlock = true;
Visit(condition.stmt_block());
break;
}
}
if (!evaluatedBlock && context.stmt_block() != null)
{
Visit(context.stmt_block());
}
return MuValue.Void;
}
Much borrowed from Bart Kiers's excellent implementation of his Mu demonstration language. Lots of great ideas in that project of his. It really showed me the light and this code I've shown handles if statements great, nested arbitrarily deep if you need that. This is production code running a critical domain-specific language.

Removing direct left recursion in JavaCC

I have the following in a JavaCC file:
void condition() : {}
{
expression() comp_op() expression()
| condition() (<AND> | <OR>) condition()
}
where <AND> is "&&" and <OR> is "||". This is causing problems due to the fact that it is direct left-recursion. How can I fix this?
A condition is essentially 1 or more of expression comp_op expression separated by an AND or an OR. You could do the following
condition --> simpleCondition ( (<AND> | <OR>) simpleCondition )*
simpleCondition --> expression comp_op expression

Flex and Yacc Grammar Issue

Edit #1: I think the problem is in my .l file. I don't think the rules are being treated as rules, and I'm not sure how to treat the terminals of the rules as strings.
My last project for a compilers class is to write a .l and a .y file for a simple SQL grammar. I have no experience with Flex or Yacc, so everything I have written I have pieced together. I only have a basic understanding of how these files work, so if you spot my problem can you also explain what that section of the file is supposed to do? I'm not even sure what the '%' symbols do.
Basically some rules just do not work when I try to parse something. Some rules hang and others reject when they should accept. I need to implement the following grammar:
start
::= expression
expression
::= one-relation-expression | two-relation-expression
one-relation-expression
::= renaming | restriction | projection
renaming
::= term RENAME attribute AS attribute
term
::= relation | ( expression )
restriction
::= term WHERE comparison
projection
::= term | term [ attribute-commalist ]
attribute-commalist
::= attribute | attribute , attribute-commalist
two-relation-expression
::= projection binary-operation expression
binary-operation
::= UNION | INTERSECT | MINUS | TIMES | JOIN | DIVIDEBY
comparison
::= attribute compare number
compare
::= < | > | <= | >= | = | <>
number
::= val | val number
val
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
attribute
::= CNO | CITY | CNAME | SNO | PNO | TQTY |
SNAME | QUOTA | PNAME | COST | AVQTY |
S# | STATUS | P# | COLOR | WEIGHT | QTY
relation
::= S | P | SP | PRDCT | CUST | ORDERS
Here is my .l file:
%{
#include <stdio.h>
#include "p5.tab.h"
%}
binaryOperation UINION|INTERSECT|MINUS|TIMES|JOIN|DIVIDEBY
compare <|>|<=|>=|=|<>
attribute CNO|CITY|CNAME|SNO|PNO|TQTY|SNAME|QUOTA|PNAME|COST|AVQTY|S#|STATUS|P#|COLOR|WEIGHT|QTY
relation S|P|SP|PRDCT|CUST|ORDERS
%%
[ \t\n]+ ;
{binaryOperation} return binaryOperation;
{compare} return compare;
[0-9]+ return val;
{attribute} return attribute;
{relation} return relation;
"RENAME" return RENAME;
"AS" return AS;
"WHERE" return WHERE;
"(" return '(';
")" return ')';
"[" return '[';
"]" return ']';
"," return ',';
. {printf("REJECT\n");
exit(0);}
%%
Here is my .y file:
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token RENAME attribute AS relation WHERE binaryOperation compare val
%%
start:
expression {printf("ACCEPT\n");}
;
expression:
oneRelationExpression
| twoRelationExpression
;
oneRelationExpression:
renaming
| restriction
| projection
;
renaming:
term RENAME attribute AS attribute
;
term:
relation
| '(' expression ')'
;
restriction:
term WHERE comparison
;
projection:
term
| term '[' attributeCommalist ']'
;
attributeCommalist:
attribute
| attribute ',' attributeCommalist
;
twoRelationExpression:
projection binaryOperation expression
;
comparison:
attribute compare number
;
number:
val
| val number
;
%%
yyerror() {
printf("REJECT\n");
exit(0);
}
main() {
yyparse();
}
yywrap() {}
Here is my makefile:
p5: p5.tab.c lex.yy.c
cc -o p5 p5.tab.c lex.yy.c
p5.tab.c: p5.y
bison -d p5.y
lex.yy.c: p5.l
flex p5.l
This works:
S RENAME CNO AS CITY
These do not:
S
S WHERE CNO = 5
I have not tested everything, but I think there is a common problem for these issues.
Your grammar is correct, the problem is that you are running interactively. When you call yyparse() it will attempt to read all input. Because the input
S
could be followed by either RENAME or WHERE it won't accept. Similarly,
S WHERE CNO = 5
could be followed by one or more numbers, so yyparse won't accept until it gets an EOF or an unexpected token.
What you want to do is follow the advice here and change p5.l to have these lines:
[ \t]+ ;
\n if (yyin==stdin) return 0;
That way when you are running interactively it will take the ENTER key to be the end of input.
Also, you want to use left recursion for number:
number:
val
| number val
;

How to consume the minimal input with fuzzy parsing with ANTLR 4.4+

I am trying to extract condition between two keywords (IF & THEN in this example) without specifying the full grammar.
The input to the parser begin with the first keyword.
Input example could be : "IF A < 10 OR B> 5 THEN A = A + 1; B=6; ENDIF; IF A < 10 THEN A = 100 ENDIF"
From that input, i want to extract the condition : "A < 10 OR B> 5".
We did it with ANTLR 3.5 but unable to make it work with ANLTR 4.4 & 4.5.
** 3.5 Grammar **
grammar FuzzyTest3;
options
{
output=AST;
language=Java;
}
#header
{package fuzzytest;}
#lexer::header
{package fuzzytest;}
ifrule: IF .* THEN;
IF : 'IF';
THEN : 'THEN';
IDENTIFIER : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SEPARATOR : ( '<' | '>' | ':' '(' | ')' | '-' | '+' | '=' | ';' );
WS : ( ' ' | '\t' | '\r' | '\n' | '\u000C')+
{
{ $channel = HIDDEN; }
};
** 4.4 Grammar **
grammar FuzzyTest4;
ifrule: IF (.)*? THEN;
//ifrule: IF .* THEN; //same result
IF : 'IF';
THEN : 'THEN';
IDENTIFIER : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SEPARATOR : ( '<' | '>' | ':' '(' | ')' | '-' | '+' | '=' | ';' );
WS : ( ' ' | '\t' | '\r' | '\n' | '\u000C') -> channel(HIDDEN);
With ANTLR 3.5:
ParserRuleReturnScope rulereturn = parser.ifrule();
result = parser.input.toString(rulereturn.start, rulereturn.stop);
System.out.println("TOKENS: "+result);
My output is :
"TOKENS: IF A < 10 OR B> 5 THEN"
With ANLTR 4.4:
ParserRuleContext rulereturn = parser.ifrule();
result = parser.getInputStream().getText(rulereturn.start, rulereturn.stop);
System.out.println("TOKENS: "+result);
My output is :
"line 2:76 no viable alternative at input '<EOF>'
TOKENS: IF A < 10 OR B> 5 THEN A = A + 1; B=6; ENDIF; IF A < 10 THEN A = 100 ENDIF"
Anyone have an idea? suggestion?
One way to do it is (what that example):
ifrule: IF condition;
condition: ~(THEN|IF) condition | ~(THEN|IF);

Parsing string interpolation in ANTLR

I'm working on a simple string manipulation DSL for internal purposes, and I would like the language to support string interpolation as it is used in Ruby.
For example:
name = "Bob"
msg = "Hello ${name}!"
print(msg) # prints "Hello Bob!"
I'm attempting to implement my parser in ANTLRv3, but I'm pretty inexperienced with using ANTLR so I'm unsure how to implement this feature. So far, I've specified my string literals in the lexer, but in this case I'll obviously need to handle the interpolation content in the parser.
My current string literal grammar looks like this:
STRINGLITERAL : '"' ( StringEscapeSeq | ~( '\\' | '"' | '\r' | '\n' ) )* '"' ;
fragment StringEscapeSeq : '\\' ( 't' | 'n' | 'r' | '"' | '\\' | '$' | ('0'..'9')) ;
Moving the string literal handling into the parser seems to make everything else stop working as it should. Cursory web searches didn't yield any information. Any suggestions as to how to get started on this?
I'm no ANTLR expert, but here's a possible grammar:
grammar Str;
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '(' (Identifier | stringLiteral) ')'
;
assignment
: Identifier (Space)* '=' (Space)* stringLiteral
;
stringLiteral
: '"' (Identifier | EscapeSequence | NormalChar | Space | Interpolation)* '"'
;
Interpolation
: '${' Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
As you notice, there are a couple of (Space)*-es inside the example grammar. This is because the stringLiteral is a parser-rule instead of a lexer-rule. Therefor, when tokenizing the source file, the lexer cannot know if a white space is part of a string literal, or is just a space inside the source file that can be ignored.
I tested the example with a little Java class and all worked as expected:
/* the same grammar, but now with a bit of Java code in it */
grammar Str;
#parser::header {
package antlrdemo;
import java.util.HashMap;
}
#lexer::header {
package antlrdemo;
}
#parser::members {
HashMap<String, String> vars = new HashMap<String, String>();
}
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '('
( id=Identifier {System.out.println("> "+vars.get($id.text));}
| st=stringLiteral {System.out.println("> "+$st.value);}
)
')'
;
assignment
: id=Identifier (Space)* '=' (Space)* st=stringLiteral {vars.put($id.text, $st.value);}
;
stringLiteral returns [String value]
: '"'
{StringBuilder b = new StringBuilder();}
( id=Identifier {b.append($id.text);}
| es=EscapeSequence {b.append($es.text);}
| ch=(NormalChar | Space) {b.append($ch.text);}
| in=Interpolation {b.append(vars.get($in.text.substring(2, $in.text.length()-1)));}
)*
'"'
{$value = b.toString();}
;
Interpolation
: '${' i=Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
And a class with a main method to test it all:
package antlrdemo;
import org.antlr.runtime.*;
public class ANTLRDemo {
public static void main(String[] args) throws RecognitionException {
String source = "name = \"Bob\"; \n"+
"msg = \"Hello ${name}\"; \n"+
"print(msg); \n"+
"print(\"Bye \\${for} now!\"); ";
ANTLRStringStream in = new ANTLRStringStream(source);
StrLexer lexer = new StrLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
StrParser parser = new StrParser(tokens);
parser.parse();
}
}
which produces the following output:
> Hello Bob
> Bye \${for} now!
Again, I am no expert, but this (at least) gives you a way to solve it.
HTH.

Resources