I am writing grammar to recognize following input
Say Hello Boss
Hello friend
Here is my complete grammar
grammar org.xtext.example.second.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/second/MyDsl"
Example:
statements+=Statement*;
Statement:
(IDLABEL)? Directives;
Directives:
TAG1 | TAG2 | TAG3 | TAG4;
TAG1: tag=('Hi'|'Hello') IDLABEL;
TAG2: tag=('Tag2') IDLABEL;
TAG3: tag=('Tag3') IDLABEL;
TAG4: tag=('Tag4') IDLABEL;
STRING_OPERANDS hidden(WS):
("*"|UNQUOTED|QUOTED)+;
terminal QUOTED:
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'";
terminal UNQUOTED:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '-' | '*' | "/" | "\\" | '(' | ')' | '$' | '=' |'#' |'.' | '"' |'#'|'+'|"'"|'<'|'>')*;
terminal IDLABEL:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9'|'='|'#')*;
For the input, Say Hello Boss
I am getting an error "missing EOF at Say"
and for the input Hello Boss
I am getting an error "mismatched input 'Boss' expecting RULE_IDLABEL"
What is wrong with this grammar?
Boss matches both the rule IDLABEL and UNQUOTED. In cases where two rules can match the current input and both rules match the same prefix, the tokenizer uses the rule that comes first. So the input Boss produces an UNQUOTED token, not an IDLABEL token.
In fact all valid IDLABELs are also valid UNQUOTEDs, so you'll never get any IDLABEL tokens.
To fix this, you can change the order of UNQUOTED and IDLABEL, so that IDLABEL comes first.
Related
I am trying to create a simple compiler using flex and bison, however all the rules i've written down have shown to be useless "14 useless nonterminals and 66 useless rules". What makes a rule useless and is there a way to fix it?
File: Decl_Class'*' 'EOF'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Field: Variable
|Constructor
|Method
Variable: Modifier'*' Expr_type Decl_variables
Decl_variables: Decl_variable
|Decl_variable ',' Decl_variables
Decl_variable: IDENT
|IDENT '=' Expr
Constructor: Modifier'*' IDENT '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Method: Modifier'*' '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Modifier: IDENT
Expr: IDENT
Params: '(' Expr_type ')' IDENT | '(' Expr_type ')' IDENT ',' Params
Expr_type: BOOLEAN | INT | DOUBLE | IDENT | INTEGER | REAL | TRU | FALS
| THIS | NULLVAL
| '(' Expr ')'
| Access|Access '=' Expr|Access '(' L_Expr '?' ')'
| NEW IDENT '(' L_Expr '?' ')'
| '+''+'Expr | '-''-'Expr | Expr'+''+' | Expr'-''-'
| '!'Expr | '-'Expr | '+'Expr
| Expr Operator Expr
| '(' Expr_type ')' Expr_type
Operator: "==" | "!=" | "<" | "<=" | ">" | ">=" | "+" | "-" | "*" | "/" | "%" | "&&" | "||"
Access: IDENT | Expr '.' IDENT
L_Expr: Expr | Expr ',' L_Expr
Instruction: ';'
| Expr';'
| Expr_type Decl_variables';'
| IF '(' Expr ')' Instruction
| IF '(' Expr ')' Instruction ELSE Instruction
| WHILE '(' Expr ')' Instruction
| FOR '(' L_Expr '?' ';' Expr '?' ';' L_Expr '?'')' Instruction
| FOR '(' Expr ')' Decl_variables ';' Expr '?' ';' L_Expr '?'')' Instruction { Instructions'*' }
| RETURN Expr '?'';'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Here the part inside the {} is a code action (which will produce a syntax error when compiled), not a reference to the Field non-terminal. So Field is never actually used and neither are the non-terminal referenced by it. That's what makes them useless: they're never used.
PS: At various places in your grammar you're using '*' and '?' in a way that suggests the intention may be to match zero or more or zero or one items respectively. Be aware that all '*' and '?' do is to match a token with the given value. There is no syntactic shortcut to repeat something or make it optional in bison - you'll need to define separate non-terminals for that.
PPS: In most (all?) languages that have ++ and -- operators, those consist of a single token not two subsequent '+' or '-' tokens (so - -x would be double negation and only --x without the space between the -s would be a decrement). So your rules for the decrement and increment operators are unusual in that regard.
Currently I'm trying to create a DSL for the class diagrams of PlantUML. I'm new to Xtext and I can't get my head around several things. Before I list my problems I show you some parts of my current grammar:
ClassUml:
{ClassUml}
'#startuml' umlElements+=(ClassElement)* '#enduml';
ClassElement:
Class
| Association;
Class:
{Class}
'class' name=ClassName
(color=ColorTag)?
('{' (classContents+=ClassContent)* '}')?;
ClassContent:
Attribute | Method;
ClassName:
(ID | STRING);
Attribute:
{Attribute}
(visibility=Visibility)? name=ID (":" type=ID)?;
Method:
{Method}
(visibility=Visibility)? name=METHID
(":" type=ID)?;
Association:
{Association}
(classFrom=[Class]
associationType=Bidirectional
classTo=[Class])
|
(classTo=[Class]
associationType=UnidirectionalLeft
classFrom=[Class])
|
(classFrom=[Class]
associationType=UnidirectionalRight
classTo=[Class])
(':' text+=(ID)*)?;
Bidirectional:
{Bidrectional}
('-' ("[" color=ColorTag "]")? '-'?)
| ('.' ("[" color=ColorTag "]")? '.'?);
UnidirectionalLeft:
{UnidirectionalLeft}
('<-' ("[" color=ColorTag "]")? '-'?)
| ('<.' ("[" color=ColorTag "]")? '.'?);
UnidirectionalRight:
{UnidirectionalRight}
((('-[' color=ColorTag "]")|'-')? '->')
| ((('.[' color=ColorTag "]")|'.')? '.>');
ColorTag:
(COLOR | HEXCODE);
enum Visibility:
PROTECTED='#'
| PRIVATE='-'
| DEFAULT='~'
| PUBLIC='+';
terminal COLOR:
"#"
('red') | ('orange');
terminal HEXCODE:
"#"
('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9')
('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9');
terminal STRING:
'"' ('\\' . | !('\\' | '"'))* '"';
terminal ID:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '\"\"' | '//' | '\\')
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '\"\"' | '//' | '\\' | ':')*;
I left out the other association types (--*, --o, --|>) because I've defined them in the same way.
Problems
1. The visibility enum '#' isn't working without a separation from the method / attribute name. But all the other cases (+,-,~) are fine, with and without a blank space between.
2. The associations don't seem to work in most cases. I've listed a few examples:
' Working '
Alice -* Bob : Hello
Alice - Bob
Alice .o Bob
Alice <|-[#002211]- Bob
Alice *-[#red]- Bob
Alice -[#000000]-> Bob
Alice .[#red].> Bob
' Not Working '
Alice *-- Bob
Alice --* Bob
Alice .. Bob
Alice -[#ff0022]- Bob
Alice <-- Bob
Alice ..> Bob
Alice -- Bob
I don't know how I can use cross references for classes which were defined by STRING and not ID.
Also I'm guessing the additional terminal for the method name is a weird solution and should be handled differently.
1) Color should be a parser rule not a terminal rule.
Also remove the Hex rule and simply use your changed ID rule.
Color:
"#" ('red' | 'orange' | ID);
2) Make sure you to unify the differences, for instance there is a conflict between
Bidirectional:
...
('-' ("[" ...;
and
UnidirectionalRight:
((('-[' ...;
a sequence '-[' will always match the latter version. You should create one rule AssociationType and make that work for all cases. Something like this:
Association:
{Association}
(classFrom=[Class | ClassName]
associationType=AssociationType
classTo=[Class | ClassName])
(':' text+=(ID)*)?;
AssociationType:
{AssociationType}
left?='<'? ('-'|'.') ("[" color=Color "]")? ('-'|'.') right?='>'?;
3) You could allow a STRING in the cross references, as well, by using the following syntax for the crossrefs: classFrom=[Class|ClassName]
I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.
How can I recognize different tokens for the same symbol in ANTLR v4? For example, in selected = $("library[title='compiler'] isbn"); the first = is an assignment, whereas the second = is an operator.
Here are the relevant lexer rules:
EQUALS
:
'='
;
OP
:
'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
And here is the parser rule for that line:
assign
:
ID EQUALS DOLLAR OPEN_PARENTHESIS QUOTES ID selector ID QUOTES
CLOSE_PARENTHESIS SEMICOLON
;
selector
:
OPEN_BRACKET ID OP APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
This correctly parses the line, as long as I use an OP different than =.
Here is the error log:
JjQueryParser::init:34:29: mismatched input '=' expecting OP
JjQueryParser::init:34:39: mismatched input ''' expecting '\"'
JjQueryParser::init:34:46: mismatched input '"' expecting '='
The problem cannot be solved in the lexer, since the lexer does always return one token type for the same string. But it would be quite easy to resolve it in the parser. Just rewrite the rules lower case:
equals
: '='
;
op
:'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
I had the same issue. Resolved in the lexer as follows:
EQUALS: '=';
OP : '|' EQUALS
| '*' EQUALS
| '~' EQUALS
| '$' EQUALS
| '!' EQUALS
| '^' EQUALS
;
This guarantees that the symbol '=' is represented by a single token all the way. Don't forget to update the relevant rule as follows:
selector
:
OPEN_BRACKET ID (OP|EQUALS) APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.