DSL with XText for PlantUML - xtext

Currently I'm trying to create a DSL for the class diagrams of PlantUML. I'm new to Xtext and I can't get my head around several things. Before I list my problems I show you some parts of my current grammar:
ClassUml:
{ClassUml}
'#startuml' umlElements+=(ClassElement)* '#enduml';
ClassElement:
Class
| Association;
Class:
{Class}
'class' name=ClassName
(color=ColorTag)?
('{' (classContents+=ClassContent)* '}')?;
ClassContent:
Attribute | Method;
ClassName:
(ID | STRING);
Attribute:
{Attribute}
(visibility=Visibility)? name=ID (":" type=ID)?;
Method:
{Method}
(visibility=Visibility)? name=METHID
(":" type=ID)?;
Association:
{Association}
(classFrom=[Class]
associationType=Bidirectional
classTo=[Class])
|
(classTo=[Class]
associationType=UnidirectionalLeft
classFrom=[Class])
|
(classFrom=[Class]
associationType=UnidirectionalRight
classTo=[Class])
(':' text+=(ID)*)?;
Bidirectional:
{Bidrectional}
('-' ("[" color=ColorTag "]")? '-'?)
| ('.' ("[" color=ColorTag "]")? '.'?);
UnidirectionalLeft:
{UnidirectionalLeft}
('<-' ("[" color=ColorTag "]")? '-'?)
| ('<.' ("[" color=ColorTag "]")? '.'?);
UnidirectionalRight:
{UnidirectionalRight}
((('-[' color=ColorTag "]")|'-')? '->')
| ((('.[' color=ColorTag "]")|'.')? '.>');
ColorTag:
(COLOR | HEXCODE);
enum Visibility:
PROTECTED='#'
| PRIVATE='-'
| DEFAULT='~'
| PUBLIC='+';
terminal COLOR:
"#"
('red') | ('orange');
terminal HEXCODE:
"#"
('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9')
('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9')('A' .. 'F'|'0' .. '9');
terminal STRING:
'"' ('\\' . | !('\\' | '"'))* '"';
terminal ID:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '\"\"' | '//' | '\\')
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '\"\"' | '//' | '\\' | ':')*;
I left out the other association types (--*, --o, --|>) because I've defined them in the same way.
Problems
1. The visibility enum '#' isn't working without a separation from the method / attribute name. But all the other cases (+,-,~) are fine, with and without a blank space between.
2. The associations don't seem to work in most cases. I've listed a few examples:
' Working '
Alice -* Bob : Hello
Alice - Bob
Alice .o Bob
Alice <|-[#002211]- Bob
Alice *-[#red]- Bob
Alice -[#000000]-> Bob
Alice .[#red].> Bob
' Not Working '
Alice *-- Bob
Alice --* Bob
Alice .. Bob
Alice -[#ff0022]- Bob
Alice <-- Bob
Alice ..> Bob
Alice -- Bob
I don't know how I can use cross references for classes which were defined by STRING and not ID.
Also I'm guessing the additional terminal for the method name is a weird solution and should be handled differently.

1) Color should be a parser rule not a terminal rule.
Also remove the Hex rule and simply use your changed ID rule.
Color:
"#" ('red' | 'orange' | ID);
2) Make sure you to unify the differences, for instance there is a conflict between
Bidirectional:
...
('-' ("[" ...;
and
UnidirectionalRight:
((('-[' ...;
a sequence '-[' will always match the latter version. You should create one rule AssociationType and make that work for all cases. Something like this:
Association:
{Association}
(classFrom=[Class | ClassName]
associationType=AssociationType
classTo=[Class | ClassName])
(':' text+=(ID)*)?;
AssociationType:
{AssociationType}
left?='<'? ('-'|'.') ("[" color=Color "]")? ('-'|'.') right?='>'?;
3) You could allow a STRING in the cross references, as well, by using the following syntax for the crossrefs: classFrom=[Class|ClassName]

Related

missing EOF at 'Say'

I am writing grammar to recognize following input
Say Hello Boss
Hello friend
Here is my complete grammar
grammar org.xtext.example.second.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/second/MyDsl"
Example:
statements+=Statement*;
Statement:
(IDLABEL)? Directives;
Directives:
TAG1 | TAG2 | TAG3 | TAG4;
TAG1: tag=('Hi'|'Hello') IDLABEL;
TAG2: tag=('Tag2') IDLABEL;
TAG3: tag=('Tag3') IDLABEL;
TAG4: tag=('Tag4') IDLABEL;
STRING_OPERANDS hidden(WS):
("*"|UNQUOTED|QUOTED)+;
terminal QUOTED:
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'";
terminal UNQUOTED:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '-' | '*' | "/" | "\\" | '(' | ')' | '$' | '=' |'#' |'.' | '"' |'#'|'+'|"'"|'<'|'>')*;
terminal IDLABEL:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9'|'='|'#')*;
For the input, Say Hello Boss
I am getting an error "missing EOF at Say"
and for the input Hello Boss
I am getting an error "mismatched input 'Boss' expecting RULE_IDLABEL"
What is wrong with this grammar?
Boss matches both the rule IDLABEL and UNQUOTED. In cases where two rules can match the current input and both rules match the same prefix, the tokenizer uses the rule that comes first. So the input Boss produces an UNQUOTED token, not an IDLABEL token.
In fact all valid IDLABELs are also valid UNQUOTEDs, so you'll never get any IDLABEL tokens.
To fix this, you can change the order of UNQUOTED and IDLABEL, so that IDLABEL comes first.

Calculate first, follow and predict from EBNF notation

Grammatical rules are defined as:
an integer literal is a sequence of digits;
a boolean literal is one of true or false;
a keyword is one of if, while, or the boolean literals;
a variable is a string that starts with a letter and is followed by letters or digits, and
is not a keyword;
an operator is one of <= >= == != && || = + - * < >
punctuation is one of the ( ) { } , ; characters.
Based on the description I wrote out grammar in EBNF notation as fallows:
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
int literal = digit {digit} ;
bool = "true" | "false" ;
keyword = "if" | "while" | bool ;
letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" |
"h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z" ;
variable = (letter {digit | letter}) -keyword ;
operator = "<=" |">=" | "==" | "!=" | "&&" | "||" | "=" | "+" | "-" | "*" | "<" | ">" | "!" ;
punctuation = "(" | ")" | " {" | " }" | " , " | " ; " ;
Now i want to calculate FIRST, FOLLOW and PREDICT sets but I'm not sure how to do it out of EBNF notation. Should I first change it to Chomsky normal form? Is so then how? Would that be right?
DIGIT -> 0 1 2 3 ...
INT -> DIGIT | DIGIT DIGIT
BOOL -> true false
KEYWORD -> if while BOOL
LETTER -> A B C D ...
VARIABLE -> LETTER | LETTER DIGIT | LETTER LETTER
First and follow are pretty straight-forward, even with EBNF. In this case, they are even easier, since you have no nullable non-terminals. (You need to watch out for repetition groups, since the repetition count can be 0. If you have:
... A { X ... } Y ...
then FOLLOW(A) must include both FIRST(X) and FIRST(Y). And if you have
C -> A { X }
then FOLLOW(A) must include FOLLOW(C).
None of this should be complicated if you're doing the computation by hand. For an automated solution, I would probably unroll the repetition operators into unextended BNF by creating new non-terminals, but you could do the computation directly on the EBNF as well.
The one wrinkle is your use of the set difference operator -, in
variable = (letter {digit | letter}) - keyword ;
In this particular case, it does not create any difficulties, but the general solution is tricky. In fact, since there is no guarantee that the difference between two context-free languages is context-free, it will not really be possible to find a truly general solution.
Predict sets are another story. Indeed, I'm not even 100% sure what a predict set would be for EBNF, since you need to be able to predict repetition of a subpattern, not just derivations. Again, expanding to BNF might help, but it can happen that the expansion creates a predict conflict which didn't exist in the original grammar.
The grammar you present is incomplete, so I don't know how useful computing LL(1) sets will be. I suppose that it is intended to be just the lexical part of the grammar, but really there is a reason why lexical analysis is usually done with regular expressions rather than context-free parsing.
Several reasons, really: aside from the fact that lexical analysis usually involves reasonably readable regular expressions, there is also the important fact that lexical analysis does not usually involve parsing the internal structure of a token. That lets you choose to simply recognize a repeated element rather than worrying about whether the parse tree for the repetition should be left- or right-leaning.
The key insight about computing FIRST and FOLLOW sets is that they mean just what their names indicate. The FIRST set of a non-terminal is precisely the set of tokens which can begin a complete derivation from the non-terminal; similarly, the FOLLOW set is precisely the set of tokens which might immediately follow the non-terminal during a derivation from the start symbol. In many simple grammars, these sets can be computed by inspection; that certainly should be the case for your grammar, at least for the FIRST sets.
The fact that you have no start symbol here is another indication that you are probably not solving the right problem; without a start symbol, there is no meaningful definition of FOLLOW.
If you are trying to do lexical analysis, you might be able to get away with:
start -> { token }
token -> int literal | keyword | identifier | ...
Although to be formally correct, you'd also need to handle "ignored tokens" such as comments and whitespace.

Xtext : prevent for matching keywords from another rule

I'm looking for a way to prevent KEYWORDS matching at a place where those KEYWORDS are not expected.
Take a look at the following grammar. Both 'APPLY' and 'OUTPUT' are keywords.
'OUTPUT' has an argument that contains any characters.
Everything works fine but if this argument contains the word APPLY, an error is raised (extraneous input APPLY expecting RULE_END).
Is there a way to solve this issue?
Thanks.
Sample text
APPLY, 'an id' $
OUTPUT, A text $
OUTPUT, A text with the word APPLY $
DSL
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
statement+=Statement*;
Statement:
ApplyStatement | OutputStatement;
OutputStatement:
'OUTPUT' ',' out+=EXTENDLABEL* end=END;
ApplyStatement:
'APPLY' ',' id=LABELIDENTIFIER end=END;
terminal fragment LETTER:
'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T'
| 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' |
'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z';
terminal LABELIDENTIFIER:
"'"->"'";
terminal EXTENDLABEL:
(LETTER) (LETTER)*;
terminal END:
'$' !('\n' | '\r')*;
I see a few different ways your issue can be handled. First of all, you could escape the keywords appearing, e.g. the Xbase language uses the '^' character as an escape character; if for any reason there is a problem with writing a keyword, you can prefix it with '^', and it would work. Similarly, if you would put your string inside specific symbols, e.g. apostrophes, it would help a lot. Of course, these solutions require to change your language itself, which you may or may not do.
You might also replace your EXTENDLABEL terminal with a datatype rule. This allows greater flexibility with regards to conflict resolution; worst case you could add the language keywords as options. I was suggested this route by a tangentially related case in the Eclipse forums.
an other solution is to change the ID of your token before that your parser used it. Token are provided by the lexer and your parser will take these tokens in input to produce your AST. So the idea is to change the tokens before to pass them to your parser.
To do it you need to declare your own parser:
#Override
public Class<? extends IParser> bindIParser() {
return ModelParser.class;
}
Note : your parser will extends the generated parser of your grammar.
Then you need to override the following method to introduce your own TokenSource:
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new TokenSource(tokenSource, getTokenDefProvider());
}
You own token source need to extend 'XtextTokenStream'.
After you need to override the method 'LT' as following :
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
Then you just need to change the ID :
def void tokenOverride(Token token, int index){
switch (token.text){
case "APPLY" : {
overrideType(t_parameter, InternalModelParser.RULE_ID);
}
}
}
def void overrideType(Token token, int i) {
token.type = i
}
Note : don't forget to add your condition before to change the ID of your token, in this example all token 'APPLY' will become an ID.
And of course inside the switch you can use the ID of the token 'APPLY' instead the text of your token.

Can xtext grammar match all ID except some keywords?

Can xtext lexer emit whatever it can't recognize as a special token? Like
terminal USE: 'use';
terminal SELECT: 'select';
terminal OTHER_KEYWORDS: /* not 'use' nor 'select' */;
I write grammar like
terminal fragment A: 'a' | 'A';
...
terminal fragment Z: 'z' | 'Z';
terminal fragment LETTER: 'a'..'z' | 'A'..'Z';
terminal fragment A_: 'b'..'z' | 'B'..'Z';
...
terminal fragment Z_: 'a'..'y' | 'A'..'Y';
terminal fragment SU_: 'a'..'r' | 't' | 'v'..'z' | 'A'..'R' | 'T' | 'V'..'Z';
terminal OTHER_KEYWORDS:
SU_ LETTER* |
U S_ LETTER* |
U S E_ LETTER* |
S E_ LETTER* |
S E L_ LETTER* |
S E L E_ LETTER* |
S E L E C_ LETTER* |
S E L E C T_ LETTER*
;
The reason I want to do this is because antlr will failed on that kind of typo and failed for all the parsing after that. If there is another could avoid failed for parsing then I don't need to use this error prone and looks stupid way to solve that.
I found out simply using ID to consume the other garbage in input stream would work.
terminal USE: 'use';
terminal SELECT: 'select';
...
terminal TYPO: ID;
So if I have us e, us will be parsed as an ID; if I have use, use will be parsed as a USE. The order of terminal tokens is important.

ANTLR lexer disabling tokens then reenabling them not working as expected

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.

Resources