How to create EBNF grammar for a list in xtext? - xtext

I m trying to create a list in xtext, If anybody can help me to create a grammar for that, it will be really helpful. I tried writing this but its not the xtext format so i getting errors on that.
List:
'List' name=ID type = Nlist;
Nlist:
Array | Object
;
Array:
{Array} "[" values*=Value[','] "]"
;
Value:
STRING | FLOAT | BOOL | Object | Array | "null"
;
Object:
"{" members*=Member[','] "}"
;
Member:
key=STRING ':' value=Value
I m new to this one, Any help will be appreciated.
Thank you.

the default syntax for comma separated lists is e.g.
MyList: '#[' (elements+=Element (',' elements+=Element )*)? ']';

Related

xtext not accepting string constant - expecting RULE_ID

I have tried to cut down my problem to the simplest problem I can in xtext - I would like to use the following grammar:
M: lines += T*;
T:
DT
| BDT
| N
;
BDT:
name = ('a' | 'b' | 'c')
;
DT:
'd' name=ID
('(' (ts += BDT (','ts += BDT)*) ')')?
;
N:
'n' name=ID ':' type=[T]
;
I am intending to parse expressions of the form d f(a,b,b) for example which works fine. I would also like to be able to parse n g:f which also works, but not n g:a - where a here is part of the BDT rule. The error given is "Missing RULE_ID at 'a'".
I'd like to allow the grammar to parse n g:a for example, and I'd be very grateful if anyone could point out where I'm going wrong here on this very simple grammar.
Lexing is done context free. A keyword can never be an ID. You can address this trough parser rules.
You can introduce a datatype rule
MyID: ID | "a" | ... | "c";
And use it where you use ID

ANTLR grammar for multi-level text segmentation

I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.

Shift/Reduce conflict in variable declaration

I am using bison to write a parser to a C-like grammar, but I'm having problem in variable declaration.
My variables can be simples variable, arrays or structs, so I can have a variable like a.b[3].c.
I have the following rule:
var : NAME /* NAME is a string */
| var '[' expr ']'
| var '.' var
;
which are giving me a shift/reduce conflict, but I can't think of a way to solve this.
How can I rewrite the grammar or use Bison precedence to resolve the problem?
I don't know why you want the . and [] operators to be associative, however something like this might inspire you.
y
%union{}
%token NAME
%start var
%left '.'
%left '['
%%
var:
NAME
| var '[' expr ']'
| var '.' var
Also, as var probably appears in expr, there may be other problems with the grammar.
It's certainly possible to resolve this shift-reduce conflict with a precedence declaration, but in my opinion that just makes the grammar harder to read. It's better to make the grammar reflect the semantics of the language.
A var (what some might call an lvalue or a reference) is either the name of a simple scalar variable or an expression built up iteratively from a var naming a complex value and a selector, either a member name or an array index. In other words:
var : NAME
| var selector
;
selector: '.' NAME
| '[' expression ']'
;
This makes it clear what the meaning of a.b[3].c is, for example; the semantics are described by the parse tree:
a.b[3].c
+----+----+
| |
a.b[3] .c
+---+--+
| |
a.b [3]
+-+-+
| |
a .b
It's not necessary to create the selector rule; it would be trivial to wrap the two rules together. If you don't feel that it is clearer as written above, the following will work just as well:
var: NAME
| var '.' NAME
| var '[' expression ']'
;

Xtext multiple cross references

I need a Xtext grammar rule (or multiple) working similar to the following:
1: CollectionGetElement:
2: val=[VariableReference] '='
3: (ref=[List] | ref=[Bytefield] | ref=[Map])
4: '[' keys+=GetElementKeyType ']' ('[' keys+=GetElementKeyType ']')* ';';
5: GetElementKeyType:
6: key=[VariableReference] | INT | STRING;
Like this unfortuantely it doesn't work, because of the 3 line!
I also tried 3 seperated rules (for: map, list and bytefield), but then It's difficult (impossible) for the parser to recognize the correct rule.
ListGetElement:
val=[VariableReference] '='
ref=[List]
'[' key+=GetElementKeyType ']' ('[' key+=GetElementKeyType ']')* ';';
... same for the others
Error then is:
Decision can match input such as "RULE_ID '=' RULE_ID '[' RULE_ID ']' '[' RULE_ID ']' ';'" using multiple alternatives: 5, 6
The following alternatives can never be matched: 6
What's the best way to achive this?
there are two problems in your grammar,
assigning 3 different types to attribute 'ref'
generating 3 different types by parsing some ID
I am not sure what do you want to do. But, I can give you an example. Hope it can help you.
e.g.
List:
'list' '(' elements += Element * ')';
Map:
'map' '(' pairs += Pair * ')';
GeneralDataType:
List | Map
CollectionGetElement:
val=[VariableReference] '='
ref = GeneralDataType
;

ANTLR lexer disabling tokens then reenabling them not working as expected

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.

Resources