Misinterpreted grammar

Misinterpreted grammar - xtext

I have the following grammar piece:
SlotConstraint:
lExpr = [Slot] pred = ('in' | 'inn' | 'from' | 'fromm' | 'is') rExpr = SetSexpr |
lExpr = [Slot] pred = ('in' | 'inn' | 'from' | 'fromm' | 'is')? neg = ('not' | 'not in' | 'not from') rExpr = SetSexpr
;
When I write something like this - a in b or a is not in b it is fine. However I am not able to write a is not b. The question is: why it understands not in or not from but not plain not?
Thanks

do not use whitespace in keywords

Related

Shift/reduce conflicts using Sablecc

I'm supposed to write a .grammar file for MiniPython using Sablecc. I'm getting these shift/reduce conflicts:
shift/reduce conflict in state [stack: TIf PTpower *] on TMult in {
[ PMltp = * TMult PTopower Mltp ] (shift)
[ PMlpt = * ] followed by TMult (reduce)
}
shift/reduce conflict in state [stack: TIf PTopower *] on TDiv in {
[ PMltp = * TDiv PTopower Mltp ] (shift)
[ PMltp = * ] followed by TDiv (reduce)
}
Some of the tokens are:
id = letter (letter | digit)*;
digit = ['0' .. '9'];
letter = ['a' .. 'z']|['A' .. 'Z'];
pow = '**';
mult = '*';
div = '/';
plus = '+';
minus = '-';
assert = 'assert';
l_par = '(';
r_par = ')';
l_bra = '[';
r_bra = ']';
Part of my .grammar file is this:
expression = multiplication exprsn;
exprsn = {addition} plus multiplication exprsn
| {subtraction} minus multiplication exprsn
| {empty};
topower = something tpwr;
tpwr = {topower} pow something tpwr
| {empty};
multiplication = topower mltp;
mltp = {multiplication} mult topower mltp
| {division} div topower mltp
| {empty};
something = {id} id
| {parexp} l_par expression r_par
| {fcall} functioncall
| {value} value
| {list} id l_bra expression r_bra
| {other} l_bra value comval* r_bra
| {assert} assert expression comexpr?;
comexpr = comma expression;
This is the grammar after I tried to eliminate left recursion. I noticed that if I remove the assert rule from the something production, I get no conflicts. Also, removing the {empty} rules from exprsn, tpwr and mltp rules gives me no conflicts but I don't think this is the correct way to resolve this.
Any tips would be really appreciated.
UPDATE: Here is the whole grammar, before removing left recursion, as requested:
Package minipython;
Helpers
digit = ['0' .. '9'];
letter = ['a' .. 'z']|['A' .. 'Z'];
cr = 13;
lf = 10;
all = [0..127];
eol = lf | cr | cr lf ;
not_eol = [all - [cr + lf]];
Tokens
tab = 9;
plus = '+';
dot = '.';
pow = '**';
minus = '-';
mult = '*';
div = '/';
eq = '=';
minuseq = '-=';
diveq = '/=';
exclam = '!';
def = 'def';
equal = '==';
nequal = '!=';
l_par = '(';
r_par = ')';
l_bra = '[';
r_bra = ']';
comma= ',';
qmark = '?';
gqmark = ';';
assert = 'assert';
if = 'if';
while = 'while';
for = 'for';
in = 'in';
print = 'print';
return = 'return';
importkn = 'import';
as = 'as';
from = 'from';
less = '<';
great = '>';
true = 'true';
semi = ':';
false = 'false';
quote = '"';
blank = (' ' | lf | cr);
line_comment = '#' not_eol* eol;
number = digit+ | (digit+ '.' digit+);
id = letter (letter | digit)*;
string = '"'not_eol* '"';
cstring = ''' letter ''';
Ignored Tokens
blank, line_comment;
Productions
program = commands*;
commands = {stmt} statement
| {func} function;
function = def id l_par argument? r_par semi statement;
argument = id eqval? ceidv*;
eqval = eq value;
ceidv = comma id eqval?;
statement = {if} tab* if comparison semi statement
| {while} tab* while comparison semi statement
| {for} tab* for [id1]:id in [id2]:id semi statement
| {return} tab* return expression
| {print} tab* print expression comexpr*
| {assign} tab* id eq expression
| {minassign} tab* id minuseq expression
| {divassign} tab* id diveq expression
| {list} tab* id l_bra [ex1]:expression r_bra eq [ex2]:expression
| {fcall} tab* functioncall
| {import} import;
comexpr = comma expression;
expression = {multiplication} multiplication
| {addition} expression plus multiplication
| {subtraction} expression minus multiplication;
topower = {smth} something
| {power} topower pow something;
something = {id} id
| {parexp} l_par expression r_par
| {fcall} functioncall
| {value} value
| {list} id l_bra expression r_bra
| {assert} assert expression comexpr?
| {other} l_bra value comval* r_bra;
comval = comma value;
multiplication = {power} topower
| {multiplication} multiplication mult topower
| {division} multiplication div topower;
import = {import} importkn module asid? comod*
| {from} from module importkn id asid? comid*;
asid = as id;
comod = comma module asid?;
comid = comma id asid?;
module = idot* id;
idot = id dot;
comparison = {true} true
| {false} false
| {greater} [ex1]:expression great [ex2]:expression
| {lesser} [ex1]:expression less [ex2]:expression
| {equals} [ex1]:expression equal [ex2]:expression
| {nequals} [ex1]:expression nequal [ex2]:expression;
functioncall = id l_par arglist? r_par;
arglist = expression comexpr*;
value = {fcall} id dot functioncall
| {numb} number
| {str} string
| {cstr} cstring;
The shift/reduce conflict now is:
shift/reduce conflict in state [stack: TIf PTopower *] on TPow in {
[ PMultiplication - PTopower * ] followed by TPow (reduce),
[ PTopower = PTopower * TPow PSomething ] (shift)
}

(Note: this answer has been drawn from the original grammar, not from the attempt to remove left-recursion, which has additional issues. There is no need to remove left-recursion from a grammar being provided to an LALR(1) parser generator like SableCC.)
Indeed, the basic problem is the production:
something = {assert} assert expression comexpr?
This production is curious, partly because the name of the non-terminal ("something") provides no hint whatsoever as to what it is, but mostly because one would normally expect assert expression to be a statement, not part of an expression. And something is clearly derived from expression:
expression = multiplication
multiplication = topower
topower = something
But the assert production ends with an expression. That leads to an ambiguity, since
assert 4 + 3
could be parsed as: (some steps omitted for succinctness):
expression = expression plus multiplication
| | |
V | |
something | |
| | |
V | |
assert expression | |
| | | |
| V V V
assert 4 + 3
Or, more naturally, as:
expression = something
|
V
assert expression
| |
| V
| expression plus multiplication
| | | |
| V V V
assert 4 + 3
The first parse seems unlikely because assert doesn't (as far as I would guess) actually return a value. (Although the second one would be more natural if the operator were a comparison rather than an addition.)
Without seeing the definition of the language you're trying to parse, I can't really provide a concrete suggestion for how to fix this, but my inclination would be to make assert a statement, and rename something to something more descriptive ("term" is common, although I usually use "atom").

Top Down Recursive Parsing

I need to include function parseS, parseL, and parseE. These functions will take as input an already tokenized ‘program’ represented as a list of tokens.
type token = TK_IF | TK_THEN | TK_ELSE | TK_BEGIN | TK_END | TK_PRINT | TK_SEMIC | TK_ID of string;;
let program = [ TK_IF; TK_ID("a"); TK_THEN; TK_BEGIN; TK_PRINT; TK_ID("b"); TK_SEMIC; TK_PRINT; TK_ID("c"); TK_END; TK_ELSE; TK_PRINT; TK_ID("d")];;
and output something like this:
parseS program;;
IF2(ID "a",STATLIST(PRINT (ID "b"),LISTCONT(PRINT (ID "c"),END)),PRINT (ID "d"))
The parseL should look something like this
let rec parseL = function
| TK_END::xs -> (xs,END)
| TK_SEMIC::xs -> let (rest,stat) = parseS xs
let (rest,lst) = parseL rest
(rest, LISTCONT(stat,lst))
| l -> failwith ("error parsing L (expecting TK_END or TK_SEMIC):" + (List.map debug l).ToString())
Help?

Matching of tokens with Antlr4

I am a an Antlr4 newbie and have problems with a relatively simple grammar. The grammar is given at the bottom at the end. (This is a fragment from a grammar for parsing description of biological sequence variants).
I am trying to parse the string "p.A3L" in the following unit test.
#Test
public void testProteinSubtitutionWithoutRef() {
ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
HGVSLexer l = new HGVSLexer(inputStream);
HGVSParser p = new HGVSParser(new CommonTokenStream(l));
p.setTrace(true);
p.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
p.hgvs();
}
The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA". I assume that this is related to lexing, i.e. splitting "A3L" into the three tokens A, 3, and L, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it.
What is going wrong here and where can I learn how to fix this?
The grammar
grammar HGVS;
hgvs: protein_var
;
// Basix lexemes
AA: AA1
| AA3
| 'X';
AA1: 'A'
| 'R'
| 'N'
| 'D'
| 'C'
| 'Q'
| 'E'
| 'G'
| 'H'
| 'I'
| 'L'
| 'K'
| 'M'
| 'F'
| 'P'
| 'S'
| 'T'
| 'W'
| 'Y'
| 'V';
AA3: 'Ala'
| 'Arg'
| 'Asn'
| 'Asp'
| 'Cys'
| 'Gln'
| 'Glu'
| 'Gly'
| 'His'
| 'Ile'
| 'Leu'
| 'Lys'
| 'Met'
| 'Phe'
| 'Pro'
| 'Ser'
| 'Thr'
| 'Trp'
| 'Tyr'
| 'Val';
NUMBER: [0-9]+;
NAME: [a-zA-Z0-9_]+;
// Top-level Rule
/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
;

There are two problems:
Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead).
Remove the rule for NAME. A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule
The resulting grammar should look like:
grammar HGVS;
hgvs
: protein_var
;
protein_var
: 'p.' AA NUMBER AA
;
AA: ...;
AA3: ...;
AA1: ...;
NUMBER: [0-9]+;
If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAMEs and AA do not have in common or by using lexer modes).

How can I make this match expression more concise?

Learning F# by writing blackjack. I have these types:
type Suit =
| Heart = 0
| Spade = 1
| Diamond = 2
| Club = 3
type Card =
| Ace of Suit
| King of Suit
| Queen of Suit
| Jack of Suit
| ValueCard of int * Suit
I have this function (ignoring for now that aces can have 2 different values):
let NumericValue =
function | Ace(Suit.Heart) | Ace(Suit.Spade) | Ace(Suit.Diamond) | Ace(Suit.Club) -> 11
| King(Suit.Heart) | King(Suit.Spade)| King(Suit.Diamond) | King(Suit.Club) | Queen(Suit.Heart) | Queen(Suit.Spade)| Queen(Suit.Diamond) | Queen(Suit.Club) | Jack(Suit.Heart) | Jack(Suit.Spade)| Jack(Suit.Diamond) | Jack(Suit.Club) -> 10
| ValueCard(num, x) -> num
Is there a way I can include a range or something? Like [Ace(Suit.Heart) .. Ace(Suit.Club)]. Or even better Ace(*)

You want a wildcard pattern. The spec (§7.4) says:
The pattern _ is a wildcard pattern and matches any input.
let numericValue = function
| Ace _-> 11
| King _
| Queen _
| Jack _ -> 10
| ValueCard(num, _) -> num

URL BNF search part does not make sense

While implementing a Java regular expression for URL based on the URL BNF published by W3C, I've failed to understand the search part. As quoted:
httpaddress h t t p : / / hostport [ / path ] [ ?
search ]
search xalphas [ + search ]
xalphas xalpha [ xalphas ]
xalpha alpha | digit | safe | extra | escape
alpha a | b | c | d | e | f | g | h | i | j | k |
l | m | n | o | p | q | r | s | t | u | v |
w | x | y | z | A | B | C | D | E | F | G |
H | I | J | K | L | M | N | O | P | Q | R |
digit 0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
safe $ | - | _ | # | . | & | + | -
extra ! | * | " | ' | ( | ) | ,
Search claims it is xalphas seperated by a plus sign.
xalphas can contain plus signs by it self, as claimed by safe.
Thus according to my understanding , it should be:
search xalphas
Where am I wrong here?

That's pretty clearly a mistake (+ is a reserved delimiter for URIs), but the BNF you're linking to seems to be out of date. Probably best to use the one included at the end of the latest RFC 3986.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Misinterpreted grammar - xtext

do not use whitespace in keywords

Related

Shift/reduce conflicts using Sablecc

Top Down Recursive Parsing

Matching of tokens with Antlr4

How can I make this match expression more concise?

URL BNF search part does not make sense

Categories

Resources