SML datatypes struggle - parsing

I have the following grammar that I need to translate into SML datatypes:
Integer ranges over SML integer constants.
Boolean ::= 'true' | 'false'
Operator ::= 'ADD' | 'IF' | 'LESS_THAN'
Arguments ::= ( ',' Expression ) *
Expression ::= 'INT' '(' Integer ')'
| 'BOOL' '(' Boolean ')'
| 'OPERATION' '(' Operator ',' '[' Expression ( ',' Expression ) * ']' ')'
I have managed the following:
datatype BOOL = true | false;
datatype OPERATOR = ADD | IF | LESS_THAN;
datatype INT = INT of int;
However I am struggling with datatypes Arguments and Expression. Any help would be appreciated.

for ARGUMENTS, you can use a sequence of EXPRESSIONs, so something like a list of EXPRESSION would work fine (the parentheses need to be parsed, but you don't need to store them in your type, since they are always there).
for EXPRESSION, you need to combine the approach you've used in OPERATOR (where you have alternatives) with what you've done in INT (where you have of ...). in otherwords it's going to be of the form A of B | C of D | ....
also, you don't really need INT of int - you could just use an simple int (ie an integer) for INT - and i suspect ML has a boolean type that could use instead of defining a datatype for BOOL (in other words, you probably don't need to define a datatype at all for either of those - just use what is already present in the language).
ps it's also normal to add the "homework" tag for homework.
[edit for OPERATOR you have multiple types, but that's ok - just stick them in a tuple, like (A,B) whose type is written a * b. for the sequence of expressions, use a list, like for ARGUMENTS.]

Related

ANTLR Making Negative Test Cases

I'm new to ANTLR and am trying to understand how to do some things with it. I need it to throw an error when a statement is missing things, like a semicolon or an end bracket. It's been called negative test cases by the problem set that I'm working through.
For example, the below code returns true, which is correct.
val program = """
1 + 2;
"""
recognize(program)
However, this code also returns true, despite it missing the semicolon at the end. It should return false ([PARSER error at line=1]: missing ';' at '').
val program = """
1 + 2
""".trimIndent()
recognize(program)
The grammar is as follows:
program: (expression ';')* | EOF;
expression: INT PLUS INT | OPENBRAC INT PLUS INT CLOSEBRAC | QUOTE IDENT QUOTE PLUS QUOTE IDENT QUOTE;
IDENT: [A-Za-z0-9]+;
INT: [-][0-9]+ | ('0'..'9')+;
PLUS: '+';
OPENBRAC: '(';
CLOSEBRAC: ')';
QUOTE: '"';
program: (expression ';')* | EOF;
This means a program can either be zero or more instances of expression ';' followed by whatever else is in the input stream or it can be empty. Since (expression ';')* can already match the empty input by itself, the | EOF is just redundant.
What you want is program: (expression ';')* EOF, which means that a program consists of zero or more instances of expression ';', followed by the end of input, meaning there must be nothing left in the input afterwards.

Grammar of calculator in a finite field

I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.

ANTLR4 Context-sensitive rule: unexpected parsing/resynchronization when failing semantic predicate

I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

Creating a parser rule

I'm currently in a CSCI class, compiler at my college. I have to write a parser for the compiler and I've already done Adding subtracting multiplying dividing and the assignment statement. My question is we now have to do the less than equal (<=) and the greater than equal (>=) and I'm not sure how to write the rule for it...
I was thinking something like...
expr LESSTHAN expr { $1 <= $3 }
expr GREATERTHAN expr { $1 >= $3 }
any suggestions?
You should include a more precise question. Here are some general suggestions though.
The structure of the rule for relational operations should be the same as of the arithmetic operations. In both cases you have binary operators. The difference is that one returns a number, the other returns a boolean value. While 1 + 1 >= 3 usually is valid syntax, other combinations like 1 >= 2 => 5 is most likely invalid. Of course there are exceptions. Some languages allow it as syntactic sugar for multiple operations. Others simply define that boolean values are just integers (0 and 1). It's up to you (or your assignment) what you want the syntax to look like.
Anyway, you probably don't simply want append those rules to expr, but create a new rule. This way you distinguish between relational and arithmetical expressions.
expr :
expr PLUS expr |
expr MINUS expr |
... ;
relational_expr :
expr LESSTHAN expr |
expr GREATERTHAN expr ;
assignment :
identifier '=' relational_expr |
identifier '=' expr ;

Resources