Writing AST matcher to find all case statements having no break statement - clang

I want to find all the case statement having no break statement. I using clang-query to build my matcher. My matcher is failing in some of the test cases.
I wrote simple matcher as
match caseStmt(unless(has(breakStmt())))
it works with follwing test case
#include<stdlib.h>
int main(){
int x;
switch(x){
case 1:
break;
case 2:
default:
x++;
}
return 0;
}
and
int main()
{
int x = 1, y = 2;
// Outer Switch
switch (x) {
// If x == 1
case 1:
// Nested Switch
switch (y) {
// If y == 2
case 2:
//break;
// If y == 3
case 3:
break;
}
break;
// If x == 4
case 4:
break;
// If x == 5
case 5:
break;
default:
break;
}
return 0;
}
does not work well with following
#include <iostream>
using namespace std;
int main()
{
int x = 1, y = 2;
// Outer Switch
switch (x) {
// If x == 1
case 1:
// Nested Switch
switch (y) {
// If y == 2
case 2:
cout << "Choice is 2";
//break;
// If y == 3
case 3:
cout << "Choice is 3";
break;
}
//break;
// If x == 4
case 4:
cout << "Choice is 4";
break;
// If x == 5
case 5:
cout << "Choice is 5";
break;
default:
cout << "Choice is other than 1, 2 3, 4, or 5";
break;
}
return 0;
}
In above case it shows case statement that are having break statement along with case statement with no break statement.
what wrong i am doing ? please help :) I am following this
http://releases.llvm.org/8.0.0/tools/clang/docs/LibASTMatchersTutorial.html

Unfortunately this is not going to work :-(
case is technically a label, and label has only one statement as its child. If you print out AST you'll see that case and break statements will be at the same level:
| |-CaseStmt 0x5618732e1e30 <line:29:3, line:30:9>
| | |-IntegerLiteral 0x5618732e1e10 <line:29:8> 'int' 4
| | |-<<<NULL>>>
| | `-CallExpr 0x5618732e1f00 <line:30:5, col:9> 'void'
| | `-ImplicitCastExpr 0x5618732e1ee8 <col:5> 'void (*)()' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x5618732e1ec0 <col:5> 'void ()' lvalue Function 0x5618732e16d0 'foo' 'void ()'
| |-BreakStmt 0x5618732e1f28 <line:31:5>
| |-CaseStmt 0x5618732e1f50 <line:34:3, line:35:9>
| | |-IntegerLiteral 0x5618732e1f30 <line:34:8> 'int' 5
| | |-<<<NULL>>>
| | `-CallExpr 0x5618732e2020 <line:35:5, col:9> 'void'
| | `-ImplicitCastExpr 0x5618732e2008 <col:5> 'void (*)()' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x5618732e1fe0 <col:5> 'void ()' lvalue Function 0x5618732e16d0 'foo' 'void ()'
| |-BreakStmt 0x5618732e2048 <line:36:5>
Here you can see that CallExpr is a child of CaseStmt while BreakStmt is not.
NOTE: to make example a bit easier I replaced std::cout << "..." with foo().
You'll have to write a much more complex matcher that fetches for cases that don't have break statements between them and the following cases.
I hope this is still helpful.

Related

Clang AST Libtooling: How to print Array identifier on AST Matching

My code that I tried is below:
if(const ArraySubscriptExpr *array = Result.Nodes.getNodeAs<ArraySubscriptExpr>("array"))
{
llvm::outs() << array->getBase() <<'\n';
}
getBase() should print the array identifier, but it is printing the address, e.g. 0x559f7da7e838. How can I print the array name/identifier?
For example, in the case of arr[i] = 40;
I want to print arr
getBase returns a pointer to the base expression, so that is why the address is being printed. The AST for arr[i] is:
| |-ArraySubscriptExpr 0xc04c608 <col:3, col:8> 'double' lvalue
| | |-ImplicitCastExpr 0xc04c5d8 <col:3> 'double *' <LValueToRValue>
| | | `-DeclRefExpr 0xc04c598 <col:3> 'double *' lvalue Var 0xc04c480 'arr' 'double *'
| | `-ImplicitCastExpr 0xc04c5f0 <col:7> 'int' <LValueToRValue>
| | `-DeclRefExpr 0xc04c5b8 <col:7> 'int' lvalue Var 0xc04c518 'i' 'int'
As can be seen, the name of the array appears in the children of the ImplicitCastExpr node which is children of ArraySubscriptExpr. This worked for me:
if (auto *array = dyn_cast<ArraySubscriptExpr>(st)) {
if (auto *cast = dyn_cast<ImplicitCastExpr>(array->getBase())) {
if (auto *decl = dyn_cast<DeclRefExpr>(cast->getSubExpr())) {
cout << decl->getNameInfo().getAsString() << endl;
}
}
}

Conflicts in Parser for Propositional logic with IF-THEN-ELSE ternary operator

I want to implement the Parser for proposition logic which has the following operators in decreasing order of precedence:
NOT p
p AND q
p OR q
IF p THEN q
p IFF q
IF p THEN q ELSE r
The main issue is with the IF-THEN-ELSE operator. Without it, I am able to write the grammar properly. Presently my yacc file looks like
%term
PARSEPROG | AND | NOT | OR | IF | THEN | ELSE | IFF | LPAREN | RPAREN | ATOM of string | SEMICOLON | EOF
%nonterm
start of Absyn.program | EXP of Absyn.declaration
%start start
%eop EOF SEMICOLON
%pos int
%verbose
%right ELSE
%right IFF
%right THEN
%left AND OR
%left NOT
%name Fol
%noshift EOF
%%
start : PARSEPROG EXP (Absyn.PROGRAM(EXP))
EXP: ATOM ( Absyn.LITERAL(ATOM) )
| LPAREN EXP RPAREN (EXP)
| EXP AND EXP ( Absyn.CONJ(EXP1, EXP2) )
| EXP OR EXP ( Absyn.DISJ(EXP1, EXP2) )
| IF EXP THEN EXP ELSE EXP ( Absyn.IFTHENELSE(EXP1, EXP2, EXP3) )
| IF EXP THEN EXP ( Absyn.IMPLI(EXP1, EXP2) )
| EXP IFF EXP ( Absyn.BIIMPLI(EXP1, EXP2) )
| NOT EXP ( Absyn.NEGATION(EXP) )
But I don't seem to get the correct idea how to eliminate reduce-shift conflicts. Some examples of correct parsing are:
IF a THEN IF b THEN c________a->(b->c)
IF a THEN IF b THEN c ELSE d IFF e OR f_______IFTHENELSE(a,b->c,d<=>e/\f)
Any help/pointers will be really helpful. Thanks.
Making my Yacc sit up and beg
I'm more convinced than ever that the correct approach here is a GLR grammar, if at all possible. However, inspired by #Kaz, I produced the following yacc/bison grammar with an LALR(1) grammar (not even using precedence declarations).
Of course, it cheats, since the problem cannot be solved with an LALR(1) grammar. At appropriate intervals, it walks the constructed tree of IF THEN and IF THEN ELSE expressions, and moves the ELSE clauses as required.
Nodes which need to be re-examined for possible motion are given the AST nodetype IFSEQ and the ELSE clauses are attached with the traditional tightest match grammar, using a classic matched-if/unmatched-if grammar. A fully-matched IF THEN ELSE clause does not need to be rearranged; the tree rewrite will apply to the expression associated with the first ELSE whose right-hand operand is unmatched (if there is one). Keeping the fully-matched prefix of an IF expression separate from the tail which needs to be rearranged required almost-duplicating some rules; the almost-duplicated rules differ in that their actions directly produce TERNARY nodes instead if IFSEQ nodes.
In order to correctly answer the question, it would also be necessary to rearrange some IFF nodes, since the IFF binds more weakly than the THEN clause and more tightly than the ELSE clause. I think this means:
IF p THEN q IFF IF r THEN s ==> ((p → q) ↔ (r → s))
IF p THEN q IFF r ELSE s IFF t ==> (p ? (q ↔ r) : (s ↔ t))
IF p THEN q IFF IF r THEN s ELSE t IFF u ==> (p ? (q ↔ (r → s)) : (t ↔ u))
although I'm not sure that is what is being asked for (particularly the last one) and I really don't think it's a good idea. In the grammar below, if you want IFF to apply to an IF p THEN q subexpression, you will have to use parentheses; IF p THEN q IFF r produces p → (q ↔ r) and p IFF IF q THEN r is a syntax error.
Frankly, I think this whole thing would be easier using arrows for conditionals and biconditionals (as in the glosses above), and using IF THEN ELSE only for ternary selector expressions (written above with C-style ? : syntax, which is another possibility). That will generate far fewer surprises. But it's not my language.
One solution for the biconditional operator with floating precedence would be to parse in two passes. The first pass would only identify the IF p THEN q operators without an attached ELSE, using a mechanism similar to the one proposed here, and change them to p -> q by deleting the IF and changing the spelling of THEN. Other operators would not be parsed and parentheses would be retained. It would then feed to resulting token stream into a second LALR parser with a more traditional grammar style. I might get around to coding that only because I think that two-pass bison parsers are occasionally useful and there are few examples floating around.
Here's the tree-rewriting parser. I apologise for the length:
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void yyerror(const char* msg);
int yylex(void);
typedef struct Node Node;
enum AstType { ATOM, NEG, CONJ, DISJ, IMPL, BICOND, TERNARY,
IFSEQ
};
struct Node {
enum AstType type;
union {
const char* atom;
Node* child[3];
};
};
Node* node(enum AstType type, Node* op1, Node* op2, Node* op3);
Node* atom(const char* name);
void node_free(Node*);
void node_print(Node*, FILE*);
typedef struct ElseStack ElseStack;
struct ElseStack {
Node* action;
ElseStack* next;
};
ElseStack* build_else_stack(Node*, ElseStack*);
ElseStack* shift_elses(Node*, ElseStack*);
%}
%union {
const char* name;
struct Node* node;
}
%token <name> T_ID
%token T_AND "and"
T_ELSE "else"
T_IF "if"
T_IFF "iff"
T_NOT "not"
T_OR "or"
T_THEN "then"
%type <node> term conj disj bicond cond mat unmat tail expr
%%
prog : %empty | prog stmt;
stmt : expr '\n' { node_print($1, stdout); putchar('\n'); node_free($1); }
| '\n'
| error '\n'
term : T_ID { $$ = atom($1); }
| "not" term { $$ = node(NEG, $2, NULL, NULL); }
| '(' expr ')' { $$ = $2; }
conj : term
| conj "and" term { $$ = node(CONJ, $1, $3, NULL); }
disj : conj
| disj "or" conj { $$ = node(DISJ, $1, $3, NULL); }
bicond: disj
| disj "iff" bicond { $$ = node(BICOND, $1, $3, NULL); }
mat : bicond
| "if" expr "then" mat "else" mat
{ $$ = node(IFSEQ, $2, $4, $6); }
unmat: "if" expr "then" mat
{ $$ = node(IFSEQ, $2, $4, NULL); }
| "if" expr "then" unmat
{ $$ = node(IFSEQ, $2, $4, NULL); }
| "if" expr "then" mat "else" unmat
{ $$ = node(IFSEQ, $2, $4, $6); }
tail : "if" expr "then" mat
{ $$ = node(IFSEQ, $2, $4, NULL); }
| "if" expr "then" unmat
{ $$ = node(IFSEQ, $2, $4, NULL); }
cond : bicond
| tail { shift_elses($$, build_else_stack($$, NULL)); }
| "if" expr "then" mat "else" cond
{ $$ = node(TERNARY, $2, $4, $6); }
expr : cond
%%
/* Walk the IFSEQ nodes in the tree, pushing any
* else clause found onto the else stack, which it
* returns.
*/
ElseStack* build_else_stack(Node* ifs, ElseStack* stack) {
if (ifs && ifs->type != IFSEQ) {
stack = build_else_stack(ifs->child[1], stack);
if (ifs->child[2]) {
ElseStack* top = malloc(sizeof *top);
*top = (ElseStack) { ifs->child[2], stack };
stack = build_else_stack(ifs->child[2], top);
}
}
return stack;
}
/* Walk the IFSEQ nodes in the tree, attaching elses from
* the else stack.
* Pops the else stack as it goes, freeing popped
* objects, and returns the new top of the stack.
*/
ElseStack* shift_elses(Node* n, ElseStack* stack) {
if (n && n->type == IFSEQ) {
if (stack) {
ElseStack* top = stack;
stack = shift_elses(n->child[2],
shift_elses(n->child[1], stack->next));
n->type = TERNARY;
n->child[2] = top;
free(top);
}
else {
shift_elses(n->child[2],
shift_elses(n->child[1], NULL));
n->type = IMPL;
n->child[2] = NULL;
}
}
return stack;
}
Node* node(enum AstType type, Node* op1, Node* op2, Node* op3) {
Node* rv = malloc(sizeof *rv);
*rv = (Node){type, .child = {op1, op2, op3}};
return rv;
}
Node* atom(const char* name) {
Node* rv = malloc(sizeof *rv);
*rv = (Node){ATOM, .atom = name};
return rv;
}
void node_free(Node* n) {
if (n) {
if (n->type == ATOM) free((char*)n->atom);
else for (int i = 0; i < 3; ++i) node_free(n->child[i]);
free(n);
}
}
const char* typename(enum AstType type) {
switch (type) {
case ATOM: return "ATOM";
case NEG: return "NOT" ;
case CONJ: return "CONJ";
case DISJ: return "DISJ";
case IMPL: return "IMPL";
case BICOND: return "BICOND";
case TERNARY: return "TERNARY" ;
case IFSEQ: return "IF_SEQ";
}
return "**BAD NODE TYPE**";
}
void node_print(Node* n, FILE* out) {
if (n) {
if (n->type == ATOM)
fputs(n->atom, out);
else {
fprintf(out, "(%s", typename(n->type));
for (int i = 0; i < 3 && n->child[i]; ++i) {
fputc(' ', out); node_print(n->child[i], out);
}
fputc(')', out);
}
}
}
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
int main(int argc, char** argv) {
return yyparse();
}
The lexer is almost trivial. (This one uses lower-case keywords because my fingers prefer that, but it's trivial to change.)
%{
#include "ifelse.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
and { return T_AND; }
else { return T_ELSE; }
if { return T_IF; }
iff { return T_IFF; }
not { return T_NOT; }
or { return T_OR; }
then { return T_THEN; }
[[:alpha:]]+ { yylval.name = strdup(yytext);
return T_ID; }
([[:space:]]{-}[\n])+ ;
\n { return '\n'; }
. { return *yytext;}
As written, the parser/lexer reads a line at a time, and prints the AST for each line (so multiline expressions aren't allowed). I hope it's clear how to change it.
A relatively easy way to deal with this requirement is to create a grammar which over-generates, and then reject the syntax we don't want using semantics.
Concretely, we use a grammar like this:
expr : expr AND expr
| expr OR expr
| expr IFF expr
| IF expr THEN expr
| expr ELSE expr /* generates some sentences we don't want! */
| '(' expr ')'
| ATOM
;
Note that ELSE is just an ordinary low precedence operator: any expression can be followed by ELSE and another expression. But in the semantic rule, we implement a check that the left side of ELSE is an IF expression. If not, then we raise an error.
This approach is not only easy to implement, but easy to document for the end-users and consequently easy to understand and use. The end user can accept the simple theory that ELSE is just another binary operator with a very low precedence, along with a rule which rejects it when it's not combined with IF/THEN.
Here is a test run from a complete program I wrote (using classic Yacc, in C):
$ echo 'a AND b OR c' | ./ifelse
OR(AND(a, b), c)
$ echo 'a OR b AND c' | ./ifelse
OR(a, AND(b, c))
$ echo 'IF a THEN b' | ./ifelse
IF(a, b)
Ordinary single IF/ELSE does what we want:
$ echo 'IF a THEN b ELSE c' | ./ifelse
IFELSE(a, b, c)
The key thing that you're after:
$ echo 'IF a THEN IF x THEN y ELSE c' | ./ifelse
IFELSE(a, IF(x, y), c)
correctly, the ELSE goes with the outer IF. Here is the error case with bad ELSE:
$ echo 'a OR b ELSE c' | ./ifelse
error: ELSE must pair with IF
<invalid>
Here is parentheses to force the usual "else with closest if" behavior:
$ echo 'IF a THEN (IF x THEN y ELSE c)' | ./ifelse
IF(a, IFELSE(x, y, c))
The program shows what parse it is using by building an AST and then walking it to print it in prefix F(X, Y) syntax. (For which as a Lisp programmer, I had to hold back the gagging reflex a little bit).
The AST structure is also what allows the ELSE rule to detect whether its left argument is an expression of the correct kind.
Note: You might want the following to be handled, but it isn't:
$ echo 'IF a THEN IF x THEN y ELSE z ELSE w' | ./ifelse
error: ELSE must pair with IF
<invalid>
The issue here is that the ELSE w is being paired with an IFELSE expression.
A more sophisticated approach is possible that might be interesting to explore. The parser can treat ELSE as an ordinary binary operator and generate the AST that way. Then a whole separate walk can check the tree for valid ELSE usage and transform it as necessary. Or perhaps we can play here with the associativity of ELSE and treat cascading ELSE in the parser action in some suitable way.
The complete source code, which I saved in a file called ifelse.y and built using:
$ yacc ifelse.y
$ gcc -o ifelse y.tab.c
is here:
%{
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
typedef struct astnode {
int op;
struct astnode *left, *right;
char *lexeme;
} astnode;
void yyerror(const char *s)
{
fprintf(stderr, "error: %s\n", s);
}
void *xmalloc(size_t size)
{
void *p = malloc(size);
if (p)
return p;
yyerror("out of memory");
abort();
}
char *xstrdup(char *in)
{
size_t sz = strlen(in) + 1;
char *out = xmalloc(sz);
return strcpy(out, in);
}
astnode *astnode_cons(int op, astnode *left, astnode *right, char *lexeme)
{
astnode *a = xmalloc(sizeof *a);
a->op = op;
a->left = left;
a->right = right;
a->lexeme = lexeme;
return a;
}
int yylex(void);
astnode *ast;
%}
%union {
astnode *node;
char *lexeme;
int none;
}
%token<none> '(' ')'
%token<lexeme> ATOM
%left<none> ELSE
%left<none> IF THEN
%right<none> IFF
%left<none> OR
%left<none> AND
%type<node> top expr
%%
top : expr { ast = $1; }
expr : expr AND expr
{ $$ = astnode_cons(AND, $1, $3, 0); }
| expr OR expr
{ $$ = astnode_cons(OR, $1, $3, 0); }
| expr IFF expr
{ $$ = astnode_cons(IFF, $1, $3, 0); }
| IF expr THEN expr
{ $$ = astnode_cons(IF, $2, $4, 0); }
| expr ELSE expr
{ if ($1->op != IF)
{ yyerror("ELSE must pair with IF");
$$ = 0; }
else
{ $$ = astnode_cons(ELSE, $1, $3, 0); } }
| '(' expr ')'
{ $$ = $2; }
| ATOM
{ $$ = astnode_cons(ATOM, 0, 0, $1); }
;
%%
int yylex(void)
{
int ch;
char tok[64], *te = tok + sizeof(tok), *tp = tok;
while ((ch = getchar()) != EOF) {
if (isalnum((unsigned char) ch)) {
if (tp >= te - 1)
yyerror("token overflow");
*tp++ = ch;
} else if (isspace(ch)) {
if (tp > tok)
break;
} else if (ch == '(' || ch == ')') {
if (tp == tok)
return ch;
ungetc(ch, stdin);
break;
} else {
yyerror("invalid character");
}
}
if (tp > tok) {
yylval.none = 0;
*tp++ = 0;
if (strcmp(tok, "AND") == 0)
return AND;
if (strcmp(tok, "OR") == 0)
return OR;
if (strcmp(tok, "IFF") == 0)
return IFF;
if (strcmp(tok, "IF") == 0)
return IF;
if (strcmp(tok, "THEN") == 0)
return THEN;
if (strcmp(tok, "ELSE") == 0)
return ELSE;
yylval.lexeme = xstrdup(tok);
return ATOM;
}
return 0;
}
void ast_print(astnode *a)
{
if (a == 0) {
fputs("<invalid>", stdout);
return;
}
switch (a->op) {
case ATOM:
fputs(a->lexeme, stdout);
break;
case AND:
case OR:
case IF:
case IFF:
switch (a->op) {
case AND:
fputs("AND(", stdout);
break;
case OR:
fputs("OR(", stdout);
break;
case IF:
fputs("IF(", stdout);
break;
case IFF:
fputs("IFF(", stdout);
break;
}
ast_print(a->left);
fputs(", ", stdout);
ast_print(a->right);
putc(')', stdout);
break;
case ELSE:
fputs("IFELSE(", stdout);
ast_print(a->left->left);
fputs(", ", stdout);
ast_print(a->left->right);
fputs(", ", stdout);
ast_print(a->right);
putc(')', stdout);
break;
}
}
int main(void)
{
yyparse();
ast_print(ast);
puts("");
return 0;
}

Why do I need to rewrite a grammar?

I'm trying to study compiler construction on my own. I'm reading a book and this is one of the exercises (I want to stress that this is not homework, I'm doing this on my own).
The following grammar represents a simple arithmetic expressions in
LISP-like prefix notation
lexp -> number | ( op lexp-seq )
op -> + | * | +
lexp-seq -> lexp-seq lexp | lexp
For example, the expression (* (-2) 3 4) has a value of -24. Write
Yacc/Bison specification for a program that will compute and print
the value of expressions in this syntax. (Hint: this will require
rewriting the grammar, as well as the use of a mechanism for passing
the operator to an lexp-seq
I have solved it. The solution is provided below. However I have questions about my solution as well as the problem itself. Here they are:
I don't modify a grammar in my solution and it seems to be working perfectly. There are no conflicts when Yacc/Bison spec is converted to a .c file. So why is the author saying that I need to rewrite a grammar?
My solution is using a stack as a mechanism for passing the operator to an lexp-seq. Can someone suggest a different method, the one that will not use a stack?
Here is my solution to the problem (I'm not posting code for stack manipulation as the assumption is that the reader is familiar with how stacks work)
%{
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include "linkedstack.h"
int yylex();
int yyerror();
node *operatorStack;
%}
%token NUMBER
%%
command : lexp { printf("%d\n", $1); };
lexp : NUMBER { $$ = $1; }
| '(' op lexp_seq ')'
{
int operator;
operatorStack = pop(operatorStack, &operator);
switch(operator) {
default:
yyerror("Unknown operator");
exit(1);
break;
case '+':
case '*':
$$ = $3;
break;
case '-':
$$ = -$3;
break;
}
}
;
op : '+' { operatorStack = push(operatorStack, '+'); }
| '-' { operatorStack = push(operatorStack, '-'); }
| '*' { operatorStack = push(operatorStack, '*'); }
;
lexp_seq : lexp_seq lexp
{
switch(operatorStack->data) {
default:
yyerror("Unrecognized operator");
exit(1);
break;
case '+':
$$ = $1 + $2;
break;
case '-':
$$ = $1 - $2;
break;
case '*':
$$ = $1 * $2;
break;
}
}
| lexp { $$ = $1; }
;
%%
int main(int argc, char** argv) {
int retVal;
init(operatorStack);
if (2 == argc && (0 == strcmp("-g", argv[1])))
yydebug = 1;
retVal = yyparse();
destroy(operatorStack);
return retVal;
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
Using a stack here is unnecessary if you rewrite the grammar.
One way is to use a different non-terminal for each operator:
command : lexp '\n' { printf("%d\n", $1); }
lexp : NUMBER
| '(' op_exp ')' { $$ = $2; }
op_exp : plus_exp | times_exp | minus_exp
plus_exp: '+' lexp { $$ = $2; }
| plus_exp lexp { $$ = $1 + $2; }
times_exp: '*' lexp { $$ = $2; }
| times_exp lexp { $$ = $1 * $2; }
minus_exp: '-' lexp { $$ = -$2; }
| minus_exp lexp { $$ = $1 - $2; }
I don't know if that is what your book's author had in mind. There are certainly other possible implementations.
In a real lisp-like language, you would need to do this quite differently, because the first object in an lexp could be a higher-order value (i.e. a function), which might even be the result of a function call, so you can't encode the operations into the syntax (and you can't necessarily partially evaluate the expression as you parse new arguments, either).

Using record types in FSYACC

In FSYACC it is common to have terminals that result in tuples. However, for convenience I want to use a record type instead. For example, if I have the following in my Abstract Syntax Tree (AbstractSyntaxTree.fsl):
namespace FS
module AbstractSyntaxTree =
type B = { x : int; y : int }
type Either =
| Record of B
| Tuple of int * string
type A =
| Int of int
| String of string
| IntTuple of Either
I'm not clear on the correct syntax in FSYACC (parser.fsy), because if I use:
%start a
%token <string> STRING
%token <System.Int32> INT
%token ATOMTOKEN TUPLETOKEN EOF
%type < A > a
%%
a:
| atomS { $1 }
| atomI { $1 }
| either { $1 }
atomI:
| ATOMTOKEN INT { Int($2) }
atomS:
| ATOMTOKEN STRING { String($2) }
either:
| TUPLETOKEN INT INT { Record {x=$2;y=$3} } // !!!
| TUPLETOKEN TUPLETOKEN INT STRING { Tuple( $3, $4) } // !!!
I would expect the type B and the Tuple to be inferred. However, FSYACC gives the error for both of the lines marked with "!!!":
This expression was expected to have type A but here has type Either
What is the correct syntax to for the "either" production on the last two lines?
Don't you mean IntTuple($2, $3) as opposed to B($2, $3)? I'd try IntTuple{x=$2; y=$3}
EDIT: this works:
module Ast
type B = { x : int; y : int }
type A =
| Int of int
| String of string
| IntTuple of B
and
%{
open Ast
%}
%start a
%token <string> STRING
%token <System.Int32> INT
%token ATOMTOKEN TUPLETOKEN
%type < Ast.A > a
%%
a:
| atom { $1 }
| tuple { $1 }
atom:
| ATOMTOKEN INT { Int($2) }
| ATOMTOKEN STRING { String($2) }
tuple:
| TUPLETOKEN INT INT { IntTuple {x = $2; y = $3} }
EDIT 2: Take good care, that the line %type < Ast.A > a requires your non-terminal a to be of type Ast.A. So therefore, since you are using the non-terminal tuple directly, tuple needs to be of type Ast.A. As such, you have to wrap the record in IntTuple, so the syntax is IntTuple {x = $2; y = $3} as opposed to just {x = $2; y = $3}.

How smart is pattern match?

My program spends most of time on array pattern match, I am wondering if I should rewrite the function and discard the auto pattern matching.
E.g. a very simple case
let categorize array =
match array with
| [|(1|2);(1|2);(1|2)|] -> 3
| [|(1|2);(1|2);_|] -> 2
| [|(1|2);_;_|] -> 1
| _ -> 0
categorize [|2;1;3|]
Would the compiler apply the least amount of comparisons in this case, by recognizing that e.g. the first case is the same as the second case except for the third element.
Actually the patterns are more complicated, the pre optimized pattern matching could cost way more time than fully optimized pattern matching.
Straight from Reflector:
public static int categorize(int[] array)
{
if ((array > null) && (array.Length == 3))
{
switch (array[0])
{
case 1:
switch (array[1])
{
case 1:
switch (array[2])
{
case 1:
case 2:
goto Label_005C;
}
goto Label_005A;
case 2:
switch (array[2])
{
case 1:
case 2:
goto Label_005C;
}
goto Label_005A;
}
goto Label_0042;
case 2:
switch (array[1])
{
case 1:
switch (array[2])
{
case 1:
case 2:
goto Label_005C;
}
goto Label_005A;
case 2:
switch (array[2])
{
case 1:
case 2:
goto Label_005C;
}
goto Label_005A;
}
goto Label_0042;
}
}
return 0;
Label_0042:
return 1;
Label_005A:
return 2;
Label_005C:
return 3;
}
I don't see anything inefficient.
What is really missing in your question is the actual subject area. In other words, your question is quite generic (which is, generally, good for SO), while coding against on your actual problem may solve the entire issue in an elegant manner.
If I extrapolate your question as it currently stands, you just need the index of the first element which is neither 1 nor 2, and the implementation is trivial:
let categorize arr =
try
Array.findIndex (fun x -> not(x = 1 || x = 2)) arr
with
| :? System.Collections.Generic.KeyNotFoundException -> Array.length arr
// Usage
let x1 = categorize [|2;1;3|] // returns 2
let x2 = categorize [|4;2;1;3|] // returns 0
let x3 = categorize [|1;2;1|] // returns 3
As several free benefits, you get the code that is array length-agnostic and absolutely readable.
Is this what you need?
You could write:
let f (xs: _ []) =
if xs.Length=3 then
let p n = n=1 || n=2
if p xs.[0] then
if p xs.[1] then
if p xs.[2] then 3
else 2
else 1
else 0
Test 1
F#
let test1 x =
match x with
| [| 1; 2; 3 |] -> A
| [| 1; 2; _ |] -> A
| [| 1; _; _ |] -> A
Decompiled C#
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.A;
default:
return Program.MyType.A;
}
break;
default:
return Program.MyType.A;
}
break;
}
}
throw new MatchFailureException(...);
Decompiled IL
Code size 107
Conclusion
Pattern Match doesn't optimize based on the values after ->.
Pattern Match is able to find the optimized approach for array decomposition under conclusion 1.
Incomplete pattern matches always throw exceptions, so there is no harm to add a wildcard to catch the missing patterns and throw exceptions explicitly.
Test 2
F#
let test2 x =
match x with
| [| 1; 2; 3 |] -> A
| [| _; 2; 3 |] -> B
| [| _; _; 3 |] -> C
Decompiled C#
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.A;
default:
goto IL_49;
}
break;
default:
switch (x[2])
{
case 3:
break;
default:
goto IL_49;
}
break;
}
break;
default:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.B;
default:
goto IL_49;
}
break;
default:
switch (x[2])
{
case 3:
goto IL_58;
}
goto IL_49;
}
break;
}
IL_58:
return Program.MyType.C;
}
IL_49:
throw new MatchFailureException(...);
Decompiled IL
Code size 185
Conclusion
Pattern Match checks values from the beginning of an array to end. So it fails to find the optimized approach.
Code size is 2x as much as an optimal one.
Test 3
F#
let test3 x =
match x with
| [| 1; 2; 3 |] -> A
| [| 1; 2; a |] when a <> 3 -> B
| [| 1; 2; _ |] -> C
Decompiled C#
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.A;
default:
if (x[2] != 3)
{
int a = x[2];
return Program.MyType.B;
}
break;
}
break;
}
break;
}
}
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
return Program.MyType.C;
}
break;
}
}
throw new MatchFailureException(...);
Conclusion
The compiler isn't smart enough to see through Guard to check completeness/duplicity.
Guard makes Pattern Match produce weird unoptimized code.
Test 4
F#
let (| Is3 | IsNot3 |) x =
if x = 3 then Is3 else IsNot3
let test4 x =
match x with
| [| 1; 2; 3 |] -> A
| [| 1; 2; Is3 |] -> B
| [| 1; 2; IsNot3 |] -> C
| [| 1; 2; _ |] -> D // This rule will never be matched.
Decompiled C#
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.A;
default:
{
FSharpChoice<Unit, Unit> fSharpChoice = Program.|Is3|IsNot3|(x[2]);
if (fSharpChoice is FSharpChoice<Unit, Unit>.Choice2Of2)
{
return Program.MyType.C;
}
return Program.MyType.B;
}
}
break;
}
break;
}
}
throw new MatchFailureException(...);
Conclusion
Multiple cases Active Patterns compile to FSharpChoice.
The compiler is able to check completeness/duplicity of active patterns, however it cannot compare them with normal patterns.
Unreached patterns are not compiled.
Test 5
F#
let (| Equal3 |) x =
if x = 3 then Equal3 1 else Equal3 0 // Equivalent to "then 1 else 0"
let test5 x =
match x with
| [| 1; 2; 3 |] -> A
| [| 1; 2; Equal3 0 |] -> B
| [| 1; 2; Equal3 1 |] -> C
| [| 1; 2; _ |] -> D
Decompiled C#
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.A;
default:
{
int num = x[2];
switch ((num != 3) ? 0 : 1)
{
case 0:
return Program.MyType.B;
case 1:
return Program.MyType.C;
default:
return Program.MyType.D;
}
break;
}
}
break;
}
break;
}
}
throw new MatchFailureException(...);
Conclusion
Single case Active Patterns compile to the return type.
The compiler sometimes auto inline the function.
Test 6
F#
let (| Partial3 | _ |) x =
if x = 3 then Some (Partial3 true) else None // Equivalent to "then Some true"
let test6 x =
match x with
| [| 1; 2; 3 |] -> A
| [| 1; 2; Partial3 true |] -> B
| [| 1; 2; Partial3 true |] -> C
Decompiled C#
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
switch (x[2])
{
case 3:
return Program.MyType.A;
default:
{
FSharpOption<bool> fSharpOption = Program.|Partial3|_|(x[2]);
if (fSharpOption != null && fSharpOption.Value)
{
return Program.MyType.B;
}
break;
}
}
break;
}
break;
}
}
if (x != null && x.Length == 3)
{
switch (x[0])
{
case 1:
switch (x[1])
{
case 2:
{
FSharpOption<bool> fSharpOption = Program.|Partial3|_|(x[2]);
if (fSharpOption != null && fSharpOption.Value)
{
return Program.MyType.C;
}
break;
}
}
break;
}
}
throw new MatchFailureException(...);
Conclusion
Partial Active Patterns compile to FSharpOption.
The compiler is unable to check completeness/duplicity of partial active patterns.
Test 7
F#
type MyOne =
| AA
| BB of int
| CC
type MyAnother =
| AAA
| BBB of int
| CCC
| DDD
let test7a x =
match x with
| AA -> 2
let test7b x =
match x with
| AAA -> 2
Decompiled C#
public static int test7a(Program.MyOne x)
{
if (x is Program.MyOne._AA)
{
return 2;
}
throw new MatchFailureException(...);
}
public static int test7b(Program.MyAnother x)
{
if (x.Tag == 0)
{
return 2;
}
throw new MatchFailureException(...);
}
Conclusion
If there are more than 3 cases in the union, Pattern Match would use Tag property instead of is. (It also applies to Multiple cases Active Patterns.)
Often a Pattern Match would result in multiple is which degenerate performance greatly.

Resources