How do I write a grammar for a select sql statement? - parsing

I am writing a grammar for an SQL parser and I've been stuck on this for a while now-
F: FETCH fields FROM tables Conditions
;
fields: ALL
| ids
;
ids: ID ids_
;
ids_: ',' ID ids_
| { /*empty*/ }
;
tables: ID
;
Conditions: WHERE ConditionList
| { /*empty*/ }
;
ConditionList: Condition ConditionList_
;
ConditionList_: BoolOp Condition ConditionList_
| { /*empty*/ }
;
Condition: Operand RELOP Operand
| NOT Operand RELOP Operand
;
Operand: ID
| NUM
;
BoolOp: AND
| OR
;
For some reason when the lexer reads a FROM token, the parser terminates with an error. Here's the lex code-
FETCH{ printf("fetch "); return FETCH;}
FROM { printf("from "); return UNIQUE; }
ALL { printf("all "); return ALL; }
WHERE { printf("where "); return WHERE; }
AND { printf("and "); return AND; }
OR { printf("or "); return OR; }
NOT { printf("not "); return NOT; }
RelOp { printf("%s", yytext); yylval.string = strdup(yytext); return RELOP; }
[0-9]* {printf("num "); return NUM; }
[_a-zA-Z][_a-zA-Z0-9]* { printf("id "); return ID; }
{symbol} { printf("%c ", yytext[0]); return yytext[0]; }
. { }
RelOp is a pattern- RelOp ("<"|"<="|">"|">="|"=")
and symbol is a pattern- symbol ("("|")"|",")

Your grammar starts with
F: FETCH fields FROM tables Conditions
However, your lexer rules includes
FROM { printf("from "); return UNIQUE; }
Since UNIQUE is different from FROM, the grammar rule won't apply.
If those printf calls in your lexer are some kind of debugging attempt, they are not very useful since they won't tell you whether you are actually returning the correct token type (and value, in the cases where that is necessary). I strongly recommend using bison's trace feature to get an accurate view of what is going on. (Bison's trace will tell you which token type is being received by the parser, for example.)

Related

Breaking head over how to get position of token with a rule - ANTLR4 / grammar

I'm writing a little grammar using ANLTR, and I have a rule like this:
operation : OPERATION (IDENT | EXPR) ',' (IDENT | EXPR);
...
OPERATION : 'ADD' | 'SUB' | 'MUL' | 'DIV' ;
IDENT : [a-z]+;
EXPR : INTEGER | FLOAT;
INTEGER : [0-9]+ | '-'[0-9]+
FLOAT : [0-9]+'.'[0-9]+ | '-'[0-9]+'.'[0-9]+
Now in the listener inside Java, how do I determine in the case of such a scenario where an operation consist of both IDENT and EXPR the order in which they appear?
Obviously the rule can match both
ADD 10, d
or
ADD d, 10
But in the listener for the rule, generated by ANTLR4, if there is both IDENT() and EXPR() how to get their order, since I want to assign the left and right operands correctly.
Been breaking my head over this, is there any simple way or should I rewrite the rule itself? The ctx.getTokens () requires me to give the token type, which kind of defeats the purpose, since I cannot get the sequence of the tokens in the rule, if I specify their type.
You can do it like this:
operation : OPERATION lhs=(IDENT | EXPR) ',' rhs=(IDENT | EXPR);
and then inside your listener, do this:
#Override
public void enterOperation(TParser.OperationContext ctx) {
if (ctx.lhs.getType() == TParser.IDENT) {
// left hand side is an identifier
} else {
// left hand side is an expression
}
// check `rhs` the same way
}
where TParser comes from the grammar file T.g4. Change this accordingly.
Another solution would be something like this:
operation
: OPERATION ident_or_expr ',' ident_or_expr
;
ident_or_expr
: IDENT
| EXPR
;
and then in your listener:
#Override
public void enterOperation(TParser.OperationContext ctx) {
Double lhs = findValueFor(ctx.ident_or_expr().get(0));
Double rhs = findValueFor(ctx.ident_or_expr().get(1));
...
}
private Double findValueFor(TParser.Ident_or_exprContext ctx) {
if (ctx.IDENT() != null) {
// it's an identifier
} else {
// it's an expression
}
}

Need Lex regular expression to match string upto newline

I want to parse strings of the type :
a=some value
b=some other value
There are no blanks around '=' and values extend up to newline. There may be leading spaces.
My lex specification (relevant part) is:
%%
a= { printf("Found attr %s\n", yytext); return aATTR; }
^[ \r\t]+ { printf("Found space at the start %s\n", yytext); }
([^a-z]=).*$ { printf("Found value %s\n", yytext); }
\n { return NEWLINE; }
%%
I tried .*$ [^\n]* and a few other regular expressions but to no avail.
This looks pretty simple. Any suggestions? I am also aware that lex returns the longest match so that complicates it further. I get the whole line matched for some regular expressions I tried.
You probably want to incorporate separate start states. These permit you to encode simple contexts. The simple example below captures your id, operator and value on each call to yylex().
%{
char id;
char op;
char *value;
%}
%x VAL OP
%%
<INITIAL>[a-z]+ {
id = yytext[0];
yyleng = 0;
BEGIN OP;
}
<INITIAL,OP>[ \t]*
<OP>=[ \t]* {
op = yytext[0];
yyleng = 0;
BEGIN VAL;
}
<VAL>.*\n {
value = yytext;
BEGIN INITIAL;
return 1;
}
%%

Happy resolution of an error

In Regular Expressions, I can write:
a(.)*b
And this will match the entire string in, for example
acdabb
I try to simulate this with a token stream in Happy.
t : a wildcard b
wildcard : {- empty -} | wild wildcard
wild : a | b | c | d | whatever
However, the parser generated by Happy does not recognize
acdabb
Is there a way around this/am I doing it wrong?
As you noted Happy uses an LALR(1) parser, which is noted in the documentation. You noted in the comments that changing to right recursion resolves the problem, but for the novice it might not be clear how that can be achieved. To change the recursion the wilcard wild is rewritten as wild wildcard, which results in the following file:
{
module ABCParser (parse) where
}
%tokentype { Char }
%token a { 'a' }
%token b { 'b' }
%token c { 'c' }
%token d { 'd' }
%token whatever { '\n' }
%name parse t
%%
t
: a wildcard b
{ }
wildcard
:
{ }
| wildcard wild
{ }
wild
: a
{ }
| b
{ }
| c
{ }
| d
{ }
| whatever
{ }
Which now generates a working parser.

semicolon as delimiter in custom grammar parsed by flex/bison

I'm trying to write a simple parser for a meta programming language.
Everything works fine, but I want to use ';' as statement delimiter and not newline or ommit the semicolon entirely.
So this is the expected behaviour:
// good code
v1 = v2;
v3 = 23;
should parse without errors
But:
// bad code
v1 = v2
v3 = 23;
should fail
yet if I remove the 'empty' rule from separator both codes fail like this:
ID to ID
Error detected in parsing: syntax error, unexpected ID, expecting SEMICOLON
;
If I leave the 'empty' rule active, then both codes are accepted, which is not desired.
ID to ID // should raise error
ID to NUM;
Any help is welcome here, as most tutorials do not cover delimiters at all.
Here is a simplified version of my parser/lexxer:
parser.l:
%{
#include "parser.tab.h"
#include<stdio.h>
%}
num [0-9]
alpha [a-zA-Z_]
alphanum [a-zA-Z_0-9]
comment "//"[^\n]*"\n"
string \"[^\"]*\"
whitespace [ \t\n]
%x ML_COMMENT
%%
<INITIAL>"/*" {BEGIN(ML_COMMENT); printf("/*");}
<ML_COMMENT>"*/" {BEGIN(INITIAL); printf("*/");}
<ML_COMMENT>[.]+ { }
<ML_COMMENT>[\n]+ { printf("\n"); }
{comment}+ {printf("%s",yytext);}
{alpha}{alphanum}+ { yylval.str= strdup(yytext); return ID;}
{num}+ { yylval.str= strdup(yytext); return NUM;}
{string} { yylval.str= strdup(yytext); return STRING;}
';' {return SEMICOLON;}
"=" {return ASSIGNMENT;}
" "+ { }
<<EOF>> {exit(0); /* this is suboptimal */}
%%
parser.y:
%{
#include<stdio.h>
#include<string.h>
%}
%error-verbose
%union{
char *str;
}
%token <str> ID
%token <str> NUM
%token <str> STRING
%left SEMICOLON
%left ASSIGNMENT
%start input
%%
input: /* empty */
| expression separator input
;
expression: assign
| error {}
;
separator: SEMICOLON
| empty
;
empty:
;
assign: ID ASSIGNMENT ID { printf("ID to ID"); }
| ID ASSIGNMENT STRING { printf("ID to STRING"); }
| ID ASSIGNMENT NUM { printf("ID to NUM"); }
;
%%
yyerror(char* str)
{
printf("Error detected in parsing: %s\n", str);
}
main()
{
yyparse();
}
Compiled like this:
$>flex -t parser.l > parser.lex.yy.c
$>bison -v -d parser.y
$>cc parser.tab.c parser.lex.yy.c -lfl -o parser
Never mind... the problematic line was this one:
';' {return SEMICOLON;}
which required to be changed to
";" {return SEMICOLON;}
Now the behaviour is correct. :-)

Why is this fsyacc input producing F# that does not compile?

My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.

Resources