I'm trying to write a simple parser for a meta programming language.
Everything works fine, but I want to use ';' as statement delimiter and not newline or ommit the semicolon entirely.
So this is the expected behaviour:
// good code
v1 = v2;
v3 = 23;
should parse without errors
But:
// bad code
v1 = v2
v3 = 23;
should fail
yet if I remove the 'empty' rule from separator both codes fail like this:
ID to ID
Error detected in parsing: syntax error, unexpected ID, expecting SEMICOLON
;
If I leave the 'empty' rule active, then both codes are accepted, which is not desired.
ID to ID // should raise error
ID to NUM;
Any help is welcome here, as most tutorials do not cover delimiters at all.
Here is a simplified version of my parser/lexxer:
parser.l:
%{
#include "parser.tab.h"
#include<stdio.h>
%}
num [0-9]
alpha [a-zA-Z_]
alphanum [a-zA-Z_0-9]
comment "//"[^\n]*"\n"
string \"[^\"]*\"
whitespace [ \t\n]
%x ML_COMMENT
%%
<INITIAL>"/*" {BEGIN(ML_COMMENT); printf("/*");}
<ML_COMMENT>"*/" {BEGIN(INITIAL); printf("*/");}
<ML_COMMENT>[.]+ { }
<ML_COMMENT>[\n]+ { printf("\n"); }
{comment}+ {printf("%s",yytext);}
{alpha}{alphanum}+ { yylval.str= strdup(yytext); return ID;}
{num}+ { yylval.str= strdup(yytext); return NUM;}
{string} { yylval.str= strdup(yytext); return STRING;}
';' {return SEMICOLON;}
"=" {return ASSIGNMENT;}
" "+ { }
<<EOF>> {exit(0); /* this is suboptimal */}
%%
parser.y:
%{
#include<stdio.h>
#include<string.h>
%}
%error-verbose
%union{
char *str;
}
%token <str> ID
%token <str> NUM
%token <str> STRING
%left SEMICOLON
%left ASSIGNMENT
%start input
%%
input: /* empty */
| expression separator input
;
expression: assign
| error {}
;
separator: SEMICOLON
| empty
;
empty:
;
assign: ID ASSIGNMENT ID { printf("ID to ID"); }
| ID ASSIGNMENT STRING { printf("ID to STRING"); }
| ID ASSIGNMENT NUM { printf("ID to NUM"); }
;
%%
yyerror(char* str)
{
printf("Error detected in parsing: %s\n", str);
}
main()
{
yyparse();
}
Compiled like this:
$>flex -t parser.l > parser.lex.yy.c
$>bison -v -d parser.y
$>cc parser.tab.c parser.lex.yy.c -lfl -o parser
Never mind... the problematic line was this one:
';' {return SEMICOLON;}
which required to be changed to
";" {return SEMICOLON;}
Now the behaviour is correct. :-)
Related
I want to parse strings of the type :
a=some value
b=some other value
There are no blanks around '=' and values extend up to newline. There may be leading spaces.
My lex specification (relevant part) is:
%%
a= { printf("Found attr %s\n", yytext); return aATTR; }
^[ \r\t]+ { printf("Found space at the start %s\n", yytext); }
([^a-z]=).*$ { printf("Found value %s\n", yytext); }
\n { return NEWLINE; }
%%
I tried .*$ [^\n]* and a few other regular expressions but to no avail.
This looks pretty simple. Any suggestions? I am also aware that lex returns the longest match so that complicates it further. I get the whole line matched for some regular expressions I tried.
You probably want to incorporate separate start states. These permit you to encode simple contexts. The simple example below captures your id, operator and value on each call to yylex().
%{
char id;
char op;
char *value;
%}
%x VAL OP
%%
<INITIAL>[a-z]+ {
id = yytext[0];
yyleng = 0;
BEGIN OP;
}
<INITIAL,OP>[ \t]*
<OP>=[ \t]* {
op = yytext[0];
yyleng = 0;
BEGIN VAL;
}
<VAL>.*\n {
value = yytext;
BEGIN INITIAL;
return 1;
}
%%
When I'm trying to check the expression "boolean x;" I'm getting "syntax error" and I can't understand why.
When I'm checking the expression "x = 3;" or "2 = 1;", the abstract syntax tree is generated and no errors are presented.
(I'm not allowed to use anything beside Lex and Yacc in this project and I'm using Ubuntu)
Lex file:
%%
[\n\t ]+;
boolean {return BOOL;}
TRUE {return TRUE;}
FALSE {return FALSE;}
[0-9]+ {return NUM;}
[a-zA-Z][0-9a-zA-Z]* {return ID;}
. {return yytext[0];}
%%
Yacc file:
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct node{
struct node *left;
struct node *right;
char *token;
} node;
node *mknode(node *left, node *right, char *token);
void printtree(node *tree);
#define YYSTYPE struct node *
%}
%start code
%token ID,NUM,TRUE,FALSE,BOOL
%right '='
%%
code:lines{printtree($1); printf("\n");}
lines:calcExp';'|assignExp';'|boolExp ';'{$$ = $1;}
boolExp: boolST id{$$=$2;}
calcExp: number '+' number {$$ = mknode($1,$3,"+");}
assignExp: id '=' number{$$ = mknode($1,$3,"=");}
boolSt : BOOL;
id : ID {$$ = mknode(0,0,yytext);}
number : NUM{$$ = mknode(0,0,yytext);}
%%
#include "lex.yy.c"
int main (void) {return yyparse();}
node *mknode(node *left, node *right, char *token){
node *newnode = (node *)malloc(sizeof(node));
char *newstr = (char *)malloc(strlen(token)+1);
strcpy(newstr, token);
newnode->left = left;
newnode->right = right;
newnode->token = newstr;
return newnode;
}
void printtree(node *tree){
if (tree->left || tree->right)
printf("(");
printf(" %s ", tree->token);
if(tree->left)
printtree(tree->left);
if(tree->right)
printtree(tree->right);
if(tree->left || tree->right)
printf(")");
}
void yyerror (char *s) {
fprintf (stderr, "%s\n",s);}
The first step to debug syntax errors is to enable %error-verbose in the bison file. Now instead of just saying "syntax errors", it tells us there was an unexpected character after the boolean keyword when it expected an identifier.
So let's add a print statement to the . rule in the lexer that prints the matched character, so that we can see where it produces unexpected characters. Now we see that it prints a space, but spaces should have been ignored, right? So let's look at the rule that's supposed to do that:
[\n\t ]+;
If your editor has proper syntax highlighting for flex files, the problem should become apparent now: The ; is seen as part of the rule, not the action. That is, the rule matches white space, followed by a semicolon, instead of just matching white space.
So remove the semicolon and it should work.
Is it possible for a Bison rule to expand instead of reducing so that it turns into more tokens? Asked a different way: is it possible to insert extra tokens to be parsed before the next token in the parser input?
Here is an example where I might want this:
Suppose I want a parser that understands three token types. Numbers (just positive integers for the sake of simplicity - INT), words (any number of letters, upper or lower case STRING) and some kind of other symbol (lets use an exclamation mark for no good reason - EXC)
Suppose I have a rule that reduces a word followed by a number followed by an exclamation mark. This rule results in an integer type, let's say for now that it simply doubles its input. This rule also allows itself to be the integer that it parses.
I also have a rule to accept any number of these in a row (the start rule).
The Bison parser look like this: (quicktest.y)
%{
#include <stdio.h>
%}
%union {
int INT_VAL;
}
%token STRING EXC
%token <INT_VAL> INT
%type <INT_VAL> somenumber
%%
start: somenumber {printf ("Result: %d\n", $1);}
| start somenumber {printf ("Result: %d\n", $2);}
;
somenumber: STRING INT EXC {$$ = $2 *2;}
| STRING somenumber EXC {$$ = $2 *2;}
;
%%
main(int argc, char ** argv){
yyparse();
}
yyerror(char* s){
fprintf(stderr, "%s\n", s);
}
The tokens can be generated with a flex lexer like so: (quicktest.l)
%{
#include "quicktest.tab.h"
%}
%%
[A-Za-z]+ {return STRING;}
[1-9]+ {yylval.INT_VAL = atoi(yytext); return INT;}
"!" {return EXC;}
. {}
This can be built with the following commands:
bison -d quicktest.y
flex quicktest.l
gcc -o quicktest quicktest.tab.c lex.yy.c -lfl -ggdb
I can now input something like this:
double double 2 ! !
and get the result 8
Now if I want the user to be able to avoid having lots of exclamation marks on one line, like this:
a b c d e f 2 ! ! ! ! ! !
I'd like to be able to allow them to input something like this:
a b c d e f 2 !*6
So I can add a flex expression for such a token that simply extracts the number of exclamations needed:
!\*[1-9]+ {
char *number = malloc(sizeof(char) * (strlen(yytext)-1));
strcpy(number, yytext+2);
yylval.INT_VAL = atoi(number);
free(number);
printf("Multiple exclamations: %d\n", yylval.INT_VAL);
return REPEAT_EXC;
}
But how would I implement the bison side of things?
I can add the token type like so:
%token <INT_VAL> REPEAT_EXC
And then a rule of some kind perhaps?
repeat_exc: REPEAT_EXC {/*expand into n exclamation marks (EXC tokens)*/}
;
Does Bison support this in any way?
If not how should I implement this?
Should I somehow have the lexer return the EXC token n times when it receives the repeat exc expression? (I'd rather avoid this if possible as this requires the flex code to keep record of some kind of state, it could be in the repeat exclamation state or in a normal state. The lexer is then not as simple to maintain.)
That's really not possible in a context-free grammar.
It's not that difficult to do in a traditional lexer, but as you say it requires that the lexer maintain state. An easier approach is to use a push parser, where the parser is called from the lexer rather than the other way around. [Note 1]
The bison manual doesn't explain the API very well; if you declare a pure push parser, the interface you get is:
int yypush_parse(yypstate*, int, const YYSTYPE*);
or, if position-tracking is enabled:
int yypush_parse(yypstate*, int, const YYSTYPE*, YYLTYPE*);
I made fairly minimal changes to your example, in order to show the push_parser interface. First, the parser; the only differences are the %define directives to declare a push parser; the elimination of main (the lexer is now top-level), and the declaration of yyerror with an explicit void return type. [Note 2]
%{
#include <stdio.h>
void yyerror(char* msg);
%}
%define api.pure full
%define api.push-pull push
%union {
int INT_VAL;
}
%token STRING EXC
%token <INT_VAL> INT
%type <INT_VAL> somenumber
%%
start: somenumber {printf ("Result: %d\n", $1);}
| start somenumber {printf ("Result: %d\n", $2);}
;
somenumber: STRING INT EXC {$$ = $2 *2;}
| STRING somenumber EXC {$$ = $2 *2;}
;
%%
void yyerror(char* s){
fprintf(stderr, "%s\n", s);
}
The lexer has some more substantial changes, but I don't think the end result is any harder to read or maintain. It might even be easier.
The macro PARSE sends a token with a specified type tag and value to yyparse; the macro PARSE_TOKEN sends a token without a semantic value.
The %options line removes several warnings from the compile step
The initialization of the parser state was added. (Indented lines after the %% and before any rule are inserted at the top of the lexer function, in this case yypush_parse, so they can be used to declare and initialize local variables.)
The INT rule was changed to allow 10 to be a valid integer.
The !*<int> rule was added.
The <<EOF>> rule was added. (It's pretty well boiler-plate for lexer-driven push-parsing.)
A main function was added, which calls yylex.
(Oh, and I changed a rule to avoid echoing new lines.)
%{
#include "push.tab.h"
#define PARSE(tok,tag,val) do { \
YYSTYPE yylval = {.tag=val}; \
int status = yypush_parse(ps, tok, &yylval); \
if (status != YYPUSH_MORE) return status; \
} while(0)
#define PARSE_TOKEN(tok) do { \
int status = yypush_parse(ps, tok, 0); \
if (status != YYPUSH_MORE) return status; \
} while(0)
%}
%option noyywrap nounput noinput
%%
yypstate *ps = yypstate_new ();
[A-Za-z]+ {PARSE_TOKEN(STRING);}
[1-9][0-9]* {PARSE(INT,INT_VAL,atoi(yytext));}
"!*"[1-9][0-9]* {int r = atoi(yytext+2);
while (r--) PARSE_TOKEN(EXC);
}
"!" {PARSE_TOKEN(EXC);}
.|\n {}
<<EOF>> {int status = yypush_parse(ps, 0, 0);
yypstate_delete(ps);
return status;
}
%%
int main(int argc, char** argv) {
return yylex();
}
Notes
This is the style of the lemon parser generator. lemon was originally written to create the sqlite SQL parser but is used in various projects precisely for the convenience of the "push" interface. bison's push-parser support is more recent, and very welcome.
I'm not crazy about INT_VAL; I prefer lower-case for union tags, but I was trying to minimize the diff.
I have the following grammar and I want to match the String "{name1, name2}". I just want lists of names/intergers with at least one element. However I get the error:
line 1:6 no viable alternative at character ' '
line 1:11 no viable alternative at character '}'
line 1:7 mismatched input 'name' expecting SIMPLE_VAR_TYPE
I would expect whitespaces and such are ignored... Also interesting is the error does not occur with input "{name1,name2}" (no space after ',').
Heres my gramar
grammar NusmvInput;
options {
language = Java;
}
#header {
package secltlmc.grammar;
}
#lexer::header {
package secltlmc.grammar;
}
specification :
SIMPLE_VAR_TYPE EOF
;
INTEGER
: ('0'..'9')+
;
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;
NAME
: ('A'..'Z' | 'a'..'z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$' | '#' | '-')*
;
WS
: (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;}
;
And this is my testing code
package secltlmc;
public class Main {
public static void main(String[] args) throws
IOException, RecognitionException {
CharStream stream = new ANTLRStringStream("{name1, name2}");
NusmvInputLexer lexer = new NusmvInputLexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
NusmvInputParser parser = new NusmvInputParser(tokenStream);
parser.specification();
}
}
Thanks for your help.
The problem is that you are trying to parse SIMPLE_VAR_TYPE with the lexer, i.e. you are trying to make it a single token. In reality, it looks like you want a multi-token production, since you'd like whitespace to be re-directed to hidden channel through WS.
You should change SIMPLE_VAR_TYPE from a lexer rule to a parser rule by changing its initial letter (or better yet, the entire name) to lower case.
specification :
simple_var_type EOF
;
simple_var_type
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;
The defintion of SIMPLE_VAR_TYPE specifies the following expression:
Open {
followed by one of NAME or INTEGER
follwoed by zero or more of:
comma (,) followed by one of NAME or INTEGER
followed by closing }
Nowhere does it allow white-space in the input (neither NAME nor INTEGER allows it either), so you get an error when you supply one
Try:
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (WS* ',' WS* (NAME | INTEGER))* '}' )
;
My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.