In CUP: How to make something optional to parse? - parsing

PROC_DECL -> "proc" [ "ret" TYPE ] NAME
"(" [ PARAM_DECL { "," PARAM_DECL } ] ")"
"{" { DECL } { STMT } "}"
This is the grammar for a Procedure declaration.
How do you say that the "ret" TYPE is optional without making multiple cases?

Use another production, say ret_stmt, which can be either empty or contain a single return statement so in your .cup file you will have this productions:
ret_stmt ::= // empty
{: /*your action for empty return statement*/ :}
// Single return statement
| "ret":r TYPE:t
{: /*your action for single return statement*/ :}
PROC_DECL ::= "proc":p ret_stmt:r NAME:n
"(" param_list:pl ")"
"{" { DECL } { STMT } "}"
{: /*your action for procedure declaration statement*/ :}
You can use a similar approach with parameters declaration, adding the production param_list.

Related

Breaking head over how to get position of token with a rule - ANTLR4 / grammar

I'm writing a little grammar using ANLTR, and I have a rule like this:
operation : OPERATION (IDENT | EXPR) ',' (IDENT | EXPR);
...
OPERATION : 'ADD' | 'SUB' | 'MUL' | 'DIV' ;
IDENT : [a-z]+;
EXPR : INTEGER | FLOAT;
INTEGER : [0-9]+ | '-'[0-9]+
FLOAT : [0-9]+'.'[0-9]+ | '-'[0-9]+'.'[0-9]+
Now in the listener inside Java, how do I determine in the case of such a scenario where an operation consist of both IDENT and EXPR the order in which they appear?
Obviously the rule can match both
ADD 10, d
or
ADD d, 10
But in the listener for the rule, generated by ANTLR4, if there is both IDENT() and EXPR() how to get their order, since I want to assign the left and right operands correctly.
Been breaking my head over this, is there any simple way or should I rewrite the rule itself? The ctx.getTokens () requires me to give the token type, which kind of defeats the purpose, since I cannot get the sequence of the tokens in the rule, if I specify their type.
You can do it like this:
operation : OPERATION lhs=(IDENT | EXPR) ',' rhs=(IDENT | EXPR);
and then inside your listener, do this:
#Override
public void enterOperation(TParser.OperationContext ctx) {
if (ctx.lhs.getType() == TParser.IDENT) {
// left hand side is an identifier
} else {
// left hand side is an expression
}
// check `rhs` the same way
}
where TParser comes from the grammar file T.g4. Change this accordingly.
Another solution would be something like this:
operation
: OPERATION ident_or_expr ',' ident_or_expr
;
ident_or_expr
: IDENT
| EXPR
;
and then in your listener:
#Override
public void enterOperation(TParser.OperationContext ctx) {
Double lhs = findValueFor(ctx.ident_or_expr().get(0));
Double rhs = findValueFor(ctx.ident_or_expr().get(1));
...
}
private Double findValueFor(TParser.Ident_or_exprContext ctx) {
if (ctx.IDENT() != null) {
// it's an identifier
} else {
// it's an expression
}
}

How to implement a BNF grammar tree for parsing input in GO?

The grammar for the type language is as follows:
TYPE ::= TYPEVAR | PRIMITIVE_TYPE | FUNCTYPE | LISTTYPE;
PRIMITIVE_TYPE ::= ‘int’ | ‘float’ | ‘long’ | ‘string’;
TYPEVAR ::= ‘`’ VARNAME; // Note, the character is a backwards apostrophe!
VARNAME ::= [a-zA-Z][a-zA-Z0-9]*; // Initial letter, then can have numbers
FUNCTYPE ::= ‘(‘ ARGLIST ‘)’ -> TYPE | ‘(‘ ‘)’ -> TYPE;
ARGLIST ::= TYPE ‘,’ ARGLIST | TYPE;
LISTTYPE ::= ‘[‘ TYPE ‘]’;
My input like this: TYPE
for example, if I input (int,int)->float, this is valid. If I input ( [int] , int), it's a wrong type and invalid.
I need to parse input from keyboard and decide if it's valid under this grammar(for later type inference). However, I don't know how to build this grammar with go and how to parse input by each byte. Is there any hint or similar implementation? That's will be really helpful.
For your purposes, the grammar of types looks simple enough that you should be able to write a recursive descent parser that roughly matches the shape of your grammar.
As a concrete example, let's say that we're recognizing a similar language.
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
ARGLIST ::= TYPE ARGLIST | TYPE
Not quite exactly the same as your original problem, but you should be able to see the similarities.
A recursive descent parser consists of functions for each production rule.
func ParseType(???) error {
???
}
func ParsePrimitiveType(???) error {
???
}
func ParseTupleType(???) error {
???
}
func ParseArgList(???) error {
???
}
where we'll denote things that we don't quite know what to put as ???* till we get there. We at least will say for now that we get an error if we can't parse.
The input into each of the functions is some stream of tokens. In our case, those tokens consist of sequences of:
"int"
"("
")"
and we can imagine a Stream might be something that satisfies:
type Stream interface {
Peek() string // peek at next token, stay where we are
Next() string // pick next token, move forward
}
to let us walk sequentially through the token stream.
A lexer is responsible for taking something like a string or io.Reader and producing this stream of string tokens. Lexers are fairly easy to write: you can imagine just using regexps or something similar to break a string into tokens.
Assuming we have a token stream, then a parser then just needs to deal with that stream and a very limited set of possibilities. As mentioned before, each production rule corresponds to a parsing function. Within a production rule, each alternative is a conditional branch. If the grammar is particularly simple (as yours is!), we can figure out which conditional branch to take.
For example, let's look at TYPE and its corresponding ParseType function:
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
How might this corresponds to the definition of ParseType?
The production says that there are two possibilities: it can either be (1) primitive, or (2) tuple. We can peek at the token stream: if we see "int", then we know it's primitive. If we see a "(", then since the only possibility is that it's tuple type, we can call the tupletype parser function and let it do the dirty work.
It's important to note: if we don't see either a "(" nor an "int", then something horribly has gone wrong! We know this just from looking at the grammar. We can see that every type must parse from something FIRST starting with one of those two tokens.
Ok, let's write the code.
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
Parsing PRIMITIVETYPE and TUPLETYPE is equally direct.
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
The only one that might cause some issues is the parser for argument lists. Let's look at the rule.
ARGLIST ::= TYPE ARGLIST | TYPE
If we try to write the function ParseArgList, we might get stuck because we don't yet know which choice to make. Do we go for the first, or the second choice?
Well, let's at least parse out the part that's common to both alternatives: the TYPE part.
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
/// ... FILL ME IN. Do we call ParseArgList() again, or stop?
}
So we've parsed the prefix. If it was the second case, we're done. But what if it were the first case? Then we'd still have to read additional lists of types.
Ah, but if we are continuing to read additional types, then the stream must FIRST start with another type. And we know that all types FIRST start either with "int" or "(". So we can peek at the stream. Our decision whether or not we picked the first or second choice hinges just on this!
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
Believe it or not, that's pretty much all we need. Here is working code.
package main
import "fmt"
type Stream interface {
Peek() string
Next() string
}
type TokenSlice []string
func (s *TokenSlice) Peek() string {
return (*s)[0]
}
func (s *TokenSlice) Next() string {
result := (*s)[0]
*s = (*s)[1:]
return result
}
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
func main() {
fmt.Println(ParseType(&TokenSlice{"int"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "(", "int", ")", "(", "int", ")", ")"}))
// Should show error:
fmt.Println(ParseType(&TokenSlice{"(", ")"}))
}
This is a toy parser, of course, because it is not handling certain kinds of errors very well (like premature end of input), and tokens should include, not only their textual content, but also their source location for good error reporting. For your own purposes, you'll also want to expand the parsers so that they don't just return error, but also some kind of useful result from the parse.
This answer is just a sketch on how recursive descent parsers work. But you should really read a good compiler book to get the details, because you need them. The Dragon Book, for example, spends at least a good chapter on about how to write recursive descent parsers with plenty of the technical details. in particular, you want to know about the concept of FIRST sets (which I hinted at), because you'll need to understand them to choose which alternative is appropriate when writing each of your parser functions.

semicolon as delimiter in custom grammar parsed by flex/bison

I'm trying to write a simple parser for a meta programming language.
Everything works fine, but I want to use ';' as statement delimiter and not newline or ommit the semicolon entirely.
So this is the expected behaviour:
// good code
v1 = v2;
v3 = 23;
should parse without errors
But:
// bad code
v1 = v2
v3 = 23;
should fail
yet if I remove the 'empty' rule from separator both codes fail like this:
ID to ID
Error detected in parsing: syntax error, unexpected ID, expecting SEMICOLON
;
If I leave the 'empty' rule active, then both codes are accepted, which is not desired.
ID to ID // should raise error
ID to NUM;
Any help is welcome here, as most tutorials do not cover delimiters at all.
Here is a simplified version of my parser/lexxer:
parser.l:
%{
#include "parser.tab.h"
#include<stdio.h>
%}
num [0-9]
alpha [a-zA-Z_]
alphanum [a-zA-Z_0-9]
comment "//"[^\n]*"\n"
string \"[^\"]*\"
whitespace [ \t\n]
%x ML_COMMENT
%%
<INITIAL>"/*" {BEGIN(ML_COMMENT); printf("/*");}
<ML_COMMENT>"*/" {BEGIN(INITIAL); printf("*/");}
<ML_COMMENT>[.]+ { }
<ML_COMMENT>[\n]+ { printf("\n"); }
{comment}+ {printf("%s",yytext);}
{alpha}{alphanum}+ { yylval.str= strdup(yytext); return ID;}
{num}+ { yylval.str= strdup(yytext); return NUM;}
{string} { yylval.str= strdup(yytext); return STRING;}
';' {return SEMICOLON;}
"=" {return ASSIGNMENT;}
" "+ { }
<<EOF>> {exit(0); /* this is suboptimal */}
%%
parser.y:
%{
#include<stdio.h>
#include<string.h>
%}
%error-verbose
%union{
char *str;
}
%token <str> ID
%token <str> NUM
%token <str> STRING
%left SEMICOLON
%left ASSIGNMENT
%start input
%%
input: /* empty */
| expression separator input
;
expression: assign
| error {}
;
separator: SEMICOLON
| empty
;
empty:
;
assign: ID ASSIGNMENT ID { printf("ID to ID"); }
| ID ASSIGNMENT STRING { printf("ID to STRING"); }
| ID ASSIGNMENT NUM { printf("ID to NUM"); }
;
%%
yyerror(char* str)
{
printf("Error detected in parsing: %s\n", str);
}
main()
{
yyparse();
}
Compiled like this:
$>flex -t parser.l > parser.lex.yy.c
$>bison -v -d parser.y
$>cc parser.tab.c parser.lex.yy.c -lfl -o parser
Never mind... the problematic line was this one:
';' {return SEMICOLON;}
which required to be changed to
";" {return SEMICOLON;}
Now the behaviour is correct. :-)

JavaCC grammar conflict

I have a grammar defined roughly like this.
TOKEN:{
<T_INT: "int"> |
<T_STRING: ["a"-"z"](["a"-"z"])*>
}
SKIP: { " " | "\t" | "\n" | "\r" }
/** Main production. */
SimpleNode Start() : {}
{
(LOOKAHEAD(Declaration()) Declaration() | Function())
{ return jjtThis; }
}
void Declaration() #Decl: {}
{
<T_INT> <T_STRING> ";"
}
void Function() #Func: {}
{
<T_STRING> "();"
}
This works fine for stuff like:
int a;
foo();
But when I try int();, which is legal for me and should be parsed by the Function(), it goes for the Declaration instead. How do I fix this "conflict"? I tried various combinations.
The JavaCC FAQ's section on this is titled "How do I deal with keywords that aren't reserved?".
What I would do is allowing the keywords alternatively to the identifiers, i.e.
(<T_STRING> | <T_INT>) "();"
When there are many keywords, it could be beneficial to create an Identifier production that allows them all, along with the general identifier token.
By the way, you might want "(" ")" ";" instead of "();".

Why is this fsyacc input producing F# that does not compile?

My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.

Resources