Happy resolution of an error - parsing

In Regular Expressions, I can write:
a(.)*b
And this will match the entire string in, for example
acdabb
I try to simulate this with a token stream in Happy.
t : a wildcard b
wildcard : {- empty -} | wild wildcard
wild : a | b | c | d | whatever
However, the parser generated by Happy does not recognize
acdabb
Is there a way around this/am I doing it wrong?

As you noted Happy uses an LALR(1) parser, which is noted in the documentation. You noted in the comments that changing to right recursion resolves the problem, but for the novice it might not be clear how that can be achieved. To change the recursion the wilcard wild is rewritten as wild wildcard, which results in the following file:
{
module ABCParser (parse) where
}
%tokentype { Char }
%token a { 'a' }
%token b { 'b' }
%token c { 'c' }
%token d { 'd' }
%token whatever { '\n' }
%name parse t
%%
t
: a wildcard b
{ }
wildcard
:
{ }
| wildcard wild
{ }
wild
: a
{ }
| b
{ }
| c
{ }
| d
{ }
| whatever
{ }
Which now generates a working parser.

Related

Breaking head over how to get position of token with a rule - ANTLR4 / grammar

I'm writing a little grammar using ANLTR, and I have a rule like this:
operation : OPERATION (IDENT | EXPR) ',' (IDENT | EXPR);
...
OPERATION : 'ADD' | 'SUB' | 'MUL' | 'DIV' ;
IDENT : [a-z]+;
EXPR : INTEGER | FLOAT;
INTEGER : [0-9]+ | '-'[0-9]+
FLOAT : [0-9]+'.'[0-9]+ | '-'[0-9]+'.'[0-9]+
Now in the listener inside Java, how do I determine in the case of such a scenario where an operation consist of both IDENT and EXPR the order in which they appear?
Obviously the rule can match both
ADD 10, d
or
ADD d, 10
But in the listener for the rule, generated by ANTLR4, if there is both IDENT() and EXPR() how to get their order, since I want to assign the left and right operands correctly.
Been breaking my head over this, is there any simple way or should I rewrite the rule itself? The ctx.getTokens () requires me to give the token type, which kind of defeats the purpose, since I cannot get the sequence of the tokens in the rule, if I specify their type.
You can do it like this:
operation : OPERATION lhs=(IDENT | EXPR) ',' rhs=(IDENT | EXPR);
and then inside your listener, do this:
#Override
public void enterOperation(TParser.OperationContext ctx) {
if (ctx.lhs.getType() == TParser.IDENT) {
// left hand side is an identifier
} else {
// left hand side is an expression
}
// check `rhs` the same way
}
where TParser comes from the grammar file T.g4. Change this accordingly.
Another solution would be something like this:
operation
: OPERATION ident_or_expr ',' ident_or_expr
;
ident_or_expr
: IDENT
| EXPR
;
and then in your listener:
#Override
public void enterOperation(TParser.OperationContext ctx) {
Double lhs = findValueFor(ctx.ident_or_expr().get(0));
Double rhs = findValueFor(ctx.ident_or_expr().get(1));
...
}
private Double findValueFor(TParser.Ident_or_exprContext ctx) {
if (ctx.IDENT() != null) {
// it's an identifier
} else {
// it's an expression
}
}

ocaml menhir parser production is never reduced

I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html

Debug parser by printing useful information

I would like to parse a set of expressions, for instance:X[3], X[-3], XY[-2], X[4]Y[2], etc.
In my parser.mly, index (which is inside []) is defined as follows:
index:
| INTEGER { $1 }
| MINUS INTEGER { 0 - $2 }
The token INTEGER, MINUS etc. are defined in lexer as normal.
I try to parse an example, it fails. However, if I comment | MINUS INTEGER { 0 - $2 }, it works well. So the problem is certainly related to that. To debug, I want to get more information, in other words I want to know what is considered to be MINUS INTEGER. I tried to add print:
index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n" $2; 0 - $2 }
But nothing is printed while parsing.
Could anyone tell me how to print information or debug that?
I tried coming up with an example of what you describe and was able to get output of 8 with what I show below. [This example is completely stripped down so that it only works for [1] and [- 1 ], but I believe it's equivalent logically to what you said you did.]
However, I also notice that your example's debug string in your example does not have an explicit flush with %! at the end, so that the debugging output might not be flushed to the terminal until later than you expect.
Here's what I used:
Test.mll:
{
open Ytest
open Lexing
}
rule test =
parse
"-" { MINUS }
| "1" { ONE 1 }
| "[" { LB }
| "]" { RB }
| [ ' ' '\t' '\r' '\n' ] { test lexbuf }
| eof { EOFTOKEN }
Ytest.mly:
%{
%}
%token <int> ONE
%token MINUS LB RB EOFTOKEN
%start item
%type <int> index item
%%
index:
ONE { 2 }
| MINUS ONE { Printf.printf "%n" 8; $2 }
item : LB index RB EOFTOKEN { $2 }
Parse.ml
open Test;;
open Ytest;;
open Lexing;;
let lexbuf = Lexing.from_channel stdin in
ignore (Ytest.item Test.test lexbuf)

JavaCC grammar conflict

I have a grammar defined roughly like this.
TOKEN:{
<T_INT: "int"> |
<T_STRING: ["a"-"z"](["a"-"z"])*>
}
SKIP: { " " | "\t" | "\n" | "\r" }
/** Main production. */
SimpleNode Start() : {}
{
(LOOKAHEAD(Declaration()) Declaration() | Function())
{ return jjtThis; }
}
void Declaration() #Decl: {}
{
<T_INT> <T_STRING> ";"
}
void Function() #Func: {}
{
<T_STRING> "();"
}
This works fine for stuff like:
int a;
foo();
But when I try int();, which is legal for me and should be parsed by the Function(), it goes for the Declaration instead. How do I fix this "conflict"? I tried various combinations.
The JavaCC FAQ's section on this is titled "How do I deal with keywords that aren't reserved?".
What I would do is allowing the keywords alternatively to the identifiers, i.e.
(<T_STRING> | <T_INT>) "();"
When there are many keywords, it could be beneficial to create an Identifier production that allows them all, along with the general identifier token.
By the way, you might want "(" ")" ";" instead of "();".

Why is this fsyacc input producing F# that does not compile?

My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.

Resources