I'm working on a simple example parser/lexer for a tiny project, but I've run into a problem.
I'm parsing content along these lines:
Name SEP Gender SEP Birthday
Name SEP Gender SEP Birthday
… where SEP is any one (but not multiple!) of |, ,, or whitespace.
Now, I didn't want to lock the field-order in at the lexer order, so I'm trying to lex this with a very simple set of tokens:
%token <string> SEP
%token <string> VAL
%token NL
%token EOF
Now, it's dictated that I produce a parse-error if, for instance, the gender field doesn't contain a small set of per-determined values, say {male,female,neither,unspecified}. I can wrap the parser and deal with this, but I'd really like to encode this requirement into the automaton for future expansion.
My first attempt, looking something like this, failed horribly:
doc:
| EOF { [] }
| it = rev_records { it }
;
rev_records:
| (* base-case: empty *) { [] }
| rest = rev_records; record; NL { record :: rest }
| rest = rev_records; record; EOF { record :: rest }
;
record:
last_name = name_field; SEP; first_name = name_field; SEP;
gender = gender_field; SEP; favourite_colour = colour_field; SEP;
birthday = date_field
{ {last_name; first_name; gender; favourite_colour; birthday} }
name_field: str = VAL { str }
gender_field:
| VAL "male" { Person.Male }
| VAL "female" { Person.Female }
| VAL "neither" { Person.Neither }
| VAL "unspecified" { Person.Unspecified }
;
Yeah, no dice. Obviously, my attempt at an unstructured-lexing is already going poorly.
What's the idiomatic way to parse something like this?
Parsers, such as Menhir and OCamlYacc, operate on tokens, not on strings or characters. The transformation from characters to tokens is made on the lexer level. That's why you can't specify a string in the production rule.
You can, of course, perform any check in the semantic action and raise an exception, e.g.,
record:
last_name = name_field; SEP; first_name = name_field; SEP;
gender_val = VAL; SEP; favourite_colour = colour_field; SEP;
birthday = date_field
{
let gender = match gender_val with
| "male" -> Person.Male
| "female" -> Person.Female
| "neither" -> Person.Neither
| "unspecified" -> Person.Unspecified
| _ -> failwith "Parser error: invalid value in the gender field" in
{last_name; first_name; gender; favourite_colour; birthday}
}
You can also tokenize possible gender or you can use regular expressions on the lexer level to prevent invalid fields, e.g.,
rule token = parser
| "male" | "female" | "neither" | "unspecified" as -> {GENDER s}
...
However, this is not recommended, as it will, in fact, turn male, female, etc into keywords, so their occurrences in other places will break your grammar.
Related
I have a xtext grammar which consists of one declaration per line. When I format the code, all the declarations end up in the same line, the line breaks are removed.
As I didn't manage to change the grammar to require line breaks, I would like to disable the removal of line breaks. How do I do that? Bonus points if someone can tell me how to require line breaks at the end of each declaration.
Part of the Grammar:
grammar com.example.Msg with org.eclipse.xtext.common.Terminals
hidden(WS, SL_COMMENT)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate msg_idl "http://www.example.com/ex/ample/msg"
Model:
MsgDef
;
MsgDef:
(definitions+=definition)+
;
definition:
type=fieldType ' '+ name=ValidID (' '* '=' ' '* const=Value)?
;
fieldType:
value = ( builtinType | header)
;
builtinType:
BOOL = "bool"
| INT32 = "int32"
| CHAR = "char"
;
header:
value="Header"
;
Bool_l:
target=BOOL_E
;
String_l:
target = ('""'|STRING)
;
Number_l:
Double_l | Integer_l | NegInteger_l
;
NegInteger_l:
target=NEG_INT
;
Integer_l :
target=INT
;
Double_l:
target=DOUBLE
;
terminal NEG_INT returns ecore::EInt:
'-' INT
;
terminal DOUBLE returns ecore::EDouble :
('-')? ('0'..'9')* ('.' INT) |
('-')? INT ('.') |
('-')? INT ('.' ('0'..'9')*)? (('e'|'E')('-'|'+')? INT )|
'nan' | 'inf' | '-inf'
;
enum BOOL_E :
true | false
;
ValidID:
"bool"
| "string"
| "time"
| "duration"
| "char"
| ID ;
Value:
String_l | Number_l
;
terminal SL_COMMENT :
' '* '#' !('\n'|'\r')* ('\r'? '\n')?
;
Example data
string left
string top
string right
string bottom
I already tried:
class MsgFormatter extends AbstractDeclarativeFormatter {
extension MsgGrammarAccess msgGrammarAccess = grammarAccess as MsgGrammarAccess
override protected void configureFormatting(FormattingConfig c) {
c.setLinewrap(0, 1, 2).before(SL_COMMENTRule)
c.setLinewrap(0, 1, 2).before(ML_COMMENTRule)
c.setLinewrap(0, 1, 1).after(ML_COMMENTRule)
c.setLinewrap().before(definitionRule); // does not work
c.setLinewrap(1,1,2).before(definitionRule); // does not work
c.setLinewrap().before(fieldTypeRule); // does not work
}
}
In general it is a bad idea to encode whitespace into the language itself. Most of the time it is better to write the language in a way that you can use all kinds of whitespaces (blanks, tabs, newlines ...) to separate tokens.
You should implement a custom formatter for your language that inserts the line breaks after each statement. Xtext comes with two formatter APIs (an old one and a new one starting with Xtext 2.8). I propose to use the new one.
Here you extend AbstractFormatter2 and implement the format methods.
You can find a bit information in the online manual: https://www.eclipse.org/Xtext/documentation/303_runtime_concepts.html#formatting
Some more explanation in the folowing blog post: https://blogs.itemis.com/en/tabular-formatting-with-the-new-formatter-api
Some technical background: https://de.slideshare.net/meysholdt/xtexts-new-formatter-api
I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html
I'm stuck on this problem for a while now, hope you can help. I've got the following (shortened) language grammar:
lexical Id = [a-zA-Z][a-zA-Z]* !>> [a-zA-Z] \ MyKeywords;
lexical Natural = [1-9][0-9]* !>> [0-9];
lexical StringConst = "\"" ![\"]* "\"";
keyword MyKeywords = "value" | "Male" | "Female";
start syntax Program = program: Model* models;
syntax Model = Declaration;
syntax Declaration = decl: "value" Id name ':' Type t "=" Expression v ;
syntax Type = gender: "Gender";
syntax Expression = Terminal;
syntax Terminal = id: Id name
| constructor: Type t '(' {Expression ','}* ')'
| Gender;
syntax Gender = male: "Male"
| female: "Female";
alias ASLId = str;
data TYPE = gender();
public data PROGRAM = program(list[MODEL] models);
data MODEL = decl(ASLId name, TYPE t, EXPR v);
data EXPR = constructor(TYPE t, list[EXPR] args)
| id(ASLId name)
| male()
| female();
Now, I'm trying to parse:
value mannetje : Gender = Male
This parses fine, but fails on implode, unless I remove the id: Id name and it's constructor from the grammar. I expected that the /MyKeywords would prevent this, but unfortunately it doesn't. Can you help me fix this, or point me in the right direction to how to debug? I'm having some trouble with debugging the Concrete and Abstract syntax.
Thanks!
It does not seem to be parsing at all (I get a ParseError if I try your example).
One of the problems is probably that you don't define Layout. This causes the ParseError with you given example. One of the easiest fixes is to extend the standard Layout in lang::std::Layout. This layout defines all the default white spaces (and comment) characters.
For more information on nonterminals see here.
I took the liberty in simplifying your example a bit further so that parsing and imploding works. I removed some unused nonterminals to keep the parse tree more concise. You probably want more that Declarations in your Program but I leave that up to you.
extend lang::std::Layout;
lexical Id = ([a-z] !<< [a-z][a-zA-Z]* !>> [a-zA-Z]) \ MyKeywords;
keyword MyKeywords = "value" | "Male" | "Female" | "Gender";
start syntax Program = program: Declaration* decls;
syntax Declaration = decl: "value" Id name ':' Type t "=" Expression v ;
syntax Type = gender: "Gender";
syntax Expression
= id: Id name
| constructor: Type t '(' {Expression ','}* ')'
| Gender
;
syntax Gender
= male: "Male"
| female: "Female"
;
data PROGRAM = program(list[DECL] exprs);
data DECL = decl(str name, TYPE t, EXPR v);
data EXPR = constructor(TYPE t, list[EXPR] args)
| id(str name)
| male()
| female()
;
data TYPE = gender();
Two things:
The names of the ADTs should correspond to the nonterminal names (you have difference cases and EXPR is not Expression). That is the only way implode can now how to do its work. Put the data decls in their own module and implode as follows: implode(#AST::Program, pt) where pt is the parse tree.
The grammar was ambiguous: the \ MyKeywords only applied to the tail of the identifier syntax. Use the fix: ([a-zA-Z][a-zA-Z]* !>> [a-zA-Z]) \ MyKeywords;.
Here's what worked for me (grammar unchanged except for the fix):
module AST
alias ASLId = str;
data Type = gender();
public data Program = program(list[Model] models);
data Model = decl(ASLId name, Type t, Expression v);
data Expression = constructor(Type t, list[Expression] args)
| id(ASLId name)
| male()
| female();
I would like to parse a set of expressions, for instance:X[3], X[-3], XY[-2], X[4]Y[2], etc.
In my parser.mly, index (which is inside []) is defined as follows:
index:
| INTEGER { $1 }
| MINUS INTEGER { 0 - $2 }
The token INTEGER, MINUS etc. are defined in lexer as normal.
I try to parse an example, it fails. However, if I comment | MINUS INTEGER { 0 - $2 }, it works well. So the problem is certainly related to that. To debug, I want to get more information, in other words I want to know what is considered to be MINUS INTEGER. I tried to add print:
index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n" $2; 0 - $2 }
But nothing is printed while parsing.
Could anyone tell me how to print information or debug that?
I tried coming up with an example of what you describe and was able to get output of 8 with what I show below. [This example is completely stripped down so that it only works for [1] and [- 1 ], but I believe it's equivalent logically to what you said you did.]
However, I also notice that your example's debug string in your example does not have an explicit flush with %! at the end, so that the debugging output might not be flushed to the terminal until later than you expect.
Here's what I used:
Test.mll:
{
open Ytest
open Lexing
}
rule test =
parse
"-" { MINUS }
| "1" { ONE 1 }
| "[" { LB }
| "]" { RB }
| [ ' ' '\t' '\r' '\n' ] { test lexbuf }
| eof { EOFTOKEN }
Ytest.mly:
%{
%}
%token <int> ONE
%token MINUS LB RB EOFTOKEN
%start item
%type <int> index item
%%
index:
ONE { 2 }
| MINUS ONE { Printf.printf "%n" 8; $2 }
item : LB index RB EOFTOKEN { $2 }
Parse.ml
open Test;;
open Ytest;;
open Lexing;;
let lexbuf = Lexing.from_channel stdin in
ignore (Ytest.item Test.test lexbuf)
My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.