This is a demo code
label:
var id
let id = 10
goto label
If allowed keyword as identifier will be
let:
var var
let var = 10
goto let
This is totally legal code. But it seems very hard to do this in antlr.
AFAIK, If antlr match a token let, will never fallback to id token. so for antlr it will see
LET_TOKEN :
VAR_TOKEN <missing ID_TOKEN>VAR_TOKEN
LET_TOKEN <missing ID_TOKEN>VAR_TOKEN = 10
although antlr allowed predicate, I have to control ever token match and problematic. grammar become this
grammar Demo;
options {
language = Go;
}
#parser::members{
var _need = map[string]bool{}
func skip(name string,v bool){
_need[name] = !v
fmt.Println("SKIP",name,v)
}
func need(name string)bool{
fmt.Println("NEED",name,_need[name])
return _need[name]
}
}
proj#init{skip("inst",false)}: (line? NL)* EOF;
line
: VAR ID
| LET ID EQ? Integer
;
NL: '\n';
VAR: {need("inst")}? 'var' {skip("inst",true)};
LET: {need("inst")}? 'let' {skip("inst",true)};
EQ: '=';
ID: ([a-zA-Z] [a-zA-Z0-9]*);
Integer: [0-9]+;
WS: [ \t] -> skip;
Looks so terrible.
But this is easy in peg, test this in pegjs
Expression = (Line? _ '\n')* ;
Line
= 'var' _ ID
/ 'let' _ ID _ "=" _ Integer
Integer "integer"
= [0-9]+ { return parseInt(text(), 10); }
ID = [a-zA-Z] [a-zA-Z0-9]*
_ "whitespace"
= [ \t]*
I actually done this in peggo and javacc.
My question is how to handle these grammars in antlr4.6, I was so excited about the antlr4.6 go target, but seems I choose the wrong tool for my grammar ?
The simplest way is to define a parser rule for identifiers:
id: ID | VAR | LET;
VAR: 'var';
LET: 'let';
ID: [a-zA-Z] [a-zA-Z0-9]*;
And then use id instead of ID in your parser rules.
A different way is to use ID for identifiers and keywords, and use predicates for disambiguation. But it's less readable, so I'd use the first way instead.
Related
Problem
When I use do, a lua keyword as a table's key it gives following error
> table.newKey = { do = 'test' }
stdin:1: unexpected symbol near 'do'
>
I need to use do as key. What should I do ?
sometable.somekey is syntactic sugar for sometable['somekey'],
similarly { somekey = somevalue } is sugar for { ['somekey'] = somevalue }
Information like this can be found in this very good resource:
For such needs, there is another, more general, format. In this format, we explicitly write the index to be initialized as an expression, between square brackets:
opnames = {["+"] = "add", ["-"] = "sub",
["*"] = "mul", ["/"] = "div"}
-- Programming in Lua: 3.6 – Table Constructors
Use this syntax:
t = { ['do'] = 'test' }
or t['do'] to get or set a value.
I need to use do as key. What should I do ?
Read the Lua 5.4 Reference Manual and understand that something like t = { a = b} or t.a = b only works if a is a valid Lua identifier.
3.4.9 - Table constructors
The general syntax for constructors is
tableconstructor ::= ‘{’ [fieldlist] ‘}’
fieldlist ::= field {fieldsep field} [fieldsep]
field ::= ‘[’ exp ‘]’ ‘=’ exp | Name ‘=’ exp | exp
fieldsep ::= ‘,’ | ‘;’
A field of the form name = exp is equivalent to ["name"] = exp.
So why does this not work for do?
3.1 - Lexical Conventsions
Names (also called identifiers) in Lua can be any string of Latin
letters, Arabic-Indic digits, and underscores, not beginning with a
digit and not being a reserved word. Identifiers are used to name
variables, table fields, and labels.
The following keywords are reserved and cannot be used as names:
and break do else elseif end
false for function goto if in
local nil not or repeat return
do is not a name so you need to use the syntax field ::= ‘[’ exp ‘]’ ‘=’ exp
which in your example is table.newKey = { ['do'] = 'test' }
The variables of Velocity has following notation. (see Velocity User Guide):
The shorthand notation of a variable consists of a leading "$" character followed by a VTL Identifier. A VTL Identifier must start with an alphabetic character (a .. z or A .. Z). The rest of the characters are limited to the following types of characters:
alphabetic (a .. z, A .. Z)
numeric (0 .. 9)
underscore ("_")
I want to use lexer mode to split the normal text and the variables, so I wrote something like this:
// default mode
DOLLAR : ‘$’ -> pushMode(VARIABLE);
TEXT : ~[$]+? -> skip;
mode VARIABLE:
ID : [a-zA-Z] [a-zA-Z0-9-_]*;
???? : XXX -> popMode; // how can I pop mode to default?
Because the notation of the variables has no explicit end character, so I don't know how to determine its end.
Maybe I got it wrong?
You would pop out of that scope like this:
mode VARIABLE;
ID : [a-zA-Z] [a-zA-Z0-9-_]* -> popMode;
Here's a quick demo:
lexer grammar VelocityLexer;
DOLLAR : '$' -> more, pushMode(VARIABLE);
TEXT : ~[$]+ -> skip;
mode VARIABLE;
// the `-` needs to be escaped!
ID : [a-zA-Z] [a-zA-Z0-9\-_]* -> popMode;
Note the more in the DOLLAR which will cause the $ to be included in the ID token. If you don't, you end up with two tokens ($ and foo for the input $foo)
Test the grammar with the following Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
VelocityLexer lexer = new VelocityLexer(CharStreams.fromString("<strong>$Mu</strong>$foo..."));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-10s '%s'\n", VelocityLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
ID '$Mu'
ID '$foo'
EOF '<EOF>'
However, I think a lexical mode is not a good choice in case of an ID. Why not simply do:
lexer grammar VelocityLexer;
DOLLAR : '$' [a-zA-Z] [a-zA-Z0-9\-_]*;
TEXT : ~[$]+ -> skip;
?
I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html
I'm stuck on this problem for a while now, hope you can help. I've got the following (shortened) language grammar:
lexical Id = [a-zA-Z][a-zA-Z]* !>> [a-zA-Z] \ MyKeywords;
lexical Natural = [1-9][0-9]* !>> [0-9];
lexical StringConst = "\"" ![\"]* "\"";
keyword MyKeywords = "value" | "Male" | "Female";
start syntax Program = program: Model* models;
syntax Model = Declaration;
syntax Declaration = decl: "value" Id name ':' Type t "=" Expression v ;
syntax Type = gender: "Gender";
syntax Expression = Terminal;
syntax Terminal = id: Id name
| constructor: Type t '(' {Expression ','}* ')'
| Gender;
syntax Gender = male: "Male"
| female: "Female";
alias ASLId = str;
data TYPE = gender();
public data PROGRAM = program(list[MODEL] models);
data MODEL = decl(ASLId name, TYPE t, EXPR v);
data EXPR = constructor(TYPE t, list[EXPR] args)
| id(ASLId name)
| male()
| female();
Now, I'm trying to parse:
value mannetje : Gender = Male
This parses fine, but fails on implode, unless I remove the id: Id name and it's constructor from the grammar. I expected that the /MyKeywords would prevent this, but unfortunately it doesn't. Can you help me fix this, or point me in the right direction to how to debug? I'm having some trouble with debugging the Concrete and Abstract syntax.
Thanks!
It does not seem to be parsing at all (I get a ParseError if I try your example).
One of the problems is probably that you don't define Layout. This causes the ParseError with you given example. One of the easiest fixes is to extend the standard Layout in lang::std::Layout. This layout defines all the default white spaces (and comment) characters.
For more information on nonterminals see here.
I took the liberty in simplifying your example a bit further so that parsing and imploding works. I removed some unused nonterminals to keep the parse tree more concise. You probably want more that Declarations in your Program but I leave that up to you.
extend lang::std::Layout;
lexical Id = ([a-z] !<< [a-z][a-zA-Z]* !>> [a-zA-Z]) \ MyKeywords;
keyword MyKeywords = "value" | "Male" | "Female" | "Gender";
start syntax Program = program: Declaration* decls;
syntax Declaration = decl: "value" Id name ':' Type t "=" Expression v ;
syntax Type = gender: "Gender";
syntax Expression
= id: Id name
| constructor: Type t '(' {Expression ','}* ')'
| Gender
;
syntax Gender
= male: "Male"
| female: "Female"
;
data PROGRAM = program(list[DECL] exprs);
data DECL = decl(str name, TYPE t, EXPR v);
data EXPR = constructor(TYPE t, list[EXPR] args)
| id(str name)
| male()
| female()
;
data TYPE = gender();
Two things:
The names of the ADTs should correspond to the nonterminal names (you have difference cases and EXPR is not Expression). That is the only way implode can now how to do its work. Put the data decls in their own module and implode as follows: implode(#AST::Program, pt) where pt is the parse tree.
The grammar was ambiguous: the \ MyKeywords only applied to the tail of the identifier syntax. Use the fix: ([a-zA-Z][a-zA-Z]* !>> [a-zA-Z]) \ MyKeywords;.
Here's what worked for me (grammar unchanged except for the fix):
module AST
alias ASLId = str;
data Type = gender();
public data Program = program(list[Model] models);
data Model = decl(ASLId name, Type t, Expression v);
data Expression = constructor(Type t, list[Expression] args)
| id(ASLId name)
| male()
| female();
Given the Lexer
fragment
FRAGID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ;
ID : FRAGID;
NAME: FRAGID ('.' FRAGID)*;
Given the grammar
var_def: type=ID vname=ID ASSIGN expr
-> ^(VARDEF $type $vname expr)
;
with options
options
{
language=CSharp3;
output=AST;
}
and given the code
int i = 0
everything works fine.
However, when I want to allow the use of NAME in assignment (referring to another object)
var_def
: type=(NAME|ID) vname=ID ASSIGN expr
-> ^(VARDEF $type $vname expr)
;
I get at run-time RewriteEmptyStreamException
Antlr.Runtime.Tree.RewriteEmptyStreamException : token type
at Antlr.Runtime.Tree.RewriteRuleElementStream.NextCore() in c:\dev\stringtemplate_main\antlr\antlr3-main\runtime\CSharp3\Sources\Antlr3.Runtime\Tree\RewriteRuleElementStream.cs: line 200
at Antlr.Runtime.Tree.RewriteRuleTokenStream.NextNode() in c:\dev\stringtemplate_main\antlr\antlr3-main\runtime\CSharp3\Sources\Antlr3.Runtime\Tree\RewriteRuleTokenStream.cs: line 62
Doing some more investiagion, with the grammar
var_def
: type=NAME vname=ID ASSIGN expr
-> ^(VARDEF $type $vname expr)
;
I get a
Antlr.Runtime.Tree.RewriteEarlyExitException : Exception of type 'Antlr.Runtime.Tree.RewriteEarlyExitException' was thrown.
The 'i' in:
int i = 0
will always become an ID token. Because both ID and NAME match a single FRAGID, and since ID is defined before NAME, there will never be a NAME token (in case of a single FRAGID). It will always become an ID token.
That is why this won't work:
var_def
: type=NAME vname=ID ASSIGN expr
-> ^(VARDEF $type $vname expr)
;
You must realize that the lexer does not create tokens depending on what token the parser is trying to match at a specific time. The lexer works independently from the parser.
Try avoiding the assignment of a label to a parenthesized group of tokens/rules. Instead of:
var_def
: type=(NAME|ID) vname=ID ASSIGN expr -> ^(VARDEF $type $vname expr)
;
do this:
var_def
: type ID ASSIGN expr -> ^(VARDEF type ID expr)
;
type
: NAME
| ID
;