I'm learning pest parser written in Rust and I need to parse a file:
{ISomething as Something, IOther as Other}
I wrote a rule:
LBrace = { "{" }
RBrace = { "}" }
Comma = { "," }
As = { "as" }
symbolAliases = { LBrace ~ importAliases ~ ( Comma ~ importAliases )* ~ RBrace }
importAliases = { Identifier ~ (As ~ Identifier)? }
Identifier = { IdentifierStart ~ IdentifierPart* }
IdentifierStart = _{ 'a'..'z' | 'A'..'Z' | "$" | "_" }
IdentifierPart = _{ 'a'..'z' | 'A'..'Z' | '0'..'9' | "$" | "_" }
But the parser throws an error:
thread 'main' panicked at 'unsuccessful parse: Error { variant: ParsingError { positives: [As, RBrace, Comma], negatives: [] }, location: Pos(11), line_col: Pos((1, 12)), path: None, line: "{ISomething as Something, IOther as Other}", continued_line: None }', src\main.rs:18:10
stack backtrace:
Help me figure out what the problem is.
You're not parsing any whitespace, that's your problem. It's expecting something like {ISomethingasSomething,IOtherasOther}. You can add a whitespace rule and add it in sensible places:
LBrace = { "{" }
RBrace = { "}" }
Comma = { "," }
As = { ws+ ~ "as" ~ ws+ }
// ^^^^^------^^^^^
// an "as" token should probably be surrounded by whitespace to make sense
symbolAliases = { LBrace ~ importAliases ~ ( Comma ~ importAliases )* ~ RBrace }
importAliases = { ws* ~ Identifier ~ (As ~ Identifier)? }
// ^^^^^
Identifier = { IdentifierStart ~ IdentifierPart* }
IdentifierStart = _{ 'a'..'z' | 'A'..'Z' | "$" | "_" }
IdentifierPart = _{ 'a'..'z' | 'A'..'Z' | '0'..'9' | "$" | "_" }
// whitespace rule defined here (probably want to add tabs and so on, too):
ws = _{ " " }
That's the least amount to make it parse, but you'll probably want it around braces and add some more fine-grained rules.
OR: you can add them implicitly, depending on your needs.
Related
i'm working on Qt's qmake project file parser (open source project).
And i have a trouble with describing qmake's variant of conditional statement, called "scope" in documentation.
EBNF (simplified):
ScopeStatement -> Condition ScopeBody
Condition -> Identifier | TestFunctionCall | NotExpr | OrExpr | AndExpr
NotExpr -> "!" Condition
OrExpr -> Condition "|" Condition
AndExpr -> Condition ":" Condition
ScopeBody -> COLON Statement | BR_OPEN Statement:* BR_CLOSE
Statement -> AssignmentStatement
AssignmentStatement -> Identifier EQ String
// There are many others built-in boolean functions
TestFunctionCall -> ("defined" | ...) ARG_LIST_OPEN (String COMMA:?):* ARG_LIST_CLOSE
Identifier -> Letter (Letter | Digit | UNDERSCP):+ String -> (Letter | Digit | UNDERSCP):+
EQ -> "="
COLON -> ":"
COMMA -> ","
ARG_LIST_OPEN -> "("
ARG_LIST_CLOSE -> ")"
BLOCK_OPEN -> "{"
BLOCK_CLOSE -> "}"
UNDERSCP -> "_"
First question: how to distinguish AND-operator colon from the condition terminal one? is it possible?
P.S. My grammar draft (without function call support) don't work even for simple case like
win32:xml: x = y
PEG.JS Code:
Start
= ScopeStatement
// qmake scope statement
ScopeStatement
= BooleanExpression ws* ((":" ws* SingleLineStatement) / ("{" ws* MultiLineStatement ))
SingleLineStatement
= Identifier ws* "=" ws* Identifier lb*
MultiLineStatement
= (SingleLineStatement lb*)+
// qmake condition statement
BooleanExpression
= BooleanOrExpression
BooleanOrExpression
= left:BooleanAndExpression ws* "|" ws* right:BooleanOrExpression { return {type: "OR", left:left, right:right} }
/ BooleanAndExpression
BooleanAndExpression
= left:BooleanNotExpression ws* ":" ws* right:BooleanAndExpression { return {type: "AND", left:left, right:right} }
/ BooleanNotExpression
BooleanNotExpression
= "!" ws* operand:BooleanNotExpression { return {type: "NOT", operand: operand } }
/ BooleanComplexExpression
BooleanComplexExpression
= Identifier
/ "(" logical_or:BooleanOrExpression ")" { return logical_or; }
Identifier
= token:[a-zA-Z0-9_]+ { return token.join(""); }
ws
= [ \t]
lb
= [\r\n]
Thanks!
You need to include a negative lookahead after the BooleanAndExpression for anything that is not a BooleanAndExpression, otherwise it will keep greedily consuming additional "and" expressions.
Start
= ScopeStatement
// qmake scope statement
ScopeStatement
= bool:BooleanExpression ws* state:Statement { return {bool:bool, state:state} }
Statement
= ":" ws* state:SingleLineStatement { return state }
SingleLineStatement
= left:Identifier ws* "=" ws* right:Identifier lb* { return {type: "ASSIGN", left:left, right:right} }
MultiLineStatement
= (SingleLineStatement lb*)+
// qmake condition statement
BooleanExpression
= BooleanOrExpression
BooleanOrExpression
= left:BooleanAndExpression ws* "|" ws* right:BooleanOrExpression { return {type: "OR", left:left, right:right} }
/ BooleanAndExpression
BooleanAndExpression
= left:BooleanNotExpression ws* !(":" ws* SingleLineStatement) ":" ws* right:BooleanAndExpression { return {type: "AND", left:left, right:right} }
/ BooleanNotExpression
BooleanNotExpression
= "!" ws* operand:BooleanNotExpression { return {type: "NOT", operand: operand } }
/ BooleanComplexExpression
BooleanComplexExpression
= Identifier
/ "(" logical_or:BooleanOrExpression ")" { return logical_or; }
Identifier
= token:[a-zA-Z0-9_]+ { return token.join(""); }
ws
= [ \t]
lb
= [\r\n]
I got a problem here that I'm sure it is about how antlr works and I am doing it all wrong but I read a lot of docs and tutorials and I still don't fully understand it.
My symptom is thy my grammar stop working when I add (because I need) a lexer rule that may match things it should not. It should only be applied in the right context.
I need ATTR rule because I need description and other rules that will follow to get a string from that keyword to end of line.
This is the conflicting rule:
ATTR
: (~('\r'| '\n'))*
;
It seems it matches anything so it 'eats' the text that should match different tokens. It makes sense, but I need it or I need another solution.
This is my current example input:
; This is a comment
; Comment 2
audit-template this is the id {
description a description may include any char but {line break}
}
For reference this is my current complete grammar:
grammar Grammar;
options {
superClass = AbstractTParser;
}
#header {
package antlrTest;
}
#lexer::header {
package antlrTest;
}
#lexer::members {
private void debug(String str) {
System.err.println("DEBUG(L) " + str);
}
}
#members {
private void debug(String str) {
System.err.println("DEBUG(P) " + str);
}
}
parse
: (template|'**TODO**') EOF { debug ("EOF"); }
;
template : 'audit-template' id=IDENTIFIER OB content=templateContent CB { debug("template id=" + $id.text); }
;
templateContent:
description?
;
description : 'description' ATTR
;
//
COMMENT
: ';' ~( '\r' | '\n' )* {$channel=HIDDEN; debug("COMMENT");}
;
SPACES : ( '\t' | '\f' | ' ' | '\n'| '\r' ) {$channel=HIDDEN;}
;
OB : '{' { debug("OB"); }
;
CB : '}' ('\r'| '\n')+ { debug("CB(lf)"); }
| '}' EOF { debug("CB(eof)"); }
;
IDENTIFIER
: ( 'a'..'z' | 'A'..'Z' | '_' )
( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '.' | ' ' | '\t')*
;
ATTR
: (~('\r'| '\n'))*
;
I am getting this error (the error handling is a bit tweaked):
Line 4 char 0
at antlrTest.AbstractTParser.reportError(AbstractTParser.java:45)
at antlrTest.GrammarParser.parse(GrammarParser.java:106)
at antlrTest.TemplateParser.parseInput(TemplateParser.java:15)
at antlrTest.Main.testFile(Main.java:32)
at antlrTest.Main.main(Main.java:11)
Caused by: NoViableAltException(7#[])
at antlrTest.GrammarParser.parse(GrammarParser.java:71)
... 3 more
Solution? Alternatives? Thank you.
I'm trying to parse some bits and pieces of Verilog - I'm primarily interested in extracting module definitions and instantiations.
In verilog a module is defined like:
module foo ( ... ) endmodule;
And a module is instantiated in one of two different possible ways:
foo fooinst ( ... );
foo #( ...list of params... ) fooinst ( .... );
At this point I'm only interested in finding the name of the defined or instantiated module; 'foo' in both cases above.
Given this menhir grammar (verParser.mly):
%{
type expr = Module of expr
| ModInst of expr
| Ident of string
| Int of int
| Lparen
| Rparen
| Junk
| ExprList of expr list
%}
%token <string> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE TICK OTHER HASH EOF
%start expr2
%type <expr> mod_expr
%type <expr> expr1
%type <expr list> expr2
%%
mod_expr:
| MODULE IDENT LPAREN { Module ( Ident $2) }
| IDENT IDENT LPAREN { ModInst ( Ident $1) }
| IDENT HASH LPAREN { ModInst ( Ident $1) };
junk:
| LPAREN { }
| RPAREN { }
| HASH { }
| INT { };
expr1:
| junk* mod_expr junk* { $2 } ;
expr2:
| expr1* EOF { $1 };
When I try this out in the menhir interpretter it works fine extracting the module instantion:
MODULE IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: MODULE IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
It works fine for the single module instantiation:
IDENT IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: IDENT IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
But of course, if there is an IDENT that appears prior to any of these it will REJECT:
IDENT MODULE IDENT LPAREN IDENT IDENT LPAREN
REJECT
... and of course there will be identifiers in an actual verilog file prior to these defs.
I'm trying not to have to fully specify a Verilog grammar, instead I want to build the grammar up slowly and incrementally to eventually parse more and more of the language.
If I add IDENT to the junk rule, that fixes the problem above, but then the module instantiation rule doesn't work because now the junk rule is capturing the IDENT.
Is it possible to create a very permissive rule that will bypass stuff I don't want to match, or is it generally required that you must create a complete grammar to actually do something like this?
Is it possible to create a rule that would let me match:
MODULE IDENT LPAREN stuff* RPAREN ENDMODULE
where "stuff*" initially matches everything but RPAREN?
Something like :
stuff:
| !RPAREN { } ;
I've used PEG parsers in the past which would allow constructs like that.
I've decided that PEG is a better fit for a permissive, non-exhaustive grammar. Took a look at peg/leg and was able to very quickly put together a leg grammar that does what I need to do:
start = ( comment | mod_match | char)
line = < (( '\n' '\r'* ) | ( '\r' '\n'* )) > { lines++; chars += yyleng; }
module_decl = module modnm:ident lparen ( !rparen . )* rparen { chars += yyleng; printf("Module decl: <%s>\n",yytext);}
module_inst = modinstname:ident ident lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
|modinstname:ident hash lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
mod_match = ( module_decl | module_inst )
module = 'module' ws { modules++; chars +=yyleng; printf("Module: <%s>\n", yytext); }
endmodule = 'endmodule' ws { endmodules++; chars +=yyleng; printf("EndModule: <%s>\n", yytext); }
kwd = (module|endmodule)
ident = !kwd<[a-zA-z][a-zA-Z0-9_]+>- { words++; chars += yyleng; printf("Ident: <%s>\n", yytext); }
char = . { chars++; }
lparen = '(' -
rparen = ')' -
hash = '#'
- = ( space | comment )*
ws = space+
space = ' ' | '\t' | EOL
comment = '//' ( !EOL .)* EOL
| '/*' ( !'*/' .)* '*/'
EOF = !.
EOL = '\r\n' | '\n' | '\r'
Aurochs is possibly also an option, but I have concerns about speed and memory usage of an Aurochs generated parser. peg/leg produce a parser in C which should be quite speedy.
I am writing a parser for delphi's dfm's files. The lexer looks like this:
EXP ([Ee][-+]?[0-9]+)
%%
("#"([0-9]{1,5}|"$"[0-9a-fA-F]{1,6})|"'"([^']|'')*"'")+ {
return tkStringLiteral; }
"object" { return tkObjectBegin; }
"end" { return tkObjectEnd; }
"true" { /*yyval.boolean = true;*/ return tkBoolean; }
"false" { /*yyval.boolean = false;*/ return tkBoolean; }
"+" | "." | "(" | ")" | "[" | "]" | "{" | "}" | "<" | ">" | "=" | "," |
":" { return yytext[0]; }
[+-]?[0-9]{1,10} { /*yyval.integer = atoi(yytext);*/ return tkInteger; }
[0-9A-F]+ { return tkHexValue; }
[+-]?[0-9]+"."[0-9]+{EXP}? { /*yyval.real = atof(yytext);*/ return tkReal; }
[a-zA-Z_][0-9A-Z_]* { return tkIdentifier; }
"$"[0-9A-F]+ { /* yyval.integer = atoi(yytext);*/ return tkHexNumber; }
[ \t\r\n] { /* ignore whitespace */ }
. { std::cerr << boost::format("Mystery character %c\n") % *yytext; }
<<EOF>> { yyterminate(); }
%%
and the bison grammar looks like
%token tkInteger
%token tkReal
%token tkIdentifier
%token tkHexValue
%token tkHexNumber
%token tkObjectBegin
%token tkObjectEnd
%token tkBoolean
%token tkStringLiteral
%%object:
tkObjectBegin tkIdentifier ':' tkIdentifier
property_assignment_list tkObjectEnd
;
property_assignment_list:
property_assignment
| property_assignment_list property_assignment
;
property_assignment:
property '=' value
| object
;
property:
tkIdentifier
| property '.' tkIdentifier
;
value:
atomic_value
| set
| binary_data
| strings
| collection
;
atomic_value:
tkInteger
| tkReal
| tkIdentifier
| tkBoolean
| tkHexNumber
| long_string
;
long_string:
tkStringLiteral
| long_string '+' tkStringLiteral
;
atomic_value_list:
atomic_value
| atomic_value_list ',' atomic_value
;
set:
'[' ']'
| '[' atomic_value_list ']'
;
binary_data:
'{' '}'
| '{' hexa_lines '}'
;
hexa_lines:
tkHexValue
| hexa_lines tkHexValue
;
strings:
'(' ')'
| '(' string_list ')'
;
string_list:
tkStringLiteral
| string_list tkStringLiteral
;
collection:
'<' '>'
| '<' collection_item_list '>'
;
collection_item_list:
collection_item
| collection_item_list collection_item
;
collection_item:
tkIdentifier property_assignment_list tkObjectEnd
;
%%
void yyerror(const char *s, ...) {...}
The problem with this grammar occurs while parsing the binary data. Binary data in the dfm's files is nothing
but a sequence of hexadecimal characters which never spans more than 80 characters per line. An example of
it is:
Picture.Data = {
055449636F6E0000010001002020000001000800A80800001600000028000000
2000000040000000010008000000000000000000000000000000000000000000
...
FF00000000000000000000000000000000000000000000000000000000000000
00000000FF000000FF000000FF00000000000000000000000000000000000000
00000000}
As you can see, this element lacks any markers, so the strings clashes with other elements. In the example
above the first line is returns the proper token tkHexValue. The second however returns a tkInteger token
and the third a tkIdentifier token. So when the parsing comes, it fails with an syntax error because
binary data is composed only of tkHexValue tokens.
My first workaround was to require integers to have a maximum length (which helped in all but the last line
of the binary data). And the second was to move the tkHexValue token above the tkIdentifier but it means
that now I will not have identifiers like F0
I was wondering if there is any way to fix this grammar?
Ok, I solved this one. I needed to define a state so tkHexValue is only returned while reading binary data. In the preamble part of the lexer I added
%x BINARY
and modify the following rules
"{" {BEGIN BINARY; return yytext[0];}
<BINARY>"}" {BEGIN INITIAL; return yytext[0];}
<BINARY>[ \t\r\n] { /* ignore whitespace */ }
And that was all!
Type mismatch. Expecting a LexBuffer<char> but given a LexBuffer<byte> The type 'char' does not match the type 'byte'
This is the error message that I am getting while using fslex. I have tried manually checking every single occurrence of lexbuf and its type. It's LexBuffer<char> everywhere. But still the compiler is giving me the above error. Can you please tell me why this error occurs and how to go about resolving it.
{
open System
open Microsoft.FSharp.Text.Lexing
open Microsoft.FSharp.Text.Parsing
let lexeme (lexbuf : LexBuffer<char>) = new System.String(lexbuf.Lexeme)
let newline (lexbuf:LexBuffer<char>) = lexbuf.EndPos <- lexbuf.EndPos.NextLine
let unexpected_char (lexbuf:LexBuffer<char>) = failwith ("Unexpected character '"+(lexeme lexbuf)+"'")
}
let char = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let float = '-'?digit+ '.' digit+
let ident = char+ (char | digit)*
let whitespace = [' ' '\t']
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| "maximize" { MAXIMIZE }
| "minimize" { MINIMIZE }
| "where" { WHERE }
| '+' { PLUS }
| '-' { MINUS }
| '*' { MULTIPLY }
| '=' { EQUALS }
| '>' { STRICTGREATERTHAN }
| '<' { STRICTLESSTHAN }
| ">=" { GREATERTHANEQUALS }
| "<=" { LESSTHANEQUALS }
| '[' { LSQUARE }
| ']' { RSQUARE }
| whitespace { tokenize lexbuf }
| newline { newline lexbuf; tokenize lexbuf }
| ident { ID (lexeme lexbuf) }
| float { FLOAT (Double.Parse(lexeme lexbuf)) }
| ';' { SEMICOLON }
| eof { EOF }
| _ { unexpected_char lexbuf }
Maybe you need to generate a unicode lexer. A unicode lexer works with a LexBuffer<char> rather than LexBuffer<byte>.
The "unicode" argument to FsLex is optional, but if enabled generates a unicode lexer.
http://blogs.msdn.com/dsyme/archive/2009/10/21/some-smaller-features-in-the-latest-release-of-f.aspx
Have you tried inserting an explicit cast?
There was a mistake with my lexer file definition I believe, it compiled when I made the following my lexer definition. Experts can throw more insight into the reasons, while the understanding that I have is the type of the lexbuf that is used in the lexer should somehow be related to the definition that the parser generates
{
open System
open LanguageParser
open Microsoft.FSharp.Text.Lexing
open Microsoft.FSharp.Text.Parsing
open System.Text
let newline (lexbuf:LexBuffer<_>) = lexbuf.EndPos <- lexbuf.EndPos.NextLine
}
let char = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let float = '-'?digit+ '.' digit+
let ident = char+ (char | digit)*
let whitespace = [' ' '\t']
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| "maximize" { MAXIMIZE }
| "minimize" { MINIMIZE }
| "where" { WHERE }
| '+' { PLUS }
| '-' { MINUS }
| '*' { MULTIPLY }
| '=' { EQUALS }
| '>' { STRICTGREATERTHAN }
| '<' { STRICTLESSTHAN }
| ">=" { GREATERTHANEQUALS }
| "<=" { LESSTHANEQUALS }
| '[' { LSQUARE }
| ']' { RSQUARE }
| whitespace { tokenize lexbuf }
| newline { newline lexbuf; tokenize lexbuf }
| ident { ID <| Encoding.UTF8.GetString(lexbuf.Lexeme) }
| float { FLOAT <| Double.Parse(Encoding.UTF8.GetString(lexbuf.Lexeme)) }
| ';' { SEMICOLON }
| eof { EOF }
| _ { failwith ("Unexpected Character") }