I'm trying to understand the diagnostic messages given by Flex:
Entering state 5
Return for a new token:
Reading a token: Next token is token END_OF_FILE (4.0: )
Shifting token END_OF_FILE (4.0: )
Entering state 43
Reducing stack by rule 143 (line 331):
$1 = nterm syntax (0.0-17: )
$2 = nterm top_levels (0.18-4.0: )
$3 = token END_OF_FILE (4.0: )
-> $$ = nterm s (0.0-4.0: )
Stack now 0
Entering state 3
Return for a new token:
Reading a token: Next token is token END_OF_FILE (4.0: )
4/0: syntax error
Error: popping nterm s (0.0-4.0: )
Stack now 0
Cleanup: discarding lookahead token END_OF_FILE (4.0: )
Stack now 0
I cannot understand why / what is it trying to do with EOF token. Below are the Flex rules:
<<EOF>> { return END_OF_FILE; }
And the Bison rules:
top_level : message
| enum
| service
| import { $$ = Py_None; }
| package { $$ = Py_None; }
| option_def { $$ = Py_None; }
| ';' { $$ = Py_None; } ;
top_levels : %empty { $$ = py_list(Py_None); }
| top_levels top_level { $$ = py_append($1, $2); } ;
s : syntax top_levels END_OF_FILE { $$ = $2; } ;
And the output file generated by Bison:
State 3
0 $accept: s . $end
$end shift, and go to state 6
State 5
142 top_levels: top_levels . top_level
143 s: syntax top_levels . END_OF_FILE
BOOL shift, and go to state 9
... bunch of similar rules
END_OF_FILE shift, and go to state 43
';' shift, and go to state 44
import go to state 45
... bunch of similar rules
top_level go to state 55
State 6
0 $accept: s $end .
$default accept
I have no idea what's going on. Why does it report reading EOF token twice? What was exactly the problem with popping s? To me it seems like it actually accepted the whole thing, and then decided to reject it because it red the token second time... but the whole reporting is very confusing.
1. The problem
Don't do this:
<<EOF>> { return END_OF_FILE; }
Yacc/bison parsers augment grammars with an internal rule which produces the start symbol followed by an internal eof token called $end, whose token number is 0. (You can see this rule in states 3 and 6.) That is the only accepting rule in the grammar.
By default, (f)lex scanners return 0 when EOF is detected. So that all Just Works.
When you try to send a different token on EOF, you are attempting to defeat this mechanism, but it won't work because the start symbol is not an accepting rule. After the start symbil is reduced, the parser tries to reduce the $accept rule, so it asks the scanner for another token. But the scanner has already hit EOF. In most cases, the scanner will execute the <<EOF>> action again (although this is not guaranteed), but that's not going to produce the $end token it needs. So you get a syntax error.
2. The underlying problem (maybe)
Normally, people try this in order to create a user action which runs when the input is accepted, typically in order to return the result of the parse to yyparse's caller through an "out" parameter. Trying to explicitly recognize an EOF token (or even the $end token) in the start production cannot work, but there is a much simpler solution: an extra unit rule:
%start return
%%
return: s { *out = $1; }
s: syntax top_levels { $$ = $2; }
Note that you could also do this without top_levels:
%start return
%%
return: { *out = $1; }
s: syntax { $$ = py_list(Py_None); }
| s top_level { $$ = py_append($1, $2); }
An alternative is to use the special YYACCEPT action macro in the action for the start rule. However, I believe the standard solution outlined above is simpler because it doesn't require anything from the scanner.
3. The trace output
Error: popping nterm s (0.0-4.0: )
Means:
A syntax error was detected.
As part of error recovery, the parser popped the non-terminal s from the stack.
That non-terminal's source location extends from 0.0 to 4.0 (line . column)
If s (or its semantic type) had had a registered destructor, that would have run at step 2. You probably will want to register destructor for syntactic types which reference Python values in order to decrement their reference counts so that you don't leak memory on syntax errors. But perhaps I'm wrong about that.
Also, you could register a %printer for the syntactic value, in which case it would have been printed after the colon.
Related
I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html
I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?
Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I don´t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...
I've written a lexer in Alex and I'm trying to hook it up to a parser written in Happy. I'll try my best to summarize my problem without pasting huge chunks of code.
I know from my unit tests of my lexer that the string "\x7" is lexed to:
[TokenNonPrint '\x7', TokenEOF]
My token type (spit out by the lexer), is Token. I've defined lexWrap and alexEOF as described here, which gives me the following header and token declarations:
%name parseTokens
%tokentype { Token }
%lexer { lexWrap } { alexEOF }
%monad { Alex }
%error { parseError }
%token
NONPRINT {TokenNonPrint $$}
PLAIN { TokenPlain $$ }
I invoke the parser+lexer combo with the following:
parseExpr :: String -> Either String [Expr]
parseExpr s = runAlex s parseTokens
And here are my first few productions:
exprs :: { [Expr] }
exprs
: {- empty -} { trace "exprs 30" [] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
nonprint :: { Cmd }
: NONPRINT { NonPrint $ parseNonPrint $1}
expr :: { Expr }
expr
: nonprint {trace "expr 44" $ Cmd $ $1}
| PLAIN { trace "expr 37" $ Plain $1 }
I'll leave out the datatype declarations of Expr and NonPrint since they're long and only the constructors Cmd and NonPrint matter here. The function parseNonPrint is defined at the bottom of Parse.y as:
parseNonPrint :: Char -> NonPrint
parseNonPrint '\x7' = Bell
Also, my error handling function looks like:
parseError :: Token -> Alex a
parseError tokens = error ("Error processing token: " ++ show tokens)
Written like this, I expect the following hspec test to pass:
parseExpr "\x7" `shouldBe` Right [Cmd (NonPrint Bell)]
But instead, I see "exprs 30" print once (even though I'm running 5 different unit tests) and all of my tests of parseExpr return Right []. I don't understand why that would be the case, but I changed the exprs production to prevent it:
exprs :: { [Expr] }
exprs
: expr { trace "exprs 30" [$1] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
Now all of my tests fail on the first token they hit --- parseExpr "\x7" fails with:
uncaught exception: ErrorCall (Error processing token: TokenNonPrint '\a')
And I'm thoroughly confused, since I would expect the parser to take the path exprs -> expr -> nonprint -> NONPRINT and succeed. I don't see why this input would put the parser in an error state. None of the trace statements are hit (optimized away?).
What am I doing wrong?
It turns out the cause of this error was the innocuous line
%lexer { lexWrap } { alexEOF }
which was recommended by the linked question about using Alex with Happy (unfortunately, one of the top Google results for queries like "using Alex as a monadic lexer with Happy). The fix is to change it to the following:
%lexer { lexWrap } { TokenEOF }
I had to dig in to the generated code to uncover the issue. It is caused by the code derived from the %tokens directive, which looks as follows (I commented out all of my token declarations except for TokenNonPrint while trying to track down the error):
happyNewToken action sts stk
= lexWrap(\tk ->
let cont i = happyDoAction i tk action sts stk in
case tk of {
alexEOF -> happyDoAction 2# tk action sts stk; -- !!!!
TokenNonPrint happy_dollar_dollar -> cont 1#;
_ -> happyError' tk
})
Evidently, Happy transforms each line of the %tokens directive in to one branch of a pattern match. It also inserts a branch for whatever was identified to it as the EOF token in the %lexer directive.
By inserting the name of a value, alexEOF, rather than a data constructor, TokenEOF, this branch of the case statement has the effect of re-binding the name alexEOF to whatever token was passed in to lexWrap, shadowing the original binding and short-circuiting the case statement so that it hits the EOF rule every time, which somehow results in Happy entering an error state.
The mistake isn't caught by the type system, since the identifier alexEOF (or TokenEOF) doesn't appear anywhere else in the generated code. Misusing the %lexer directive like this will cause GHC to emit a warning, but, since the warning appears in generated code, it's impossible to distinguish it from all of the other harmless warnings the code throws out.
I am having a hard time trying to implement a grammar to parse jQuery blocks in between java code.
I do not need to implement a java grammar. This is going to be a translator. I just need to output the java as it is and translate jQuery to java...
jQuery blocks are surrounded by the following tokens: /*#jQ ... */. There can be multiple blocks, but nesting is not allowed. Here is an example:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
/*#jQ
*/
System.out.println("Good bye world!");
}
}
The desired output of the translator, for this particular case, would be:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
System.out.println("Good bye world!");
}
}
The problem is I am not being able to read java until a /*#jQ is found. Here is an excerpt of what I have so far:
main
:
java
(
jQueryBlock+ java
)*
;
java
:
.*?
;
jQueryBlock
:
JQUERYBLOCKSTART
(
jQueryStatement SINGLE_LINE_COMMENT?
)* JQUERYBLOCKEND
;
and...
JQUERYBLOCKSTART
:
'/*#jQ'
;
Although the generated parse tree is somewhat acceptable (see below), I get several token recognition error...
JjQuery::main:3:22: token recognition error at: '{'
JjQuery::main:5:44: token recognition error at: '{'
JjQuery::main:6:12: token recognition error at: '.'
JjQuery::main:6:16: token recognition error at: '.'
JjQuery::main:6:37: token recognition error at: '!"'
JjQuery::main:12:12: token recognition error at: '.'
JjQuery::main:12:16: token recognition error at: '.'
JjQuery::main:12:40: token recognition error at: '!"'
JjQuery::main:13:5: token recognition error at: '}'
JjQuery::main:15:4: token recognition error at: '}'
Thanks in advance!
UPDATE
I have modified my grammar as suggested, but I'm still having some problems. Here is an example input, the generated parse tree, and below it the errors thrown.
warning(155): Lexer.g4:22:28: rule SINGLE_LINE_COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Lexer.g4:28:25: rule WS contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
Parser::src:1:3: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Parser::src:3:5: token recognition error at: '\n'
Parser::src:4:0: token recognition error at: '\n'
Parser::src:5:2: token recognition error at: ' '
Parser::src:5:8: token recognition error at: '\n'
Parser::src:6:0: token recognition error at: '\n'
Parser::src:7:2: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Here is the current Lexer.g4:
lexer grammar Lexer;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
// Default mode rules (the SEA)
JQBegin
:
'/*#jQ' -> pushMode ( JQUERY )
;
JAVA
:
.
;
WS
:
[ \t\r\n]+ -> channel ( WHITESPACE ) // channel(1)
;
SINGLE_LINE_COMMENT
:
'//' .*? '\n' -> channel ( COMMENTS ) // channel(2)
;
mode JQUERY;
JQEnd
:
'*/' -> popMode
;
IN
:
'in'
;
OUT
:
'out'
;
ID
:
[a-zA-Z_] [a-zA-Z0-9_]*
;
SEMICOLON
:
';'
;
And the Parser.g4:
parser grammar Parser;
options {
tokenVocab = Lexer;
} // use tokens from ModeTagsLexer.g4
src
:
(
JAVA
| jQuery
)+ EOF
;
jQuery
:
JQBegin
(
in
| out
)* JQEnd
;
in
:
IN ID SEMICOLON
;
out
:
OUT ID SEMICOLON
;
Use lexical modes to separately handle JQuery and Java blocks (even though the Java blocks are trivial in your case). Note, lexer modes are only available in Lexer grammars and not in combined grammars.
Also, the Java catchall must match a single character at a time. Otherwise it can consume the JQuery begin sequence (this is likely the source of the errors you are seeing).
main: ( JAVA | jqBlock )+ EOF ;
jqBlock: JQBegin
( ... | ... | ... ) // your JQuery rules
JQEnd
;
JQBegin: '/*#jQ' -> pushMode(JQ) ;
JAVA : . ;
mode JQ;
... // your JQuery specific rules
BlockComment : '/*' .*? '*/' ; // handle any possibly ambiguous
// sequences that otherwise might
// cause early exits
JQEnd: '*/' -> popMode() ;
If I write grammar file in Yacc/Bison like this:
Module
:ModuleName "=" Functions
{ $$ = Builder::concat($1, $2, ","); }
Functions
:Functions Function
{ $$ = Builder::concat($1, $2, ","); }
| Function
{ $$ = $1; }
Function
: DEF ID ARGS BODY
{
/** Lacks module name to do name mangling for the function **/
/** How can I obtain the "parent" node's module name here ?? **/
module_name = ; //????
$$ = Builder::def_function(module_name, $ID, $ARGS, $BODY);
}
And this parser should parse codes like this:
main_module:
def funA (a,b,c) { ... }
In my AST, the name "funA" should be renamed as main_module.funA. But I can't get the module's information while the parser is processing Function node !
Is there any Yacc/Bison facilities can help me to handle this problem, or should I change my parsing style to avoid such embarrassing situations ?
There is a bison feature, but as the manual says, use it with care:
$N with N zero or negative is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:
foo: expr bar '+' expr { ... }
| expr bar '-' expr { ... }
;
bar: /* empty */
{ previous_expr = $0; }
;
As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo.
More cleanly, you could use a mid-rule action (in Module) to push the module name on a name stack (which would have to be part of the parsing context). You would then pop the stack at the end of the rule.
For more information and examples of mid-rules actions, see the manual.