I am trying to create a parser with Bison (GNU bison 2.4.1) and flex (2.5.35) on my Ubuntu OS. I have something like this:
sql.h:
typedef struct word
{
char *val;
int length;
} WORD;
struct yword
{
struct word v;
int o;
...
};
sql1.y
%{
..
#include "sql.h"
..
%}
%union yystype
{
struct tree *t;
struct yword b;
...
}
%token <b> NAME
%%
...
table:
NAME { add_table(root, $1.v); }
;
...
Trouble is that whatever string I give to it, when it comes to resolve this, v always has values (NULL, 0) even if the input string should have some table name. (I chose to skip unnecessary other details/snippets, but can provide more if it helps resolve this.)
I wrote the grammar which is complete and correct, but I can't get it to build the parse tree due to this problem.
Any inputs would be quite appreciated.
Your trouble seems related to some missing or buggous code in the lexical analyzer.
Check your lexical analyzer first.
If it does not return the token proprely the parser part can not handle correctly the values.
Write a basic test that print the token value.
Do not mind the "c" style, above all is the principle :
main() {
int token;
while( token = yylex() ) {
switch( token) {
case NAME:
printf("name '%s'\n", yylval.b.v.val );
break;
...
}
}
}
If you run some input and that does not work.
if the lexical analyzer does not set yylval when it returns NAME, it is normal that val is empty.
If in your flex you have a pattern such as :
[a-z]+ { return NAME; }
It is incorrect you have to set the value like this
[a-z]+ {
yylval.val = strdup(yytext);
yylval.length = yylen;
return NAME; }
Related
Note: I initially posted an over-simplified version of my problem. A more
accurate description follows:
I have the following struct:
struct Thing(T) {
T[3] values;
int opApply(scope int delegate(size_t, ref T) dg) {
int res = 0;
foreach(idx, ref val; values) {
res = dg(idx, val);
if (res) break;
}
return res;
}
}
Foreach can be used like so:
unittest {
Thing!(size_t[]) thing;
foreach(i, ref val ; thing) val ~= i;
}
However, it is not #nogc friendly:
#nogc unittest {
Thing!size_t thing;
foreach(i, ref val ; thing) val = i;
}
If I change the signature to
int opApply(scope int delegate(size_t, ref T) #nogc dg) { ... }
It works for the #nogc case, but fails to compile for non-#nogc cases.
The solutions I have tried are:
Cast the delegate
int opApply(scope int delegate(size_t, ref T) dg) {
auto callme = cast(int delegate(size_t, ref T) #nogc) dg;
// use callme instead of dg to support nogc
This seems wrong as I am willfully casting a #nogc attribute even onto
functions that do may not support it.
Use opSlice instead of opApply:
I'm not sure how to return an (index, ref value) tuple from my range. Even if
I could, I think it would have to contain a pointer to my static array, which
could have a shorter lifetime than the returned range.
Use a templated opApply:
All attempts to work with this have failed to automatically determine the
foreach argument types. For example, I needed to specify:
foreach(size_t idx, ref int value ; thing)
Which I see as a significant hindrance to the API.
Sorry for underspecifying my problem before. For total transparency,
Enumap is the "real-world" example. It
currently uses opSlice, which does not support ref access to values. My
attempts to support 'foreach with ref' while maintaining #nogc support is what
prompted this question.
Instead of overloading the opApplyoperator you can implement an input range for your type. Input ranges work automatically as the agregate argument in foreach statements:
struct Thing(K,V) {
import std.typecons;
#nogc bool empty(){return true;}
#nogc auto front(){return tuple(K.init, V.init);}
#nogc void popFront(){}
}
unittest {
Thing!(int, int) w;
foreach(val ; w) {
int[] i = [1,2,3]; // spurious allocation
}
}
#nogc unittest {
Thing!(int, int) w;
foreach(idx, val ; w) { assert(idx == val); }
}
This solves the problem caused by the allocation of the delegate used in foreach.
Note that the example is shitty (the range doesn't work at all, and usually ranges are provided via opSlice, etc) but you should get the idea.
I want to extend the IDL.g4 grammar a bit so that I can distinguish the following two comments //#top-level false and //#top-level true, all other comments I just want to skip like before.
I have tried to add top_level, TOP_LEVEL_TRUEand TOP_LEVEL_FALSElike this, because I thought antr4 gave precedence to lexical rules comming first.
top_level
: TOP_LEVEL_TRUE
| TOP_LEVEL_FALSE
;
TOP_LEVEL_TRUE
: '//#top-level true'
;
TOP_LEVEL_FALSE
: '//#top-level false'
;
LINE_COMMENT
: '//' ~('\n'|'\r')* '\r'? '\' -> channel(HIDDEN)
;
But the listener enterTop_level(...) is never called,
all comments seems to be eaten by LINE_COMMENT. How shall I organize the lexer and parser rules?
And one more question, I also want to be notified when end of input-file is reached. How do I do that? I have tried a finalize() function i the listener class, but never get called.
Updated with a complete example:
I use this grammar file : IDL.g4 as I said above. Then I update it by putting the parser rule top_level just below the event_header rule. The Lexer rules is put just above the ID rule.
Here is my Listener.java file
class Listener extends IDLBaseListener {
#Override
public void enterTop_level(IDLParser.Top_levelContext ctx) {
System.out.println("Found top-level");
}
}
and here is a main program: IDLCheck.java
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import java.io.FileInputStream;
import java.io.InputStream;
public class IDLCheck {
public void process(String[] args) throws Exception {
InputStream is = new FileInputStream("sample.idl");
ANTLRInputStream input = new ANTLRInputStream(is);
IDLLexer lexer = new IDLLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
IDLParser parser = new IDLParser(tokens);
parser.setBuildParseTree(true);
RuleContext tree = parser.specification();
Listener listener = new Listener();
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(listener, tree);
}
public static void main(String[] args) throws Exception {
new IDLCheck().process(args);
}
}
and a input file: sample.idl
module CommonTypes {
struct WChannel {
int w;
float d;
}; //#top-level false
struct EPlanID {
int kind;
short index;
}; //#top-level TRUE
};
I expect to see the output "Found top-level" twice, but I see nothing
Finally I found a solution. I just added newline characters to the TOP_LEVEL_FALSE and TOP_LEVEL_TRUElexer rules an I also added the top_level parser rule to the definition rule because I only expected top_level to appear after a struct or union. this is a rti.com specific extension to the IDL-format, this modification seems to be good enough for me.
definition
: type_decl SEMICOLON top_level?
| const_decl SEMICOLON
...
TOP_LEVEL_TRUE
: '//#top-level true' '\r'? '\n'
;
TOP_LEVEL_FALSE
: '//#top-level false' '\r'? '\n'
;
I'm converting a working flex/bison parser to run re-entrantly. The parser has the ability to accept include command-file.txt directives, which was implemented on the flex side of things like this:
^include { BEGIN INCL; }
<INCL>{ws}+ { /* Ignore */ }
<INCL>[^ \t\n\r\f]+ { /* Swallow everything up to whitespace or an EOL character.
* When state returns to initial, the whitepsace
* and/or EOL will be taken care of. */
yyin = fopen ( yytext, "r" );
if (! yyin) {
char filename[1024];
sprintf(filename,"/home/scripts/%s",yytext);
yyin = fopen( filename, "r");
if ( ! yyin) {
char buf[256];
sprintf(buf,"Couldn't open ""%s"".",yytext);
yyerror(buf);
}
}
yypush_buffer_state(yy_create_buffer(yyin, YY_BUF_SIZE));
BEGIN 0;
}
<<EOF>> {
yypop_buffer_state();
if (!YY_CURRENT_BUFFER) {
yyterminate();
}
}
This works nicely. Now that I've added %option reentrant and %option bison-bridge, I get these errors:
lexer.l:119: error: too few arguments to function `yy_create_buffer'
lexer.l:119: error: too few arguments to function `yypush_buffer_state'
lexer.l:123: error: too few arguments to function `yypop_buffer_state'
What are the proper ways to invoke these functions/macros in a re-entrant parser?
The reentrant interfaces are documented (briefly) in the flex manual.
All interfaces have one extra argument of type yyscan_t which comes at the end of the argument list. Examples (pulled from a flex-generated file):
YY_BUFFER_STATE yy_create_buffer (FILE *file,int size ,yyscan_t yyscanner );
void yy_delete_buffer (YY_BUFFER_STATE b ,yyscan_t yyscanner );
void yy_flush_buffer (YY_BUFFER_STATE b ,yyscan_t yyscanner );
void yypush_buffer_state (YY_BUFFER_STATE new_buffer ,yyscan_t yyscanner );
void yypop_buffer_state (yyscan_t yyscanner );
yylex follows the same pattern, so you can use yyscanner inside an action to refer to the argument provided to yylex
The DSL I'm working on allows users to define a 'complete text substitution' variable. When parsing the code, we then need to look up the value of the variable and start parsing again from that code.
The substitution can be very simple (single constants) or entire statements or code blocks.
This is a mock grammar which I hope illustrates my point.
grammar a;
entry
: (set_variable
| print_line)*
;
set_variable
: 'SET' ID '=' STRING_CONSTANT ';'
;
print_line
: 'PRINT' ID ';'
;
STRING_CONSTANT: '\'' ('\'\'' | ~('\''))* '\'' ;
ID: [a-z][a-zA-Z0-9_]* ;
VARIABLE: '&' ID;
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
Then the following statements executed consecutively should be valid;
SET foo = 'Hello world!';
PRINT foo;
SET bar = 'foo;'
PRINT &bar // should be interpreted as 'PRINT foo;'
SET baz = 'PRINT foo; PRINT'; // one complete statement and one incomplete statement
&baz foo; // should be interpreted as 'PRINT foo; PRINT foo;'
Any time the & variable token is discovered, we immediately switch to interpreting the value of that variable instead. As above, this can mean that you set up the code in such a way that is is invalid, full of half-statements that are only completed when the value is just right. The variables can be redefined at any point in the text.
Strictly speaking the current language definition doesn't disallow nesting &vars inside each other, but the current parsing doesn't handle this and I would not be upset if it wasn't allowed.
Currently I'm building an interpreter using a visitor, but this one I'm stuck on.
How can I build a lexer/parser/interpreter which will allow me to do this? Thanks for any help!
So I have found one solution to the issue. I think it could be better - as it potentially does a lot of array copying - but at least it works for now.
EDIT: I was wrong before, and my solution would consume ANY & that it found, including those in valid locations such as inside string constants. This seems like a better solution:
First, I extended the InputStream so that it is able to rewrite the input steam when a & is encountered. This unfortunately involves copying the array, which I can maybe resolve in the future:
MacroInputStream.java
package preprocessor;
import org.antlr.v4.runtime.ANTLRInputStream;
public class MacroInputStream extends ANTLRInputStream {
private HashMap<String, String> map;
public MacroInputStream(String s, HashMap<String, String> map) {
super(s);
this.map = map;
}
public void rewrite(int startIndex, int stopIndex, String replaceText) {
int length = stopIndex-startIndex+1;
char[] replData = replaceText.toCharArray();
if (replData.length == length) {
for (int i = 0; i < length; i++) data[startIndex+i] = replData[i];
} else {
char[] newData = new char[data.length+replData.length-length];
System.arraycopy(data, 0, newData, 0, startIndex);
System.arraycopy(replData, 0, newData, startIndex, replData.length);
System.arraycopy(data, stopIndex+1, newData, startIndex+replData.length, data.length-(stopIndex+1));
data = newData;
n = data.length;
}
}
}
Secondly, I extended the Lexer so that when a VARIABLE token is encountered, the rewrite method above is called:
MacroGrammarLexer.java
package language;
import language.DSL_GrammarLexer;
import org.antlr.v4.runtime.Token;
import java.util.HashMap;
public class MacroGrammarLexer extends MacroGrammarLexer{
private HashMap<String, String> map;
public DSL_GrammarLexerPre(MacroInputStream input, HashMap<String, String> map) {
super(input);
this.map = map;
// TODO Auto-generated constructor stub
}
private MacroInputStream getInput() {
return (MacroInputStream) _input;
}
#Override
public Token nextToken() {
Token t = super.nextToken();
if (t.getType() == VARIABLE) {
System.out.println("Encountered token " + t.getText()+" ===> rewriting!!!");
getInput().rewrite(t.getStartIndex(), t.getStopIndex(),
map.get(t.getText().substring(1)));
getInput().seek(t.getStartIndex()); // reset input stream to previous
return super.nextToken();
}
return t;
}
}
Lastly, I modified the generated parser to set the variables at the time of parsing:
DSL_GrammarParser.java
...
...
HashMap<String, String> map; // same map as before, passed as a new argument.
...
...
public final SetContext set() throws RecognitionException {
SetContext _localctx = new SetContext(_ctx, getState());
enterRule(_localctx, 130, RULE_set);
try {
enterOuterAlt(_localctx, 1);
{
String vname = null; String vval = null; // set up variables
setState(1215); match(SET);
setState(1216); vname = variable_name().getText(); // set vname
setState(1217); match(EQUALS);
setState(1218); vval = string_constant().getText(); // set vval
System.out.println("Found SET " + vname +" = " + vval+";");
map.put(vname, vval);
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.reportError(this, re);
_errHandler.recover(this, re);
}
finally {
exitRule();
}
return _localctx;
}
...
...
Unfortunately this method is final so this will make maintenance a bit more difficult, but it works for now.
The standard pattern to handling your requirements is to implement a symbol table. The simplest form is as a key:value store. In your visitor, add var declarations as encountered, and read out the values as var references are encountered.
As described, your DSL does not define a scoping requirement on the variables declared. If you do require scoped variables, then use a stack of key:value stores, pushing and popping on scope entry and exit.
See this related StackOverflow answer.
Separately, since your strings may contain commands, you can simply parse the contents as part of your initial parse. That is, expand your grammar with a rule that includes the full set of valid contents:
set_variable
: 'SET' ID '=' stringLiteral ';'
;
stringLiteral:
Quote Quote? (
( set_variable
| print_line
| VARIABLE
| ID
)
| STRING_CONSTANT // redefine without the quotes
)
Quote
;
I'm trying to process in flex a config file, which looks like this
[Tooldir]
BcLib=C:\APPS\BC\LIB
BcInclude=C:\APPS\BC\INCLUDE
[IDE]
DefaultDesktopDir=C:\APPS\BC\BIN
HelpDir=C:\APPS\BC\BIN
[Startup]
State=0
Left=21
Right=21
Width=946
Height=663
[Project]
Lastproj=c:\apps\bc\bin\proj0002.ide
So it would look like this
[Tooldir]
[IDE]
[Startup]
[Project]
I'm currently trying with states, but I just don't seem to understand how they work.
%{
#include <stdio.h>
int yywrap(void);
int yylex(void);
%}
%s section
%%
/* == rules == */
<INITIAL>"[" BEGIN section;
<section>. printf("%s",yytext);
<section>"]\n" BEGIN INITIAL;
%%
int yywrap(void) { return 1; }
int main() { return yylex(); }
The code above is printing everything except the "[" and "]"... Some help, please?
EDIT:
Working code
%{
#include <stdio.h>
int yywrap(void);
int yylex(void);
%}
%s section
%%
/* == rules == */
<INITIAL>"[" BEGIN section; printf("[");
<section>. printf("%s",yytext);
<section>"]\n" BEGIN INITIAL; printf("]\n");
.|\n {;}
%%
int yywrap(void) { return 1; }
int main() { return yylex(); }
By default, anything that doesn't match any of the flex rules is printed. So your rules match the [whatever] lines and print whatever (removing the [ and ]), while the default rule matches everything else (printing it).
Add a rule like:
.|\n { /* ignoring all other unmatched text */ }
to the end of your rules if you want to ignore everything else, rather than printing it.