Custom location tracking in jison-gho - flex-lexer

I need to parse a "token-level" language, i.e. the input is already tokenized with a semicolon as a delimiter. Sample input: A;B;A;D0;ASSIGN;X;. Here's also my grammar file.
I'd like to track location columns per-token. For the previous example, here's how I'd like to have columns defined:
Input: A;B;A;D0;ASSIGN;X;\n
Col: 1122334445555555666
So basically I'd like to increment column every time a semicolon is hit. I made a function that increments column count when semicolon is hit and for all actions I just set column in yylloc to my custom column count. However, with this approach I have to copypaste a function call to every action. Do you please know if there's any other cleaner way? Also there'll be no lexical errors in the input since it's autogenerated.
Edit: Nevermind, my solution actually doesn't work. So I'll be happy for any suggestions :)
%lex
%{
var delimit = (terminal) => { this.begin('delimit'); return terminal }
var columnInc = () => {
if (yy.lastLine === undefined) yy.lastLine = -1
if (yylloc.first_line !== yy.lastLine) {
yy.lastLine = yylloc.first_line
yy.columnCount = 0
}
yy.columnCount++
}
var setColumn = () => {
yylloc.first_column = yylloc.last_column = yy.columnCount
}
%}
%x delimit
%%
"ASSIGN" { return delimit('ASSIGN'); setColumn() }
"A" { return delimit('A'); setColumn() }
<delimit>{DELIMITER} { columnInc(); this.popState(); setColumn() }
\n { setColumn() }
...

There are a few ways to accomplish this in jison-gho. As you're looking to implement a token counter which is tracked by the parser, this invariably means we need to find a way to 'hook' into the code path where the lexer passes tokens to the parser.
Before we go look at a few implementations, a few thoughts that may help others who are facing similar, yet slightly different problems:
completely custom lexer for prepared token streams: as your input is a set of tokens already, one might consider using a custom lexer which would then just take the input stream as-is and do as little as possible while passing the tokens to the parser. This is doable in jison-gho and a fairly minimal example of such is demonstrated here:
https://github.com/GerHobbelt/jison/blob/0.6.1-215/examples/documentation--custom-lexer-ULcase.jison
while another way to integrate that same custom lexer is demonstrated here:
https://github.com/GerHobbelt/jison/blob/0.6.1-215/examples/documentation--custom-lexer-ULcase-alt.jison
or you might want to include the code from an external file via a %include "documentation--custom-lexer-ULcase.js" statement. Anyway, I digress.
Given your problem, depending on where that token stream comes from (who turns that into text? Is that outside your control as there's a huge overhead cost there as you're generating, then parsing a very long stream of text, while a custom lexer and some direct binary communications might reduce network or other costs there.
The bottom line is: if the token generator and everything up to this parse point is inside your control, I personally would favor a custom lexer and no text conversion what-so-ever for the intermediary channel. But in the end, that depends largely on your system requirements, context, etc. and is way outside the realm of this SO coding question.
augmenting the jison lexer: of course another approach could be to augment all (or a limited set of) lexer rules' action code, where you modify yytext to pass this data to the parser. This is the classic approach from the days of yacc/bison. Indeed, yytext doesn't have to be a string, but can be anything, e.g.
[a-z] %{
yytext = new DataInstance(
yytext, // the token string
yylloc, // the token location info
... // whatever you want/need...
);
return 'ID'; // the lexer token ID for this token
%}
For this problem, this is a lot of code duplication and thus a maintenance horror.
hooking into the flow between parser and lexer: this is new and facilitated by the jison-gho tool by pre_lex and post_lex callbacks. (The same mechanism is available around the parse() call so that you can initialize and postprocess a parser run in any way you want: pre_parse and post_parse are for that.
Here, since we want to count tokens, the simplest approach would be using the post_lex hook, which is only invoked when the lexer has completely parsed yet another token and passes it to the parser. In other words: post_lex is executed at the very end of the lex() call in the parser.
The documentation for these is included at the top of every generated parser/lexer JS source file, but then, of course, you need to know about that little nugget! ;-)
Here it is:
parser.lexer.options:
pre_lex: function()
optional: is invoked before the lexer is invoked to produce another token.
this refers to the Lexer object.
post_lex: function(token) { return token; }
optional: is invoked when the lexer has produced a token token;
this function can override the returned token value by returning another.
When it does not return any (truthy) value, the lexer will return
the original token.
this refers to the Lexer object.
Do note that options 1 and 3 are not available in vanilla jison, with one remark about option 1: jison does not accept a custom lexer as part of the jison parser/lexer spec file as demonstrated in the example links above. Of course, you can always go around and wrap the generated parser and thus inject a custom lexer and do other things.
Implementing the token counter using post_lex
Now how does it look in actual practice?
Solution 1: Let's do it nicely
We are going to 'abuse'/use (depending on your POV about riding on undocumented features) the yylloc info object and augment it with a counter member. We choose to do this so that we never risk interfering (or getting interference from) the default text/line-oriented yylloc position tracking system in the lexer and parser.
The undocumented bit here is the knowledge that all data members of any yylloc instance will be propagated by the default jison-gho location tracking&merging logic in the parser, hence when you tweak an yylloc instance in the lexer or parser action code, and if that yylloc instance is propagated to the output via merge or copy as the parser walks up the grammar tree, then your tweaks will be visible in the output.
Hooking into the lexer token output means we'll have to augment the lexer first, which we can easily do in the %% section before the /lex end-of-lexer-spec-marker:
// lexer extra code
var token_counter = 0;
lexer.post_lex = function (token) {
// hello world
++token_counter;
this.yylloc.counter = token_counter;
return token;
};
// extra helper so multiple parse() calls will restart counting tokens:
lexer.reset_token_counter = function () {
token_counter = 0;
};
where the magic bit is this statement: this.yylloc.counter = token_counter.
We hook a pre_lex callback into the flow by directly injecting it into the lexer definition via lexer.post_lex = function (){...}.
We could also have done this via the lexer options: lexer.options.post_lex = function ...
or via the parser-global yy instance: parser.yy.post_lex = function ... though those approaches would have meant we'ld be doing this in the parser definition code chunk or from the runtime which invokes the parser. These two slightly different approaches will not be demonstrated here.
Now all we have to do is complete this with a tiny bit of pre_parse code to ensure multiple parser.parse(input) invocations each will restart with the token counter reset to zero:
// extra helper: reset the token counter at the start of every parse() call:
parser.pre_parse = function (yy) {
yy.lexer.reset_token_counter();
};
Of course, that bit has to be added to the parser's final code block, after the %% in the grammar spec part of the jison file.
Full jison source file is available as a gist here.
How to compile and test:
# compile
jison --main so-q-58891186-2.jison
# run test code in main()
node so-q-58891186-2.js
Notes: I have 'faked' the AST construction code in your original source file so that one can easily diff the initial file with the one provided here. All that hack-it-to-make-it-work stuff is at the bottom part of the file.
Solution 2: Be a little nasty and re-use the yylloc.column location info and tracking
Instead of using the line info part of yylloc, I chose to use the column part instead, as to me that's about the same granularity level as a token sequence index. Doesn't matter which one you use, line or column, as long as you follow the same pattern.
When we do this right, we get the location tracking features of jison-gho added in for free, which is: column and line ranges for a grammar rule are automatically calculated from the individual token yylloc info in such a way that the first/last members of yylloc will show the first and last column, pardon, token index of the token sequence which is matched by the given grammar rule. This is the classic,merge jison-gho behaviour as mentioned in the --default-action CLI option:
--default-action
Specify the kind of default action that jison should include for every parser rule.
You can specify a mode for value handling ($$) and one for location
tracking (#$), separated by a comma, e.g.:
--default-action=ast,none
Supported value modes:
classic : generate a parser which includes the default
$$ = $1;
action for every rule.
ast : generate a parser which produces a simple AST-like
tree-of-arrays structure: every rule produces an array of
its production terms' values. Otherwise it is identical to
classic mode.
none : JISON will produce a slightly faster parser but then you are
solely responsible for propagating rule action $$ results.
The default rule value is still deterministic though as it
is set to undefined: $$ = undefined;
skip : same as none mode, except JISON does NOT INJECT a default
value action ANYWHERE, hence rule results are not
deterministic when you do not properly manage the $$ value
yourself!
Supported location modes:
merge : generate a parser which includes the default #$ = merged(#1..#n); location tracking action for every rule,
i.e. the rule's production 'location' is the range spanning its terms.
classic : same as merge mode.
ast : ditto.
none : JISON will produce a slightly faster parser but then you are solely responsible for propagating rule action #$ location results. The default rule location is still deterministic though, as it is set to undefined: #$ = undefined;
skip : same as "none" mode, except JISON does NOT INJECT a default location action ANYWHERE, hence rule location results are not deterministic when you do not properly manage the #$ value yourself!
Notes:
when you do specify a value default mode, but DO NOT specify a location value mode, the latter is assumed to be the same as the former.
Hence:
--default-action=ast
equals:
--default-action=ast,ast
when you do not specify an explicit default mode or only a "true"/"1" value, the default is assumed: classic,merge.
when you specify "false"/"0" as an explicit default mode, none,none is assumed. This produces the fastest deterministic parser.
Default setting: [classic,merge]
Now that we are going to 're-use' the fist_column and last_column members of yylloc instead of adding a new counter member, the magic bits that do the work remain nearly the same as in Solution 1:
augmenting the lexer in its %% section:
// lexer extra code
var token_counter = 0;
lexer.post_lex = function (token) {
++token_counter;
this.yylloc.first_column = token_counter;
this.yylloc.last_column = token_counter;
return token;
};
// extra helper so multiple parse() calls will restart counting tokens:
lexer.reset_token_counter = function () {
token_counter = 0;
};
Side Note: we 'abuse' the column part for tracking the token number; meanwhile the range member will still be usable to debug the raw text input as that one will track the positions within the raw input string.
Make sure to tweak both first_column and last_column so that the default location tracking 'merging' code in the generated parser can still do its job: that way we'll get to see which range of tokens constitute a particular grammar rule/element, just like
it were text columns.
Could've done the same with first_line/last_line, but I felt it more suitable to use the column part for this as it's at the same very low granularity level as 'token index'...
We hook a pre_lex callback into the flow by directly injecting it into the lexer definition via lexer.post_lex = function (){...}.
Same as Solution 1, now all we have to do is complete this with a tiny bit of pre_parse code to ensure multiple parser.parse(input) invocations each will restart with the token counter reset to zero:
// extra helper: reset the token counter at the start of every parse() call:
parser.pre_parse = function (yy) {
yy.lexer.reset_token_counter();
};
Of course, that bit has to be added to the parser's final code block, after the %% in the grammar spec part of the jison file.
Full jison source file is available as a gist here.
How to compile and test:
# compile
jison --main so-q-58891186-3.jison
# run test code in main()
node so-q-58891186-3.js
Aftermath / Observations about the solutions provided
Observe the test verification data at the end of both those jison files provided for how the token index shows up in the parser output:
Solution 1 (stripped, partial) output:
"type": "ProgramStmt",
"a1": [
{
"type": "ExprStmt",
"a1": {
"type": "AssignmentValueExpr",
"target": {
"type": "VariableRefExpr",
"a1": "ABA0",
"loc": {
"range": [
0,
8
],
"counter": 1
}
},
"source": {
"type": "VariableRefExpr",
"a1": "X",
"loc": {
"counter": 6
}
},
"loc": {
"counter": 1
}
},
"loc": {
"counter": 1
}
}
],
"loc": {
"counter": 1
}
Note here that the counter index is not really accurate for compound elements, i.e. elements which were constructed from multiple tokens matching one or more grammar rules: only the first token index is kept.
Solution 2 fares much better in that regard:
Solution 2 (stripped, partial) output:
"type": "ExprStmt",
"a1": {
"type": "AssignmentValueExpr",
"target": {
"type": "VariableRefExpr",
"a1": "ABA0",
"loc": {
"first_column": 1,
"last_column": 4,
}
},
"source": {
"type": "VariableRefExpr",
"a1": "X",
"loc": {
"first_column": 6,
"last_column": 6,
}
},
"loc": {
"first_column": 1,
"last_column": 6,
}
},
"loc": {
"first_column": 1,
"last_column": 7,
}
}
As you can see the first_column plus last_column members nicely track the set of tokens which constitute each part.
(Note that the counter increment code implied we start counting with ONE(1), not ZERO(0)!)
Parting thought
Given the input A;B;A;D0;ASSIGN;X;SEMICOLON; the current grammar parses this like ABA0 = X; and I wonder if this is what you really intend to get: constructing the identifier ABA0 like that seems a little odd to me.
Alas, that's not relevant to your question. It's just me encountering something quite out of the ordinary here, that's all. No matter.
Cheers and hope this long blurb is helpful to more of us. :-)
Source files:
original OP file as gist
solution 1 JISON file
solution 2 JISON file
current jison-gho release example grammars, including several which demo advanced features

Related

Preserving whitespace in Rascal when transforming Java code

I am trying to add instrumentation (e.g. logging some information) to methods in a Java file. I am using the following Rascal code which seems to work mostly:
import ParseTree;
import lang::java::\syntax::Java15;
// .. more imports
// project is a loc
M3 model = createM3FromEclipseProject(project);
set[loc] projectFiles = { file | file <- files(model)} ;
for (pFile <- projectFiles) {
CompilationUnit cunit = parse(#CompilationUnit, pFile);
cUnitNew = visit(cunit) {
case (MethodBody) `{<BlockStm* post>}`
=> (MethodBody) `{
'System.out.println(new Throwable().getStackTrace()[0]);
'<BlockStm* post>
'}`
}
writeFile(pFile, cUnitNew);
}
I am running into two issues regarding whitespace, which might be unrelated.
The line of code that I am inserting does not preserve whitespace that was there previously. If there was a tab character, it will now be removed. The same is true for the line directly following the line I am inserting and the closing brace. How can I 'capture' whitespace in my pattern?
Example before transforming (all lines start with a tab character, line 2 and 3 with two):
void beforeFirst() throws Exception {
rowIdx = -1;
rowSource.beforeFirst();
}
Example after transforming:
void beforeFirst() throws Exception {
System.out.println(new Throwable().getStackTrace()[0]);
rowIdx = -1;
rowSource.beforeFirst();
}
An additional issue regarding whitespace; if a file ends on a newline character, the parse function will throw a ParseError without further details. Removing this newline from the original source will fix the issue, but I'd rather not 'manually' have to fix code before parsing. How can I circumvent this issue?
Alas, capturing whitespace with a concrete pattern is not a feature of the current version of Rascal. We used to have it, but now it's back on the TODO list. I can point you to papers about the topic if you are interested. So for now you have to deal with this "damage" later.
You could write a Tree to Tree transformation on the generic level (see ParseTree.rsc), to fix indentation issues in a parse tree after your transformation, or to re-insert the comments that you lost. This is about matching the Tree data-type and appl constructors. The Tree format is a form of reflection on the parse trees of Rascal that allow any kind of transformation, including whitespace and comments.
The parse error you talked about is caused by not using the start non-terminal. If you use parse(#start[CompilationUnit], ...) then whitespace and comments before and after the CompilationUnit are accepted.

writing a function to do type cast

I'm trying to write a function that does type casting, which seems to be a frequently occurring activity in Rascal code. But I can't seem to get it right. The following and several variations on it fail.
public &T cast(type[&T] tp, value v) throws str {
if (tp tv := v)
return tv;
else
throw "cast failed";
}
Can someone help me out?
Some more info: I frequently use pattern matching against a pattern of the form "Type Var" (i.e. against a variable declaration) in order to tell Rascal that an expression has a certain type, e.g.
map[str,value] m := myexp
This is usually in cases where I know that myexp has type map[str,value], but omitting the matching would make Rascal's type checking mechanism complain.
In order to be a bit more defensive against mistakes, I usually wrap the matching construct in an if-then-else where an exception is raised if the match fails:
if (map[str,value] m := myexp) {
// use m
} else {
throw "cast failed";
}
I would like to shorten all such similar pieces of code using a single function that does the job generically, so that I can write instead
cast(#map[str,value], myexp)
PS. Also see How to cast a value type to Map in Rascal?
It seems that the best way to write this, if you truly need to do this, is the following:
public map[str,value] cast(map[str,value] v) = v;
public default map[str,value] cast(value v) { throw "cast failed!"; }
Then you could just say
m = cast(myexp);
and it would do what you want to do -- the actual pattern matching is moved into the function signature for cast, with a case specific to the type you are wanting to use and a case that handles everything that doesn't otherwise match.
However, I'm still not sure why you are using type value, either here (inside the map) or in the linked question. The "standard" Rascal way of handling cases where you could have one of multiple choices is to define these with a user-defined data type and constructors. You could then use pattern matching to match the constructors, or use the is and has keywords to interrogate a value to check to see if it was created using a specific constructor or if it has a specific field, respectively. The rule for fields is that all occurrences of a field in the constructor definitions for a given ADT have the same type. So, it may help to know more about your usage scenario to see if this definition of cast is the best option or if there is a better solution to your problem.
EDITED
If you are reading JSON, an alternate way to do it is to use the JSON grammar and AST that also live in that part of the library (I think the one you are using is more of a stream reader, like our current text readers and writers, but I would need to look at the code more to be sure). You can then do something like this (long output included to give an idea of the results):
rascal>import lang::json::\syntax::JSON;
ok
rascal>import lang::json::ast::JSON;
ok
rascal>import lang::json::ast::Implode;
ok
ascal>js = buildAST(parse(#JSONText, |project://rascal/src/org/rascalmpl/library/lang/json/examples/twitter01.json|));
Value: object((
"since_id":integer(0),
"refresh_url":string("?since_id=202744362520678400&q=amsterdam&lang=en"),
"page":integer(1),
"since_id_str":string("0"),
"completed_in":float(0.058),
"results_per_page":integer(25),
"next_page":string("?page=2&max_id=202744362520678400&q=amsterdam&lang=en&rpp=25"),
"max_id_str":string("202744362520678400"),
"query":string("amsterdam"),
"max_id":integer(202744362520678400),
"results":array([
object((
"from_user":string("adekamel"),
"profile_image_url_https":string("https:\\/\\/si0.twimg.com\\/profile_images\\/2206104506\\/339515338_normal.jpg"),
"in_reply_to_status_id_str":string("202730522013728768"),
"to_user_id":integer(215350297),
"from_user_id_str":string("366868475"),
"geo":null(),
"in_reply_to_status_id":integer(202730522013728768),
"profile_image_url":string("http:\\/\\/a0.twimg.com\\/profile_images\\/2206104506\\/339515338_normal.jpg"),
"to_user_id_str":string("215350297"),
"from_user_name":string("nurul amalya \\u1d54\\u1d25\\u1d54"),
"created_at":string("Wed, 16 May 2012 12:56:37 +0000"),
"id_str":string("202744362520678400"),
"text":string("#Donnalita122 #NaishahS #fatihahmS #oishiihotchoc #yummy_DDG #zaimar93 #syedames I\'m here at Amsterdam :O"),
"to_user":string("Donnalita122"),
"metadata":object(("result_type":string("recent"))),
"iso_language_code":string("en"),
"from_user_id":integer(366868475),
"source":string("<a href="http:\\/\\/blackberry.com\\/twitter" rel="nofollow">Twitter for BlackBerry\\u00ae<\\/a>"),
"id":integer(202744362520678400),
"to_user_name":string("Rahmadini Hairuddin")
)),
object((
"from_user":string("kelashby"),
"profile_image_url_https":string("https:\\/\\/si0.twimg.com\\/profile_images\\/1861086809\\/me_beach_normal.JPG"),
"to_user_id":integer(0),
"from_user_id_str":string("291446599"),
"geo":null(),
"profile_image_url":string("http:\\/\\/a0.twimg.com\\/profile_images\\/1861086809\\/me_beach_normal.JPG"),
"to_user_id_str":string("0"),
"from_user_name":string("Kelly Ashby"),
"created_at":string("Wed, 16 May 2012 12:56:25 +0000"),
"id_str":string("202744310872018945"),
"text":string("45 days til freedom! Cannot wait! After Paris: London, maybe Amsterdam, then southern France, then CANADA!!!!"),
"to_user":null(),
"metadata":object(("result_type":string("recent"))),
"iso_language_code":string("en"),
"from_user_id":integer(291446599),
"source":string("<a href="http:\\/\\/mobile.twitter.com" rel="nofollow">Mobile Web<\\/a>"),
"id":integer(202744310872018945),
"to_user_name":null()
)),
object((
"from_user":string("johantolsma"),
"profile_image_url_https":string("https:\\/\\/si0.twimg.com\\/profile_images\\/1961917557\\/image_normal.jpg"),
"to_user_id":integer(0),
"from_user_id_str":string("23632499"),
"geo":null(),
"profile_image_url":string("http:\\/\\/a0.twimg.com\\/profile_images\\/1961917557\\/image_normal.jpg"),
"to_user_id_str":string("0"),
"from_user_name":string("Johan Tolsma"),
"created_at":string("Wed, 16 May 2012 12:56:16 +0000"),
"id_str":string("202744274050236416"),
"text":string("RT #agerolemou: Office space for freelancers in Amsterdam http:\\/\\/t.co\\/6VfHuLeK"),
"to_user":null(),
"metadata":object(("result_type":string("recent"))),
"iso_language_code":string("en"),
"from_user_id":integer(23632499),
"source":string("<a href="http:\\/\\/itunes.apple.com\\/us\\/app\\/twitter\\/id409789998?mt=12" rel="nofollow">Twitter for Mac<\\/a>"),
"id":integer(202744274050236416),
"to_user_name":null()
)),
object((
"from_user":string("hellosophieg"),
"profile_image_url_https":string("https:\\/\\/si0.twimg.com\\/profile_images\\/2213055219\\/image_normal.jpg"),
"to_user_id":integer(0),
"from_user_id_str":string("41153106"),
"geo":null(),
"profile_image_url":string("http:\\/\\/a0.twimg.com\\/profile_images\\/2213055219\\/image_normal.jp...
rascal>js is object;
bool: true
rascal>js.members<0>;
set[str]: {"since_id","refresh_url","page","since_id_str","completed_in","results_per_page","next_page","max_id_str","query","max_id","results"}
rascal>js.members["results_per_page"];
Value: integer(25)
You can then use pattern matching, over the types defined in lang::json::ast::json, to extract the information you need.
The code has a bug. This is the fixed code:
public &T cast(type[&T] tp, value v) throws str {
if (&T tv := v)
return tv;
else
throw "cast failed";
}
Note that we do not wish to include this in the standard library. Rather lets collect cases where we need it and find out how to fix it in another way.
If you find you need this casting often, then you might be avoiding the better parts of Rascal, such as pattern based dispatch. See also the answer by Mark Hills.

How to understand the "isCommitted" property of ParserResult?

I'm reading the source of polux's great parsers, and found there is a special isCommitted property which I can't understand:
class ParseResult<A> {
final bool isSuccess;
final bool isCommitted;
/// [:null:] if [:!isSuccess:]
final A value;
final String text;
final Position position;
final Expectations expectations;
// ...
}
You can see there is already a isSuccess to indicate the parse result is successful or not, why do we need a isCommitted? I tried to read related code, but still don't understand.
If you want to see the source, you can find it here.
The short answer is: don't worry about isCommited, it's for internal purposes only.
The long answer is: you can call commited on a paser, which means that once it has succeeded, you know for sure that it's pointless to backtrack (very much like Prolog's cut). For instance consider a grammar like this:
expr() => str('(') + rec(expr) str(')') ^ ...
| num()
Assume we parse the string "(...". Once we have recognized the parenthesis, we know for sure that if ... turns out not to be an expr, there is no need to rewind to the start of the string and try to parse a num, since a num will never start with a parenthesis anyway. We can fail early. This is done by marking ( as being a "commit point":
expr() => str('(').commited + rec(expr) str(')') ^ ...
| num()
This is an optimisation which should be used with great care because it breaks the modularity of parsers with respect to |. I personally never had to use it so far.
Whenever you call commited on a parser, it returns a new parser whose isCommited property is true. It is then used by | to decide whether to backtrack or not. This is what isCommited is used for. As an end user you should never have to care. I should probably make it private.
This feature is inspired by Polyparse's commit.

Make a table containing tokens visible for both .mly and .mll by menhir

I would like to define a keyword_table which maps some strings to some tokens, and I would like to make this table visible for both parser.mly and lexer.mll.
It seems that the table has to be defined in parser.mly,
%{
open Utility (* where hash_table is defined to make a table from a list *)
let keyword_table = hash_table [
"Call", CALL; "Case", CASE; "Close", CLOSE; "Const", CONST;
"Declare", DECLARE; "DefBool", DEFBOOL; "DefByte", DEFBYTE ]
%}
However, I could NOT use it in lexer.mll, for instance
{
open Parser
let x = keyword_table (* doesn't work *)
let x = Parser.keyword_table (* doesn't work *)
let x = Parsing.keyword_table (* doesn't work *)
}
As this comment suggests, menhir has a solution for this, could anyone tell me any details?
The first option is to define tokens in a separate .mly file. Executing menhir for this file with --only-tokens option will generate a module containing type token that you can use in your parser compiled with --external-tokens option.
If this solves the problem with tokens, you can specify all other functions that are used by both parser and lexer in a separate file as Thomash suggested.
There is an alternative solution as well. You can use %parameter<module signature> declaration in the parser to parametrize the entire parser over type and function annotations specified inside given signature. The main advantage is that this signature is provided in the interface file for the parser, so the parser can share this signature with other modules (that can construct modules based on the signature).
I suggest to refer to menhir examples, namely see calc-two to get know about external tokens and to calc-param to know how to create parametrized parsers.
I usually put the keyword_tablein lexer.mll and I see no reason to put it in parser.mly.
If you need to access it from both lexer.mll and parser.mly (but why do you want to access it from parser.mly?), the easiest solution is to put it in a third file keyword.ml and use Keyword.keyword_table (or open Keyword and keyword_table).

Pushing/Poping Lexical States in JavaCC

I'm trying to refactor a JavaCC DSL parser that is written using mostly only one lexical state.
My goal is to introduce a new keyword that is context sensitive to not invalidate older configurations using the older DSL.
The idea was to change lexical state and introduce the new keyword so it is only valid during a very specific context. Making backward compatible with earlier releases.
Problem: Comments already change lexical state, to change back to DEFAULT after end of comment. Changing back to DEFAULT is "hardcoded", but now I would need Comments to instead change back to last active lexical state.
From what I understand keeping states in a stack, pushing and poping them, would help me achieve this (Think this is called DPDA). Is this possible in JavaCC?
[Edit, after searching some more, this is who far I got:]
TOKEN_MGR_DECLS : {
Stack lexicalStateStack = new Stack();
}
SKIP : {
" "
| "\t"
| "\n"
| "\r"
}
MORE :
{
"/*" { lexicalStateStack.push(curLexState); } : IN_COMMENT
}
SPECIAL_TOKEN :
{
<IN_COMMENT : "*/" > { SwitchTo((int)lexicalStateStack.pop()); }
}
Seems to be doing what I want, but is it correct? (Thinking of lookaheads here)
Absolutely. You can stack the lexical states. But, do the transitions from the token manager, not the parser.
See questions 3.17 How do I tokenize nested comments? and 3.12 Can the parser force a switch to a new lexical state? in the FAQ http://www.engr.mun.ca/~theo/JavaCC-FAQ/ .

Resources