How to implement parser for a grammar in Java - parsing

I have defined a grammar and now I'm implementing a parser for it.
The program should start with the keyword main followed by an opening curly bracket, followed in turn by a (possibly empty) sequence of statements, and terminated by a closing curly bracket. My questions is how to define the program in the parser? I have tried several different ways, including this, but it doesn't seem to be correct when I test.
public void program() {
//Program -> MAIN LCBR Statement* RCBR
eat("MAIN");
eat("LCBR");
while (lex.token().type != "RCBR") {
statement();
}
}
Any suggestions would be appreciated!

Related

Preserving whitespace in Rascal when transforming Java code

I am trying to add instrumentation (e.g. logging some information) to methods in a Java file. I am using the following Rascal code which seems to work mostly:
import ParseTree;
import lang::java::\syntax::Java15;
// .. more imports
// project is a loc
M3 model = createM3FromEclipseProject(project);
set[loc] projectFiles = { file | file <- files(model)} ;
for (pFile <- projectFiles) {
CompilationUnit cunit = parse(#CompilationUnit, pFile);
cUnitNew = visit(cunit) {
case (MethodBody) `{<BlockStm* post>}`
=> (MethodBody) `{
'System.out.println(new Throwable().getStackTrace()[0]);
'<BlockStm* post>
'}`
}
writeFile(pFile, cUnitNew);
}
I am running into two issues regarding whitespace, which might be unrelated.
The line of code that I am inserting does not preserve whitespace that was there previously. If there was a tab character, it will now be removed. The same is true for the line directly following the line I am inserting and the closing brace. How can I 'capture' whitespace in my pattern?
Example before transforming (all lines start with a tab character, line 2 and 3 with two):
void beforeFirst() throws Exception {
rowIdx = -1;
rowSource.beforeFirst();
}
Example after transforming:
void beforeFirst() throws Exception {
System.out.println(new Throwable().getStackTrace()[0]);
rowIdx = -1;
rowSource.beforeFirst();
}
An additional issue regarding whitespace; if a file ends on a newline character, the parse function will throw a ParseError without further details. Removing this newline from the original source will fix the issue, but I'd rather not 'manually' have to fix code before parsing. How can I circumvent this issue?
Alas, capturing whitespace with a concrete pattern is not a feature of the current version of Rascal. We used to have it, but now it's back on the TODO list. I can point you to papers about the topic if you are interested. So for now you have to deal with this "damage" later.
You could write a Tree to Tree transformation on the generic level (see ParseTree.rsc), to fix indentation issues in a parse tree after your transformation, or to re-insert the comments that you lost. This is about matching the Tree data-type and appl constructors. The Tree format is a form of reflection on the parse trees of Rascal that allow any kind of transformation, including whitespace and comments.
The parse error you talked about is caused by not using the start non-terminal. If you use parse(#start[CompilationUnit], ...) then whitespace and comments before and after the CompilationUnit are accepted.

Rascal: TrafoFields Syntax error: concrete syntax fragment

I'm trying to re-create Tijs' CurryOn16 example "TrafoFields" scraping the code from the video, but using the Java18.rsc grammar instead of his Java15.rsc. I've parsed the Example.java successfully in the repl, like he did in the video, yielding a var pt. I then try to do the transformation with trafoFields(pt). The response I get is:
|project://Rascal-Test/src/TrafoFields.rsc|(235,142,<12,9>,<16,11>): Syntax error: concrete syntax fragment
My TrafoFields.rsc looks like this:
module TrafoFields
import lang::java::\syntax::Java18;
/**
* - Make public fields private
* - add getters and setters
*/
start[CompilationUnit] trafoFields(start[CompilationUnit] cu) {
return innermost visit (cu) {
case (ClassBody)`{
' <ClassBodyDeclaration* cs1>
' public <Type t> <ID f>;
' <ClassBodyDeclaration* cs2>
'}`
=> (ClassBody)`{
' <ClassBodyDeclaration* cs1>
' private <Type t> <ID f>;
' public void <ID setter>(<Type t> x) {
' this.<ID f> = x;
' }
' public <Type t> <ID getter>() {
' return this.<ID f>;
' }
' <ClassBodyDeclaration* cs2>
'}`
when
ID setter := [ID]"set<f>",
ID getter := [ID]"get<f>"
}
}
The only deviation from Tijs' code is that I've changed ClassBodyDec* to ClassBodyDeclaration*, as the grammar has this as a non-terminal. Any hint what else could be wrong?
UPDATE
More non-terminal re-writing adapting to Java18 grammar:
Id => ID
Ah yes, that is the Achilles-heal of concrete syntax usability; parse errors.
Note that a generalized parser (such as GLL which Rascal uses), simulates "unlimited lookahead" and so a parse error may be reported a few characters or even a few lines after the actual cause (but never before!). So shortening the example (delta debugging) will help localize the cause.
My way-of-life in this is:
First replace all pattern holes by concrete Java snippets. I know Java, so I should be able to write a correct fragment that would have matched the holes.
If there is still a parse error, now you check the top-non-terminal. Is it the one you needed? also make sure there is no extra whitespace before the start and after the end of the fragment inside the backquotes. Still a parse error? Write a shorter fragment first for a sub-nonterminal first.
Parse error solved? this means one of the pattern holes was not syntactically correct. The type of the hole is leading here, it should be one of the non-terminals used the grammar literally, and of course at the right spot in the fragment. Add the holes back in one-by-one until you hit the error again. Then you know the cause and probably also the fix.

Writing a parser for a Scheme EBNF Grammar

I'm currently writing my own compiler for a scheme subset and have issues with my own recursive descent parser for the grammar. I'm using the chez scheme grammar found here: https://www.scheme.com/tspl2d/grammar.html. The issues come from the asterisks und plus signs. Since asterisks basically mean epsilon rules, I would either need to check if the following token is legal or allow the called non-terminal method to fail and return nothing. I took the second approach and use a vector to save tokens that possibly need to be restored.
Example:
Note: The code to build the AST is still missing
bool Parser::definition() {
if (variableDefinition()){
consumeTT(); // consumeTemporaryTokens
return true;
}
if (derivedDefinition()){
consumeTT();
return true;
}
if (checkNext(Tag::LPAR)) {
if (checkNext(Tag::BEGIN)) {
while (definition())
;
consumeTT();
return true;
}
}
restoreTT(); //restoreTemporaryTokens
return false;
}
Is this the right approach? And how am I supposed to handle Errors now, since basically any rule can fail?

Interpolation in Concrete Syntax Matching

I'm working with a Java 8 grammar and I want to find occurrences of a method invocation, more specifically it.hasNext(), when it is an Iterator.
This works:
visit(unit) {
case (MethodInvocation)`it . <TypeArguments? ta> hasNext()`: {
println("found");
}
}
Ideally I would like to match with any identifier, not just it.
So I tried using String interpolation, which compiles but doesn't match:
str iteratorId = "it";
visit(unit) {
case (MethodInvocation)`$iteratorId$ . <TypeArguments? ta> hasNext()`: {
println("achei");
}
}
I also tried several other ways, including pattern variable uses (as seen in the docs) but I can't get this to work.
Is this kind of matching possible in rascal? If yes, how can it be done?
The answer specifically depends on the grammar you are using, which I did not look up, but in general in concrete syntax fragments this notation is used for placeholders: <NonTerminal variableName>
So your pattern should look something like the following:
str iteratorId = "it";
visit(unit) {
case (MethodInvocation)`<MethodName name>.<TypeArguments? ta>hasNext()`:
if (iteratorId == "<name>") println("bingo!");
}
That is assuming that MethodName is indeed a non-terminal in your Java8 grammar and part of the syntax rule for method invocations.

Aliasing frequently used patterns in Lex

I have one regexp, which is used in a several rules. Can I define alias for it, to keep this regexp definition in one place and just use it across the code?
Example:
[A-Za-z0-9].[A-Za-z0-9_-]* (expression) NAME (alias)
...
%%
NAME[=]NAME {
//Do something.
}
%%
It goes in the definitions section of your lex input file (before the %%) and you use it in a regular expression by putting the name inside curly braces ({…}). For example:
name [A-Za-z0-9][A-Za-z0-9_-]*
%%
{name}[=]{name} { /* Do something */ }

Resources