How to implement for loop in javacc - parsing

I am implementing a parser based on javacc which will be able to GW Basic programs.
I implemented for loop like this
void forloop(Token line):
{
Token toV;
Token toI;
Token step;
Token next;
Token var;
}
{
<FOR> var=<VARIABLE> "=" Expression() { instructions.add("STORE " + var.image); } <TO> toV=<INTEGER> <STEP> step=<INTEGER>
{
instructions.add("LABEL "+labelsCntr);
instructions.add("LOAD "+var.image);
instructions.add("CONST "+toV.image);
instructions.add("SUB");
instructions.add("CONST 0");
}
( Line() )*
next = <INTEGER> <NEXT> <VARIABLE>
{
instructions.add("LINE "+next.image);
instructions.add("LOAD "+step.image);
instructions.add("LOAD "+var.image);
instructions.add("ADD");
instructions.add("JMP LABEL "+(labelsCntr));
labelsCntr++;
}
}
But it does not work.
How can I implement for loop so that it works.
Or where I am doing wrong.
Thanks in advance.

There are two things I don't see in your machine code that I would have expected. One is a conditional jump out of the loop. When the variable exceeds toV, control should jump to the first instruction after the loop. Second I don't see where the var is changed. At the end of the loop you add the value of the variable to the step value, but you don't store the result back into the variable.
There are also a few checks I expect need to be done at compile time that are missing: That the variable in the NEXT statement matches that in the VAR and that the step is positive.

Related

Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:
530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I
The patterns that are used in this case are:
FOR { return TOK_FOR; }
TO { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
return IDENTIFIER;
}
(many lines later...)
[ \t\r\l] { /* eat non-string whitespace */ }
The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:
530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI
Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.
The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.
Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.
Here's a simple pattern for recognising keywords, using trailing context:
tail [[:alnum:]]*[$%!#]?
%%
FOR/{tail} { return TOK_FOR; }
TO/{tail} { return TOK_TO; }
NEXT/{tail} { return TOK_NEXT; }
/* etc. */
[[:alpha:]]{tail} { /* Handle an ID */ }
Effectively, that just extends the keyword match without extending the matched token.
But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?

ANTLR4 - parsing regex literals in JavaScript grammar

I'm using ANTLR4 to generate a Lexer for some JavaScript preprocessor (basically it tokenizes a javascript file and extracts every string literal).
I used a grammar originally made for Antlr3, and imported the relevant parts (only the lexer rules) for v4.
I have just one single issue remaining: I don't know how to handle corner cases for RegEx literals, like this:
log(Math.round(v * 100) / 100 + ' msec/sample');
The / 100 + ' msec/ is interpreted as a RegEx literal, because the lexer rule is always active.
What I would like is to incorporate this logic (C# code. I would need JavaScript, but simply I don't know how to adapt it):
/// <summary>
/// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled.
/// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token.
/// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true.
/// </summary>
private bool AreRegularExpressionsEnabled
{
get
{
if (Last == null)
{
return true;
}
switch (Last.Type)
{
// identifier
case Identifier:
// literals
case NULL:
case TRUE:
case FALSE:
case THIS:
case OctalIntegerLiteral:
case DecimalLiteral:
case HexIntegerLiteral:
case StringLiteral:
// member access ending
case RBRACK:
// function call or nested expression ending
case RPAREN:
return false;
// otherwise OK
default:
return true;
}
}
}
This rule was present in the old grammar as an inline predicate, like this:
RegularExpressionLiteral
: { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
;
But I don't know how to use this technique in ANTLR4.
In the ANTLR4 book, there are some suggestions about solving this kind of problems at the parser level (chapter 12.2 - context sensitive lexical problems), but I don't want to use a parser. I want just to extract all the tokens, leave everything untouched except for the string literals, and keep the parsing out of my way.
Any suggestion would be really appreciated, thanks!
I'm posting here the final solution, developed adapting the existing one to the new syntax of ANTLR4, and addressing the differences in JavaScript syntax.
I'm posting just the relevant parts, to give a clue to someone else about a working strategy.
The rule was edited as follows:
RegularExpressionLiteral
: DIV {this.isRegExEnabled()}? RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
;
The function isRegExEnabled is defined in a #members section on top of the lexer grammar, as follows:
#members {
EcmaScriptLexer.prototype.nextToken = function() {
var result = antlr4.Lexer.prototype.nextToken.call(this, arguments);
if (result.channel !== antlr4.Lexer.HIDDEN) {
this._Last = result;
}
return result;
}
EcmaScriptLexer.prototype.isRegExEnabled = function() {
var la = this._Last ? this._Last.type : null;
return la !== EcmaScriptLexer.Identifier &&
la !== EcmaScriptLexer.NULL &&
la !== EcmaScriptLexer.TRUE &&
la !== EcmaScriptLexer.FALSE &&
la !== EcmaScriptLexer.THIS &&
la !== EcmaScriptLexer.OctalIntegerLiteral &&
la !== EcmaScriptLexer.DecimalLiteral &&
la !== EcmaScriptLexer.HexIntegerLiteral &&
la !== EcmaScriptLexer.StringLiteral &&
la !== EcmaScriptLexer.RBRACK &&
la !== EcmaScriptLexer.RPAREN;
}}
As you can see, two functions are defined, one is an override of lexer's nextToken method, which wraps the existing nextToken and saves the last non-comment-or-whitespace token for reference. Then, the semantic predicate invokes isRegExEnabled checking if the last significative token is compatible with the presence of RegEx literals. If it's not, it returns false.
Thanks to Lucas Trzesniewski for the comment: it pointed me in the right direction, and to Patrick Hulsmeijer for the original work on v3.

How to get such pattern matching of regular expression in lex

Hi I want to check a specific pattern in regular expression but I'm failed to do that. Input should be like
noun wordname:wordmeaning
I'm successful getting noun and wordname but couldn't design a pattern for word meaning. My code is :
int state;
char *meaning;
char *wordd;
^verb { state=VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
//my try but failed
[:\a-z] {
meaning=yytext;
printf(" Meaning is getting detected %s", meaning);
}
[a-zA-Z]+ {
word=yytext;
}
Example input:
noun john:This is a name
Now word should be equal to john and meaning should be equal to This is a name.
Agreeing that lex states (also known as start conditions) are the way to go (odd, but there are no useful tutorials).
Briefly:
your application can be organized as states, using one for "noun", one for "john" and one for the definition (after the colon).
at the top of the lex file, declare the states, e.g.,
%s TYPE NAME VALUE
the capitals are not necessary, but since you are defining constants, that is a good convention.
next to the patterns, put those state names in < > brackets to tell lex that the patterns are used only in those states. You can list more than one state, comma-separated, when it matters. But your lex file probably does not need that.
one state is predefined: INITIAL.
your program switches states using the BEGIN() macro, in actions, e.g.,
{ BEGIN(TYPE); }
if your input is well-formed, it's simple: as each "type" is recognized, it begins the NAME state.
in the NAME state, your lexer looks for whatever you think a name should be, e.g.,
<NAME>[[:alpha:]][[:alnum:]]+ { my_name = strdup(yytext); }
the name ends with a colon, so
<NAME>":" { BEGIN(VALUE); }
the value is then everything until the end of the line, e.g.,
<VALUE>.* { my_value = strdup(yytext); BEGIN(INITIAL); }
whether you switch to INITIAL or TYPE depends on what other things you might add to your lexer (such as ignoring comment lines and whitespace).
Further reading:
Start conditions (flex documentation)
Introduction to Flex

How can I get all character until EOF with input() function in flex lexer?

I tried to use flex below.
<MOD1>{INFBLK_START} {
int c = input(pp->scaninfo);
while(c != EOF){
//....save the character.
c = input(pp->scaninfo);
}
return BLOCK;
}
but the code gives segment fault signal when I run those code.
the code crashed in yy_get_next_buffer function, where lex state is YY_END_OF_BUFFER.
How can I get all characters to EOF safely?
#rici, I have finished this myself by changing the flex rules as below.
<MOD1>{INFBLK_START} {
//malloc memory.
BEGIN MOD2;
}
<MOD2>.|\n{
//return each char and record them in bison code.
}
<MOD2><<EOF>>{
yyterminate();
}
This is one way to get all charactors until EOF.
However there is a considerable shortage. The lexer should send every single char to parser with a function call, which will cost too much when the number of chars are very large.

Resources