Lightweight syntax and nested records - f#

This program:
type A = { a : int }
type B = { b : A }
//34567890
let r = {
b = { // line 6
a = 2 // line 7
}
}
Produces under mono/fsharpc this warning twice:
/Users/debois/git/dcr/foo.fs(7,5): warning FS0058: Possible incorrect indentation: this token is offside of context started at position (6:7). Try indenting this token further or using standard formatting conventions.
Why does this warning occur at all? The f#-spec p. 228 makes me think the token 'a' following '{' sets a new offside line, in which case there should be no problem?
Why does it occur twice?
Thanks,
Søren
Full output:
dcr > fsharpc foo.fs
F# Compiler for F# 3.0 (Open Source Edition)
Freely distributed under the Apache 2.0 Open Source License
/Users/debois/git/dcr/foo.fs(7,5): warning FS0058: Possible incorrect indentation: this token is offside of context started at position (6:7). Try indenting this token further or using standard formatting conventions.
/Users/debois/git/dcr/foo.fs(7,5): warning FS0058: Possible incorrect indentation: this token is offside of context started at position (6:7). Try indenting this token further or using standard formatting conventions.
dcr >

reformat your code as
type A = { a : int }
type B = { b : A }
//34567890
let r = { b = { a = 2 } }
or
let r =
{
b = { a = 2 }
}
i.e. the { is the left-most token.
EDIT: One off-site line starts with the { therefore you need to indent at least as much as the { the line break after is not mandatory. And the second warning is because of the same reason.

I finally worked it out. From the F# spec pp. 229–230, here are the pertinent rule about when offside contexts (lines) are introduced (rule numbering is mine):
(i) The column of the first token of a (, { or begin token.
(ii) Immediately after an = token is encountered in a record expression when the subsequent token either (a) occurs on the next line or (b) is one of try, match, if, let, for, while or use.
Now recall the problem:
//34567890
let r = {
b = { // line 6
a = 2 // line 7
}
}
The b on line 6 follows a { and so pushes an offside-line on column 3 by (i). Then, the { on line 6 follows a record expression = and so pushes a new offside line on column 7 by (ii). The a on line 7 column 5 violates that offside line.
The tricky bit is that the { on line 5 makes the next token define an offside-line, whereas the { on line 6 is itself an offside line (because it follows a record-expression equal sign).

Spec says:
Other structured constructs also introduce offside lines at the following places:
The column of the first token of a (, { or begin token.
Why do you decide that offside line should be introduced by the token after the { when spec says that it should be the { itself?
NOTE: I agree that phrase "first token of ...token " sounds confusing. More likely it should be "first character of ...token" as this is applicable to begin\end case (offside line is introduced by the column of 'b' character)
begin
begin
end // possible incorrect indentation: this token is offside of the context started at (2,2)
end
Having two warnings for the same position looks like a bug.

Related

Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:
530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I
The patterns that are used in this case are:
FOR { return TOK_FOR; }
TO { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
return IDENTIFIER;
}
(many lines later...)
[ \t\r\l] { /* eat non-string whitespace */ }
The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:
530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI
Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.
The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.
Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.
Here's a simple pattern for recognising keywords, using trailing context:
tail [[:alnum:]]*[$%!#]?
%%
FOR/{tail} { return TOK_FOR; }
TO/{tail} { return TOK_TO; }
NEXT/{tail} { return TOK_NEXT; }
/* etc. */
[[:alpha:]]{tail} { /* Handle an ID */ }
Effectively, that just extends the keyword match without extending the matched token.
But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?

How to get such pattern matching of regular expression in lex

Hi I want to check a specific pattern in regular expression but I'm failed to do that. Input should be like
noun wordname:wordmeaning
I'm successful getting noun and wordname but couldn't design a pattern for word meaning. My code is :
int state;
char *meaning;
char *wordd;
^verb { state=VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
//my try but failed
[:\a-z] {
meaning=yytext;
printf(" Meaning is getting detected %s", meaning);
}
[a-zA-Z]+ {
word=yytext;
}
Example input:
noun john:This is a name
Now word should be equal to john and meaning should be equal to This is a name.
Agreeing that lex states (also known as start conditions) are the way to go (odd, but there are no useful tutorials).
Briefly:
your application can be organized as states, using one for "noun", one for "john" and one for the definition (after the colon).
at the top of the lex file, declare the states, e.g.,
%s TYPE NAME VALUE
the capitals are not necessary, but since you are defining constants, that is a good convention.
next to the patterns, put those state names in < > brackets to tell lex that the patterns are used only in those states. You can list more than one state, comma-separated, when it matters. But your lex file probably does not need that.
one state is predefined: INITIAL.
your program switches states using the BEGIN() macro, in actions, e.g.,
{ BEGIN(TYPE); }
if your input is well-formed, it's simple: as each "type" is recognized, it begins the NAME state.
in the NAME state, your lexer looks for whatever you think a name should be, e.g.,
<NAME>[[:alpha:]][[:alnum:]]+ { my_name = strdup(yytext); }
the name ends with a colon, so
<NAME>":" { BEGIN(VALUE); }
the value is then everything until the end of the line, e.g.,
<VALUE>.* { my_value = strdup(yytext); BEGIN(INITIAL); }
whether you switch to INITIAL or TYPE depends on what other things you might add to your lexer (such as ignoring comment lines and whitespace).
Further reading:
Start conditions (flex documentation)
Introduction to Flex

How to implement for loop in javacc

I am implementing a parser based on javacc which will be able to GW Basic programs.
I implemented for loop like this
void forloop(Token line):
{
Token toV;
Token toI;
Token step;
Token next;
Token var;
}
{
<FOR> var=<VARIABLE> "=" Expression() { instructions.add("STORE " + var.image); } <TO> toV=<INTEGER> <STEP> step=<INTEGER>
{
instructions.add("LABEL "+labelsCntr);
instructions.add("LOAD "+var.image);
instructions.add("CONST "+toV.image);
instructions.add("SUB");
instructions.add("CONST 0");
}
( Line() )*
next = <INTEGER> <NEXT> <VARIABLE>
{
instructions.add("LINE "+next.image);
instructions.add("LOAD "+step.image);
instructions.add("LOAD "+var.image);
instructions.add("ADD");
instructions.add("JMP LABEL "+(labelsCntr));
labelsCntr++;
}
}
But it does not work.
How can I implement for loop so that it works.
Or where I am doing wrong.
Thanks in advance.
There are two things I don't see in your machine code that I would have expected. One is a conditional jump out of the loop. When the variable exceeds toV, control should jump to the first instruction after the loop. Second I don't see where the var is changed. At the end of the loop you add the value of the variable to the step value, but you don't store the result back into the variable.
There are also a few checks I expect need to be done at compile time that are missing: That the variable in the NEXT statement matches that in the VAR and that the step is positive.

"Expected token" using lemon parser generator

Is there a known way to generate an "Expected token" list when a syntax error happens ? I'm using Lemon as parser generator.
This seems to work:
%syntax_error {
int n = sizeof(yyTokenName) / sizeof(yyTokenName[0]);
for (int i = 0; i < n; ++i) {
int a = yy_find_shift_action(yypParser, (YYCODETYPE)i);
if (a < YYNSTATE + YYNRULE) {
printf("possible token: %s\n", yyTokenName[i]);
}
}
}
It tries all possible tokens and prints those that are applicable in the current parser state.
Note that when an incorrect token comes, the parser doesn't immediately call syntax_error, but it tries to reduce what's on stack hoping the token can be shifted afterwards. Only when nothing else can be reduced and the current token cannot be shifted, the parser calls syntax_error. The reductions will change parser state, which means that you may see less tokens than would have been applicable before the reductions. It should be sufficient for error reporting though.
There is no direct method to generate such list in Lemon. But you can try do this using debug output of Lemon tool and debug trace of generated parser. After call to ParseTrace function generated parser prints list of Shifts and Reduces it applies to the input stream. The last Shift before syntax error contains number of current state before the error. Find this state in *.out file for your parser and see list of expected tokens for it.
The modern versions of Lemon use something like the following:
%syntax_error {
for (int32_t i = 1, a = 0; i < YYNTOKEN; ++i) {
a = yy_find_shift_action((YYCODETYPE)i, yypParser->yytos->stateno);
if (a != YY_ERROR_ACTION) {
// 'a' is a valid token.
}
}
}

Resources