High-precedence application expressions as arguments - f#

A high precedence application expression is one in which an identifier is immediately following by a left paren without intervening whitespace, e.g., f(g). Parentheses are required when passing these as function arguments: func (f(g)).
Section 15.2 of the spec states the grammar and precedence rules allow the unparenthesized form -- func f(g) -- but an additional check prevents this.
Why is this intentionally prohibited? It would obviate the need for excessive parentheses and piping, and generally make the code much cleaner.
A common example is
raise <| IndexOutOfRangeException()
raise (IndexOutOfRangeException())
could become simply
raise IndexOutOfRangeException()

I agree that the need for writing the additional parentheses is a bit annoying. I think that the main reason why it is not allowed to omit them is that adding a whitespace would then change the meaning of your code in quite a significant way:
// Call 'foo' with the result of 'bar()' as an argument
foo bar()
// Call 'foo' with 'bar' as the first argument and '()' as the second
foo bar ()
There are still some rough edges where adding parens changes the evaluation (see this form post), but that "just" changes the evaluation order. This would change the meaning of your code!


whitespace in flex patterns leads to "unrecognized rule"

The flex info manual provides allows whitespace in regular expressions using the "x" modifier in the (?r-s:pattern) form. It specifically offers a simple example (without whitespace)
(?:foo) same as (foo)
but the following program fails to compile with the error "unrecognized rule":
BAD (?:foo)
{BAD} {}
I cannot find any form of (? that is acceptable as a rule pattern. Is the manual in error, or do I misunderstand?
The example in your question does not seem to reflect the question itself, since it shows neither the use of whitespace nor a x flag. So I'm going to assume that the pattern which is failing for you is something like
BAD (?x:two | lines |
of | words)
{BAD} { }
And, indeed, that will not work. Although you can use extended format in a pattern, you can only use it in a definition if it doesn't contain a newline. The definition terminates at the last non-whitespace character on the definition line.
Anyway, definitions are overused. You could write the above as
(?x:two | lines |
of | words ) { }
Which saves anyone reading your code from having to search for a definition.
I do understand that you might want to use a very long pattern in a rule, which is awkward, particularly if you want to use it twice. Regardless of the issue with newlines, this tends to run into problems with Flex's definition length limit (2047 characters). My approach has been to break the very long pattern into a series of definitions, and then define another symbol which concatenates the pieces.
Before v2.6, Flex did not chop whitespace off the end of the definition line, which also leads to mysterious "unrecognized rule" errors. The manual seems to still reflect the v2.5 behaviour:
The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line.

How does groovy distinguish division from strings?

Groovy supports / as a division operator:
groovy> 1 / 2
===> 0.5
It supports / as a string delimiter, which can even be multiline:
groovy> x = /foo/
===> foo
groovy:000> x = /foo
groovy:001> bar/
===> foo
Given this, why can't I evaluate a slashy-string literal in groovysh?
groovy:000> /foo/
clearly groovysh thinks this is unterminated for some reason.
How does groovy avoid getting confused between division and strings? What does this code mean:
groovy> f / 2
Is this a function call f(/2 .../) where / is beginning a multiline slashy-string, or f divided by 2?
How does Groovy distinguish division from strings?
I'm not entirely sure how Groovy does it, but I'll describe how I'd do it, and I'd be very surprised if Groovy didn't work in a similar way.
Most parsing algorithms I've heard of (Shunting-yard, Pratt, etc) recognize two distinct kinds of tokens:
Those that expect to be preceded by an expression (infix operators, postfix operators, closing parentheses, etc). If one of these is not preceded by an expression, it's a syntax error.
Those that do not expect to be preceded by an expression (prefix operators, opening parentheses, identifiers, literals, etc). If one of these is preceded by an expression, it's a syntax error.
To make things easier, from this point onward I'm going to refer to the former kind of token as an operator and the latter as a non-operator.
Now, the interesting thing about this distinction is that it's made not based on what the token actually is, but rather on the immediate context, particularly the preceding tokens. Because of this, the same token can be interpreted very differently depending on its position in the code, and whether the parser classifies it as an operator or a non-operator. For example, the '-' token, if in an operator position, denotes a subtraction, but the same token in a non-operator position is a negation. There is no issue deciding whether a '-' is a subtraction operator or not, because you can tell based on its context.
The same is, in general, true for the '/' character in Groovy. If preceded by an expression, it's interpreted as an operator, which means it's a division. Otherwise, it's a non-operator, which makes it a string literal. So, you can generally tell if a '/' is a division or not, by looking at the token that immediately precedes it:
The '/' is a division if it follows an identifier, literal, postfix operator, closing parenthesis, or other token that denotes the end of an expression.
The '/' begins a string if it follows a prefix operator, infix operator, opening parenthesis, or other such token, or if it begins a line.
Of course, it isn't quite so simple in practice. Groovy is designed to be flexible in the face of various styles and uses, and therefore things like semicolons or parentheses are often optional. This can make parsing somewhat ambiguous at times. For example, say our parser comes across the following line:
println / foo
This is most likely an attempt to print a multiline string: foo is the beginning of a string being passed to println as an argument, and the optional parentheses around the argument list are left out. Of course, to a simple parser it looks like a division. I expect the Groovy parser can tell the difference by reading ahead to the following lines to see which interpretation does not give an error, but for something like groovysh that is literally impossible (since, as a repl, it doesn't yet have access to more lines), so it's forced to just guess.
Why can't I evaluate a slashy-string literal in groovysh?
As before, I don't know the exact reason, but I do know that because groovysh is a repl, it's bound to have more trouble with the more ambiguous rules. Even so, a simple single-line slashy-string is pretty unambiguous, so I believe something else may be going on here. Here is the result of me playing with various forms in groovysh:
> /foo - unexpected char: '/' # line 2, column 1.
> /foo/ - awaits further input
> /foo/bar - unexpected char: '/' # line 2, column 1.
> /foo/bar/ - awaits further input
> /foo/ + 'bar' - unexpected char: '/' # line 2, column 1.
> 'foo' + /bar/ - evaluates to 'foobar'
> /foo/ - evaluates to 'foo'
> /foo - awaits further input
> /foo/bar - Unknown property: bar
It appears that something strange happens when a '/' character is the first character in a line. The pattern it appears to follow (as far as I can tell) is this:
A slash as the first character of a line begins a strange parsing mode.
In this mode, every line that ends with a slash followed by nothing but whitespace causes the repl to await further lines.
On the first line that ends with something other than a slash (or whitespace following a slash), the error unexpected char: '/' # line 2, column 1. is printed.
I've also noticed a couple of interesting points regarding this:
Both forward slashes (/) and backslashes (\) appear to count, and seem to be completely interchangeable, in this special mode.
This does not appear to happen at all in groovyConsole or in actual Groovy files.
Putting any whitespace before the opening slash character causes groovysh to interpret it correctly, but only if the opening slash is a forward slash, not a backslash.
So, I personally expect that this is just a quirk of groovysh, either a bug or some under-documented feature I haven't heard about.

Why is "do" allowed inside a function?

I noticed that the following code compiles and works in VS 2013:
let f() =
do Console.WriteLine(41)
But when looking at the F# 3.0 specification I can't find any mention of do being used this way. As far as I can tell, do can have the following uses:
As a part of loop (e.g. while expr do expr done), that's not the case here.
Inside computation expressions, e.g.:
seq {
for i in 1..2 do
do Console.WriteLine(i)
yield i * 2
That's not the case here either, f doesn't contain any computation expressions.
Though what confuses me here is that according to the specification, do should be followed by in. That in should be optional due to lightweight syntax, but adding it here causes a compile error (“Unexpected token 'in' or incomplete expression”).
Statement inside a module or class. This is also not the case here, the do is inside a function, not inside a module or a class.
I also noticed that with #light "off", the code doesn't compile (“Unexpected keyword 'do' in binding”), but I didn't find anything that would explain this in the section on lightweight syntax either.
Based on all this, I would assume that using do inside a function this way should not compile, but it does. Did I miss something in the specification? Or is this actually a bug in the compiler or in the specification?
From the documentation on MSDN:
A do binding is used to execute code without defining a function or value.
Even though the spec doesn't contain a comprehensive list of the places it is allowed, it is merely an expression asserted to be of type unit. Some examples:
if ((do ()); true) then ()
let x: unit = do ()
It is generally omitted. Each of the preceding examples are valid without do. Therefore, do serves only to assert that an expression is of type unit.
Going through the F# 3.0 specification expression syntax has do expr as a choice of class-function-or-value-defn (types) [Ch 8, A.2.5] and module-function-or-value-defn (modules) [Ch 10, A.2.1.1].
I don't actually see in the spec where function-defn can have more than one expression, as long all but the last one evaluate to unit -- or that all but the last expression is ignored in determining the functions return value.
So, it seems this is an oversight in the documentation.

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
end_of_stmt: ';'
| '\n'
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

"do" in "do Application.Run(form)" sentence

What is the difference between
do Application.Run(form)
and, simply:
Application.Run(form) ?
What is the role of do keyword in the first sentence?
Whereas 'do' was a required keyword in many places in the language in some of the earlier releases, nowadays you rarely need 'do'. The remaining exceptions that I can think of are that 'do' is still part of loop syntax (e.g. "while e1 do e2") and if you want to put an assembly-level attribute or an attribute on the startup method, you can put the attribute before the explicit 'do' of a final code block in a module. Often times in F# samples you'll see
do Application.Run(form)
as the last two lines of a file, and I think the 'do' is still required there in order to be able to attach the attribute on the line above it.
I think it's just a holdover - like how you can still CALL a sub or SET a variable instead of just doing those things directly, as in:
SET varname = 5
CALL mysub()
Versus just:
varname = 5
In other words, I don't think it matters, and the compiler just discards it.
