Error: Unexpected infix operator in expression, about a successfully compiled prefix operator - f#

Playing around a little bit with infix operators, I was surprised about the following:
let (>~~~) = function null -> String.Empty | s -> s // compiles fine, see screenshot
match >~~~ input with .... // error: Unexpected infix operator in expression
and:
Changing the first characters of the prefix operator (to !~~~ for instance) fixes it. That I get an error that the infix operator is unexpected is rather weird. Hovering shows the definition to be string -> string.
I'm not too surprised about the error, F# requires (iirc) that the first character of a prefix operator must itself be one of the predefined prefix operators. But why does it compile just fine, and when I use it, the compiler complains?
Update: the F# compiler seems to know in other cases just fine when I use an invalid character in my operator definition, it says "Invalid operator definition. Prefix operator definitions must use a valid prefix operator name."

The rules for custom operators in F# are quite tight - so even though you can define custom operators, there is a lot of rules about how they will behave and you cannot change those. In particular:
Only some operators (mainly those with ! and ~) can be used as prefix operators. With ~ you can also overload unary operators +, -, ~ and ~~, so if you define an operator named ~+., you can then use it as e.g. +. 42.
Other operators (including those starting with >) can only be used as infix. You can turn any operator into ordinary function using parentheses, which is why e.g. (+) 1 2 is valid.
The ? symbols is special (it is used for dynamic invocation) and cannot appear as the first symbol of a custom operator.
I think the most intuitive way of thinking about this is that custom operators will behave like standard F# operators, but you can add additional symbols after the standard operator name.

Related

How is the conditional operator parsed?

So, the cppreference claims:
The expression in the middle of the conditional operator (between ? and :) is parsed as if parenthesized: its precedence relative to ?: is ignored.
However, it appears to me that the part of the expression after the ':' operator is also parsed as if it were between parentheses. I've tried to implement the ternary operator in my programming language (and you can see the results of parsing expressions here), and my parser pretends that the part of the expression after ':' is also parenthesized. For example, for the expression (1?1:0?2:0)-1, the interpreter for my programming language outputs 0, and this appears to be compatible with C. For instance, the C program:
#include <stdio.h>
int main() {
printf("%d\n",(1?1:0?2:0)-1);
}
Outputs 0.
Had I programmed the parser of my programming language that, when parsing the ternary operators, simply take the first already parsed node after ':' and take it as the third operand to '?:', it would output the same as ((1?1:0)?2:0)-1, that is 1.
My question is whether this would (pretending that the expression after the ':' is parenthesized) always be compatible with C?
"Pretends that it is parenthesised" is some kind of description of operator parenthesis. But of course that has to be interpreted relative to precedence relations (including associativity). So in a-b*c and a*b-c, the subtraction effectively acts as though its arguments are parenthesised, only the left-hand argument is treated that way in a-b-c and it is the comparison operator which causes grouping in a<b-c and a-b<c.
I'm sure you know all that since your parser seems to work for all these cases, but I say that because the ternary operator is right-associative and of lower precedence than any other operator [Note 1]. That means that the pseudo-parentheses imposed by operator precedence surround the right-hand argument (regardless of its dominating operator, since all operators have higher precedence), and also the left-hand argument unless its dominating operator is another conditional operator. But that wouldn't be the case in C, where the comma operator has lower precedence and would not be enclosed by the imaginary parentheses following the :.
It's important to understand what is meant by the precedence of a complex operator. In effect, to compute the precedence relations we first collapse the operator to a simple ?: which includes the enclosed (second) argument. This is not "as if the expression were parenthesized", because it is parenthesized. It is parenthesized between ? and :, which in this context are syntactically parenthetic.
In this sense, it is very similar to the usual analysis of the subscript operator as a postfix operator, although the brackets of the subscript operator enclose a second argument. The precedence of the subscript operator is logically what would result from considering it to be a single [], abstracting away the expression contained inside. This is also the same as the function call operator. That happens to be written with parentheses, but the precise symbols are not important: it is possible to imagine an alternative language in which function calls are written with different symbols, perhaps { and }. That wouldn't affect the grammar at all.
It might seem odd to think of ? and : to be "parenthetic", since they don't look parenthetic. But a parser doesn't see the shapes of the symbols. It is satisfied by being told that a ( is closed by a ) and, in this case, that a ? is closed by a :. [Note 2]
Having said all that, I tried your compiler on the conditional expression
d = 0 ? 0 : n / d
It parses this expression correctly, but the compiled code computes n / d before verifying whether d = 0 is true. That's not the way the conditional operator should work; in this case, it will lead to an unexpected divide by 0 exception. The conditional operator must first evaluate its left-hand argument, and then evaluate exactly one of the other two expressions.
Notes:
In C, this is not quite correct. The comma operator has lower precedence, and there is a more complex interaction with assignment operators, which logically have the same precedence and are also right-associative.
In C-like languages those symbols are not used for any other purpose, so it's OK to just regard them as strange-looking parentheses and leave it at that. But as the case of the function-call operator shows (or, for that matter, the unary - operator), it is sometimes possible to reuse operator symbols for more than one purpose.
As a curiosity, it is not strictly necessary that open and close parentheses be different symbols, as long as they are not used for any other purpose. So, for example, if | is not used as an operator symbol (as it is in C), then you could use | a | to mean the absolute value of a without creating any ambiguities.
A precise analysis of the circumstances in which symbol reuse leads to actual ambiguities is beyond the scope of this answer.

How can I check whether the given expression is an infix expression, postfix expression or prefix expression?

I need algorithms that will check whether given expression is infix, postfix or prefix expression.
I have tried a method by checking first or last 2 terms of the string e.g.
+AB if there is an operator in the very first index of string then its a prefix
AB+ if there is an operator in the very last index of string then its
a postfix
else it is an infix.
But it doesn't feel appropriate so kindly suggest me a better algorithim.
If it starts with a valid infix operator it's infix, unless you're going to allow unary operators.
If it ends with a valid postfix operator it's postfix.
Otherwise it is either infix or invalid.
Note that (3) includes the case you mentioned in comments of an expression in parentheses. There are no parentheses in prefix or postfix. That's why they exist. (3) also includes the degenerate case of a single term, e.g. 1, but in that case it doesn't matter how you parse it.
You can only detect an invalid expression by parsing it fully.
If you're going to allow unary operators in infix notation I can only suggest that you try all three parses and stop when you get a success. Very possibly this is the strategy you should follow anyway.
check the first elements in the string.
1- if the first element is an operator, then it is for sure prefix expression
2- else, check the second element, if it is operator, then it is for sure infix
3- else, it is for sure postfix

How to resolve ambiguity in the definition of an LR(1) grammar?

I am writing a Golang compiler in OCaml, and argument lists are causing me a bit of a headache. In Go, you can group consecutive parameter names of the same type in the following way:
func f(a, b, c int) === func f(a int, b int, c int)
You can also have a list of types, without parameter names:
func g(int, string, int)
The two styles cannot be mix-and-matched; either all parameters are named or none are.
My issue is that when the parser sees a comma, it doesn't know what to do. In the first example, is a the name of a type or the name of a variable with more variables coming up? The comma has a dual role and I am not sure how to fix this.
I am using the Menhir parser generator tool for OCaml.
Edit: at the moment, my Menhir grammar follows exactly the rules as specified at http://golang.org/ref/spec#Function_types
As written, the go grammar is not LALR(1). In fact, it is not LR(k) for any k. It is, however, unambiguous, so you could successfully parse it with a GLR parser, if you can find one (I'm pretty sure that there are several GLR parser generators for OCAML, but I don't know enough about any of them to recommend one).
If you don't want to (or can't) use a GLR parser, you can do it the same way Russ Cox did in the gccgo compiler, which uses bison. (bison can generate GLR parsers, but Cox doesn't use that feature.) His technique does not rely on the scanner distinguishing between type-names and non-type-names.
Rather, it just accepts parameter lists whose elements are either name_or_type or name name_or_type (actually, there are more possibilities than that, because of the ... syntax, but it doesn't change the general principle.) That's simple, unambiguous and LALR(1), but it is overly-accepting -- it will accept func foo(a, b int, c), for example -- and it does not produce the correct abstract syntax tree because it doesn't attach the type to the list of parameters being declared.
What that means is that once the argument list is fully parsed and is about to be inserted into the AST attached to some function declaration (for example), a semantic scan is performed to fix it up and, if necessary, produce an error message. That scan is done right-to-left over the list of declaration elements, so that the specified type can be propagated to the left.
It's worth noting that the grammar in the reference manual is also overly-accepting, because it does not express the constraint that "either all parameters are named or none are". That constraint could be expressed in an LR(1) grammar -- I'll leave that as an exercise for readers -- but the resulting grammar would be a lot more difficult to understand.
You don't have ambiguity. The fact that the standard Go parser is LALR(1) proves that.
is a the name of a type or the name of a variable with more variables coming up?
So basically your grammar and the parser as a whole should be completely disconnected from the symbol table; don't be C – your grammar is not ambiguous therefore you can check the type name later in the AST.
These are the relevant rules (from http://golang.org/ref/spec); they are already correct.
Parameters = "(" [ ParameterList [ "," ] ] ")" .
ParameterList = ParameterDecl { "," ParameterDecl } .
ParameterDecl = [ IdentifierList ] [ "..." ] Type .
IdentifierList = identifier { "," identifier } .
I'll explain them to you:
IdentifierList = identifier { "," identifier } .
The curly braces represent the kleene-closure (In POSIX regular expression notation it's the asterisk). This rule says "an identifier name, optionally followed by a literal comma and an identifier, optionally followed by a literal comma and an identifier, etc… ad infinitum"
ParameterDecl = [ IdentifierList ] [ "..." ] Type .
The square brackets are nullability; this means that that part may or may not be present. (In POSIX regular expression notation it's the question mark). So you have "Maybe an IdentifierList, followed by maybe an ellipsis, followed by a type.
ParameterList = ParameterDecl { "," ParameterDecl } .
You can have several ParameterDecl in a list like e.g. func x(a, b int, c, d string).
Parameters = "(" [ ParameterList [ "," ] ] ")" .
This rules defines that a ParameterList is optional and to be surrounded by parenthesis and may include an optional final comma literal, useful when you write something like:
func x(
a, b int,
c, d string, // <- note the final comma
)
The Go grammar is portable and can be parsed by any bottom-up parser with one token of lookahead.
Edit regarding "don't be C": I said this because C is context-sensitive and the way they solve this problem in many (all?) compilers is by wiring the symbol table to the lexer and lexing tokens differently depending on if they are defined as type names or variables. This is a hack and should not be done for unambiguous grammars!

Why does f# dot operator have such a low precedence

The precedence of F#'s member selection dot (.) operator as used in
System.Console.WriteLine("test")
has a lower precedence than [space] such that the following
ignore System.Console.WriteLine("test")
must be written explicitly as
ignore (System.Console.WriteLine("test"))
though this would be the intuition from the notion of juxtaposed symbols. Having used CoffeeScript, I can appreciate how intuitive precedence can serve to de-clutter code.
Are there any efforts being made to rationalize this kerfuffle, perhaps something along the lines that incorporated the "lightweight" syntax of the early years?
==============
Upon review, the culprit is not the "." operator but the invocation operator "()", as in "f()". So, given:
type C() = class end
then the following intuitive syntax fails:
printfn "%A" C() <-- syntax error FS0597
and must be written thus (as prescribed by the documentation):
printfn "%A" (C()) <-- OK
It seems intuitive that a string of symbols unbroken by white space should implicitly represents a block. In fact, the utility of juxtaposing is to create such a block.
a b.c is parsed as a (b.c), not (a b).c. So there are no efforts to rationalize this - it simply is not true.
Thanks to all those who responded.
My particular perplexity stemmed from treating () as an invocation operator. As an eager evaluation language, F# does not have or need such a thing. In stead, this is an expression boundary, as in, (expression). In particular, () bounds the nothing expression which is the only value of the type, unit. Consequently, () is the stipulation of a value and not a direction to resolved the associated function (though that is the practical consequence when parameters are provided to functions due to F#'s eager evaluation.)
As a result, the following expression
ignore System.Console.WriteLine("test")
actually surfaces three distinct values,
ignore System.Console.WriteLine ("test")
which are interpreted according to the left-to-right precedence evaluation order or F# (which then permits partial function application and perhaps other things)
( ignore System.Console.WriteLine ) ("test")
...but the result of (ignore expr) will be unit, which does not expect a parameter. Hence, syntax error (strong typing, yea!). So, an expression boundary is required. In particular,
ignore ( System.Console.WriteLine ("test") )
or
ignore (System.Console.WriteLine "test")
or
ignore <| System.Console.WriteLine "test"
or
System.Console.WriteLine "test" |> ignore

Defining Function Signatures in a Simple Language Grammar

I am currently learning how to create a simple expression language using Irony. I'm having a little bit of trouble figuring out the best way to define function signatures, and determining whose responsibility it is to validate the input to those functions.
So far, I have a simple grammar that defines the basic elements of my language. This includes a handful of binary operators, parentheses, numbers, identifiers, and function calls. The BNF for my grammar looks something like this:
<expression> ::= <number> | <parenexp> | <binexp> | <fncall> | <identifier>
<parenexp> ::= ( <expression> )
<fncall> ::= <identifier> ( <argumentlist> )
<binexp> ::= <expression> <binop> <expression>
<binop> ::= + - * / %
... the rest of the grammar definition
Using the Irony parser, I am able to validate the syntax of various input strings to make sure they conform to this grammar:
x + y / z * AVG(a + b, p) -> Valid Syntax
x +/ AVG(x -> Invalid Syntax
All that is well and good, but now I want to go a step further and define the available functions, along with the number of parameters that each function requires. So for example, I want to have a function FOO that accepts one parameter and BAR that accepts two parameters:
FOO(a + b) * BAR(x + y, p + q) -> Valid
FOO(a + b, 13) -> Invalid
When the second statement is parsed, I'd like to be able to output an error message that is aware of the expected input for this function:
Too many arguments specified for function 'FOO'
I don't actually need to evaluate any of these statements, only validate the syntax of the statements and determine if they are valid expressions or not.
How exactly should I be doing this? I know that technically I could simply add the functions to the grammar like so:
<foofncall> ::= FOO( <expression> )
<barfncall> ::= BAR( <expression>, <expression> )
But something about this doesn't feel quite right. To me it seems like the grammar should only define a generic function call, and not every function available to the language.
How is this typically accomplished in other languages?
What are the components called that should handle the responsibilities of analyzing the basic syntax of the language grammar versus the more specific elements like function definitions? Should both responsibilities be handled by the same component?
While you can do typechecking in directly in the grammar so its enforced in the parser, its generally a bad idea to do so. Instead, the parser should just parse the basic syntax, and separate typechecking code should be used for typechecking.
In the normal case of a compiler, the parser just produces an abstract syntax tree or some equivalent representation of the program. Then, a typechecking pass is run over the AST that ensures all types match appropriately -- ensures that functions have the right number of arguments and those arguments have the right type, as well as ensuring that variables have the right type for what is assigned to them and how they are used.
Besides being generally simpler, this usually allows you to give better error messages -- instead of just 'Invalid', you can say 'too many arguments to FOO' or what have you.

Resources