In the PowerShell Language Specification, the grammar has a term called "Primary Expression".
primary-expression:
value
member-access
element-access
invocation-expression
post-increment-expression
post-decrement-expression
Semantically, what is a primary expression intended to describe?
As I understand it there are two parts to this:
Formal grammars break up things like expressions so operator precedence is implicit.
Eg. if the grammar had
expression:
value
expression * expression
expression + expression
…
There would need to be a separate mechanism to define * as having higher precedence than +. This
becomes significant when using tools to directly transform the grammar into a tokeniser/parser.1
There are so many different specific rules (consider the number of operators in PowerShell) that using fewer rules would be harder to understand because each rule would be so long.
1 I suspect this is not the case with PowerShell because it is highly context sensitive (express vs command mode, and then consider calling non-inbuilt executables). But such grammars across languages tend to have a lot in common, so style can also be carried over (don't re-invent the wheel).
My understaning of it is that an expression is an arrangement of commands/arguments and/or operators/operands that, taken together will execute and effect an action or produce a result (even if it's $null) than can be assigned to a variable or send down the pipeline. The term "primary expression" is used to differentiate the whole expression from any sub-expressions $() that may be contained within it.
My very limited experience says that primary refers to expressions (or sub-expressions) that can't be parsed in more than one way, regardless of precedence or associativity. They have enough syntactic anchors that they are really non-negotiable--ID '(' ID ')' for example, can only match that. The parens make it guaranteed, and there are no operators to allow for any decision on precedence.
In matching an expression tree, it's common to have a set of these sub-expressions under a "primary_expr" rule. Any sub-expressions can be matched any way you like because they're all absolutely determined by syntax. Everything after that will have to be ordered carefully in the grammar to guarantee correct precedence and associativity.
There's some support for this interpretation here.
think $( lots of statements ) would be like a complex expression
and in powershell most things can be an expression like
$a = if ($y) { 1 } else { 2}
so primary expression is likely the lowest common denominator of classic expressions
a single thing that returns a value
whether an explicit value, calling something.. getting a variable, or a property from a variable $x.y or the result of increment operation $z++
but even a simple math "expression" wouldn't match this 4 +2 + (3 / 4 ) , that would be more of a complex expression.
So I was thinking of the WHY behind it, and at first it was mentioned that it could be to help determine command/expression mode, but with investigation that wasn't it. Then i thought maybe it was what could be passed in command mode as an argument explicitly, but thats not it, because if you pass $x++ as a parameter to a cmdlet you get a string.
I think its probably just the lowest common denominator of expressions, a building block, so the next question is what other expressions does the grammar contain, and where is this one used?
Related
Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}
It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.
I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.
Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().
This page says "Prefix operators are usually right-associative, and postfix operators left-associative" (emphasis mine).
Are there real examples of left-associative prefix operators, or right-associative postfix operators? If not, what would a hypothetical one look like, and how would it be parsed?
It's not particularly easy to make the concepts of "left-associative" and "right-associative" precise, since they don't directly correspond to any clear grammatical feature. Still, I'll try.
Despite the lack of math layout, I tried to insert an explanation of precedence relations here, and it's the best I can do, so I won't repeat it. The basic idea is that given an operator grammar (i.e., a grammar in which no production has two non-terminals without an intervening terminal), it is possible to define precedence relations ⋖, ≐, and ⋗ between grammar symbols, and then this relation can be extended to terminals.
Put simply, if a and b are two terminals, a ⋖ b holds if there is some production in which a is followed by a non-terminal which has a derivation (possibly not immediate) in which the first terminal is b. a ⋗ b holds if there is some production in which b follows a non-terminal which has a derivation in which the last terminal is a. And a ≐ b holds if there is some production in which a and b are either consecutive or are separated by a single non-terminal. The use of symbols which look like arithmetic comparisons is unfortunate, because none of the usual arithmetic laws apply. It is not necessary (in fact, it is rare) for a ≐ a to be true; a ≐ b does not imply b ≐ a and it may be the case that both (or neither) of a ⋖ b and a ⋗ b are true.
An operator grammar is an operator precedence grammar iff given any two terminals a and b, at most one of a ⋖ b, a ≐ b and a ⋗ b hold.
If a grammar is an operator-precedence grammar, it may be possible to find an assignment of integers to terminals which make the precedence relationships more or less correspond to integer comparisons. Precise correspondence is rarely possible, because of the rarity of a ≐ a. However, it is often possible to find two functions, f(t) and g(t) such that a ⋖ b is true if f(a) < g(b) and a ⋗ b is true if f(a) > g(b). (We don't worry about only if, because it may be the case that no relation holds between a and b, and often a ≐ b is handled with a different mechanism: indeed, it means something radically different.)
%left and %right (the yacc/bison/lemon/... declarations) construct functions f and g. They way they do it is pretty simple. If OP (an operator) is "left-associative", that means that expr1 OP expr2 OP expr3 must be parsed as <expr1 OP expr2> OP expr3, in which case OP ⋗ OP (which you can see from the derivation). Similarly, if ROP were "right-associative", then expr1 ROP expr2 ROP expr3 must be parsed as expr1 ROP <expr2 ROP expr3>, in which case ROP ⋖ ROP.
Since f and g are separate functions, this is fine: a left-associative operator will have f(OP) > g(OP) while a right-associative operator will have f(ROP) < g(ROP). This can easily be implemented by using two consecutive integers for each precedence level and assigning them to f and g in turn if the operator is right-associative, and to g and f in turn if it's left-associative. (This procedure will guarantee that f(T) is never equal to g(T). In the usual expression grammar, the only ≐ relationships are between open and close bracket-type-symbols, and these are not usually ambiguous, so in a yacc-derivative grammar it's not necessary to assign them precedence values at all. In a Floyd parser, they would be marked as ≐.)
Now, what about prefix and postfix operators? Prefix operators are always found in a production of the form [1]:
non-terminal-1: PREFIX non-terminal-2;
There is no non-terminal preceding PREFIX so it is not possible for anything to be ⋗ PREFIX (because the definition of a ⋗ b requires that there be a non-terminal preceding b). So if PREFIX is associative at all, it must be right-associative. Similarly, postfix operators correspond to:
non-terminal-3: non-terminal-4 POSTFIX;
and thus POSTFIX, if it is associative at all, must be left-associative.
Operators may be either semantically or syntactically non-associative (in the sense that applying the operator to the result of an application of the same operator is undefined or ill-formed). For example, in C++, ++ ++ a is semantically incorrect (unless operator++() has been redefined for a in some way), but it is accepted by the grammar (in case operator++() has been redefined). On the other hand, new new T is not syntactically correct. So new is syntactically non-associative.
[1] In Floyd grammars, all non-terminals are coalesced into a single non-terminal type, usually expression. However, the definition of precedence-relations doesn't require this, so I've used different place-holders for the different non-terminal types.
There could be in principle. Consider for example the prefix unary plus and minus operators: suppose + is the identity operation and - negates a numeric value.
They are "usually" right-associative, meaning that +-1 is equivalent to +(-1), the result is minus one.
Suppose they were left-associative, then the expression +-1 would be equivalent to (+-)1.
The language would therefore have to give a meaning to the sub-expression +-. Languages "usually" don't need this to have a meaning and don't give it one, but you can probably imagine a functional language in which the result of applying the identity operator to the negation operator is an operator/function that has exactly the same effect as the negation operator. Then the result of the full expression would again be -1 for this example.
Indeed, if the result of juxtaposing functions/operators is defined to be a function/operator with the same effect as applying both in right-to-left order, then it always makes no difference to the result of the expression which way you associate them. Those are just two different ways of defining that (f g)(x) == f(g(x)). If your language defines +- to mean something other than -, though, then the direction of associativity would matter (and I suspect the language would be very difficult to read for someone used to the "usual" languages...)
On the other hand, if the language doesn't allow juxtaposing operators/functions then prefix operators must be right-associative to allow the expression +-1. Disallowing juxtaposition is another way of saying that (+-) has no meaning.
I'm not aware of such a thing in a real language (e.g., one that's been used by at least a dozen people). I suspect the "usually" was merely because proving a negative is next to impossible, so it's easier to avoid arguments over trivia by not making an absolute statement.
As to how you'd theoretically do such a thing, there seem to be two possibilities. Given two prefix operators # and # that you were going to treat as left associative, you could parse ##a as equivalent to #(#(a)). At least to me, this seems like a truly dreadful idea--theoretically possible, but a language nobody should wish on even their worst enemy.
The other possibility is that ##a would be parsed as (##)a. In this case, we'd basically compose # and # into a single operator, which would then be applied to a.
In most typical languages, this probably wouldn't be terribly interesting (would have essentially the same meaning as if they were right associative). On the other hand, I can imagine a language oriented to multi-threaded programming that decreed that application of a single operator is always atomic--and when you compose two operators into a single one with the left-associative parse, the resulting fused operator is still a single, atomic operation, whereas just applying them successively wouldn't (necessarily) be.
Honestly, even that's kind of a stretch, but I can at least imagine it as a possibility.
I hate to shoot down a question that I myself asked, but having looked at the two other answers, would it be wrong to suggest that I've inadvertently asked a subjective question, and that in fact that the interpretation of left-associative prefixes and right-associative postfixes is simply undefined?
Remembering that even notation as pervasive as expressions is built upon a handful of conventions, if there's an edge case that the conventions never took into account, then maybe, until some standards committee decides on a definition, it's better to simply pretend it doesn't exist.
I do not remember any left-associated prefix operators or right-associated postfix ones. But I can imagine that both can easily exist. They are not common because the natural way of how people are looking to operators is: the one which is closer to the body - is applying first.
Easy example from C#/C++ languages:
~-3 is equal 2, but
-~3 is equal 4
This is because those prefix operators are right associative, for ~-3 it means that at first - operator applied and then ~ operator applied to the result of previous. It will lead to value of the whole expression will be equal to 2
Hypothetically you can imagine that if those operators are left-associative, than for ~-3 at first left-most operator ~ is applied, and after that - to the result of previous. It will lead to value of the whole expression will be equal to 4
[EDIT] Answering to Steve Jessop:
Steve said that: the meaning of "left-associativity" is that +-1 is equivalent to (+-)1
I do not agree with this, and think it is totally wrong. To better understand left-associativity consider the following example:
Suppose I have hypothetical programming language with left-associative prefix operators:
# - multiplies operand by 3
# - adds 7 to operand
Than following construction ##5 in my language will be equal to (5*3)+7 == 22
If my language was right-associative (as most usual languages) than I will have (5+7)*3 == 36
Please let me know if you have any questions.
Hypothetical example. A language has prefix operator # and postfix operator # with the same precedence. An expression #x# would be equal to (#x)# if both operators are left-associative and to #(x#) if both operators are right-associative.
I've been mulling over creating a language that would be extremely well suited to creation of DSLs, by allowing definitions of functions that are infix, postfix, prefix, or even consist of multiple words. For example, you could define an infix multiplication operator as follows (where multiply(X,Y) is already defined):
a * b => multiply(a,b)
Or a postfix "squared" operator:
a squared => a * a
Or a C or Java-style ternary operator, which involves two keywords interspersed with variables:
a ? b : c => if a==true then b else c
Clearly there is plenty of scope for ambiguities in such a language, but if it is statically typed (with type inference), then most ambiguities could be eliminated, and those that remain could be considered a syntax error (to be corrected by adding brackets where appropriate).
Is there some reason I'm not seeing that would make this extremely difficult, impossible, or just a plain bad idea?
Edit: A number of people have pointed me to languages that may do this or something like this, but I'm actually interested in pointers to how I could implement my own parser for it, or problems I might encounter if doing so.
This is not too hard to do. You'll want to assign each operator a fixity (infix, prefix, or postfix) and a precedence. Make the precedence a real number; you'll thank me later. Operators of higher precedence bind more tightly than operators of lower precedence; at equal levels of precedence, you can require disambiguation with parentheses, but you'll probably prefer to permit some operators to be associative so you can write
x + y + z
without parentheses. Once you have a fixity, a precedence, and an associativity for each operator, you'll want to write an operator-precedence parser. This kind of parser is fairly simply to write; it scans tokens from left to right and uses one auxiliary stack. There is an explanation in the dragon book but I have never found it very clear, in part because the dragon book describes a very general case of operator-precedence parsing. But I don't think you'll find it difficult.
Another case you'll want to be careful of is when you have
prefix (e) postfix
where prefix and postfix have the same precedence. This case also requires parentheses for disambiguation.
My paper Unparsing Expressions with Prefix and Postfix Operators has an example parser in the back, and you can download the code, but it's written in ML, so its workings may not be obvious to the amateur. But the whole business of fixity and so on is explained in great detail.
What are you going to do about order of operations?
a * b squared
You might want to check out Scala which has a kind of unique approach to operators and methods.
Haskell has just what you're looking for.