how to use trailing context in readable patterns with Flex? - flex-lexer

In Flex, I can use the a trailing pattern in a name definition like this:
NAME foo$|bar
and this passes flex.
But I don't like to write regular expressions like this, without whitespace, as they are hard to read. So I would like to do properly:
NAME (?x: foo$ | bar )
but now this fails flex because, according to the manual, "‘$’, cannot be grouped inside parentheses".
IMHO, this is silly, to allow some construct, but to not allow to describe it readably.
How can I use trailing context with a readable pattern in Flex?

First, to answer your question: "How can I use trailing context with a readable pattern in Flex?". If you insist that patterns are only readable if they are sprinkled with whitespace, then the answer is "You cannot." Sorry, but that's the way it is. The (?x: flag was hacked into flex at some point, and there are still a lot of rough edges.
In a way, it doesn't matter since you cannot use the $ operator as part of one alternative in an r|s regular expression. So even if you could have used the "readable syntax", it wouldn't have meant what you intended. You can certainly use the following "readable syntax" (at least, I think it's readable). It means something different, but it's the only use of the $ operator which flex supports:
NAME (?x: foo | bar )$
Below are a few notes.
In Flex, I can use the a trailing pattern in a name definition like this:
NAME foo$|bar
No, you can't. Or, better said, you can write that but it doesn't involve trailing context because:
…a '$' which does not occur at the end of a rule loses its special properties and is treated as a normal character.
(From the Flex manual; it's the last phrase in the point which says that you can't put trailing context operators inside parentheses.)
It is true (and slightly curious) that flex will reject:
NAME (?x: foo$ | bar )
although it will accept:
NAME (?x: foo$| bar )
I would go out on a limb and say that it is a bug. A $ is recognized as a trailing context operator only if it is at the end of the pattern. However, the code which checks that simply checks to see if the next character is whitespace, because patterns terminate at the first whitespace character. (The pattern isn't parsed in the definition; it is parsed when it is actually included in some rule pattern.) The test does not check whether the $ is within a (?x: block, so in
(?x: foo$ | bar )
the $ is a trailing context operator, which is a syntax error (the operator must appear at the very end of the pattern), while in
(?x: foo$| bar )
the $ is just an ordinary character, which is legal but perhaps unexpected.
Finally, a little note: the following is completely legal and the $ will be treated as a trailing context operator, provided that the definition is used at the very end of a pattern:
NAME bar|foo$
However, it probably doesn't mean what you think it means, either. The trailing context operator has lower precedence than the alternation operator, so as long as the expansion is at the end of a pattern, that is parsed as though it were written
NAME (bar|foo)$
I would strongly recommend against using such a definition. (In fact, I generally discourage the use of definitions, partly because of all these quirks.) A definition which ends with a $ is inserted into the referencing pattern without being surrounded with parentheses (so that the $ could be treated as an operator). This leads to all sorts of unexpected behaviour. For example, if you write:
NAME bar|foo$
and then use it:
x{NAME}y /* Some action */
The end result will be as though you had written
xbar|foo"$"y /* Some action */
(No parentheses, but the $ is a regular character.)
On the other hand, if you use it like this:
x{NAME} /* Some action */
That's as though you had written
xbar|foo$ /* Some action */
in which $ is the trailing context operator, but because of the low precedence of that operator it ends up being equivalent to
(xbar|foo)$ /* Some action */
It's unlikely that any of those expansions were what you wanted, and even less likely that anyone reading your code will expect those results.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

What is the purpose of double colons in Lua?

I have been aware that version 5.3 for Lua had come out not too long ago but hadn't had a reason to visit the documentation online until now. I may be wrong, but I do not believe to remember the usage of the double-colons :: like it is used so abundantly there.
I see that it is considered a "special token" like others are (greater than, less than, asterisks, etc) but I know what those are for.
What is the purpose of using them in Lua?
:: is only used for one thing in Lua *:
Declaring labels for jumping with goto.
goto label
::label::
The goto statement transfers the program control to a label. For syntactical reasons, labels in Lua are considered statements too:
stat ::= goto Name
stat ::= label
label ::= ‘::’ Name ‘::’
A label is visible in the entire block where it is defined, except inside nested blocks where a label with the same name is defined and inside nested functions. A goto may jump to any visible label as long as it does not enter into the scope of a local variable.
Labels and empty statements are called void statements, as they perform no actions.
* I don't consider extensive use with extended BNF in the documentation use in Lua itself.

Syntax Highlighting when using special characters

I'm currently finishing up a mathematical DSL based on LaTeX code in Rascal. This means that I have a lot of special characters ({,},), for instance in the syntax shown below, the sum doesn't get highlighted unless I remove the \ and _{ from the syntax.
syntax Expression = left sum: '\\sum_{' Assignment a '}^{' Expression until '}' Expression e
I've noticed that keywords that contain either \ or { and } do not get highlighted. Is there a way to overcome this?
Edit: I accidentally used data instead of syntax in this example
There are at least two solutions, one is based on changing the grammar, one is based on a post-parse tree traversal. Pick your poison :-)
The cause of the behavior is the default highlighting rules which heuristically detect what a "keyword" to be highlighted is by matching any literal with the regular expression [A-Za-z][A-Za-z0-9\-]*. Next to these heuristic defaults, the highlighting is fully programmable via #category tags in the grammar and #category annotations in the parse tree.
If you change the grammar like so, you can influence highlighting via tags:
data Expression = left sum: SumKw Assignment a '}^{' Expression until '}' Expression e
data SymKw = #category="MetaKeyword" '\\sum_{';
Or, another grammar-based solution is to split the definition up (which is not a language preserving grammar refactoring since it adds possibility for spaces):
data Expression = left sum: "\\" 'sum' "_{" Assignment a '}^{' Expression until '}' Expression e
(The latter solution will trigger the heuristic for keywords again)
If you don't like to hack the grammar to accomodate highlighting, the other way is to add an annotation via a tree traversal, like so:
visit(yourTree) {
case t:appl(prod(cilit("\\sum_{"),_,_),_) => t[#category="MetaKeyword"]
}
The code is somewhat hairy because you have to match on and replace a tree which can usually be ignored while thinking of your own language. It's the notion of the syntax rule generated for each (case-insensitive) literal and it's application to the individual characters it consists of. See ParseTree.rsc from the standard library for a detailed and formal definition of what parse trees look like under-the-hood.
To make the latter solution have effect, when you instantiate the IDE using the registerLanguage function from util::IDE, make sure to wrap the call to the parser with some function which executes this visit.

What is a "primary expression"?

In the PowerShell Language Specification, the grammar has a term called "Primary Expression".
primary-expression:
value
member-access
element-access
invocation-expression
post-increment-expression
post-decrement-expression
Semantically, what is a primary expression intended to describe?
As I understand it there are two parts to this:
Formal grammars break up things like expressions so operator precedence is implicit.
Eg. if the grammar had
expression:
value
expression * expression
expression + expression
…
There would need to be a separate mechanism to define * as having higher precedence than +. This
becomes significant when using tools to directly transform the grammar into a tokeniser/parser.1
There are so many different specific rules (consider the number of operators in PowerShell) that using fewer rules would be harder to understand because each rule would be so long.
1 I suspect this is not the case with PowerShell because it is highly context sensitive (express vs command mode, and then consider calling non-inbuilt executables). But such grammars across languages tend to have a lot in common, so style can also be carried over (don't re-invent the wheel).
My understaning of it is that an expression is an arrangement of commands/arguments and/or operators/operands that, taken together will execute and effect an action or produce a result (even if it's $null) than can be assigned to a variable or send down the pipeline. The term "primary expression" is used to differentiate the whole expression from any sub-expressions $() that may be contained within it.
My very limited experience says that primary refers to expressions (or sub-expressions) that can't be parsed in more than one way, regardless of precedence or associativity. They have enough syntactic anchors that they are really non-negotiable--ID '(' ID ')' for example, can only match that. The parens make it guaranteed, and there are no operators to allow for any decision on precedence.
In matching an expression tree, it's common to have a set of these sub-expressions under a "primary_expr" rule. Any sub-expressions can be matched any way you like because they're all absolutely determined by syntax. Everything after that will have to be ordered carefully in the grammar to guarantee correct precedence and associativity.
There's some support for this interpretation here.
think $( lots of statements ) would be like a complex expression
and in powershell most things can be an expression like
$a = if ($y) { 1 } else { 2}
so primary expression is likely the lowest common denominator of classic expressions
a single thing that returns a value
whether an explicit value, calling something.. getting a variable, or a property from a variable $x.y or the result of increment operation $z++
but even a simple math "expression" wouldn't match this 4 +2 + (3 / 4 ) , that would be more of a complex expression.
So I was thinking of the WHY behind it, and at first it was mentioned that it could be to help determine command/expression mode, but with investigation that wasn't it. Then i thought maybe it was what could be passed in command mode as an argument explicitly, but thats not it, because if you pass $x++ as a parameter to a cmdlet you get a string.
I think its probably just the lowest common denominator of expressions, a building block, so the next question is what other expressions does the grammar contain, and where is this one used?

Matching trailing context in flex

In the flex manual it mentions a "trailing context" pattern (r/s), which means r, but only if followed by s. However the following code doesn't compile (instead it gives an error of "unrecognized rule". Why?
LITERAL a/b
%%
{LITERAL} { }
The simple answer is that unless you use the -l option, which is not recommended, you cannot put trailing context into a name definition. That's because flex:
doesn't allow trailing context inside parentheses; and
automatically surrounds expansions of definitions with parentheses, except in a few situations (see below).
The reason flex surrounds expansions with parentheses is that otherwise weird things happen. For example:
prefix milli|centi
%%
{prefix}pede return BUG;
Without the automatic parentheses, the pattern would expand to:
milli|centipede
which would not match millipede. (There's a similar problem with the various postfix operators. Consider {prefix}?pede, for example.)
Flex doesn't allow trailing context inside parentheses because many such expressions are harder to compile. In effect, you can end up writing patterns which are the intersection of two regular expressions. (For example, ({base}/{a}){b} matches {base} followed by a {b} which is either a prefix or a projection of an {a}.) These are still regular expressions, but they aren't contemplated by the Thomson algorithm for turning regular expressions into finite state machines. Since the feature is rarely if ever needed, no attempt was ever made to implement it.
Unfortunately, banning trailing context inside parentheses also bans redundant parentheses around patterns which include trailing context, and this includes definition expansions because definitions are expanded with possibly redundant parentheses.
The original AT&T lex did not add the parentheses, which is why forcing lex-compatibility with -l allows your flex file to compile. However, it may result in all sorts of other problems, as indicated above, so I wouldn't recommend it.
Also, "trailing context" here means either a full pattern of the form r/s or of the form r$. Putting r/s inside parentheses (whether explicitly or implicitly) produces an error message, but putting r$ inside parentheses just makes the $ match a $ character, instead of forcing the pattern to match at the end of a line. No error or warning is emitted in this case.
That would make it impossible to use $ (or ^) inside a name definition. However, at some point prior to version 2.3.53, a hack was inserted which suppresses the parentheses if the definition starts with ^ or ends with $. And, for reasons I don't fully understand, it also suppresses the parentheses if the expansion occurs at the end of trailing context. This might be a bug, and indeed there is a bug report relating to it.
I found the answer to your problem in the FAQ of the info pages of flex: "Your problem is that some of the definitions in the scanner use the '/' trailing context operator, and have it enclosed in ()'s. Flex does not allow this operator to be enclosed in ()'s because doing so allows undefined regular expressions such as "(a/b)+". So the solution is to remove the parentheses. Note that you must also be building the scanner with the -l option for AT&T lex compatibility. Without this option, flex automatically encloses the definitions in parentheses." (quote from Vern Paxson). See also FAQ trailing context
The use of trailing contexts is better avoided when possible. As it is described above it is not allowed in nested expressions. Your example does work with the -l option.

Resources