I have been aware that version 5.3 for Lua had come out not too long ago but hadn't had a reason to visit the documentation online until now. I may be wrong, but I do not believe to remember the usage of the double-colons :: like it is used so abundantly there.
I see that it is considered a "special token" like others are (greater than, less than, asterisks, etc) but I know what those are for.
What is the purpose of using them in Lua?
:: is only used for one thing in Lua *:
Declaring labels for jumping with goto.
goto label
::label::
The goto statement transfers the program control to a label. For syntactical reasons, labels in Lua are considered statements too:
stat ::= goto Name
stat ::= label
label ::= ‘::’ Name ‘::’
A label is visible in the entire block where it is defined, except inside nested blocks where a label with the same name is defined and inside nested functions. A goto may jump to any visible label as long as it does not enter into the scope of a local variable.
Labels and empty statements are called void statements, as they perform no actions.
* I don't consider extensive use with extended BNF in the documentation use in Lua itself.
Related
I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.
Normally, when you want to reuse a regular expression, you can declare it in flex in declaration section. They will get enclosed by parenthesis by default. Eg:
num_seq [0-9]+
%%
{num_seq} return INT; // will become ([0-9]+)
{num_seq}\.{num_seq} return FLOAT; // will become ([0-9]+)\.([0-9]+)
But, I wanted to reuse some character classes. Can I define custom classes like [:alpha:], [:alnum:] etc. A toy Eg:
chars [a-zA-Z]
%%
// will become (([a-zA-Z]){-}[aeiouAEIOU])+ // ill-formed
// desired ([a-zA-Z]{-}[aeiouAEIOU])+ // correct
({chars}{-}[aeiouAEIOU])+ return ONLY_CONS;
({chars}{-}[a-z])+ return ONLY_UPPER;
({chars}{-}[A-Z])+ return ONLY_LOWER;
But currently, this will fail to compile because of the parenthesis added around them. Is there a proper way or at-least a workaround to achieve this?
This might be useful from time to time, but unfortunately it has never been implemented in flex. You could suppress the automatic parentheses around macro substitution by running flex in lex compatibility mode, but that has other probably undesirable effects.
Posix requires that regular expression bracket syntax includes, in addition to the predefined character classes,
…character class expressions of the form: [:name:] … in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category.
Unfortunately, flex does not implement this requirement. It is not too difficult to patch flex to do this, but since there is no portable mechanism to allow the user to add charclasses to their locale --and, indeed, many standard C library implementations lack proper locale support-- there is little incentive to make this change.
Having looked at all these options, I eventually convinced myself that the simplest portable solution is to preprocess the flex input file to replace [:name:] with a set of characters based on name. Since that sequence of characters is unlikely to be present in a flex input file, a simple-minded search and replace using sed or python is adequate; correctly parsing the flex input file seems to me to be more trouble than it was worth.
In Flex, I can use the a trailing pattern in a name definition like this:
NAME foo$|bar
and this passes flex.
But I don't like to write regular expressions like this, without whitespace, as they are hard to read. So I would like to do properly:
NAME (?x: foo$ | bar )
but now this fails flex because, according to the manual, "‘$’, cannot be grouped inside parentheses".
IMHO, this is silly, to allow some construct, but to not allow to describe it readably.
How can I use trailing context with a readable pattern in Flex?
First, to answer your question: "How can I use trailing context with a readable pattern in Flex?". If you insist that patterns are only readable if they are sprinkled with whitespace, then the answer is "You cannot." Sorry, but that's the way it is. The (?x: flag was hacked into flex at some point, and there are still a lot of rough edges.
In a way, it doesn't matter since you cannot use the $ operator as part of one alternative in an r|s regular expression. So even if you could have used the "readable syntax", it wouldn't have meant what you intended. You can certainly use the following "readable syntax" (at least, I think it's readable). It means something different, but it's the only use of the $ operator which flex supports:
NAME (?x: foo | bar )$
Below are a few notes.
In Flex, I can use the a trailing pattern in a name definition like this:
NAME foo$|bar
No, you can't. Or, better said, you can write that but it doesn't involve trailing context because:
…a '$' which does not occur at the end of a rule loses its special properties and is treated as a normal character.
(From the Flex manual; it's the last phrase in the point which says that you can't put trailing context operators inside parentheses.)
It is true (and slightly curious) that flex will reject:
NAME (?x: foo$ | bar )
although it will accept:
NAME (?x: foo$| bar )
I would go out on a limb and say that it is a bug. A $ is recognized as a trailing context operator only if it is at the end of the pattern. However, the code which checks that simply checks to see if the next character is whitespace, because patterns terminate at the first whitespace character. (The pattern isn't parsed in the definition; it is parsed when it is actually included in some rule pattern.) The test does not check whether the $ is within a (?x: block, so in
(?x: foo$ | bar )
the $ is a trailing context operator, which is a syntax error (the operator must appear at the very end of the pattern), while in
(?x: foo$| bar )
the $ is just an ordinary character, which is legal but perhaps unexpected.
Finally, a little note: the following is completely legal and the $ will be treated as a trailing context operator, provided that the definition is used at the very end of a pattern:
NAME bar|foo$
However, it probably doesn't mean what you think it means, either. The trailing context operator has lower precedence than the alternation operator, so as long as the expansion is at the end of a pattern, that is parsed as though it were written
NAME (bar|foo)$
I would strongly recommend against using such a definition. (In fact, I generally discourage the use of definitions, partly because of all these quirks.) A definition which ends with a $ is inserted into the referencing pattern without being surrounded with parentheses (so that the $ could be treated as an operator). This leads to all sorts of unexpected behaviour. For example, if you write:
NAME bar|foo$
and then use it:
x{NAME}y /* Some action */
The end result will be as though you had written
xbar|foo"$"y /* Some action */
(No parentheses, but the $ is a regular character.)
On the other hand, if you use it like this:
x{NAME} /* Some action */
That's as though you had written
xbar|foo$ /* Some action */
in which $ is the trailing context operator, but because of the low precedence of that operator it ends up being equivalent to
(xbar|foo)$ /* Some action */
It's unlikely that any of those expansions were what you wanted, and even less likely that anyone reading your code will expect those results.
I'm re-building a Lua to ES3 transpiler (a tool for converting Lua to cross-browser JavaScript). Before I start to spend my ideas on this transpiler, I want to ask if it's possible to convert Lua labels to ECMAScript 3. For example:
goto label;
:: label ::
print "skipped";
My first idea was to separate each body of statements in parts, e.g, when there's a label, its next statements must be stored as a entire next part:
some body
label (& statements)
other label (& statements)
and so on. Every statement that has a body (or the program chunk) gets a list of parts like this. Each part of a label should have its name stored in somewhere (e.g, in its own part object, inside a property).
Each part would be a function or would store a function on itself to be executed sequentially in relation to the others.
A goto statement would lookup its specific label to run its statement and invoke a ES return statement to stop the current statements execution.
The limitations of separating the body statements in this way is to access the variables and functions defined in different parts... So, is there a idea or answer for this? Is it impossible to have stable labels if converting them to ECMAScript?
I can't quite follow your idea, but it seems someone already solved the problem: JavaScript allows labelled continues, which, combined with dummy while loops, permit emulating goto within a function. (And unless I forgot something, that should be all you need for Lua.)
Compare pages 72-74 of the ECMAScript spec ed. #3 of 2000-03-24 to see that it should work in ES3, or just look at e.g. this answer to a question about goto in JS. As usual on the 'net, the URLs referenced there are dead but you can get summerofgoto.com [archived] at the awesome Internet Archive. (Outgoing GitHub link is also dead, but the scripts are also archived: parseScripts.js, goto.min.js or goto.js.)
I hope that's enough to get things running, good luck!
In the flex manual it mentions a "trailing context" pattern (r/s), which means r, but only if followed by s. However the following code doesn't compile (instead it gives an error of "unrecognized rule". Why?
LITERAL a/b
%%
{LITERAL} { }
The simple answer is that unless you use the -l option, which is not recommended, you cannot put trailing context into a name definition. That's because flex:
doesn't allow trailing context inside parentheses; and
automatically surrounds expansions of definitions with parentheses, except in a few situations (see below).
The reason flex surrounds expansions with parentheses is that otherwise weird things happen. For example:
prefix milli|centi
%%
{prefix}pede return BUG;
Without the automatic parentheses, the pattern would expand to:
milli|centipede
which would not match millipede. (There's a similar problem with the various postfix operators. Consider {prefix}?pede, for example.)
Flex doesn't allow trailing context inside parentheses because many such expressions are harder to compile. In effect, you can end up writing patterns which are the intersection of two regular expressions. (For example, ({base}/{a}){b} matches {base} followed by a {b} which is either a prefix or a projection of an {a}.) These are still regular expressions, but they aren't contemplated by the Thomson algorithm for turning regular expressions into finite state machines. Since the feature is rarely if ever needed, no attempt was ever made to implement it.
Unfortunately, banning trailing context inside parentheses also bans redundant parentheses around patterns which include trailing context, and this includes definition expansions because definitions are expanded with possibly redundant parentheses.
The original AT&T lex did not add the parentheses, which is why forcing lex-compatibility with -l allows your flex file to compile. However, it may result in all sorts of other problems, as indicated above, so I wouldn't recommend it.
Also, "trailing context" here means either a full pattern of the form r/s or of the form r$. Putting r/s inside parentheses (whether explicitly or implicitly) produces an error message, but putting r$ inside parentheses just makes the $ match a $ character, instead of forcing the pattern to match at the end of a line. No error or warning is emitted in this case.
That would make it impossible to use $ (or ^) inside a name definition. However, at some point prior to version 2.3.53, a hack was inserted which suppresses the parentheses if the definition starts with ^ or ends with $. And, for reasons I don't fully understand, it also suppresses the parentheses if the expansion occurs at the end of trailing context. This might be a bug, and indeed there is a bug report relating to it.
I found the answer to your problem in the FAQ of the info pages of flex: "Your problem is that some of the definitions in the scanner use the '/' trailing context operator, and have it enclosed in ()'s. Flex does not allow this operator to be enclosed in ()'s because doing so allows undefined regular expressions such as "(a/b)+". So the solution is to remove the parentheses. Note that you must also be building the scanner with the -l option for AT&T lex compatibility. Without this option, flex automatically encloses the definitions in parentheses." (quote from Vern Paxson). See also FAQ trailing context
The use of trailing contexts is better avoided when possible. As it is described above it is not allowed in nested expressions. Your example does work with the -l option.