I am currently looking through the first page of documentation for LUA and noticed that every assignment appears as var ::= Name, however I could not find any reference to the syntax of ::= itself. The documentation goes over the structuring of an assignment but glosses over these symbols. What I want to know is if every assignment requires the :: before the actual assignment operator, and, if so, why is it structured this way and not just a plain =?
What you're seeing is not Lua code, but a fragment of the grammar of the Lua language, as defined in Backus-Naur Form. The ::= operator is part of BNF.
It's the assignment operator - typically used in formal grammars.
Related
Most interpreters let you type the following at their console:
>> a = 2
>> a+3
5
>>
My question is what mechanisms are usually used to handle this syntax? Somehow the parser is able to distinguish between an assignment and an expression even though they could both start with a digit or letter. It's only when we retrieve the second token that you know if you have an assignment or not. In the past, I've looked ahead two tokens and if the second token isn't an equals I push the tokens back into the lexical stream and assume it's an expression. I suppose one could treat the assignment as an expression which I think some languages do. I thought of using left-factoring but I can't see it working.
eg
assignment = variable A
A = '=' expression | empty
Update I found this question on StackOverflow which address the same question: How to modify parsing grammar to allow assignment and non-assignment statements?
From how you're describing your approach - doing a few tokens of lookahead to decide how to handle things - it sounds like you're trying to write some sort of top-down parser along the lines of an LL(1) or an LL(2) parser, and you're trying to immediately decide whether the expression you're parsing is a variable assignment or an arithmetical expression. There are several ways that you could parse expressions like these quite naturally, and they essentially involve weakening one of those two assumptions.
The first way we could do this would be to switch from using a top-down parser like an LL(1) or LL(2) parser to something else like an LR(0) or SLR(1) parser. Those parsers work bottom-up by reading larger prefixes of the input string before deciding what they're looking at. In your case, a bottom-up parser might work by seeing the variable and thinking "okay, I'm either going to be reading an expression to print or an assignment statement, but with what I've seen so far I can't commit to either," then scanning more tokens to see what comes next. If they see an equals sign, great! It's an assignment statement. If they see something else, great! It's not. The nice part about this is that if you're using a standard bottom-up parsing algorithm like LR(0), SLR(1), LALR(1), or LR(1), you should probably find that the parser generally handles these sorts of issues quite well and no special-casing logic is necessary.
The other option would be to parse the entire expression assuming that = is a legitimate binary operator like any other operation, and then check afterwards whether what you parsed is a legal assignment statement or not. For example, if you use Dijkstra's shunting-yard algorithm to do the parsing, you can recover a parse tree for the overall expression, regardless of whether it's an arithmetical expression or an assignment. You could then walk the parse tree to ask questions like
if the top-level operation is an assignment, is the left-hand side a single variable?
if the top-level operation isn't an assignment, are there nested assignment statements buried in here that we need to get rid of?
In other words, you'd parse a broader class of statements than just the ones that are legal, and then do a postprocessing step to toss out anything that isn't valid.
Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().
I have made some searches, including taking a second look through the red Dragon Book in front of me, but I haven't found a clear answer to this. Most people are talking about whitespace-sensitivity in terms of indentation, but that's not my case.
I want to implement a transpiler for a simple language. This language has a concept of a "command", which is a reserved keyword followed by some arguments. To give you an idea of what I'm talking about, a sequence of commands may look something like this:
print "hello, world!";
set running 1;
while running #
read progname;
launch progname;
print "continue? 1 = yes, 0 = no";
readint running;
#
Informally, you can view the grammar as something along the lines of
<program> ::= <statement> <program>
<statement> ::= while <expression> <sequence>
| <command> ;
<sequence> ::= # <program> #
| <statement>
<command> ::= print <expression>
| set <variable> <expression>
| read <variable>
| readint <variable>
| launch <expression>
<expression> ::= <variable>
| <string>
| <int>
for simplicity, we can define the following as such
<string> is an arbitrary sequence of characters surrounded by quotes
<int> is a sequence of characters '0'..'9'
<variable> is a sequence of characters 'a'..'z'
Now this would ordinarily not be any problem. In fact, given just this specification I have a working implementation, where the lexer silently eats all whitespace. However, here's the catch:
Arguments to commands must be separated by whitespace!
In other words, it should be illegal to write
while running#print"hello";#
even though this clearly isn't ambiguous as far as the grammar is concerned. I have had two ideas on how to solve this.
Output a token whenever some whitespace is consumed, and include whitespace in the grammar. I suspect this will make the grammar a lot more complicated.
Rewrite the grammar so instead of "hard-coding" the arguments of each command, I have a production rule for "arguments" taking care of whitespace. It may look something like
<command> ::= <cmdtype> <arguments>
<arguments> ::= <argument> <arguments>
<argument> ::= <expression>
<cmdtype> ::= print | set | read | readint | launch
Then we can make sure the lexer somehow (?) takes care of leading whitespace whenever it encounters an <argument> token. However, this moves the complexity of dealing with the arity (among other things?) of built-in commands into the parser.
How is this normally solved? When the grammar of a language requires whitespace in particular places but leaves it optional almost everywhere else, does it make sense to deal with it in the lexer or in the parser?
I wish I could fudge the specification of the language just a teeny tiny bit because that would make it much simpler to implement, but unfortunately this is a backward-compatibility issue and not possible.
Backwards compatibility is usually taken to apply only to correct programs; accepting a program which previously would have benn rejected as a syntax error cannot alter the behaviour of any valid program and thus does not violate backwards compatibility.
That might not be relevant in this case, but since it would, as you note, simplify the problem considerably, it seemed worth mentioning.
One solution is to pass whitespace on to the parser, and then incorporate it into the grammar; normally, you would define a terminal, WS, and from that a non-terminal for optional whitespace:
<ows> ::= WS |
If you are careful to ensure that only one of the terminal and the non-terminal are valid in any context, this does not affect parsability, and the resulting grammar, while a bit cluttered, is still readable. The advantage is that it makes the whitespace rules explicit.
Another option is to handle the issue in the lexer; that might be simple but it depends on the precise nature of the language.
From your description, it appears that the goal is to produce a syntax error if two tokens are not separated by whitespace, unless one of the tokens is "self-delimiting"; in the example shown, I believe the only such token is the semicolon, since you seem to indicate that # must be whitespace-delimited. (It could be that your complete language has more self-delimiting tokens, but that does not substantially alter the problem.)
That can be handled with a single start condition in the lexer (assuming you are using a lexer generator which allows explicit states); reading whitespace puts you in a state in which any token is valid (which is the initial state, INITIAL if you are using a lex-derivative). In the other state, only self-delimiting tokens are valid. The state after reading a token will be the restricted state unless the token is self-delimiting.
This requires pretty well every lexer action to include a state transition action, but leaves the grammar unaltered. The effect is to move the clutter from the parser to the scanner, at the cost of obscuring the whitespace rules. But it might be less clutter and it will certainly simplify a future transition to a whitespace-agnostic dialect, if that is in your plans.
There is a different scenario, which is a posix-like shell in which identifiers (called "words" in the shell grammar) are not limited to alphabetic characters, but might include any non-self-delimiting character. In a posix shell, print"hello, world" is a single word, distinct from the two token sequence print "hello, world". (The first one will eventually be dequoted into the single token printhello, world.)
That scenario can really only be handled lexically, although it is not necessarily complicated. It might be a guide to your problem as well; for example, you could add a lexical rule which accepts any string of characters other than whitespace and self-delimiting characters; the maximal munch rule will ensure that action is only taken if the token cannot be recognised as an identifier or a string (or other valid tokens), so you can just throw an error in the action.
That is even simpler than the state-based lexer, but it is somewhat less flexible.
Does antl4 support adaptive grammars that allows the user to specify new rules, such as enforcing the number of arguments specified in a function declaration?
Example:
Base language includes the following token definitions:
Token#1 is defined as [a-z][0-9]*
token#2 is [A-Z][0-9]*
The uppercase are reserved for function names, and the lower case are reserved for variables passed to the function.
The user can "declare" Fxy, and every following instance of F has to have two variables. I want the parser to enforce the "new rule".
Perhaps this is standard fair in compilers, I know the compilers I use for C, python, etc. bitch when I don't pass the right number of arguments for a function I declared elsewhere. However, I don't know how to do this myself in my own grammar; the undergrad course I took on compilers was more than 15 years ago and I don't recall it including how to enforce # of arguments required for user declared functions. I've written some simple languages with five keywords and scoping (brackets), somewhat akin to the calculator examples you find in textbooks, but nothing complex.
So, I guess what I also want to know is whether the ANTLR books will teach me how to do this (don't want to spend the money if the books don't explain what I want to achieve).
An adaptive grammar would be a grammar for producing another grammar. But that is not what you are really asking for or how parsers are typically used for the purposes you describe.
In general, a grammar defines the allowed syntax of the language (or DSL) while the visitors to the tree generated from the grammar determine if the language semantics are met. Whether a call to a named function contains the right number and type of parameters is a question of semantics, not syntax.
Consider the following grammar snippet:
decl : fName AS FUNC LPAREN params? RPAREN body ;
func : FUNC fName LPAREN params? RPAREN body ;
params : param ( COMMA param)* ;
param : type pname ;
stmnt : fname LPAREN ( pname ( COMMA pname )* )? RPAREN SEMI ;
It allows standard functions (methods) and it allows new functions to be declared. The stmnt rule allows a named function to be called.
Whether the type and number of pnames is correct is a question of semantics that can only be answered in an analysis implemented by walking the generated tree: is there a function with the given fname, do the number of pnames and params match, do the types match or are they convertable, etc.
The Antlr books will help. You may wish to spend some time looking at the repository grammars to get a better feel for how different languages can be described by a grammar.
An adaptive grammar is essentially a grammar for a "self-extensible" parser that "learns" new grammar rules from its input. ANTLR does not appear to support adaptive grammars, but there are some other parser generators that do support them, such as dypgen, which is based on the GLR parsing algorithm.
OK, so here's a question: Given that Haskell allows you to define new operators with arbitrary operator precedence... how is it possible to actually parse Haskell source code?
You cannot know what operator precedences are set until you parse the source. But you cannot parse the source until you know the correct operator precedences. So... um, how?
Consider, for example, the expression
x *** y +++ z
Until we finish parsing the module, we don't know what other modules are imported, and hence what operators (and other identifiers) might be in scope. We certainly don't know their precedences yet. But the parser has to return something... But should it return
(x *** y) +++ z
Or should it return
x *** (y +++ z)
The poor parser has no way to know. This can only be determined once you hunt down the import that brings (+++) and (***) into scope, load that file off disk, and discover what the operator precedences are. Clearly the parser itself isn't going to do all that I/O; a parser just turns a stream of characters into an AST.
Clearly somebody somewhere has figured out how to do this. But I can't work it out... Any hints?
Quoting the page on GHC trac for the parser:
Infix operators are parsed as if they were all left-associative. The
renamer uses the fixity declarations to re-associate the syntax tree.
András Kovács's answer tells what's really done in GHC, but there's some history to this.
There was actually a somewhat hypothetical change from the Haskell 98 to the Haskell 2010 standard. In the former's BNF grammar, operator fixity and parsing were intertwined in such a way that you could in theory have some very strange interactions between the rules for fixity and the rules for when expressions and indentation blocks end. (For the latter two, the rules are essentially, "keep on going until you have to stop".)
In particular you could redefine a local operator and its fixity such that a use of it belonged in the redefining inner where block exactly ... when it didn't. So you got a parser paradox. I cannot find any of the old examples but this may be one:
let (+) = (Prelude.+)
infix 9 + -- make the inner + high precedence and non-associative
in 2 + 3 + 4
-- ^ this + cannot parse here as the inner operator, which means
-- the let ... in ... expression should end automatically first,
-- but then it's the standard +, and its fixity says it should parse
-- as part of the inner expression...
In Haskell 2010 they officially changed that so that operator fixities are determined in a separate stage after the parsing proper.
So why was this a hypothetical change? Because all the compiler writers already did it the Haskell 2010 way, and always had, for their own sanity.
Summarising the comments so far, it seems the possibilities are thus:
Return a parse tree where any infix operators are left as some kind of "list" structure, and then rearrange once precedences become known.
Pretend you know the operator precedences, and then rearrange the parse tree after the fact.
Do a first parse that only reads imports and fixity declarations, load the imports, and then do a full parse with known precedences.