Difference between terminal symbol and nonterminal symbol? - parsing

What's the difference between terminal & non-terminal symbols in system programming with examples?

A terminal symbol represents a single element of the language, and a non-terminal symbol represents several elements.
terminal and nonterminal symbols are the lexical elements used in
specifying the production rules constituting a formal grammar.
Terminal symbols are the elementary symbols of the language defined by
a formal grammar. Nonterminal symbols (or syntactic variables) are
replaced by groups of terminal symbols according to the production
rules.
source: http://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols

Related

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

According to the ECMAScript spec:
There are several situations where the identification of lexical input
elements is sensitive to the syntactic grammar context that is
consuming the input elements. This requires multiple goal symbols for
the lexical grammar.
Two such symbols are InputElementDiv and InputElementRegExp.
In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.
I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.
Questions
Why not just use a single goal symbol like so:
InputElement ::
[...]
DivPunctuator
RegularExpressionLiteral
[...]
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).
Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).
However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.
As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.
In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.
In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.
If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.
Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.
Notes
The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
a /=4/gi;
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.
Why not just use a single goal symbol like so:
InputElement ::
...
DivPunctuator
RegularExpressionLiteral
...
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)
In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).
E.g., you could define the parameter Div, where:
+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".
So then your production would become
InputElement[Div] ::
...
[+Div] DivPunctuator
[~Div] RegularExpressionLiteral
...
But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.
What are some other languages that use multiple goal symbols in their lexical grammar?
I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.
Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.
How would we classify the ECMAScript lexical grammar?
It's basically context-free, plus some extensions.

Is there any formal explanation, why a lexer rule defined first is not visible to a parser rule defined later?

The initial title question was: Why does my lexer rule not work, until I change it to a parser rule? The contents below are related to this question. Then I found new information and changed the title question. Please see my comment!
My Antlr Grammar (Only the "Spaces" rule and it's use is important).
Because my input comes from an ocr source there can be multiple whitespaces, but on the other hand i need to recognize the spaces, because they have meaning for the text structure.
For this reason in my grammar I defined
Spaces: Space (Space Space?)?;
but this throws the error above - the whitespace is not recognzied.
So when I replace it with a parser rule (lowercase!) in my grammar
spaces: Space (Space Space?)?;
the error seems to be solved (subsequent errors appear - not part of this question).
So why is the error solved then in this concrete case when using a parser rule instead of a lexer rule?
And in general - when to use a lexer rule and when a parser rule?
Thank you, guys!
A single space is being recognized as a Space and not as a Spaces, since it matches both lexical rules and Space comes first in the grammar file. (You can see that token type 1 is being recognized; Spaces would be type 9 by my count.)
Antlr uses the common "maximum munch" lexical strategy in which the lexical token recognized corresponds to the longest possible match, ordering the possibilities by order in the file in case two patterns match the same longest match. When you put Spaces first in the file, it wins the tie rule. If you make it a parser rule instead of a lexical rule, then it gets applied after the unambiguous lexical rule for Space.
Do you really only want to allow up to 3 spaces? Otherwise, you could just ditch Space and define Spaces as " "*.

Name for the `::=` symbol in BNF?

Backus-Naur Form uses ::= between the left and right sides of the production rules of a grammar. Wikipedia tells me that notation evolved from :≡. Do either of those symbols have a name?
Based on #rici's tip that Unicode simply calls it DOUBLE COLON EQUALS, it doesn't seem there's another official name.

Does a parser or a lexer generate a symbol table?

I'm taking a compilers course and I'm recapping the introduction. It's a general overview of how the compiler process works.
I'm a bit confused however.
In my course it states: "in addition a lexical analyzer will typically access the symbol table to store/get information on certain source language concepts". So this leads me to believe that a lexer will actually build a symbol table. The way I see it he creates tokens and stores the min a table and states what type of symbol it is. Like "x -> VARIABLE", for example.
Then again, when reading through Google hits and I can only seem to find vague information about the fact that the parser generates this? But the parsing phase comes after the lexer phase. So I'm a bit confused.
Symbol Table Population after parsing; Compiler building
(States that the parser populates the table)
http://www.cs.dartmouth.edu/~mckeeman/cs48/mxcom/doc/Symbols.html
Says "The symbol table is built by walking the syntax tree.". The syntax tree is generated by the parser, right? (Parse tree). So how can the lexer, which runs before the parser use this symbol table?
I understand that a lexer can not know the scope of a variable and other information that is contained within a symbol tabe. Therefore I understand that the parser will add this information to the table. However, a lexer does know wether a word is a variable, declaration keyword etc. Thus it should be able to build up a partial (?) symbol table. So could it perhaps be that they each build part of the symbol table?
I think part of the confusion stems from the fact that "symbol table" means different things to different people, and potentially at different stages in the compilation process.
It is generally agreed that the lexer splits the input stream into tokens (sometimes referred to as lexemes or terminals). These, as you say, can be categorized as different types, numbers, keywords, identifiers, punctuation symbols, and so on.
The lexer may store the recognized identifier tokens in a symbol table, but since the lexer typically does not know what an identifier represents, and since the same identifier can potentially mean different things in different compilation scopes, it is often the parser - which has more contextual knowledge - that is responsible for building the symbol table.
However, in some compiler designs the lexer simply builds a list of tokens, which is passed on to the parser (or the parser requests tokens from the input stream on demand), and the parser in turn generates a parse tree (or sometimes an abstract syntax tree) as the output, and then the symbol table is built only after parsing has completed for a certain compilation unit, by traversing the parse tree.
Many different designs are possible.

Negative lookahead in LR parsing algorithm

Consider such a rule in grammar for an LR-family parsing generator (e.g YACC, BISON, etc.):
Nonterminal : [ lookahead not in {Terminal1, ..., TerminalN} ] Rule ;
It's an ordinary rule, except that it has a restriction: a phrase produced with this rule cannot begin with Terminal1, ..., TerminalN. (Surely, this rule can be replaced with a set of usual rules, but it will result in a bigger grammar). This can be useful for resolving conflicts.
The question is, is there a modification of LR table construction algorithm that accepts such restrictions? It seems to me that such a modification is possible (like precedence relations).
Surely, it can be checked in runtime, but I mean compile-time check (a check which is performed while building parsing table, like %prec, %left, %right and %nonassoc directives in yacc-compartible generators.)
I don't see why this shouldn't be possible, but I also don't see any obvious reason why it would be useful. Do you have an example in mind?
The easiest way to do this would be to do the grammar transform you mention in parentheses. This would make a larger grammar, but it won't artificially increase the number of LR states.
The basic transformation, with only a bit of hand-waving:
For any production with terminal restrictions:
If the production starts with a non-nullable non-terminal, replace the non-terminal with a terminal-restricted version.
If the production starts with a terminal in the terminal restriction list, remove the production
If the production starts with a terminal not in the terminal restriction list, no change is necessary.
If a production starts with a nullable non-terminal, you have to create two versions of the nullable non-terminal, one of which is always null, and the other of which is non-nullable; and then create two versions of the production, one starting with each of the new non-terminals. Then apply the above transforms, but interpreting "starts with" to mean "starts with after any always-null non-terminals."
You don't actually need to modify the grammar, since the above transformations can be done on the fly during the construction of the underlying SLR machine, at least for LR(0) and LALR(1) constructions.

Resources