Tokenising whitespace in a half-whitespace sensitive language? - parsing

I have made some searches, including taking a second look through the red Dragon Book in front of me, but I haven't found a clear answer to this. Most people are talking about whitespace-sensitivity in terms of indentation, but that's not my case.
I want to implement a transpiler for a simple language. This language has a concept of a "command", which is a reserved keyword followed by some arguments. To give you an idea of what I'm talking about, a sequence of commands may look something like this:
print "hello, world!";
set running 1;
while running #
read progname;
launch progname;
print "continue? 1 = yes, 0 = no";
readint running;
#
Informally, you can view the grammar as something along the lines of
<program> ::= <statement> <program>
<statement> ::= while <expression> <sequence>
| <command> ;
<sequence> ::= # <program> #
| <statement>
<command> ::= print <expression>
| set <variable> <expression>
| read <variable>
| readint <variable>
| launch <expression>
<expression> ::= <variable>
| <string>
| <int>
for simplicity, we can define the following as such
<string> is an arbitrary sequence of characters surrounded by quotes
<int> is a sequence of characters '0'..'9'
<variable> is a sequence of characters 'a'..'z'
Now this would ordinarily not be any problem. In fact, given just this specification I have a working implementation, where the lexer silently eats all whitespace. However, here's the catch:
Arguments to commands must be separated by whitespace!
In other words, it should be illegal to write
while running#print"hello";#
even though this clearly isn't ambiguous as far as the grammar is concerned. I have had two ideas on how to solve this.
Output a token whenever some whitespace is consumed, and include whitespace in the grammar. I suspect this will make the grammar a lot more complicated.
Rewrite the grammar so instead of "hard-coding" the arguments of each command, I have a production rule for "arguments" taking care of whitespace. It may look something like
<command> ::= <cmdtype> <arguments>
<arguments> ::= <argument> <arguments>
<argument> ::= <expression>
<cmdtype> ::= print | set | read | readint | launch
Then we can make sure the lexer somehow (?) takes care of leading whitespace whenever it encounters an <argument> token. However, this moves the complexity of dealing with the arity (among other things?) of built-in commands into the parser.
How is this normally solved? When the grammar of a language requires whitespace in particular places but leaves it optional almost everywhere else, does it make sense to deal with it in the lexer or in the parser?
I wish I could fudge the specification of the language just a teeny tiny bit because that would make it much simpler to implement, but unfortunately this is a backward-compatibility issue and not possible.

Backwards compatibility is usually taken to apply only to correct programs; accepting a program which previously would have benn rejected as a syntax error cannot alter the behaviour of any valid program and thus does not violate backwards compatibility.
That might not be relevant in this case, but since it would, as you note, simplify the problem considerably, it seemed worth mentioning.
One solution is to pass whitespace on to the parser, and then incorporate it into the grammar; normally, you would define a terminal, WS, and from that a non-terminal for optional whitespace:
<ows> ::= WS |
If you are careful to ensure that only one of the terminal and the non-terminal are valid in any context, this does not affect parsability, and the resulting grammar, while a bit cluttered, is still readable. The advantage is that it makes the whitespace rules explicit.
Another option is to handle the issue in the lexer; that might be simple but it depends on the precise nature of the language.
From your description, it appears that the goal is to produce a syntax error if two tokens are not separated by whitespace, unless one of the tokens is "self-delimiting"; in the example shown, I believe the only such token is the semicolon, since you seem to indicate that # must be whitespace-delimited. (It could be that your complete language has more self-delimiting tokens, but that does not substantially alter the problem.)
That can be handled with a single start condition in the lexer (assuming you are using a lexer generator which allows explicit states); reading whitespace puts you in a state in which any token is valid (which is the initial state, INITIAL if you are using a lex-derivative). In the other state, only self-delimiting tokens are valid. The state after reading a token will be the restricted state unless the token is self-delimiting.
This requires pretty well every lexer action to include a state transition action, but leaves the grammar unaltered. The effect is to move the clutter from the parser to the scanner, at the cost of obscuring the whitespace rules. But it might be less clutter and it will certainly simplify a future transition to a whitespace-agnostic dialect, if that is in your plans.
There is a different scenario, which is a posix-like shell in which identifiers (called "words" in the shell grammar) are not limited to alphabetic characters, but might include any non-self-delimiting character. In a posix shell, print"hello, world" is a single word, distinct from the two token sequence print "hello, world". (The first one will eventually be dequoted into the single token printhello, world.)
That scenario can really only be handled lexically, although it is not necessarily complicated. It might be a guide to your problem as well; for example, you could add a lexical rule which accepts any string of characters other than whitespace and self-delimiting characters; the maximal munch rule will ensure that action is only taken if the token cannot be recognised as an identifier or a string (or other valid tokens), so you can just throw an error in the action.
That is even simpler than the state-based lexer, but it is somewhat less flexible.

Related

I have a problem replacing ebnf rules with regex in a Tatsu grammar

I have developed a syntax checker for the Gerber format, using Tatsu. It works fine, my thanks to the Tatsu developers. However, it is not overly fast, and I am now optimizing the grammar.
The Gerber format is a stream of commands, and this is handled by main loop of the grammar is as follows:
start =
{
| ['X' integer] ['Y' integer] ((['I' integer 'J' integer] 'D01*')|'D02*'|'D03*')
| ('G01*'|'G02*'|'G03*'|'G1*'|'G2*'|'G3*')
... about 25 rules
}*
M02
$;
with
integer = /[+-]?[0-9]+/;
In big files, where the performance is important, the vast majority of the statements are covered by the first rule in the choice. (It is actually three commands. By putting them first, and merging then to eliminate common elements made the checker 2-3 times faster.)
Now I try to replace the first rule by a regex, assuming regex is faster as it is in C.
In the first step I inlined the integer:
| ['X' /[+-]?[0-9]+/] ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*')
This worked fine and gave a modest speedup.
Then I tried go regex the whole rule. Failure. As a test I only modified the first rule in the sequence:
| /(X[+-]?[0-9]+)?/ ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*')
This fails to recognize the following command:
X81479571Y-38450761D01*
I cannot see the difference between ['X' /[+-]?[0-9]+/] and /(X[+-]?[0-9]+)?/
What do I miss?
The difference is that an optional expression with [] will advance over whitespace and comments while a pattern expression with // will not. It's in the documentation. A trick for this case is to place the pattern in it's own, initial-lower-case rule, so there's whitespace and comments tokenization before applying the pattern, though I don't think adding that indirection will aid with performance.
As to optimization, a trick I've used in the "...25 more rules" case is to group rules with similar prefixes under a &lookahead, for example &/G0/ in your case.
TatSu is designed to be friendly to grammar writers in favor of being performant. If you need blazing speeds, through generation of parsers in C, you may want to take a look at pegen, the predecesor to the new PEG parser in CPython.

Circular Dependency in Parser Grammar

I am trying to build my first parser. Unfortunately I am not familiar with the theory of grammars and now I wonder whether it is
plainly forbidden
just a bad idea or
kind of OK
to have circular dependencies in my grammar. My intuition raises a yellow flag, but since I am not familiar with the theory of parsers, I am not sure.
Be my lexer well-defined and be its tokens the ones one would expect from their names, then I have the following grammar:
list_content : value
| list_content COMMA list_content
list : LBRACE list_content RBRACE
value : INT
| list
In there, value depends on list, list depends on list_content and list_content depends on value.
I have seen recursive definitions in grammars before, such as:
sum | NUMBER + NUMBER
| NUMBER + sum
| LBRACE sum RBRACE
However, I think, my circular definition is different (in terms of: dirtier), because it is harder to overview and the definitory circle spans multiple grammar rules. I am not sure, whether my circular definition creates an ambiguity in my grammar. I also fear it might make my code hard to debug.
So, I have two questions:
A) Should I restructure my grammar (and my lexer) or is it OK to live with this circular definition?
B) If I should restructure, how would I best do so?
A circular dependence like this is fine -- it's a recursive definition and is analogous to using recursion in a program. As such, the important thing to look at is how the base case is realized, since that is how the recusion terminates. If you don't have a base case (or it can't be reached without also triggering additional recursion), you have a problem -- an infinite loop that can never match any finite input.
In your case, the base case is the primitive rule -- since value can reduce to a single primitive and list_content to a single value, everything is fine.
You do have one issue in your grammar in that the rule
list_content: list_content COMMA list_content
is ambiguous. What this means is that for any list with three or more elements (two or more COMMAs), there are multiple way to parse it -- either matching the left comma (left recursive) or the right comma (right recursive) first. This will cause problems with most parser tools that cannot deal with ambiguity, and in you case is probably irrelevant (you don't really care which way it is parsed, since you'll likely just concatenate the lists).
The fix for this is to rewrite the rule as a simple left- or right- recusive rule (but not both). Which you want to use depends on the parser style you are using -- for an LL (top down or recursive descent) parser, you want a right recursive rule. For an LR (bottom up or shift/reduce) parser, you (generally) want a left recursive rule.
left recursive: list_content : value | list_content COMMA value ;
right recursive: list_content : value | value COMMA list_content ;

Grammar: Precedence of grammar alternatives

This is a very basic question about grammar alternatives. If you have the following alternative:
Myalternative: 'a' | .;
Myalternative2: 'a' | 'b';
Would 'a' have higher priority over the '.' and over 'b'?
I understand that this may also depend on the behaviour of the parser generated by this syntax but in pure theoretical grammar terms could you imagine these rules being matched in parallel i.e. test against 'a' and '.' at the same time and select the one with highest priority? Or is the 'a' and . ambiguous due to the lack of precedence in grammars?
The answer depends primarily on the tool you are using, and what the semantics of that tool is. As written, this is not a context-free grammar in canonical form, and you'd need to produce that to get a theoretical answer, because only in that way can you clarify the intended semantics.
Since the question is tagged antlr, I'm going to guess that this is part of an Antlr lexical definition in which . is a wildcard character. In that case, 'a' | . means exactly the same thing as ..
Since MyAlternative matches everything that MyAlternative2 matches, and since MyAlternative comes first in the Antlr lexical definition, MyAlternative2 can never match anything. Any single character will be matched by MyAlternative (unless there is some other lexical rule which matches a longer sequence of input characters).
If you put the definition of MyAlternative2 first in the grammar file, then a or b would be matched as MyAlternative2, while any other character would be matched as MyAlternative.
The question of precedence within alternatives is meaningless. It doesn't matter whether MyAlternative considers the match of an a to be a match of a or a match of .. It is, in the end, a match of MyAlternative, and that symbol can only have one associated action.
Between lexical rules, there is a precedence relationship: The first one wins. (More accurately, as far as I know, Antlr obeys the usual convention that the longest match wins; between two rules which both match the same longest sequence, the first one in the grammar file wins.) That is not in any way influenced by alternative bars in the rules themselves.

bison/yacc - limits of precedence settings

So I've been trying to parse a haskell-like language grammar with bison. I'll omit the standard problems with grammars and unary minus (like, what is (-5) from -5 and \x->x-5 or if a-b is a-(b) or apply a (-b) which itself can still be apply a \x->x-b, haha.) and go straight to the thing that suprised me.
To simplify the whole thing to the point where it matters, consider following situation:
expression: '(' expression ')'
| expression expression /* lambda application */
| '\\' IDENTIFIER "->" expression /* lambda abstraction */
| expression '+' expression /* some operators to play with */
| expression '*' expression
| IDENTIFIER | CONSTANT /* | ..... */
;
I solved all shift/reduce conflicts with '+' and '*' with %left and %right
precedence macros, but I somehow failed to find any good solution how to set
precedence for the expression expression lambda application. I tried
%precedence and %left and %prec marker as shown for example here
%http://www.gnu.org/software/bison/manual/html_node/Non-Operators.html#Non-Operators,
but it looks like bison is completely ignoring any precedence setting on this
rule. At least all combinations I tried failed. Documentation on exactly this
topic is pretty sparse, whole thing looks like suited only for handling the
"classic" expr. OPER expr. case.
Question: Am I doing something wrong, or is this impossible in Bison? If not,
is it just unsupported or is there some theoretical justification why not?
Remark: Of course there's an easy workaround to force left-folding and
precedence that would look schematically like
expression: expression_without_lambda_application
| expression expression_without_lambda_application
;
expression_without_lambda_application: /* ..operators.. */
| '(' expression ')'
;
...but that's not as neat as it could be, right? :]
Thanks!
It's easiest to understand how bison precedence works if you understand how LR parsing works, since it's based on a simple modification of the LR algorithm. (Here, I'm just combining SLR, LALR and LR grammars, because the basic algorithm is the same.)
An LR(1) machine has two possible classes of action:
Reduce the right-hand side of the production which ends just before the lookahead token (and consequently is at the top of the stack).
Shift the lookahead token.
In an LR(1) grammar, the decision can always be made on the basis of the machine state and the lookahead token. But certain common constructs -- notably infix expressions -- apparently require grammars which appear more complicated than they need to be, and which require more unit reductions than should be necessary.
In an era in which LR parsing was new, and most practitioners were used to some sort of operator precedence grammar (see below for definition), and in which cycles were a lot more expensive than they are now so that the extra unit reductions seemed annoying, the modification of the LR algorithm to use standard precedence techniques was attractive.
The modification -- which is based on a classic algorithm for parsing operator precedence grammars -- involves assigning a precedence value (an integer) to every right-hand side (i.e. every production) and to every terminal. Then, when constructing the LR machine, if a given state and lookahead can trigger either a shift or a reduce action, the conflict is resolved by comparing the precedence of the possible reduction with the precedence of the lookahead token. If the reduction has a higher precedence, it wins; otherwise the machine shifts.
Note that reduction precedences are never compared with each other, and neither are token precedences. They can actually come from different domains. Furthermore, for a simple expression grammar, intuitively the comparison is with the operator "at the top of the stack"; this is actually accomplished by using the right-most terminal in a production to assign the precedence of the production. To handle left vs. right associativity, we don't actually use the same precedence value for a production as for a terminal. Left-associative productions are given a precedence slightly higher than the terminal's precedence, and right-associative productions are given a precedence slightly lower. This could be done by making the terminal precedences multiples of 3 and the reduction precedences a value one greater or less than the terminal. (Actually in practice the comparison is > rather than ≥ so it's possible to use even numbers for terminals, but that's an implementation detail.)
As it turns out, languages are not always quite so simple. So sometimes -- the case of unary operators is a classic example -- it's useful to explicitly provide a reduction precedence which is different from the default. (Another case is where the precedence is more related to the first terminal than the last, in the case where there are more than one.)
Editorial note:
Really, this is all a hack. It's a good hack, and it can be useful. But like all hacks, it can be pushed too far. Intricate tricks with precedence which require a full understanding of the algorithm and a detailed analysis of the grammar are not, IMHO, elegant. They are confusing. The whole point of using a context-free-grammar formalism and a parser generator is to simplify the presentation of the grammar and make it easier to verify. /Editorial note.
An operator precedence grammar is an operator grammar which can be bottom-up parsed using only precedence relations (using an algorithm such as the classic "shunting-yard" algorithm). An operator grammar is a grammar in which no right-hand side has two consecutive non-terminals. And the production:
expression: expression expression
cannot be expressed in an operator grammar.
In that production, the shift reduce conflict comes in the middle, just before where the operator would be if there were an operator. In that case, one would want to compare the precedence of whichever reduction gave rise to the first expression with the invisible operator which separates the expressions.
In some circumstances (and this requires careful grammar analysis, and is consequently very fragile), it's possible to distinguish between terminals which could start an expression and terminals which could be operators. In that case, it would be possible to use the precedence of the terminals in the FIRST set of expression as the comparators in the precedence comparison. Since those terminals will never be used as the comparators in an operator production, no additional ambiguity is created.
Of course, that fails as soon as it is possible for a terminal to be either an infix or a prefix operator, such as unary minus. So it's probably only of theoretical interest in most languages.
In summary, I personally think that the solution of explicitly defining non-application expressions is clear, elegant and consistent with the theory of LR parsing, while any attempt to use precedence relations will turn out to be far less easy to understand and verify.
But, if you insist, here is a grammar which will work in this particular case (without unary operators), based on assigning precedence values to the tokens which might start an expression:
%token IDENTIFIER CONSTANT APPLY
%left '(' ')' '\\' IDENTIFIER CONSTANT APPLY
%left '+'
%left '*'
%%
expression: '(' expression ')'
| expression expression %prec APPLY
| '\\' IDENTIFIER "->" expression
| expression '+' expression
| expression '*' expression
| IDENTIFIER | CONSTANT
;

Is the word "lexer" a synonym for the word "parser"?

The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.

Resources