basic questions in BISON for parsing - parsing

I am new to bison, I have some basic questions if you could help me with them:
Which one is right from the following:
%left ’*’ ’/’
or
%left '*' '/'
that means instead of getting the token I use it to define it in the parser file
Can I define a rule like this:
EXP -> EXP "and" EXP
instead of
EXP -> EXP AND EXP //AND here is a token
If I have LEX and BISON files for building a parser which one should include the other and if
I have used a common header file in which one of them should the file be defined?
If the BISON algorithm found a match according to one of the rules what happens first it makes reduce then it does the action defined for the rule that matched or first it does the action and after that makes reduce to the stack?

Its tough to tell what you are asking due to your formatting, but think the answer is no. %left just defines a a token (exactly like %token does) and in addition sets a precedence level for that token. You still have to "get" the token by recognizing it in your lexer and returning the appropriate token value.
While you can use "and", you don't want to because its almost impossible to get right. Its much better to use AND or and (no quotes). The difference is that using quotes creates a token that is not output as a #define in the .tab.h file, so there's no easy way to generate that token in your lexer.
There a a number of ways to do it. The easiest is to have neither include the other and have the lex file include the header generated by bison's -d flag -- this is what most examples do. It is also possible to directly include the lex.yy.c file in the 3rd section of the .y file OR include the .tab.c in the top section of the .l file (but not both!) in which case you'll only compile the one file.
Its executes the action for the rule first (so the values for the items on the RHS are available while the action is executing), and then does the stack reduction, replacing the RHS items with the value the action put int $$.

I somewhat disagree with Chris on point 2. It's better to use "and" because then in the error messages the parser will report things about "and" rather then about TOK_AND or t_AND which certainly do not make sense to the user.
And it's not that hard to get it right: provided you inserted
%token TOK_AND "and"
somewhere, you can use either "and" or TOK_AND in the grammar file. But, IMHO, the former is much clearer.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

Advantages and disadvantages of creating named tokens for single characters

Let's say I have a simple syntax where you can assign a number to an identifier using = sign.
I can write the parser in two ways.
I can include the character token = directly in the rule or I can create a named token for it and use the lexer to recognize it.
Option #1:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
// parser
%token IDENTIFIER NUMBER
%%
assignment : IDENTIFIER '=' NUMBER ;
Option #2:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
= { return EQUAL_SIGN; }
// parser
%token IDENTIFIER NUMBER EQUAL_SIGN
%%
assignment : IDENTIFIER EQUAL_SIGN NUMBER ;
Both ways of writing the parser work and I cannot quite find an information about good practices concerning such situation.
The first snippet seems to be more readable but this is not my highest concern.
Is any of these two options faster or beneficial in other way? Are they technical reasons (other than readability) to prefer one over another?
Is there maybe a third, better way?
I'm asking about problems concerning huge parsers, where optimization may be a real issue, not just such toy examples as is shown here.
Aside from readability, it basically makes no difference. There really is no optimisation issue, no matter how big your grammar is. Once the grammar has been compiled, tokens are small integers, and one small integer is pretty well the same as any other small integer.
But I wouldn't underrate the importance of readability. For one thing, it's harder to see bugs in unreadable code. It's surprisingly common for a grammar bug to be the result of simply typing the wrong name for some punctuation character. It's much easier to see that '{' expr '{' is wrong than if it were written T_LBRC expr T_LBRC, and furthermore the named symbols are much harder to interpret for someone whose first language isn't English.
Bison's parse table compression requires token numbers to be consecutive integers, which is done by passing incoming token codes through a lookup table, hardly a major overhead. Not using character codes doesn't avoid this lookup, though, because the token numbers 1 through 255 are reserved anyway.
However, Bison's C++ API using named token constructors requires token names and single-character token codes cannot be used as token names (although they're not forbidden, since you're not required to use named constructors).
Given that use case, Bison recently introduced an option which generates consecutively numbered token codes in order to avoid the recoding; this option is not compatible with single-character tokens being represented as themselves. It's possible that not having to recode the token is a marginal speed-up, but it's hard to believe that it's significant, but if you're not going to use single-quoted tokens, you might as well go for it.
Personally, I don't think the extra complexity is justified, at least for the C API. If you do choose to go with token names, perhaps in order to use the C++ API's named constructors, I'd strongly recommend using Bison aliases in order to write your grammar with double-quoted tokens (also recommended for multi-character operator and keyword tokens).

How do you deal with keywords in Lex?

Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.
On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:
optional : OPTIONAL identifier '=' expression ;
If I then define identifier as, say:
identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */
| IDENTIFIER ;
It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...
Is there an idiomatic way to solve this?
Is there an idiomatic way to solve this?
Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.
The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.
You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.
You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.
Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).
This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:
identifier : keyword | IDENTIFIER ;
keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;
If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.

Tokenising whitespace in a half-whitespace sensitive language?

I have made some searches, including taking a second look through the red Dragon Book in front of me, but I haven't found a clear answer to this. Most people are talking about whitespace-sensitivity in terms of indentation, but that's not my case.
I want to implement a transpiler for a simple language. This language has a concept of a "command", which is a reserved keyword followed by some arguments. To give you an idea of what I'm talking about, a sequence of commands may look something like this:
print "hello, world!";
set running 1;
while running #
read progname;
launch progname;
print "continue? 1 = yes, 0 = no";
readint running;
#
Informally, you can view the grammar as something along the lines of
<program> ::= <statement> <program>
<statement> ::= while <expression> <sequence>
| <command> ;
<sequence> ::= # <program> #
| <statement>
<command> ::= print <expression>
| set <variable> <expression>
| read <variable>
| readint <variable>
| launch <expression>
<expression> ::= <variable>
| <string>
| <int>
for simplicity, we can define the following as such
<string> is an arbitrary sequence of characters surrounded by quotes
<int> is a sequence of characters '0'..'9'
<variable> is a sequence of characters 'a'..'z'
Now this would ordinarily not be any problem. In fact, given just this specification I have a working implementation, where the lexer silently eats all whitespace. However, here's the catch:
Arguments to commands must be separated by whitespace!
In other words, it should be illegal to write
while running#print"hello";#
even though this clearly isn't ambiguous as far as the grammar is concerned. I have had two ideas on how to solve this.
Output a token whenever some whitespace is consumed, and include whitespace in the grammar. I suspect this will make the grammar a lot more complicated.
Rewrite the grammar so instead of "hard-coding" the arguments of each command, I have a production rule for "arguments" taking care of whitespace. It may look something like
<command> ::= <cmdtype> <arguments>
<arguments> ::= <argument> <arguments>
<argument> ::= <expression>
<cmdtype> ::= print | set | read | readint | launch
Then we can make sure the lexer somehow (?) takes care of leading whitespace whenever it encounters an <argument> token. However, this moves the complexity of dealing with the arity (among other things?) of built-in commands into the parser.
How is this normally solved? When the grammar of a language requires whitespace in particular places but leaves it optional almost everywhere else, does it make sense to deal with it in the lexer or in the parser?
I wish I could fudge the specification of the language just a teeny tiny bit because that would make it much simpler to implement, but unfortunately this is a backward-compatibility issue and not possible.
Backwards compatibility is usually taken to apply only to correct programs; accepting a program which previously would have benn rejected as a syntax error cannot alter the behaviour of any valid program and thus does not violate backwards compatibility.
That might not be relevant in this case, but since it would, as you note, simplify the problem considerably, it seemed worth mentioning.
One solution is to pass whitespace on to the parser, and then incorporate it into the grammar; normally, you would define a terminal, WS, and from that a non-terminal for optional whitespace:
<ows> ::= WS |
If you are careful to ensure that only one of the terminal and the non-terminal are valid in any context, this does not affect parsability, and the resulting grammar, while a bit cluttered, is still readable. The advantage is that it makes the whitespace rules explicit.
Another option is to handle the issue in the lexer; that might be simple but it depends on the precise nature of the language.
From your description, it appears that the goal is to produce a syntax error if two tokens are not separated by whitespace, unless one of the tokens is "self-delimiting"; in the example shown, I believe the only such token is the semicolon, since you seem to indicate that # must be whitespace-delimited. (It could be that your complete language has more self-delimiting tokens, but that does not substantially alter the problem.)
That can be handled with a single start condition in the lexer (assuming you are using a lexer generator which allows explicit states); reading whitespace puts you in a state in which any token is valid (which is the initial state, INITIAL if you are using a lex-derivative). In the other state, only self-delimiting tokens are valid. The state after reading a token will be the restricted state unless the token is self-delimiting.
This requires pretty well every lexer action to include a state transition action, but leaves the grammar unaltered. The effect is to move the clutter from the parser to the scanner, at the cost of obscuring the whitespace rules. But it might be less clutter and it will certainly simplify a future transition to a whitespace-agnostic dialect, if that is in your plans.
There is a different scenario, which is a posix-like shell in which identifiers (called "words" in the shell grammar) are not limited to alphabetic characters, but might include any non-self-delimiting character. In a posix shell, print"hello, world" is a single word, distinct from the two token sequence print "hello, world". (The first one will eventually be dequoted into the single token printhello, world.)
That scenario can really only be handled lexically, although it is not necessarily complicated. It might be a guide to your problem as well; for example, you could add a lexical rule which accepts any string of characters other than whitespace and self-delimiting characters; the maximal munch rule will ensure that action is only taken if the token cannot be recognised as an identifier or a string (or other valid tokens), so you can just throw an error in the action.
That is even simpler than the state-based lexer, but it is somewhat less flexible.

lex/yacc: why do lexers have to include a parser's header file?

I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.
In bison, one uses the %token declaration to define token names, for example
%token INTEGER
%token VARIABLE
and so forth. When bison is used on the grammar specification file, a header file y.tab.h is generated which has some define directives for each token:
#define INTEGER 258
#define VARIABLE 259
Finally, the lexer includes the header file y.tab.h by returning the right code for each token:
%{
#include "y.tab.h"
%}
%%
[a-z] {
// some code
return VARIABLE;
}
[1-9][0-9]* {
// some code
return INTEGER;
}
So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.
Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?
As with many things, it's just a historical accident. It certainly would have been possible for the token declarations to have been produced by lex (but see below). Or it would have been possible to force the user to write their own declarations.
It is more convenient for yacc/bison to produce the token numberings, though, because:
The terminals need to be parsed by yacc because they are explicit elements in the grammar productions. In lex, on the other hand, they are part of the unparsed actions and lex can generate code without any explicit knowledge about token values; and
yacc (and bison) produce parse tables which are indexed by terminal and non-terminal numbers; the logic of the tables require that terminals and non-terminals have distinct codes. lex has no way of knowing what the non-terminals are, so it can't generate appropriate codes.
The second argument is a bit weak, because in practice bison-generated parsers renumber token ids to fit them into the id-numbering scheme. Even so, this is only possible if bison is in charge of the actual numbers. (The reason for the renumbering is to make the id value contiguous; by another historical accident, it's normal to reserve codes 0 through 255 for single-character tokens, and 0 for EOF; however, not all the 8-bit codes are actually used by most scanners.)
In the lexer, the tokens are only present in the return value: they are part of the target language (ie. C++), and lex itself knows nothing about them.
In the parser, on the other hand, tokens are part of the definition language: you write them in the actual parser definition, and not just in the target language. So yacc has to know about these tokens.
The ordering of the phases is not necessarily reflected in the architecture of the compiler. The scanner is the first phase and the parser the second, so in a sense data flows from the scanner to the parser, but in a typical Bison/Flex-generated compiler it is the parser that controls everything, and it is the parser that calls the lexer as a helper subroutine when it needs a new token as input in the parsing process.

Resources