I was hoping that Bison will keep the last value generated by the last action executed, but that's not the case. I have no idea where it comes up with what it stores in YYSTYPE, but in my case it ends up being 49, which doesn't correspond to anything in my code... (it's not an ASCII code of the last character, nor the number of characters read so far, nor the number of final state, nor the number of the final rule (although it's close, there are 48 rules in total)). It may be the value of the one-before-last token... that's something I'm trying to verify right now.
Anyways, how do I access the value produced by the action of %start rule after the parser finished?
It seems that the value stored in YYSTYPE is the one-before-last token processed...
Related
Within the scripting language I am implementing, valid IDs can consist of a sequence of numbers, which means I have an ambiguous situation where "345" could be an integer, or could be an ID, and that's not known until runtime. Up until now, I've been handling every case as an ID and planning to handle the check for whether a variable has been declared under that name at runtime, but when I was improving my implementation of a particular bit of code, I found that there was a situation where an integer is valid, but any other sort of ID would not be. It seems like it would make sense to handle this particular case as a parsing error so that, e.g., the following bit of code that activates all picks with a spell level tag greater than 5 would be considered valid:
foreach pick in hero where spell.level? > 5
pick.activate[]
nexteach
but the following which instead compares against an ID that can't be mistaken for an integer constant would be flagged as an error during parsing:
foreach pick in hero where spell.level? > threshold
pick.activate[]
nexteach
I've considered separate tokens, ID and ID_OR_INTEGER, but that means having to handle that ambiguity everywhere I'm currently using an ID, which is a lot of places, including variable declarations, expressions, looping structures, and procedure calls.
Is there a better way to indicate a parsing error than to just print to the error log, and maybe set a flag?
I would think about it differently. If an ID is "just a number" and plain numbers are also needed, I would say any string of digits is a number, and a number might designate an ID in some circumstances.
For bare integer literals (like 345), I would have the tokenizer return maybe a NUMBER token, indicating it found an integer. In the parser, wherever you currently accept ID, change it to NUMBER, and call a lookup function to verify the "NUMBER" is a valid ID.
I might have misunderstood your question. You start by talking about "345", but your second example has no integer strings.
I made this experiment for Flex to see if I enter ABC, if it will see all A, AB, ABC or only ABC or only the first match in the list of expressions.
%{
#include <stdio.h>
%}
%%
A puts("got A");
AB puts("got AB");
ABC puts("got ABC");
%%
int main(int argc, char **argv)
{
yylex();
return 0;
}
When I enter ABC after compiling and running the program, it responds with "Got ABC" which really surprises me since I thought lex doesn't keep track of visited text, and only finds the first match; but actually, it seems to find the longest match.
What strategy does Flex use to respond to A if and only if there is no longer match?
The fact that (F)lex uses the maximal-munch principle should hardly be surprising, since it is well documented in the Flex manual:
When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text…. If it finds two or more matches of the same length, the rule listed first in the flex input file is chosen.
(First paragraph of the section "How the input is matched")
The precise algorithm is exceedingly simple: every time a token is requested, flex scans the text, moving through the DFA. Every time it hits an accepting state, it records the current text position. When no more transitions are possible, it returns to the last recorded accept position, and that becomes the end of the token.
The consequence is that (F)lex can scan the same text multiple times, although it only scans once for each token.
A set of lexical rules which require excessive back-tracking will slow down the lexical scan. This is discussed in the Flex manual section Performance Considerations, along with some strategies to avoid the issue. However, except in pathological cases, the overhead from back-tracking is not noticeable.
I want match something like:
var i=1;
So I want to know if var has started at word boundary.
When it matches this line I want to know the last character of previous yytext.
Just to be sure that a char before var is really a non variable character( aka "\b" in regex)
One crude way to maintain old_yytext in each rule and also have a default rule ".".
How to get it?
The only way is to save a copy of the previous token, or at least the last character. Flex's buffer management strategy does not guarantee that the previous token still exists in memory. It is possible that the current token starts at the beginning of flex's buffer.
But doing the work of saving the previous token in every rule would be really silly. You should trust flex to work as advertised, and write appropriate rules. For example, if your identifier pattern looks like this:
[[:alpha:]][[:alnum:]]*
then it is impossible for var to immediately follow an identifier because it would have been included in the idebtifier.
There is one common case in a "normal" flex scanner definition where a keyword or identifier might immediately follow an alphanumeric character, which is when the keyword immediately follows a number (123var). This is not usually a problem, because in almost all languages, it will trigger a syntax error (and if it isn't a syntax error, maybe it is ok :-) )
If you really want to trigger a lexical error, you can add a pattern which recognizes a number followed by a letter.
A lot of programming languages have statements terminated by line-endings. Usually, though, line endings are allowed in the middle of a statement if the parser can't make sense of the line; for example,
a = 3 +
4
...will be parsed in Ruby and Python* as the statement a = 3+4, since a = 3+ doesn't make any sense. In other words, the newline is ignored since it leads to a parsing error.
My question is: how can I simply/elegantly accomplish that same behavior with a tokenizer and parser? I'm using Lemon as a parser generator, if it makes any difference (though I'm also tagging this question as yacc since I'm sure the solution applies equally to both programs).
Here's how I'm doing it now: allow a statement terminator to occur optionally in any case where there wouldn't be syntactic ambiguity. In other words, something like
expression ::= identifier PLUS identifier statement_terminator.
expression ::= identifier PLUS statement_terminator identifier statement_terminator.
... in other words, it's ok to use a newline after the plus because that won't have any effect on the ambiguity of the grammar. My worry is that this would balloon the size of the grammar and I have a lot of opportunities to miss cases or introduce subtle bugs in the grammar. Is there an easier way to do this?
EDIT*: Actually, that code example won't work for Python. Python does in fact ignore the newline if you pass in something like this, though:
print (1, 2,
3)
You could probably make a parser generator get this right, but it would probably require modifying the parser generator's skeleton.
There are three plausible algorithms I know of; none is perfect.
Insert an explicit statement terminator at the end of the line if:
a. the previous token wasn't a statement terminator, and
b. it would be possible to shift the statement terminator.
Insert an explicit statement terminator prior to an unshiftable token (the "offending token", in Ecmascript speak) if:
a. the offending token is at the beginning of a line, or is a } or is the end-of-input token, and
b. shifting a statement terminator will not cause a reduction by the empty-statement production. [1]
Make an inventory of all token pairs. For every token pair, decide whether it is appropriate to replace a line-end with a statement terminator. You might be able to compute this table by using one of the above algorithms.
Algorithm 3 is the easiest to implement, but the hardest to work out. And you may need to adjust the table every time you modify the grammar, which will considerably increase the difficulty of modifying the grammar. If you can compute the table of token pairs, then inserting statement terminators can be handled by the lexer. (If your grammar is an operator precedence grammar, then you can insert a statement terminator between any pair of tokens which do not have a precedence relationship. However, even then you may wish to make some adjustments for restricted contexts.)
Algorithms 1 and 2 can be implemented in the parser if you can query the parser about the shiftability of a token without destroying the context. Recent versions of bison allow you to specify what they call "LAC" (LookAhead Correction), which involves doing just that. Conceptually, the parser stack is copied and the parser attempts to handle a token; if the token is eventually shifted, possibly after some number of reductions, without triggering an error production, then the token is part of the valid lookahead. I haven't looked at the implementation, but it's clear that it's not actually necessary to copy the stack to compute shiftability. Regardless, you'd have to reverse-engineer the facility into Lemon if you wanted to use it, which would be an interesting exercise, probably not too difficult. (You'd also need to modify the bison skeleton to do this, but it might be easier starting with the LAC implementation. LAC is currently only used by bison to generate better error messages, but it does involve testing shiftability of every token.)
One thing to watch out for, in all of the above algorithms, is statements which may start with parenthesized expressions. Ecmascript, in particular, gets this wrong (IMHO). The Ecmascript example, straight out of the report:
a = b + c
(d + e).print()
Ecmascript will parse this as a single statement, because c(d + e) is a syntactically valid function call. Consequently, ( is not an offending token, because it can be shifted. It's pretty unlikely that the programmer intended that, though, and no error will be produced until the code is executed, if it is executed.
Note that Algorithm 1 would have inserted a statement terminator at the end of the first line, but similarly would not flag the ambiguity. That's more likely to be what the programmer intended, but the unflagged ambiguity is still annoying.
Lua 5.1 would treat the above example as an error, because it does not allow new lines in between the function object and the ( in a call expression. However, Lua 5.2 behaves like Ecmascript.
Another classical ambiguity is return (and possibly other statements) which have an optional expression. In Ecmascript, return <expr> is a restricted production; a newline is not permitted between the keyword and the expression, so a return at the end of a line has a semicolon automatically inserted. In Lua, it's not ambiguous because a return statement cannot be followed by another statement.
Notes:
Ecmascript also requires that the statement terminator token be parsed as a statement terminator, although it doesn't quite say that; it does not allow the semicolons in the iterator clause of a for statement to be inserted automatically. Its algorithm also includes mandatory semicolon insertion in two context: after a return/throw/continue/break token which appears at the end of a line, and before a ++/-- token which appears at the beginning of a line.
I'm using pipes to communicate two Prolog processes and every time I reached a read/2 predicate to read a message from my pipe, the program blocked and remained like that. I couldn't understand why that happened (I tried with extremely simple programs) and at the end I realized three things:
Every time I use write/2 to send a message, the sender process must end that message with .\n. If the message does not end like this, the receiver process will get stuck at the read/2 predicate.
If the sender does not flush the output, the message therefore is not left in the pipe buffer. It may seem obvious but it wasn't for me at the beginning.
Although when the message is not flushed the read/2 is blocking, wait_for_input/3 is not blocking at all, so no need for flush_output/1 in such case.
Examples:
This does not work:
example1 :-
pipe(R,W),
write(W,hello),
read(R,S). % The program is blocked here.
That one won't work either:
example2 :-
pipe(R,W),
write(W,'hello.\n'),
read(R,S). % The program is blocked here.
While these two, do work:
example3 :-
pipe(R,W),
write(W,'hello.\n'),
flush_output(W),
read(R,S).
example4 :-
pipe(R,W),
write(W,'hello.\n'),
wait_for_input([W],L,infinite).
Now my question is why? Is there a reason why Prolog only "accepts" full lines ended with a period when reading from a pipe (actually reading from any stream you may want to read)? And why does read block while wait_for_input/3 doesn't (assuming the message is not flushed)?
Thanks!
A valid Prolog read-term always ends with a period, called end char (* 6.4.8 *). And in 6.4.8 Other tokens, the standard reads:
An end char shall be followed by a layout character or a %.
So this is what the standard demands.
A newline after the period is one possibility to end a read-term, besides space, tab and other layout characters as well as %. However, due to the prevalence of ttys and related buffering, it seems a good convention to just stick with a newline.
The reason why the end char is needed is that Prolog syntax permits infix and postfix operators. Consider as input
f(1) + g(2).
when reading f(1) you might believe that this is already the entire term, but you still must await the period to be sure that there is no infix or postfix thereafter.
Also note that you must use writeq/1 or write_canonical/1 to produce output that can be read back. You cannot use write/1.
As an example, consider write([(.)+ .]). First, this is valid syntax. The dots are immediately followed by some other character. Remark the . is commonly called a period at the end, whereas it is called a dot within Prolog text.
write/1 will write this as [. + .]. Note, that the first . is now followed by a space. So when this text is read back,
only [. will be read.
There are many other ugly examples such as this one, usually they do not hit you. But once you are hit, you are hit...