I've seen a few languages that will eat a token, then parse the token and then when they need to check the next token whilst parsing, they request it from the lexer.
So you have if (x == 3) you lex, check what it is in this case an if, lex again and make sure its a (, parse an expression which requests 3 in this case till it finishes parsing an expression, and then you lex and expect a closing parenthesis.
The other alternative is you lex this input stream as keyword, symbol, identifier, equality, number, symbol and then you give that token list to the parser which will parse it into an AST.
What are the pros/cons of these two techniques?
For most grammars, it doesn't really matter whether you lex the entire input into a token list as a first pass, then take tokens from the list during a parse pass, or lex on demand. The second method avoids the need for an in-memory token list, the first method means that you can parse several times a bit faster, which you might want to do in an interpreter.
However if the grammar require more than one token of lookahead or isn't left-right then you might need to lex more. Whilst natural languages have some odd parse rules ("time flies like an arrow, fruit flies like bananas"), computer languages are usually designed to be parseable with a simple recursive descent parser with one token of lookahead.
Related
Most interpreters let you type the following at their console:
>> a = 2
>> a+3
5
>>
My question is what mechanisms are usually used to handle this syntax? Somehow the parser is able to distinguish between an assignment and an expression even though they could both start with a digit or letter. It's only when we retrieve the second token that you know if you have an assignment or not. In the past, I've looked ahead two tokens and if the second token isn't an equals I push the tokens back into the lexical stream and assume it's an expression. I suppose one could treat the assignment as an expression which I think some languages do. I thought of using left-factoring but I can't see it working.
eg
assignment = variable A
A = '=' expression | empty
Update I found this question on StackOverflow which address the same question: How to modify parsing grammar to allow assignment and non-assignment statements?
From how you're describing your approach - doing a few tokens of lookahead to decide how to handle things - it sounds like you're trying to write some sort of top-down parser along the lines of an LL(1) or an LL(2) parser, and you're trying to immediately decide whether the expression you're parsing is a variable assignment or an arithmetical expression. There are several ways that you could parse expressions like these quite naturally, and they essentially involve weakening one of those two assumptions.
The first way we could do this would be to switch from using a top-down parser like an LL(1) or LL(2) parser to something else like an LR(0) or SLR(1) parser. Those parsers work bottom-up by reading larger prefixes of the input string before deciding what they're looking at. In your case, a bottom-up parser might work by seeing the variable and thinking "okay, I'm either going to be reading an expression to print or an assignment statement, but with what I've seen so far I can't commit to either," then scanning more tokens to see what comes next. If they see an equals sign, great! It's an assignment statement. If they see something else, great! It's not. The nice part about this is that if you're using a standard bottom-up parsing algorithm like LR(0), SLR(1), LALR(1), or LR(1), you should probably find that the parser generally handles these sorts of issues quite well and no special-casing logic is necessary.
The other option would be to parse the entire expression assuming that = is a legitimate binary operator like any other operation, and then check afterwards whether what you parsed is a legal assignment statement or not. For example, if you use Dijkstra's shunting-yard algorithm to do the parsing, you can recover a parse tree for the overall expression, regardless of whether it's an arithmetical expression or an assignment. You could then walk the parse tree to ask questions like
if the top-level operation is an assignment, is the left-hand side a single variable?
if the top-level operation isn't an assignment, are there nested assignment statements buried in here that we need to get rid of?
In other words, you'd parse a broader class of statements than just the ones that are legal, and then do a postprocessing step to toss out anything that isn't valid.
I am moving from a separate lexer and parser to a combined parser that operates on an array of characters. One problem is how to properly handle whitespace.
Problem
Take a language consisting of a sequence of characters 'a' and 'b'. Whitespace is allowed in the input but does not effect the meaning of the program.
My current approach to parse such a language is:
var token = function(p) {
return attempt(next(
parse.many(whitespace),
p));
};
var a = token(char('a'));
var b = token(char('b'));
var prog = many(either(a, b));
This works, but requires unnecessary backtracking. For a program such as '___ba_alb' (Whitespace was getting stripped in the post so let _ be a space), in matching 'b', the whitespace is parsed twice, first for a and when a fails, again for b. Simply removing attempt does not work as either will never reach b if any whitespace is consumed.
Attempts
My first though was to move the token call outside of the either:
var a = char('a');
var b = char('b');
var prog = many(token(either(a, b)));
This works, but now prog cannot be reused easily. In building a parser library, this seems to require defining parsers twice: one version that actually consumes 'a' or 'b' and can be used in other parsers, and one version that correctly handles whitespace. It also clutters up parser definitions by requiring them to have explicit knowledge of how each parser operates and how it handles whitespace.
Question
The intended behavior is that an arbitrary amount of whitespace can be consumed before a token. If parsing the token fails, it backtracks to the start of the token instead of the start of the whitespace.
How can this be expressed without preprocessing the input to produce a token stream? Are there any good, real world code examples of this? The closest I found was Higher-Order Functions for Parsing but this does not seem to address my specific concern.
I solved this problem in a JSON parser that I built. Here's the key parts of my solution, in which I followed the 'write the parsers twice' approach somewhat:
define the basic parsers for each token -- number, string, etc.
define a token combinator -- that takes in a basic token parser and outputs a whitespace-munching parser. The munching should occur after so that whitespace is only parsed once, as you noted:
function token(parser) {
// run the parser, then munch whitespace
}
use the token combinator to produce the munching token parsers
use the parsers from 3. to build the parsers for the rest of the language
I didn't have a problem with having two similar versions of the parsers, since each version was in fact distinct. It was slightly annoying, but a tiny cost. You can check out the full code here.
I am trying to determine whether suggested changes to the EcmaScript grammar introduce ambiguities.
The grammar is odd in a few ways
There is no regular or context free lexical grammar meaning there is no way to break the input into a series of tokens which can be fed to a tree builder, though at a given parser state there is a context free grammar which can be used to fetch the next token.
Some tokens are implicit. Specifically semicolons are inserted in some places when not present in the source text. This only requires one non-ignorable token of lookahead but since ignorable tokens can be of arbitrary length prevents non-finite lookahead.
There is no translation simpler than a full parse that allows removal or collapsing of ignorable tokens.
Line terminators tokens (and multiline comments that are equivalent to line terminators) are ignorable in most contexts but are significant in some.
I know that proving no ambiguity is not doable in general, but I'd like to be able to achieve a simpler goal:
A test that is true if and only if there is no string such that two different paths through the candidate grammar might produce two different trees where each path involves breaking the string into less than k tokens.
I would be very happy if I could prove such a thing for a candidate grammar to k of 50.
Is there any literature on detecting ambiguity within such limits?
I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?
I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.