Difference in token types - parsing

What is the terminology that is used to differentiate the difference between the accepting token and the tokens that are matched in the stream? As an example, here is what I mean:
Accepting Token
----------------
OPEN_PAREN: '(';
CLOSE_PAREN: ')';
PLUS: '+';
NUMBER: \d+;
Parsed Tokens
-------------------
# (2+(3))
[<OPEN_PAREN: '('>, <NUMBER: '2'>, <PLUS: '+'>,
<OPEN_PAREN: '('>, <NUMBER: '3'>, <CLOSE_PAREN: ')'>, <CLOSE_PAREN: ')'>]
How are these two different items categorized? Currently I'm calling one TOKENS and the other tokens which is most confusing.

You ask for "the terminology", but I don't think there's a generally agreed answer to this question. (As such, the question may be inappropriate for Stack Overflow.)
Personally, I'd say that (e.g.)
NUMBER is a lexical symbol,
NUMBER: \d+; is a lexical production or lexical rule, and
<NUMBER: '2'> at line 1, column 2 is a token, or a lexical node, or an instance of the lexical production/rule.

Related

ANTLR4: implicit or explicit token definition

What are the benefits and drawbacks of using explicit token definitions in ANTLR4? I find the text in single parentheses more descriptive and easier to use than creating a separate token and using that in place of the text.
E.g.:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name ';' ;
library_name: IDENTIFIER;
module: MODULE module_name ';' ;
module_name: IDENTIFIER;
MODULE: 'module' ;
IDENTIFIER: [a-zA-Z0-9]+;
The generated tokens are:
T__0=1
T__1=2
MODULE=3
IDENTIFIER=4
'library'=1
';'=2
'module'=3
If I am not interested in the 'library' "token", since the rule already establishes what I am matching against, and I will just skip over it anyway, does it make any sense to replace it with LIBRARY and a token declaration? (The number of tokens then will grow.) Why is this a warning in ANTLRWorks?
Actually, there is a difference between implicit and explicit tokens:
From "The Definitive ANTLR4 Reference", page 76:
ANTLR collects and separates all of the string literals and lexer
rules from the parser rules. Literals such as 'enum' become lexical
rules and go immediately after the parser rules but before the
explicit lexical rules.
ANTLR lexers resolve ambiguities between
lexical rules by favoring the rule specified first.
Highlight from me.
The Antlr (and most compiler/compiler generators) implementation use the concept of a separate lexer and parser, mostly for performance reasons. In this model, the lexer is responsible for reading the actual characters in the input string and returning a list of the tokens found, in a more concise represetations, like an enum or int-codes for each token. The parser will work on these tokens instead of the original input for ease of implementation and performance.
There are two ways to "declare" the usage of a token in Antlr, one is explicit and have a regular pattern expresion, the other is implicit, is always a fixed string.
ExplicitRegExp: [A-Z][a-z]+; // lexer rule starts with uppercase letter
ExplicitFixed: 'fixed';
parserRule: 'implicit' ExplicitRegExp; // parser rules starts with lowercase letter
When declare a token explicitly, it's assigned an int-code to be used in the parsing state machine. Let's say ExplicitRegExp becomes 1 and ExplicitFixed becomes 2. But the parser will also need the implicit tokens to be able to parse the grammar correctly, so the implicit token is assigned the code 3 implicitly.
How is that bad? You may have typos in different parts of the grammar:
a : 'implicit' c;
b : 'implcit' d; // typo here
And your grammar will not work as expected, because implcit will be a valid token, assigned the int-code 4. It also makes your grammar/lexer harder to debug due to Antlr auto-generating names for the implicit rules, like T___0. Another thing is that you lose the ordering of lexer rules, which could make a diference (usually not because implicit tokens are all fixed content).
The Antlr compiler could choose to give you an error message and require you to write the tokens explicitly, but it chooses to let it go and just warn you that you should not to that, probably for prototyping/testing reasons.
To let Antlr be happy, do it the verbose way and declare all of your tokens:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name=IDENTIFIER ';' ; // I'm using aliasing instead of different parser rule here, just a preference
module: 'module' module_name=IDENTIFIER ';' ;
MODULE: 'module' ;
LIBRARY: 'library' ;
IDENTIFIER: [a-zA-Z0-9]+;
Then it makes no difference if you reference a fixed token by it's explicit name (like MODULE) or by its content (like 'module').

Responsibilities of the Lexer and the Parser

I'm currently implementing a lexer for a simple programming language. So far, I can tokenize identifiers, assignment symbols, and integer literals correctly; in general, whitespace is insignificant.
For the input foo = 42, three tokens are recognized:
foo (identifier)
= (symbol)
42 (integer literal)
So far, so good. However, consider the input foo = 42bar, which is invalid due to the (significant) missing space between 42 and bar. My lexer incorrectly recognizes the following tokens:
foo (identifier)
= (symbol)
42 (integer literal)
bar (identifier)
Once the lexer sees the digit 4, it keeps reading until it encounters a non-digit. It therefore consumes the 2 and stores 42 as an integer literal token. Because whitespace is insignificant, the lexer discards any whitespace (if there is any) and starts reading the next token: It finds the identifier bar.
Now, here's my question: Is it still the lexer's responsibility to recognize that an identifier is not allowed at that position? Or does that check belong to the responsibilities of the parser?
I don't think there is any consensus to the question of whether 42foo should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.
For example:
$ python -c 'print 42and False'
False
$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'
$ perl -le 'print 42and 0'
42
# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number
# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant
$ ruby -le 'print 42and 1'
42
# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423
So, both possibilities are in common use.
If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and, the fragments 42 + 1, 42+1, and 42+ 1) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.
As a side note, in C and C++, 42and is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:
$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)" -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120
Both 12E and 1F would be invalid integers, but pasted together with the ## operator, they form a perfectly legitimate float. The ## operator only works on single tokens, so 12E and 1F both need to lexed as single tokens. c(12E+,1F) wouldn't work, but c(12E0,1F) is also fine.
This is also why you should always put spaces around the + operator in C: classic trick C question: "What is the value of 0x1E+2?"
Finally, the explanation for the awk line:
$ awk 'BEGIN{print 42foo + 3}'
423
That's lexed by awk as BEGIN{print 42 foo + 3} which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0 if used arithmetically and "" if used as strings.)
I disagree with other answers here. It should be done by the lexer. If the character following the digits isn't whitespace or a special character, you're in the middle of an illegal token, specifically an identifier that doesn't start with a letter.
Or else just return the 45 and the 'bar' separately and let the parser handle it as a syntax error.
Yes, contextual checks like this belong in the parser.
Also, you say that foo = 42bar is invalid. From the lexer's perspective, it is not, though. The 4 tokens recognized by your lexer are (probably) correct (you don't post your token definitions).
foo = 42bar may or may not be a valid statement in your language.
Edit: I just realized that that's actually an invalid token for your language. So yes, it would fail the lexer at that point, because you have no rule matching it. Otherwise, what would it be, InvalidTokenToken?
But let's say that it was a valid token. Say you write a lexer rule saying that id = <number> is ok... what do you do about id = <number> + <number> - <number>, and all of the various combinations that that leads to? How is the lexer going to give you an AST for any of those? This is where the parser comes in.
Are you using a parser combinator framework? I ask because sometimes with those the distinction between parser and lexer rules starts to seem arbitrary, especially since you may not have an explicit grammar in front of you. But the language you're parsing still has a grammar, and what counts as a parser rule is each production of the grammar. At the very "bottom" if you have rules which describe a single terminal, like "a number is one or more digits," and this, and this alone is what the lexer gets used for -- the reason being that it can speed up the parser and simplify its implementation.

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

In the PowerShell grammar, what is the the `lvalueExpression` rule saying?

I was reviewing the PowerShell grammar posted here: http://www.manning.com/payette/AppCexcerpt.pdf
(I don't think it has been updated since PowerShell v1, and there are some typos. So, it's clearly not the true PowerShell Grammar, but a human-oriented document.)
In section C.2.1, it says:
<lvalueExpression> = <lvalue> [? |? <lvalue>]*
What is the meaning of the question marks? I can't tell if it means "match any character" or "match a question mark" or it's a typo.
I'm not sure what inputs this is intended to match, but maybe it's this:
$a,$b = 1, 2
in which case maybe the question mark is supposed to be a comma?
Based on its use in the preceding rule (<assignmentStatementRule> = <lvalueExpression> <AssignmentOperatorToken> <pipelineRule>), it appears that lvalueExpression in Appendix C of Windows PowerShell in Action corresponds to expression in section B.2.3 of The PowerShell Language Specification that Joey linked to. Matching it further than this is difficult, but I'll add some speculation anyway :)
The ? characters in [? |? <lvalue>]* are very likely erroneous. If it had been used to represent "the previous token is optional", then:
the [ and | tokens it was applied to should have been quoted
only [ makes sense as part of a value expression, but indexing is already covered later by the propertyOrArrayReferenceOperator rule
? is not used anywhere else in the grammar, but {0|1} is used multiple times to indicate "can appear zero or one times"
Given its similarity to [ '|' <cmdletCall> ]* at the end of the first rule in the section, it may have been a copy-and-paste error, compounded by a ‘smart quote’ round-trip encoding error. Assuming this was copied with the intent of editing later, then ?|? may have become '.' to represent multiple property accesses (but again, this is covered by the propertyOrArrayReferenceOperator rule).
Though based on the statement at the end of section C.2.1 that "[the pipeline rule] also handles parsing assignment expressions", lvalueExpression was probably intended to list all the assignable expressions besides simpleLvalue (e.g. cast-expression for [int]$x = 1, array-literal-expression for $a,$b,$c = 1,2,3), etc).

Lexer antlr3 token problem

Can I construct a token
ENDPLUS: '+' (options (greedy = false;):.) * '+'
;
being considered by the lexer only if it is preceded by a token PREwithout including in ENDPLUS?
PRE: '<<'
;
Thanks.
No, AFAIK, this is not possible "out of the box". One only has look-ahead-control over the tokens stream in the lexer or parser by using the attribute input and calling LA(int) (look-ahead) on it. For example, the following lexer rule:
Token
: {input.LA(2) == 'b'}? .
;
matches any single character as long as that single character is followed by a b. Unfortunately, there's no input.LA(-1) feature to look behind in the token stream. The {...}? part is called a "syntactic predicate" in case you're wondering, or wanting to Google it.
A discussion, and some pointers on how to go about solving it, are given here: http://www.antlr.org/pipermail/antlr-interest/2004-July/008673.html
Note that it's {greedy=false;}, not (greedy=false;).

Resources