I am currently implementing the part of the Decaf (programming language) grammar. Here is the relevant snippet of bison code:
type:
INT
| ID
| type LS RS
;
local_var_decl:
type ID SEMICOLON
;
name:
THIS
| ID
| name DOT ID
| name LS expression RS
;
Nevertheless, as soon as I started working on name production rule, my parser gives the reduce-reduce warning.
Here what it's inside the .output file (generated by bison):
State 84
23 type: ID .
61 name: ID .
ID reduce using rule 23 (type)
LS reduce using rule 23 (type)
LS [reduce using rule 61 (name)]
$default reduce using rule 61 (name)
So, if we give the following input { abc[1] = abc; }, it says that syntax error, unexpected NUMBER, expected RS. NUMBER comes here from expression rule (basically, how it must have parsed it), though it tries to parse it through local_var_decl rule.
What do you think should be changed in order to solve this problem? Spent around 2 hours, tried different stuff, did not work.
Thank you!!
PS. Here is the link to the full .y source code
This is a specific instance of a common problem where the parser is being forced to make a decision before it has enough information. In some cases, such as this one, the information needed is not far away, and it would be sufficient to increase the lookahead, if that were possible. (Unfortunately, few parser generators produce LR(k) parsers with k > 1, and bison is no exception.) The usual solution is to simply allow the parse to continue without having to decide. Another solution, with bison (but only in C mode) is to ask for a %glr-parser, which is much more flexible about when reductions need to be resolved at the cost of additional processing time.
In this case, the context allows either a type or a name, both of which can start with an ID followed by a [ (LS). In the case of a name, the [ must be followed by a number; in the case of a type, the [ must be followed by a ]. So if we could see the second token after the ID, we could immediately decide.
But we can only see one token ahead, which is the ]. And the grammar insists that we be able to make an immediate decision because in one case we must reduce the ID to a name and in the other case, to a type. So we have a reduce-reduce conflict, which bison resolves by always using whichever reduction comes first in the grammar file.
One solution is to avoid forcing this choice, at the cost of duplicating productions. For example:
type_other:
INT
| ID LS RS
| type_other LS RS
;
type: ID
| type_other
;
name_other:
THIS
| ID LS expression RS
| name_other DOT ID
| name_other LS expression RS
;
name: ID
| name_other
;
Related
Given a positional language like the old IBM RPG, we can have a line such as
CCCCCDIDENTIFIER E S 10
Where characters
1-5: comment
6: specification type
7-21: identifier name
...And so on
Now, given that JFlex is based on RegExp, we would have a RegExp such as:
[a-zA-Z][a-zA-Z0-9]{0,14} {0,14}
for the identifier name token.
This RegExp however can match tokens longer than the 15 characters possible for identifier name, requiring yypushbacks.
Thus, is there a way to limit how many characters JFlex reads for a particular token?
Regular expression based lexical analysis is really not the right tool to parse fixed-field inputs. You can just split the input into fields at the known character positions, which is way easier and a lot faster. And it doesn't require fussing with regular expressions.
Anyway, [a-zA-Z][a-zA-Z0-9]{0,14}[ ]{0,14} wouldn't be the right expression even if it did properly handle the token length, since the token is really the word at the beginning, without space characters.
In the case of fixed-length fields which contain something more complicated than a single identifier, you might want to feed the field into a lexer, using a StringReader or some such.
Although I'm sure it's not useful, here's a regular expression which matches 15 characters which start with a word and are completed with spaces:
[a-zA-Z][ ]{14} |
[a-zA-Z][a-zA-Z0-9][ ]{13} |
[a-zA-Z][a-zA-Z0-9]{2}[ ]{12} |
[a-zA-Z][a-zA-Z0-9]{3}[ ]{11} |
[a-zA-Z][a-zA-Z0-9]{4}[ ]{10} |
[a-zA-Z][a-zA-Z0-9]{5}[ ]{9} |
[a-zA-Z][a-zA-Z0-9]{6}[ ]{8} |
[a-zA-Z][a-zA-Z0-9]{7}[ ]{7} |
[a-zA-Z][a-zA-Z0-9]{8}[ ]{6} |
[a-zA-Z][a-zA-Z0-9]{9}[ ]{5} |
[a-zA-Z][a-zA-Z0-9]{10}[ ]{4} |
[a-zA-Z][a-zA-Z0-9]{11}[ ]{3} |
[a-zA-Z][a-zA-Z0-9]{12}[ ]{2} |
[a-zA-Z][a-zA-Z0-9]{13}[ ] |
[a-zA-Z][a-zA-Z0-9]{14}
(That might have to be put on one very long line.)
I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.
first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.
The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.
I am using this ANTLR 3 grammar and ANTLRWorks for testing that grammar.
But I can't figure out why some parts of my input text are omitted.
I would like to rewrite this grammar and display every element (lparen, keywords, semicolon,..) of the source file (input) in AST / CST.
I've tried everything, but without success. Can someone who is experienced with ANTLR help me?
Parse tree:
I've managed to narrow it down to the semic rule:
/*
This rule handles semicolons reported by the lexer and situations where the ECMA 3 specification states there should be semicolons automaticly inserted.
The auto semicolons are not actually inserted but this rule behaves as if they were.
In the following situations an ECMA 3 parser should auto insert absent but grammaticly required semicolons:
- the current token is a right brace
- the current token is the end of file (EOF) token
- there is at least one end of line (EOL) token between the current token and the previous token.
The RBRACE is handled by matching it but not consuming it.
The EOF needs no further handling because it is not consumed by default.
The EOL situation is handled by promoting the EOL or MultiLineComment with an EOL present from off channel to on channel
and thus making it parseable instead of handling it as white space. This promoting is done in the action promoteEOL.
*/
semic
#init
{
// Mark current position so we can unconsume a RBRACE.
int marker = input.mark();
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| RBRACE { input.rewind(marker); }
| EOL | MultiLineComment // (with EOL in it)
;
So, the EVIL semicolon insertion strikes again!
I'm not really sure, but I think these mark/rewind calls are getting out of sync. The #init block is executed when the rule is entered for branch selection and for actual matching. It's actually creating a lot of marks but not cleaning them up. But I don't know why it messes up the parse tree like that.
Anyway, here's a working version of the same rule:
semic
#init
{
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| { int pos = input.index(); } RBRACE { input.seek(pos); }
| EOL | MultiLineComment // (with EOL in it)
;
It's much simpler and doesn't use the mark/rewind mechanism.
But there's a catch: the semic rule in the parse tree will have a child node } in the case of a semicolon insertion before a closing brace. Try to remove the semicolon after i-- and see the result. You'll have to detect this and handle it in your code. semic should either contain a ; token, or contain EOL (which means a semicolon got silently inserted at this point).
I'm currently implementing a lexer for a simple programming language. So far, I can tokenize identifiers, assignment symbols, and integer literals correctly; in general, whitespace is insignificant.
For the input foo = 42, three tokens are recognized:
foo (identifier)
= (symbol)
42 (integer literal)
So far, so good. However, consider the input foo = 42bar, which is invalid due to the (significant) missing space between 42 and bar. My lexer incorrectly recognizes the following tokens:
foo (identifier)
= (symbol)
42 (integer literal)
bar (identifier)
Once the lexer sees the digit 4, it keeps reading until it encounters a non-digit. It therefore consumes the 2 and stores 42 as an integer literal token. Because whitespace is insignificant, the lexer discards any whitespace (if there is any) and starts reading the next token: It finds the identifier bar.
Now, here's my question: Is it still the lexer's responsibility to recognize that an identifier is not allowed at that position? Or does that check belong to the responsibilities of the parser?
I don't think there is any consensus to the question of whether 42foo should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.
For example:
$ python -c 'print 42and False'
False
$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'
$ perl -le 'print 42and 0'
42
# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number
# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant
$ ruby -le 'print 42and 1'
42
# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423
So, both possibilities are in common use.
If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and, the fragments 42 + 1, 42+1, and 42+ 1) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.
As a side note, in C and C++, 42and is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:
$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)" -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120
Both 12E and 1F would be invalid integers, but pasted together with the ## operator, they form a perfectly legitimate float. The ## operator only works on single tokens, so 12E and 1F both need to lexed as single tokens. c(12E+,1F) wouldn't work, but c(12E0,1F) is also fine.
This is also why you should always put spaces around the + operator in C: classic trick C question: "What is the value of 0x1E+2?"
Finally, the explanation for the awk line:
$ awk 'BEGIN{print 42foo + 3}'
423
That's lexed by awk as BEGIN{print 42 foo + 3} which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0 if used arithmetically and "" if used as strings.)
I disagree with other answers here. It should be done by the lexer. If the character following the digits isn't whitespace or a special character, you're in the middle of an illegal token, specifically an identifier that doesn't start with a letter.
Or else just return the 45 and the 'bar' separately and let the parser handle it as a syntax error.
Yes, contextual checks like this belong in the parser.
Also, you say that foo = 42bar is invalid. From the lexer's perspective, it is not, though. The 4 tokens recognized by your lexer are (probably) correct (you don't post your token definitions).
foo = 42bar may or may not be a valid statement in your language.
Edit: I just realized that that's actually an invalid token for your language. So yes, it would fail the lexer at that point, because you have no rule matching it. Otherwise, what would it be, InvalidTokenToken?
But let's say that it was a valid token. Say you write a lexer rule saying that id = <number> is ok... what do you do about id = <number> + <number> - <number>, and all of the various combinations that that leads to? How is the lexer going to give you an AST for any of those? This is where the parser comes in.
Are you using a parser combinator framework? I ask because sometimes with those the distinction between parser and lexer rules starts to seem arbitrary, especially since you may not have an explicit grammar in front of you. But the language you're parsing still has a grammar, and what counts as a parser rule is each production of the grammar. At the very "bottom" if you have rules which describe a single terminal, like "a number is one or more digits," and this, and this alone is what the lexer gets used for -- the reason being that it can speed up the parser and simplify its implementation.