ANTLR4 lexer rules not matching correct block of text - parsing

I am trying to understand how ANTLR4 works based on lexer and parser rules but I am missing something in the following example:
I am trying to parse a file and match all mathematic additions (eg 1+2+3 etc.). My file contains the following text:
start
4 + 5 + 22 + 1
other text other text test test
test test other text
55 other text
another text 2 + 4 + 255
number 44
end
and I would like to match
4 + 5 + 22 + 1
and
2 + 4 + 255
My grammar is as follows:
grammar Hello;
hi : expr+ EOF;
expr : NUM (PLUS NUM)+;
PLUS : '+' ;
NUM : [0-9]+ ;
SPACE : [\n\r\t ]+ ->skip;
OTHER : [a-z]+ ;
My abstract Syntax Tree is visualized as
Why does rule 'expr' matches the text 'start'? I also get an error "extraneous input 'start' expecting NUM"
If i make the following change in my grammar
OTHER : [a-z]+ ->skip;
the error is gone. In addition in the image above text '55 other text
another text' matches the expression as a node in the AST. Why is this happening?
All the above have to do with the way lexer matches an input? I know that lexer looks for the first longest matching rule but how can I change my grammar so as to match only the additions?

Why does rule 'expr' matches the text 'start'?
It doesn't. When a token shows up red in the tree, that indicates an error. The token did not match any of the possible alternatives, so an error was produced and the parser continued with the next token.
In addition in the image above text '55 other text another text' matches the expression as a node in the AST. Why is this happening?
After you skipped the OTHER tokens, your input basically looks like this:
4 + 5 + 22 + 1 55 2 + 4 + 255 44
4 + 5 + 22 + 1 can be parsed as an expression, no problem. After that the parser either expects a + (continuing the expression) or a number (starting a new expression). So when it sees 55, that indicates the start of a new expression. Now it expects a + (because the grammar says that PLUS NUM must appear at least once after the first number in an expression). What it actually gets is the number 2. So it produces an error and ignores that token. Then it sees a +, which is what it expected. And then it continues that way until the 44, which again starts a new expression. Since that isn't followed by a +, that's another error.
All the above have to do with the way lexer matches an input?
Not really. The token sequence for "start 4 + 5" is OTHER NUM PLUS NUM, or just NUM PLUS NUM if you skip the OTHERs. The token sequence for "55 skippedtext 2 + 4" is NUM NUM PLUS NUM. I assume that's exactly what you'd expect.
Instead what seems to be confusing you is how ANTLR recovers from errors (or maybe that it recovers from errors).

Related

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

How can I extract some data out of the middle of a noisy file using Perl 6?

I would like to do this using idiomatic Perl 6.
I found a wonderful contiguous chunk of data buried in a noisy output file.
I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:
</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
Input file: "../../filename.count.fa"
...
Here's what I want parsed out:
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
One-liner version
.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;
In English
Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.
Same code with comments
.say # print the default variable $_
if # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/ # Match $_ if it contains "Cluster Unique"
ff^ # Flip-flop operator: true until preceding term becomes true
# false once the term after it becomes true
/^\s*$/ # Match $_ if it contains an empty line
for # Create a loop placing each element of the following list into $_
lines # Create a list of all of the lines in the file
; # End of statement
Expanded version
for lines() {
.say if (
$_ ~~ /Cluster \s+ Unique/ ff^ $_ ~~ /^\s*$/
)
}
lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
^ matches the beginning of a string
\s* matches zero or more spaces
$ matches the end of a string
By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.
I would like to do this using idiomatic Perl 6.
In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.
In Perl 6, you can read a paragraph at a time like this:
my $fname = 'data.txt';
my $infile = open(
$fname,
nl => "\n\n", #Set what perl considers the end of a line.
); #Removed die() per Brad Gilbert's comment.
for $infile.lines() -> $para {
if $para ~~ /^ 'Cluster Unique'/ {
say $para.chomp;
last; #Quit reading the file.
}
}
$infile.close;
# ^ Match start of string.
# 'Cluster Unique' By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.
However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.
Also, there are certain idioms that are considered by some to be bad practice, like:
Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say
So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

Jflex Unexpected Character error

I started studying jflex. When i try to generate output using jflex for the following code I keep getting an error
Error in file "\abc.flex" (line 29):
Unexpected character
[ \t\n]+ ;
^
1 error, 0 warnings.
Generation aborted.
Code trying to run
letter [a-zA-Z]
digit [0-9]
intlit [0-9]+
%{
#include <stdio.h>
# define BASTYPTOK 257 /*following are output from yacc*/
# define IDTOK 258 /*yacc assigns token numbers */
# define LITTOK 259
# define CINTOK 260
# define INSTREAMTOK 261
# define COUTTOK 262
# define OUTSTREAMTOK 263
# define WHILETOK 264
# define IFTOK 265
# define ADDOPTOK 266
# define MULOPTOK 267
# define RELOPTOK 268
# define NOTTOK 269
# define STRLITTOK 270
main() /*this replaces the main in the lex library*\
{ int p;
while (p= yylex())
printf("%d is \"%s\"\n", p, yytext);
/*yytext is where lex stores the lexeme*/}
%}
%%
[ \t\n]+ ;
"//".*"\n" ;
{intlit} {return(LITTOK);}
cin {return(CINTOK);}
"<<" {return(INSTREAMTOK);}
\<|"==" {return(RELOPTOK);}
\+|\-|"||" {return(ADDOPTOK);}
"=" {return(yytext[0]);}
"(" {return(yytext[0]);}
")" {return(yytext[0]);}
. {return (yytext[0]); /*default action*/}
%%
Can someone please help me figure out, what is causing the issue.
The pattern is also unindented properly.
thanks for your help.
That's valid flex input but it's not valid jflex. Since the included code is in C rather than Java, it's not clear to me why you would want to use jflex, but if your intent is to port the scanner to Java you might want to read the JFlex manual section on porting.
In particular, the sections in JFlex input are quite different from flex:
flex JFlex
definitions and declarations user code
%% %%
rules declarations
%% %%
user code definitions and rules
So your definitions and rules are in the correct section for a flex file, but not for a JFlex file. (JFlex just copies the first section to the output, so it doesn't recognize the various syntax errors resulting from putting flex declarations where JFlex expects valid user code.)
Also, JFlex definitions are of the form name = pattern rather than name pattern, so once you get the order of the file sorted out, you'll also need to add the equals signs. And. of course, rewrite the C code in Java.

Match newlines only under specific conditions

I am writing parser of language similar to javascript with its semicolon insertion ex:
var x = 1 + 2;
x;
and
var x = 1 + 2
x
and even
var x = 1 +
2
x
are the same.
For now my lexer matches newline (\n) only when it occurs after token different that semicolon. That plays nice with basic situations like 1 and 2 but how i can deal with third situation? i.e. new line happening in the middle of expression. I can't match new line every time because it would pollute my parser (inserting alternatives with newlines token everywhere) and I also cannot match them at all because it is statement terminator. Basically I would be the best to somehow check during parsing end of the statement if there was a new line character or semicolon there.
This has gone unanswered for a while. I cannot see why you cannot make a statement separator a newline **or* a semicolon. A bit like this:
whitespace [ \t]+
%%
{whitespace} /* Skip */
;[\n]* return(SEMICOLON);
[\n]+ return(SEMICOLON);
Then you're grammar is not messed up at all, as you only get the semicolon in the grammar.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Resources