With the Flex -d command line flag, why am I getting --(end of buffer or a NUL)? - flex-lexer

I am using the -d flag when I run Flex, so that the scanner will generate debug messages. It works fine but the output contains some odd (unexpected) things. Below I show the output for a lexer that tokenizes names, numbers and newlines. Notice the first line of the output:
--(end of buffer or a NUL)
Huh? What is that?
Then there are a couple more towards the end. What are they all about?
--(end of buffer or a NUL)
--accepting rule at line 5 ("John")
--accepting rule at line 7 ("
")
--accepting rule at line 6 ("24")
--accepting rule at line 7 ("
")
--accepting rule at line 5 ("Sally")
--accepting rule at line 7 ("
")
--accepting rule at line 6 ("30")
--accepting rule at line 7 ("
")
--accepting rule at line 5 ("Bill")
--accepting rule at line 7 ("
")
--(end of buffer or a NUL)
--accepting rule at line 6 ("36")
--(end of buffer or a NUL)
--EOF (start condition 0)

This message refers to the scanner's internal buffer. When you first call yylex(), the buffer is empty, since no input has been read. So the scanner reports that, and then fills its buffer by reading from yyin. (That's assuming that you haven't pre-established an input buffer using one of the yy_scan_*() functions.)
I suppose that your input ends with a line containg 36 which is not terminated by a newline. So the scanner reads the characters 3 and 4, and then attempts to read another character, because the token might be longer. But there is no more data in the buffer. As before, the scanner reports that it reached the end of the buffer and then attempts to refill the buffer from yyin. But since there's nothing more to read, the scanner gets an EOF indication. That means that the token 36 is complete, and needs to be handled.
Note that at this point, the buffer is empty, since nothing was read. So when yylex is called for the next token, it immediately encounters the end of buffer, which is reported. This time, though, it can't refill the buffer, because yyin has mo more data. So the scanner executes the <<EOF>> action.

Related

ANTLR4 lexer rules not matching correct block of text

I am trying to understand how ANTLR4 works based on lexer and parser rules but I am missing something in the following example:
I am trying to parse a file and match all mathematic additions (eg 1+2+3 etc.). My file contains the following text:
start
4 + 5 + 22 + 1
other text other text test test
test test other text
55 other text
another text 2 + 4 + 255
number 44
end
and I would like to match
4 + 5 + 22 + 1
and
2 + 4 + 255
My grammar is as follows:
grammar Hello;
hi : expr+ EOF;
expr : NUM (PLUS NUM)+;
PLUS : '+' ;
NUM : [0-9]+ ;
SPACE : [\n\r\t ]+ ->skip;
OTHER : [a-z]+ ;
My abstract Syntax Tree is visualized as
Why does rule 'expr' matches the text 'start'? I also get an error "extraneous input 'start' expecting NUM"
If i make the following change in my grammar
OTHER : [a-z]+ ->skip;
the error is gone. In addition in the image above text '55 other text
another text' matches the expression as a node in the AST. Why is this happening?
All the above have to do with the way lexer matches an input? I know that lexer looks for the first longest matching rule but how can I change my grammar so as to match only the additions?
Why does rule 'expr' matches the text 'start'?
It doesn't. When a token shows up red in the tree, that indicates an error. The token did not match any of the possible alternatives, so an error was produced and the parser continued with the next token.
In addition in the image above text '55 other text another text' matches the expression as a node in the AST. Why is this happening?
After you skipped the OTHER tokens, your input basically looks like this:
4 + 5 + 22 + 1 55 2 + 4 + 255 44
4 + 5 + 22 + 1 can be parsed as an expression, no problem. After that the parser either expects a + (continuing the expression) or a number (starting a new expression). So when it sees 55, that indicates the start of a new expression. Now it expects a + (because the grammar says that PLUS NUM must appear at least once after the first number in an expression). What it actually gets is the number 2. So it produces an error and ignores that token. Then it sees a +, which is what it expected. And then it continues that way until the 44, which again starts a new expression. Since that isn't followed by a +, that's another error.
All the above have to do with the way lexer matches an input?
Not really. The token sequence for "start 4 + 5" is OTHER NUM PLUS NUM, or just NUM PLUS NUM if you skip the OTHERs. The token sequence for "55 skippedtext 2 + 4" is NUM NUM PLUS NUM. I assume that's exactly what you'd expect.
Instead what seems to be confusing you is how ANTLR recovers from errors (or maybe that it recovers from errors).

Bison terminating instead of shifting error

I have a grammar that works well except that it doesn't tolerate syntax errors. I'm trying to work in error tokens so that it can gracefully recover. I've read through the Bison manual on error recovery but something is not adding up.
Here's a snippet from the grammar:
%start start
%token WORD WORDB SP CRLF
%%
start : A B C
| error CRLF start
A : WORD SP WORD CRLF
...
Here's a snippet of the output file that bison produces describing the grammar
State 0
0 $accept: . start $end
error shift, and go to state 1
WORD shift, and go to state 2
start go to state 3
A go to state 4
State 1
2 start: error . CRLF start
CRLF shift, and go to state 5
State 5
2 start: error CRLF . start
error shift, and go to state 1
WORD shift, and go to state 2
start go to state 25
A go to state 4
Given the input tokens WORDB CRLF WORD SP WORD CRLF ..... I would expect the state transitions to be 0 -> 1 -> 5 -> 2 -> ..., but when I run the parser it actually produces the following:
--(end of buffer or a NUL)
--accepting rule at line 49 ("WORDB")
Starting parse
Entering state 0
Reading a token: Next token is token WORDB ()
syntax error, unexpected WORDB, expecting WORD
As best I can tell, if the parser is in State 0 and it sees a token other than WORD it should interpret the token as if it was error and should go to State 1. In practice it is just hard failing.
The error transition does not suppress the call to yyerror(), so if your yyerror implementation does something like call exit(), error recovery will not be able to proceed.

How can I extract some data out of the middle of a noisy file using Perl 6?

I would like to do this using idiomatic Perl 6.
I found a wonderful contiguous chunk of data buried in a noisy output file.
I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:
</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
Input file: "../../filename.count.fa"
...
Here's what I want parsed out:
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
One-liner version
.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;
In English
Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.
Same code with comments
.say # print the default variable $_
if # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/ # Match $_ if it contains "Cluster Unique"
ff^ # Flip-flop operator: true until preceding term becomes true
# false once the term after it becomes true
/^\s*$/ # Match $_ if it contains an empty line
for # Create a loop placing each element of the following list into $_
lines # Create a list of all of the lines in the file
; # End of statement
Expanded version
for lines() {
.say if (
$_ ~~ /Cluster \s+ Unique/ ff^ $_ ~~ /^\s*$/
)
}
lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
^ matches the beginning of a string
\s* matches zero or more spaces
$ matches the end of a string
By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.
I would like to do this using idiomatic Perl 6.
In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.
In Perl 6, you can read a paragraph at a time like this:
my $fname = 'data.txt';
my $infile = open(
$fname,
nl => "\n\n", #Set what perl considers the end of a line.
); #Removed die() per Brad Gilbert's comment.
for $infile.lines() -> $para {
if $para ~~ /^ 'Cluster Unique'/ {
say $para.chomp;
last; #Quit reading the file.
}
}
$infile.close;
# ^ Match start of string.
# 'Cluster Unique' By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.
However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.
Also, there are certain idioms that are considered by some to be bad practice, like:
Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say
So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

How many bytes are stored at a memory address?

I am very confused with the following gdb output. I am debugging a program that processes a text file. The first word in the file is "the" and the gdb output looks as follows:
"The":
(gdb) p *(char*)0x7fffffff9d30
$12 = 84 'T'
(gdb) p *(char*)0x7fffffff9d34
$13 = 104 'h'
(gdb) p *(char*)0x7fffffff9d38
$14 = 101 'e'
A character is one byte, so when I increase the address of 'T' by 8 bits I should find 'h' there. But the address of 'h' is only 4 bits farther. What am I missing here?
Didn't realize that these are Wchar_t (wide characters).
FWIW, in situations like this you might like to use the "x" command to dump memory. This avoids any possible confusion caused by types and operators.

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Resources