Jflex Unexpected Character error - flex-lexer

I started studying jflex. When i try to generate output using jflex for the following code I keep getting an error
Error in file "\abc.flex" (line 29):
Unexpected character
[ \t\n]+ ;
^
1 error, 0 warnings.
Generation aborted.
Code trying to run
letter [a-zA-Z]
digit [0-9]
intlit [0-9]+
%{
#include <stdio.h>
# define BASTYPTOK 257 /*following are output from yacc*/
# define IDTOK 258 /*yacc assigns token numbers */
# define LITTOK 259
# define CINTOK 260
# define INSTREAMTOK 261
# define COUTTOK 262
# define OUTSTREAMTOK 263
# define WHILETOK 264
# define IFTOK 265
# define ADDOPTOK 266
# define MULOPTOK 267
# define RELOPTOK 268
# define NOTTOK 269
# define STRLITTOK 270
main() /*this replaces the main in the lex library*\
{ int p;
while (p= yylex())
printf("%d is \"%s\"\n", p, yytext);
/*yytext is where lex stores the lexeme*/}
%}
%%
[ \t\n]+ ;
"//".*"\n" ;
{intlit} {return(LITTOK);}
cin {return(CINTOK);}
"<<" {return(INSTREAMTOK);}
\<|"==" {return(RELOPTOK);}
\+|\-|"||" {return(ADDOPTOK);}
"=" {return(yytext[0]);}
"(" {return(yytext[0]);}
")" {return(yytext[0]);}
. {return (yytext[0]); /*default action*/}
%%
Can someone please help me figure out, what is causing the issue.
The pattern is also unindented properly.
thanks for your help.

That's valid flex input but it's not valid jflex. Since the included code is in C rather than Java, it's not clear to me why you would want to use jflex, but if your intent is to port the scanner to Java you might want to read the JFlex manual section on porting.
In particular, the sections in JFlex input are quite different from flex:
flex JFlex
definitions and declarations user code
%% %%
rules declarations
%% %%
user code definitions and rules
So your definitions and rules are in the correct section for a flex file, but not for a JFlex file. (JFlex just copies the first section to the output, so it doesn't recognize the various syntax errors resulting from putting flex declarations where JFlex expects valid user code.)
Also, JFlex definitions are of the form name = pattern rather than name pattern, so once you get the order of the file sorted out, you'll also need to add the equals signs. And. of course, rewrite the C code in Java.

Related

Why is my parser showing errors for valid programs?

I can't figure out what's wrong with my parser. Here are the associated files:
parse.y
declarations: INTEGER_SIZE IDENTIFIER TERMINATOR {declare($1,$2);}
void yyerror(char *err){
printf("\n\nYYError on line %d: Error = %s\n", yylineno, err);
}
scan.l
[Xx]+ {yylval.size = strlen(yytext);
When running it against the valid program below it shows an error at line 3; when running any of the lines individually it shows an error on line 1 via the yyerror() function.
BEGINING.
XXX XY-1.
XXXX Y.
XXXX Z.
BODY.
PRINT “Please enter a number? ”.
INPUT Y.
MOVE 15 TO Z.
ADD Y TO Z.
PRINT XY-1;” + “;Y;”=”;Z.
END.
To run the files run the following commands:
yacc -d parser.y
lex lexer.l
gcc -o parser lex.yy.c y.tab.c -ll
This non-terminal is called declarations, from which one might think that it matches one or more declarations, or perhaps zero or more declarations:
declarations: INTEGER_SIZE IDENTIFIER TERMINATOR {declare($1,$2);}
But the rule matches exactly three tokens, which is to say one declaration. So when you give it an input with two declarations, it fails on the second one.
Similarly, your non-terminal called statements only matches a single statement, not several as might be expected from its name.
Grammars need to be explicit. If you want to match several declarations, you have to write that:
declarations: declaration
| declarations declaration
By the way, I have seen before grammars written with the belief that you have to write {;} at the end of a production. I'm curious where this idea comes from. Yacc and bison do not require that productions have an action, and anyway an empty action is {}, just as it is in C.

Parse a block where each line starts with a specific symbol

I need to parse a block of code which looks like this:
* Block
| Line 1
| Line 2
| ...
It is easy to do:
block : head lines;
head : '*' line;
lines : lines '|' line
| '|' line
;
Now I wonder, how can I add nested blocks, e.g.:
* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...
Can this be expressed as a LALR grammar?
I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.
The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.
However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:
when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)
when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.
The |s themselves are not passed through to the parser.
This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.
Once that transformation is finished, the grammar will look something like:
content: /* empty */
| content line
| content block
block : head BEGIN_BLOCK content END_BLOCK
| head
head : '*' line
A rough outline of a flex implementation would be something like this: (see Note 1, below).
%x INDENT CONTENT
%%
static int depth = 0, new_depth = 0;
/* Handle pending END_BLOCKs */
send_end:
if (new_depth < depth) {
--depth;
return END_BLOCK;
}
^"|"[[:blank:]]* { new_depth = 1; BEGIN(INDENT); }
^. { new_depth = 0; yyless(0); BEGIN(CONTENT);
goto send_end; }
^\n /* Ignore blank lines */
<INDENT>{
"|"[[:blank:]]* ++new_depth;
. { yyless(0); BEGIN(CONTENT);
if (new_depth > depth) {
++depth;
if (new_depth > depth) { /* Report syntax error */ }
return BEGIN_BLOCK;
} else goto send_end;
}
\n BEGIN(INITIAL); /* Maybe you care about this blank line? */
}
/* Put whatever you use here to lexically scan the lines */
<CONTENT>{
\n BEGIN(INITIAL);
}
Notes:
Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.
The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)
There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).
In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.
If you write an explicit end-of-block token, things become clearer:
*Block{
|Line 1
*SubBlock{
| line 1.1
| line 1.2
}
|Line 2
|...
}
and grammar becomes:
block : '*' ID '{' lines '}'
lines : lines '|' line
| lines block
|

How can I extract some data out of the middle of a noisy file using Perl 6?

I would like to do this using idiomatic Perl 6.
I found a wonderful contiguous chunk of data buried in a noisy output file.
I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:
</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
Input file: "../../filename.count.fa"
...
Here's what I want parsed out:
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
One-liner version
.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;
In English
Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.
Same code with comments
.say # print the default variable $_
if # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/ # Match $_ if it contains "Cluster Unique"
ff^ # Flip-flop operator: true until preceding term becomes true
# false once the term after it becomes true
/^\s*$/ # Match $_ if it contains an empty line
for # Create a loop placing each element of the following list into $_
lines # Create a list of all of the lines in the file
; # End of statement
Expanded version
for lines() {
.say if (
$_ ~~ /Cluster \s+ Unique/ ff^ $_ ~~ /^\s*$/
)
}
lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
^ matches the beginning of a string
\s* matches zero or more spaces
$ matches the end of a string
By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.
I would like to do this using idiomatic Perl 6.
In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.
In Perl 6, you can read a paragraph at a time like this:
my $fname = 'data.txt';
my $infile = open(
$fname,
nl => "\n\n", #Set what perl considers the end of a line.
); #Removed die() per Brad Gilbert's comment.
for $infile.lines() -> $para {
if $para ~~ /^ 'Cluster Unique'/ {
say $para.chomp;
last; #Quit reading the file.
}
}
$infile.close;
# ^ Match start of string.
# 'Cluster Unique' By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.
However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.
Also, there are certain idioms that are considered by some to be bad practice, like:
Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say
So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Bison grammar warnings

I am writing a parser with Bison and I am getting the following warnings.
fol.y:42 parser name defined to default :"parse"
fol.y:61: warning: type clash ('' 'pred') on default action
I have been using Google to search for a way to get rid of them, but have pretty much come up empty handed on what they mean (much less how to fix them) since every post I found with them has a compilation error and the warnings them selves aren't addressed. Could someone tell me what they mean and how to fix them? The relevant code is below. Line 61 is the last semicolon. I cut out the rest of the grammar since it is incredibly verbose.
%union {
char* var;
char* name;
char* pred;
}
%token <var> VARIABLE
%token <name> NAME
%token <pred> PRED
%%
fol:
declines clauses {cout << "Done parsing with file" << endl;}
;
declines:
declines decline
|decline
;
decline:
PRED decs
;
The first message is likely just a warning that you didn't include %start parse in the grammar specification.
The second means that somewhere you have rule that is supposed to return a value but you haven't properly specified which type of value it is to return. The PRED returns the pred element of your union; the problem might be that you've not created %type entries for decline and declines. If you have a union, you have to specify the type for most, if not all, rules — or maybe just rules that don't have an explicit action (so as to override the default $$ = $1; action).
I'm not convinced that the problem is in the line you specify, and because we don't have a complete, minimal reproduction of your problem, we can't investigate for you to validate it. The specification for decs may be relevant (I'm not convinced it is, but it might be).
You may get more information from the output of bison -v, which is the y.output file (or something similar).
Finally found it.
To fix this:
fol.y:42 parser name defined to default :"parse"
Add %name parse before %token
Eg:
%name parse
%token NUM
(From: https://bdhacker.wordpress.com/2012/05/05/flex-bison-in-ubuntu/#comment-2669)

Resources