There are formatting rules to follow when using flex? - flex-lexer

I don't get why, of 2 functionally identical source files, only 1 passes the compilation phase with flex and the other generates errors about the use of undeclared identifier .
This one is ok ( I don't usually use tabs in my editor, those are all whitespaces )
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
int main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
This one it's not accepted by flex and doesn't generate anything but errors
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
int main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars );
}
Do I have to follow some specific convention if I want to compile my scanner with flex ?

Yes, there are formatting rules in lex/flex and you are violating them.
I'll summarise. There are three main sections of the lex/flex input program which are separated by the %% delimiter in column one (at the start of a line). The last section is optional. The first section are for lexical declarations; in this section regular expressions can be named. The second section specifies actions to be performed on patterns and the third (optional) section is used for (C) code that is to be transcribed to the output file. It is used to define functions used in the action section.
The standard format for the first (lex declaration) section is:
name pattern
Where the name must start in column one (start of line) and the pattern is separated on the same line by white space.
The format for the second (action) section is similar:
pattern action
Where the pattern must start in column one (start of line) and the action is separated on the same line by white space. The pattern can be continued on more than one line, but must be indented by white space otherwise it will be interpreted as a new pattern.
The third section has no layout restrictions as the code is just skipped.
There is one final syntactic feature that is useful. In the first section code that does not specify a lexical pattern which should be copied to the output can be indicated by a %{ and %} at the start of a line. Further, in the action (second) section any code with no pattern and just an action is copied to the output.
Starting your file with a declaration of variables in C violates these rules. If it starts of the left it is treated as a lexical definition.
If you want to declare some variables in C which should be copied to the output, you can do it in the following manner:
%{
int num_lines = 0, num_chars = 0;
%}
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
Or alternately, like this:
%%
int num_lines = 0, num_chars = 0;
\n ++num_lines; ++num_chars;
. ++num_chars;

Related

Flex confusing to transform string character by character

I want to use flex to transform a string based on simple rules. I have rules like the first character stays the same and the second and third characters might change. Like if the second character was a letter, it becomes the number listed in the rules below. If the third is a digit, it becomes a certain letter.
%%
/*^[a-z] {char *yycopy = strdup( yytext ); unput(yycopy[0]);}*/
[ajs] {putchar('1');}
[bkt] {putchar('2');}
[clu] {putchar('3');}
[dmv] {putchar('4');}
[1] {putchar('j');}
[2] {putchar('k');}
[3] {putchar('l');}
[4] {putchar('m');} /*more number rules till 9*/
%%
int yywrap(void){return 1;}
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
while (yylex());
}
If there are different rules for characters in different positions within the string, how can I use start conditions to change a particular character (i.e. the rules for the second and third character are different).
You switch start condition by using the BEGIN action. Flex never automatically changes start condition, so you when you need to return to the initial start condition (called INITIAL), you have to do so explicitly (BEGIN(INITIAL)).
You need to declare start condition names in the (f)lex prologue, usually with the %x command. (%s is also possible but with different semantics. See the Flex manual for details.)
You indicate that a start condition applies to a rule by starting the rule with a start condition name in angle brackets. You can put more than one start condition inside the angle brackets; separate them with commas and don't use spaces. Don't put a space after the angle brackets either; they are part of the pattern and (f)lex patterns cannot include unquoted space characters.
BEGIN is a macro and it does not require parentheses around the start condition name, but I suggest always using them anyway, so you don't have to worry about what the macro expands to. Start condition names are small integers (either enum constants or preprocessor macros) but nothing guarantees their value, so don't make assumptions.
That's about it. So you could implement your astro numerological codifier with:
%x SECOND THIRD REST
%%
[a-z] ECHO; BEGIN(SECOND);
<SECOND>[ajs] putchar('1'); BEGIN(THIRD);
/* More SECOND rules */
<THIRD>1 putchar('j'); BEGIN(REST);
/* More THIRD rules */
<*>.*\n? ECHO; BEGIN(INITIAL);
(I deliberately did not add any <REST> rules beacause the fallback at the end covers it. I also deliberately left out the anchor in the first rule because my rules guarantee that the INITIAL start condition is 9nly in force at the beginning of a line. See the last rule. The last rule specifies an optional newline in case the file does not end with a newline, which occasionally happens although it's technically invalid.)

Lex parsing to determine

This is an extension to a previous question. I'm trying to parse a .txt file and determine if each line is valid or invalid depending on my rules. the text files will contain an assortment of random strings, hex, integers and decimals seperated by a single space such as:
5 -0xA98F 0XA98H text hello 2.3 -12 0xabc
I'm trying to identify valid hex, integers and decimals and get an output like so.
5 valid
-0xA98F valid
0xA98H invalid
text invalid
hello invalid
2.3 valid
-12 valid
0xabc invalid
My current code however displays like so:
5 valid
-0xa98f valid
0xA98 valid <--- issue 1 just remoives the H
2.3 valid <--- ignores text and hello
-12 valid
0xabc invalid
here is the code I current have:
%{
#include <iostream>
using namespace std;
%}
Decimal [+-]?[0-9]+\.[0-9]+?
Integers [+-]?[0-9]+
Hex [-]?[0][xX][0-9A-F]+
%%
[ \t\n] ;
{Decimal} {cout << yytext << "Valid" << endl; }
{Integers} { cout << yytext << "Valid" << endl; }
{Hex} {cout << yytext << " Valid" << endl;}
. ;
%%
main() {
FILE *myfile= fopen("something.txt", "r");
if (!myfile) {
cout << "Error" << endl;
return -1;
}
yyin = myfile;
yylex();
fclose(yyin);
}
The key to using flex for problems like this is understanding the "maximal-munch" rule. The rule is simple: Flex always picks the action corresponding to the pattern which matches the longest string (starting with the current input point; flex never "searches" for a match.) If more than one pattern matches the same longest substring, then the first pattern in the flex description is chosen. That means that the order of rules is important.
This is described at more length in the Flex manual section on How the Input is Matched.
So let's suppose that you are interested in matching complete words, where "words" are non-empty sequences of arbitrary non-whitespace characters separated by whitespace. (So, for example, the line 3, 4 and 5. would contain only one valid strings.)
It's easy to identify the four possibilities:
Decimal integers
Decimal floating point
Hexadecimal integers
Anything other word.
We also need to ignore whitespace, other than recognizing it as a word separator.
If we put the rules in that order, we can be confident that the correct rule will be chosen for each line, because of the maximal munch rule.
So here's the entire flex file (except for the definition of main):
%option noinput nounput noyywrap nodefault
%%
[[:space:]]+ { /* Ignore whitespace */ }
[+-]?[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal integer */ }
[+-]?[[:digit:]]+"."[[:digit:]]* {
printf("%s valid\n", yytext); /* Decimal point */ }
[+-]?"."[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal point */ }
[+-]0[xX][[:xdigit:]]+ { printf("%s valid\n", yytext); /* Hexadecimal integer */ }
[^[:space:]]+ { printf("%s invalid\n", yytext); /* Any word not matched by above rules */ }
Notes
I've used ordinary printf statements here. You're free to use C++ streams, of course, but I prefer to use either stdio.h or iostreams, but not both. It might be considered cleaner to #include <stdio.h>, but in fact Flex already does that because it needs it for its own purposes.
The %option statement tells flex that you don't need yywrap (which means you don't need to provide one or link with -lfl), that you don't use input or unput (which means you can compile with -Wall without getting unused function warnings) and that you don't expect flex to need to insert a default rule (which saves you from embarrassing errors, because flex will warn you if there is anything which might not match any rule.)
I used [[:xdigit:]]+ in the hexadecimal pattern, which allows both upper and lower-case hex digits. If that's not desired, you could replace it with [0-9A-F] as in your original code, but your examples seem to indicate that your original code was not correct. Of course, you could write out the posix character classes, but I find them more readable. See the Flex manual section on Patterns for a complete list.

Flex: How to define a term to be the first one at the beginning of a line(exclusively)

I need some help regarding a problem I face in my flex code.
My task: To write a flex code which recognizes the declaration part of a programming language, described below.
Let a programming language PL. Its variable definition part is described as follows:
At the beginning we have to start with the keyword "var". After writing this keyword we have to write the variable names(one or more) separated by commas ",". Then a colon ":" is inserted and after that we must write the variable type(say real, boolean, integer or char in my example) followed by a semicolon ";". After doing the previous steps there is the potentiality to declare into a new line new variables(variable names separated by commas "," followed by colon ":" followed by variable type followed by a semicolon ";"), but we must not use the "var" keyword again at the beginning of the new line( the "var" keyword is written once!!!)
E.g.
var number_of_attendants, sum: integer;
ticket_price: real;
symbols: char;
Concretely, I do not know how to make it possible to define that each and every declaration part must start only with the 'var' keyword. Until now, if I would begin a declaration part directly declaring a variable, say x (without having written "var" at the beginning of the line), then no error would occur(unwanted state).
My current flex code below:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real"|"boolean"|"integer"|"char"
SUBEXPRESSION [{VAR_NAME}[","{VAR_NAME}]*":"[ \t\n]*{VAR_TYPE}";"]+
EXPRESSION {VAR_DEFINER}{SUBEXPRESSION}
%%
^{EXPRESSION} {
printf("This is not a well-syntaxed expression!\n");
return 0;
}
{EXPRESSION} printf("This is a well-syntaxed expression!\n");
";"[ \t\n]*{VAR_DEFINER} {
printf("The keyword 'var' is defined once at the beginning of a new line. You can not use it again\n");
return 0;
}
{VAR_DEFINER} printf("A keyword: %s\n", yytext);
^{VAR_DEFINER} printf("Each and every declaration part must start with the 'var' keyword.\n");
{VAR_TYPE}";" printf("The variable type is: %s\n", yytext);
{VAR_NAME} printf("A variable name: %s\n", yytext);
","/[ \t\n]*{VAR_NAME} /* eat up commas */
":"/[ \t\n]*{VAR_TYPE}";" /* eat up single colon */
[ \t\n]+ /* eat up whitespace */
. {
printf("Unrecognized character: %s\n", yytext);
return 0;
}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
I hope to have made it as much as possible clear.
I am looking forward to reading your answers!
You seem to be trying to do too much in the scanner. Do you really have to do everything in Flex? In other words, is this an exercise to learn advanced use of Flex, or is it a problem that may be solved using more appropriate tools?
I've read that the first Fortran compiler took 18 staff-years to create, back in the 1950's. Today, "a substantial compiler can be implemented even as a student project in a one-semester compiler design course", as the Dragon Book from 1986 says. One of the main reasons for this increased efficiency is that we have learned how to divide the compiler into modules that can be constructed separately. The two first such parts, or phases, of a typical compiler is the scanner and the parser.
The scanner, or lexical analyzer, can be generated by Flex from a specification file, or constructed otherwise. Its job is to read the input, which consists of a sequence of characters, and split it into a sequence of tokens. A token is the smallest meaningful part of the input language, such as a semicolon, the keyword var, the identifier number_of_attendants, or the operator <=. You should not use the scanner to do more than that.
Here is how I woould write a simplified Flex specification for your tokens:
[ \t\n] { /* Ignore all whitespace */ }
var { return VAR; }
real { return REAL; }
boolean { return BOOLEAN; }
integer { return INTEGER; }
char { return CHAR; }
[a-zA-Z][a-zA-Z0-9_]* { return VAR_NAME; }
. { return yytext[0]; }
The sequence of tokens is then passed on to the parser, or syntactical analyzer. The parser compares the token sequence with the grammar for the language. For example, the input var number_of_attendants, sum : integer; consists of the keyword var, a comma-separated list of variables, a colon, a data type keyword, and a semicolon. If I understand what your input is supposed to look like, perhaps this grammar would be correct:
program : VAR typedecls ;
typedecls : typedecl | typedecls typedecl ;
typedecl : varlist ':' var_type ';' ;
varlist : VAR_NAME | varlist ',' VAR_NAME ;
var_type : REAL | BOOLEAN | INTEGER | CHAR ;
This grammar happens to be written in a format that Bison, a parser-generator that often is used together with Flex, can understand.
If you separate your solution into a lexical part, using Flex, and a grammar part, using Bison, your life is likely to be much simpler and happier.

Flex function unput(int cahr), In JFlex the same function

We know that in C Flex there is a function unput(int c) which can put the character c back onto the input stream, I wonder if there is a similar function in JFlex. Thx!
If we look at the specification of unput from a flex manual we can note its functionality:
unput(c) puts the character c back onto the input stream. It will be
the next character scanned. The following action will take the current
token and cause it to be rescanned enclosed in parentheses.
{
int i;
/* Copy yytext because unput() trashes yytext */
char *yycopy = strdup( yytext );
unput( ')' );
for ( i = yyleng - 1; i >= 0; --i )
unput( yycopy[i] );
unput( '(' );
free( yycopy );
}
Note that since each unput() puts the given character back at the beginning of the input stream, pushing back strings must be done
back-to-front.
According to the JFlex manual, there is no unput, but there is yypushback:
• void yypushback(int number)
pushes number characters of the matched text back into the input
stream. They will be read again in the next call of the scanning
method. The number of characters to be read again must not be greater
than the length of the matched text. The pushed back characters will
not be included in yylength() and yytext(). Note that in Java
strings are unchangeable, i.e. an action code like
String matched = yytext();
yypushback(1);
return matched;
will return the whole matched text, while
yypushback(1);
return yytext();
will return the matched text minus the last character.
Although they are not the same, many of the uses of unput can be achieved by using yypushback; however you cannot put different characters into the input stream, which you could with unput. Note that flex has yyless which operates like yypushback.

rule exclusion in flex

I am trying to write a flex file which recognizes (-! comment !-) as one token called comment. The following is my file:
%{
#include <stdio.h>
void showToken(char* name);
void error();
void enter();
int lineNum=1;
%}
%option yylineno
%option noyywrap
whitespace ([\t ])
enter ([\n])
startcomment (\(\-\!)
endcomment (\!\-\))
comment (^\!\-\))
%%
{startcomment}{comment}*{endcomment} showToken("COMMENT");
{enter} enter();
{whitespace}
. error();
%%
void showToken(char* name){
printf("%d %s %s %d% \n",lineNum,name, yytext);
}
void enter(){
lineNum++;
}
void error(){
printf("%d error %s \n",lineNum,yytext);
}
but i fail for a simple (-! comment !-) input, this file does recognize the (-! and !-) but fails to recognize my comment rule. I did try replacing it with comment (^{endcomment}) but it did not work, any suggestions?
You seem to think that ^ means the following pattern should not match, but it means to match the start of a line. Inside a character class ^ does mean everything but the character class, but outside a character class its meaning is totally different.
In answer to your question for an alternative. Your problem is similar to C-comment /* comment */. The following expression matches C-comment:
"/*"([^*]|"*"+[^/*])*"*"+"/"
Alternatively and more intuitive (if you like) you can use a sub-automaton:
%x comment
%%
"/*" { BEGIN(comment); }
<comment>(.|"\n") { /* Skip */ }
<comment>"*/" { BEGIN(INITIAL); }
%%
I'll leave it as an exercise to apply this to your comment style. Having !-) as the closing of your comment, makes the first solution a bit more complicated.
Note that in general the second solution is preferred because it does not cause the use of a big buffer. The first solution will create a buffer containing the complete comment (which can be big), whereas the buffer requirements for the second solution is at most two characters long.
The easiest way to maintain line-numbers is using the %option yylineno as flex will then keep track of line-numbers in the variable int yylineno. Alternatively you can count the number of new-lines in yytext. In the second solution you can split the second rule and make a separate case for "\n" and count line-numbers there.

Resources