Flex: How to define a term to be the first one at the beginning of a line(exclusively) - flex-lexer

I need some help regarding a problem I face in my flex code.
My task: To write a flex code which recognizes the declaration part of a programming language, described below.
Let a programming language PL. Its variable definition part is described as follows:
At the beginning we have to start with the keyword "var". After writing this keyword we have to write the variable names(one or more) separated by commas ",". Then a colon ":" is inserted and after that we must write the variable type(say real, boolean, integer or char in my example) followed by a semicolon ";". After doing the previous steps there is the potentiality to declare into a new line new variables(variable names separated by commas "," followed by colon ":" followed by variable type followed by a semicolon ";"), but we must not use the "var" keyword again at the beginning of the new line( the "var" keyword is written once!!!)
E.g.
var number_of_attendants, sum: integer;
ticket_price: real;
symbols: char;
Concretely, I do not know how to make it possible to define that each and every declaration part must start only with the 'var' keyword. Until now, if I would begin a declaration part directly declaring a variable, say x (without having written "var" at the beginning of the line), then no error would occur(unwanted state).
My current flex code below:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real"|"boolean"|"integer"|"char"
SUBEXPRESSION [{VAR_NAME}[","{VAR_NAME}]*":"[ \t\n]*{VAR_TYPE}";"]+
EXPRESSION {VAR_DEFINER}{SUBEXPRESSION}
%%
^{EXPRESSION} {
printf("This is not a well-syntaxed expression!\n");
return 0;
}
{EXPRESSION} printf("This is a well-syntaxed expression!\n");
";"[ \t\n]*{VAR_DEFINER} {
printf("The keyword 'var' is defined once at the beginning of a new line. You can not use it again\n");
return 0;
}
{VAR_DEFINER} printf("A keyword: %s\n", yytext);
^{VAR_DEFINER} printf("Each and every declaration part must start with the 'var' keyword.\n");
{VAR_TYPE}";" printf("The variable type is: %s\n", yytext);
{VAR_NAME} printf("A variable name: %s\n", yytext);
","/[ \t\n]*{VAR_NAME} /* eat up commas */
":"/[ \t\n]*{VAR_TYPE}";" /* eat up single colon */
[ \t\n]+ /* eat up whitespace */
. {
printf("Unrecognized character: %s\n", yytext);
return 0;
}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
I hope to have made it as much as possible clear.
I am looking forward to reading your answers!

You seem to be trying to do too much in the scanner. Do you really have to do everything in Flex? In other words, is this an exercise to learn advanced use of Flex, or is it a problem that may be solved using more appropriate tools?
I've read that the first Fortran compiler took 18 staff-years to create, back in the 1950's. Today, "a substantial compiler can be implemented even as a student project in a one-semester compiler design course", as the Dragon Book from 1986 says. One of the main reasons for this increased efficiency is that we have learned how to divide the compiler into modules that can be constructed separately. The two first such parts, or phases, of a typical compiler is the scanner and the parser.
The scanner, or lexical analyzer, can be generated by Flex from a specification file, or constructed otherwise. Its job is to read the input, which consists of a sequence of characters, and split it into a sequence of tokens. A token is the smallest meaningful part of the input language, such as a semicolon, the keyword var, the identifier number_of_attendants, or the operator <=. You should not use the scanner to do more than that.
Here is how I woould write a simplified Flex specification for your tokens:
[ \t\n] { /* Ignore all whitespace */ }
var { return VAR; }
real { return REAL; }
boolean { return BOOLEAN; }
integer { return INTEGER; }
char { return CHAR; }
[a-zA-Z][a-zA-Z0-9_]* { return VAR_NAME; }
. { return yytext[0]; }
The sequence of tokens is then passed on to the parser, or syntactical analyzer. The parser compares the token sequence with the grammar for the language. For example, the input var number_of_attendants, sum : integer; consists of the keyword var, a comma-separated list of variables, a colon, a data type keyword, and a semicolon. If I understand what your input is supposed to look like, perhaps this grammar would be correct:
program : VAR typedecls ;
typedecls : typedecl | typedecls typedecl ;
typedecl : varlist ':' var_type ';' ;
varlist : VAR_NAME | varlist ',' VAR_NAME ;
var_type : REAL | BOOLEAN | INTEGER | CHAR ;
This grammar happens to be written in a format that Bison, a parser-generator that often is used together with Flex, can understand.
If you separate your solution into a lexical part, using Flex, and a grammar part, using Bison, your life is likely to be much simpler and happier.

Related

why my lexical analyzer can not recognize numbers and ids and operators

my lexical analyzer in flex can not recognize numbers and ids and operators ,only keywords were recognized where is my mistake? this is my code:
%{
#include<stdio.h>
%}
Nums [0-9]
LowerCase [a-z]
UpperCase [A-Z]
Letters LowerCase|UpperCase|[_]
Id {Letters}({Letters}|{Nums})*
operators +|-|\|*
%%
"if" {printf("if keyword founded \n");}
"then" {printf("then keyword founded \n");}
"else" {printf("else keyword founded \n");}
Operators {printf(" operator founded \n");}
Id {printf(" id founded ");}
%%
int main (void)
{ yylex(); return(0);}
int yywrap(void)
{ return 1;}
The pattern Operators is equivalent to "Operators", so it only matches that single word. If you meant to expand the macro by that name, the syntax is {Operators}. (Actually, {operators} since you seem to have inconsistently spelled the macro name in all lower-case.)
If you do that, flex will complain because of the syntax error in that macro. (Syntax errors in macros aren't detected unless the macro is expanded. That's just one of the problems with using macros.)
You have different problems with your other macros. For example, Nums doesn't appear in any rule at all.
My suggestion would be to use fewer (or no) macros and more character classes. Eg.:
[[:alpha:]_][[:alnum:]_]* { /* Action for identifier. */ }
[[:digit:]]+ { /* Action for number. */ }
[-+*/] { /* Action for operator. */ }
Please read the Patterns section in the flex manual for a full description of the pattern syntax, including the named character class expressions used in the first two patterns above.
To use a named definition, it ust be enclosed in {}. So your Letters rule should be
Letters {LowerCase}|{UpperCase}|[_]
... as it is, it matches the literal inputs LowerCase and UpperCase. Similarly in your rules, you want
{Operators} ...
{Id} ...
as what you have will match the literal input strings Operators and Id

How to use "literal string tokens" in Bison

I am learning Flex/Bison. The manual of Bison says:
A literal string token is written like a C string constant; for
example, "<=" is a literal string token. A literal string token
doesn’t need to be declared unless you need to specify its semantic
value data type
But I do not figure how to use it and I do not find an example.
I have the following code for testing:
example.l
%option noyywrap nodefault
%{
#include "example.tab.h"
%}
%%
[ \t\n] {;}
[0-9] { return NUMBER; }
. { return yytext[0]; }
%%
example.y
%{
#include <stdio.h>
#define YYSTYPE char *
%}
%token NUMBER
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
%%
main(int argc, char **argv) {
yyparse();
}
yyerror(char *s) {
fprintf(stderr, "error: %s\n", s);
}
Makefile
#!/usr/bin/make
# by RAM
all: example
example.tab.c example.tab.h: example.y
bison -d $<
lex.yy.c: example.l example.tab.h
flex $<
example: lex.yy.c example.tab.c
cc -o $# example.tab.c lex.yy.c -lfl
clean:
rm -fr example.tab.c example.tab.h lex.yy.c example
And when I run it:
$ ./example
3<4
<
6>9
>
6=>9
error: syntax error
Any idea?
UPDATE: I want to clarify that I know alternative ways to solve it, but I want to use literal string tokens.
One Alternative: using multiple "literal character tokens":
tokens:
NUMBER '<' '=' NUMBER { printf("<="); }
| NUMBER '=' '>' NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
When I run it:
$ ./example
3<=9
<=
Other alternative:
In example.l:
"<=" { return LE; }
"=>" { return GE; }
In example.y:
...
%token NUMBER
%token LE "<="
%token GE "=>"
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
...
When I run it:
$ ./example
3<=4
<=
But the manual says:
A literal string token doesn’t need to be declared unless you need to
specify its semantic value data type
The quoted manual paragraph is correct, but you need to read the next paragraph, too:
You can associate the literal string token with a symbolic name as an alias, using the %token declaration (see Token Declarations). If you don’t do that, the lexical analyzer has to retrieve the token number for the literal string token from the yytname table.
So you don't need to declare the literal string token, but you still need to arrange for the lexer to send the correct token number, and if you don't declare an associated token name, the only way to find the correct value is to search for the code in the yytname table.
In short, your last example where you define LE and GE as aliases, is by far the most common approach. Separating the tokens into individual characters is not a good idea; it might create shift-reduce conflicts and it will definitely allow invalid inputs, such as putting whitespace between the characters.
If you want to try the yytname solution, there is sample code in the bison manual. But please be aware that this code discovers bison's internal token number, which is not the number which needs to be returned from the scanner. There is no way to get the external token number which is easy, portable and documented; the easy and undocumented way is to look the token number up in yytoknum but since that array is not documented and conditional on a preprocessor macro, there is no guarantee that it will work. Note also that these tables are declared static so the function(s) which rely on them must be included in the bison input file. (Of course, these functions can have external linkage so that they can be called from the lexer. But you can't just use yytname directly in the lexer.)
I havent used flex/bison for a while but two things:
. as far as I remember only matches a single character. yytext is a pointer to a null terminated string char* so yytext[0] is a char which means that you can't match strings this way. You probably need to change it to return yytext. Otherwise . will probably create a token PER character and you'd probably have to write NUMBER '<' '=' NUMBER.

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser.
For the sake of brevity, I have reduced this language to a minimum so that my questions are still open.
The language has implicit and explicit strings, arrays and objects. An implicit string is just a sequence of characters that does not contain <, {, [ or ]. An explicit string looks like <salt<text>salt> where salt is an arbitrary identifier (i.e. [a-zA-Z][a-zA-Z0-9]*) and text is an arbitrary sequence of characters that does not contain the salt.
An array starts with [, followed by objects and/or strings and ends with ].
All characters within an array that don't belong to an array, object or explicit string do belong to an implicit string and the length of each implicit string is maximal and greater than 0.
An object starts with { and ends with } and consists of properties. A property starts with an identifier, followed by a colon, then optional whitespaces and then either an explicit string, array or object.
So the string [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ] represents an array with 6 items: " name:test ", "<html>[]</html>", " ", { name: "test" }, "bla" and " " (the object is notated in json).
As one can see, this language is not context free due to the explicit string (that I don't want to miss). However, the syntax tree is nonambiguous.
So my question is: Is a property a token that may be returned by a tokenizer? Or should the tokenizer return T_identifier, T_colon when he reads an object property?
The real language allows even prefixes in the identifier of a property, e.g. ns/name:<a<test>a> where ns is the prefix for a namespace.
Should the tokenizer return T_property_prefix("ns"), T_property_prefix_separator, T_property_name("name"), T_property_colon or just T_property("ns/name") or even T_identifier("ns"), T_slash, T_identifier("name"), T_colon?
If the tokenizer should recognize properties (which would be useful for syntax highlighters), he must have a stack, because name: is not a property if it is in an array. To decide whether bla: in [{foo:{bar:[test:baz]} bla:{}}] is a property or just an implicit string, the tokenizer must track when he enters and leave an object or array.
Thus, the tokenizer would not be a finite state machine any more.
Or does it make sense to have two tokenizers - the first, which separates whitespaces from alpha-numerical character sequences and special characters like : or [, the second, which uses the first to build more semantical tokens? The parser could then operate on top of the second tokenizer.
Anyways, the tokenizer must have an infinite lookahead to see when an explicit string ends. Or should the detection of the end of an explicit string happen inside the parser?
Or should I use a parser generator for my undertaking? Since my language is not context free, I don't think there is an appropriate parser generator.
Thanks in advance for your answers!
flex can be requested to provide a context stack, and many flex scanners use this feature. So, while it may not fit with a purist view of how a scanner scans, it is a perfectly acceptable and supported feature. See this chapter of the flex manual for details on how to have different lexical contexts (called "start conditions"); at the very end is a brief description of the context stack. (Don't miss the sentence which notes that you need %option stack to enable the stack.) [See Note 1]
Slightly trickier is the requirement to match strings with variable end markers. flex does not have any variable match feature, but it does allow you to read one character at a time from the scanner input, using the function input(). That's sufficient for your language (at least as described).
Here's a rough outline of a possible scanner:
%option stack
%x SC_OBJECT
%%
/* initial/array context only */
[^][{<]+ yylval = strdup(yytext); return STRING;
/* object context only */
<SC_OBJECT>{
[}] yy_pop_state(); return '}';
[[:alpha:]][[:alnum:]]* yylval = strdup(yytext); return ID;
[:/] return yytext[0];
[[:space:]]+ /* Ignore whitespace */
}
/* either context */
<*>{
[][] return yytext[0]; /* char class with [] */
[{] yy_push_state(SC_OBJECT); return '{';
"<"[[:alpha:]][[:alnum:]]*"<" {
/* We need to save a copy of the salt because yytext could
* be invalidated by input().
*/
char* salt = strdup(yytext);
char* saltend = salt + yyleng;
char* match = salt;
/* The string accumulator code is *not* intended
* to be a model for how to write string accumulators.
*/
yylval = NULL;
size_t length = 0;
/* change salt to what we're looking for */
*salt = *(saltend - 1) = '>';
while (match != saltend) {
int ch = input();
if (ch == EOF) {
yyerror("Unexpected EOF");
/* free the temps and do something */
}
if (ch == *match) ++match;
else if (ch == '>') match = salt + 1;
else match = salt;
/* Don't do this in real code */
yylval = realloc(yylval, ++length);
yylval[length - 1] = ch;
}
/* Get rid of the terminator */
yylval[length - yyleng] = 0;
free(salt);
return STRING;
}
. yyerror("Invalid character in object");
}
I didn't test that thoroughly, but here is what it looks like with your example input:
[ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ]
Token: [
Token: STRING: -- name:test --
Token: STRING: --<html>[]</html>--
Token: STRING: -- --
Token: {
Token: ID: --name--
Token: :
Token: STRING: --test--
Token: }
Token: STRING: --bla--
Token: STRING: -- --
Token: ]
Notes
In your case, unless you wanted to avoid having a parser, you don't actually need a stack since the only thing that needs to be pushed onto the stack is an object context, and a stack with only one possible value can be replaced with a counter.
Consequently, you could just remove the %option stack and define a counter at the top of the scan. Instead of pushing the start condition, you increment the counter and set the start condition; instead of popping, you decrement the counter and reset the start condition if it drops to 0.
%%
/* Indented text before the first rule is inserted at the top of yylex */
int object_count = 0;
<*>[{] ++object_count; BEGIN(SC_OBJECT); return '{';
<SC_OBJECT[}] if (!--object_count) BEGIN(INITIAL); return '}'
Reading the input one character at a time is not the most efficient. Since in your case, a string terminate must start with >, it would probably be better to define a separate "explicit string" context, in which you recognized [^>]+ and [>]. The second of these would do the character-at-a-time match, as with the above code, but would terminate instead of looping if it found a non-matching character other than >. However, the simple code presented may turn out to be fast enough, and anyway it was just intended to be good enough to do a test run.
I think the traditional way to parse your language would be to have the tokenizer return T_identifier("ns"), T_slash, T_identifier("name"), T_colon for ns/name:
Anyway, I can see three reasonable ways you could implement support for your language:
Use lex/flex and yacc/bison. The tokenizers generated by lex/flex do not have stack so you should be using T_identifier and not T_context_specific_type. I didn't try the approach so I can't give a definite comment on whether your language could be parsed by lex/flex and yacc/bison. So, my comment is try it to see if it works. You may find information about the lexer hack useful: http://en.wikipedia.org/wiki/The_lexer_hack
Implement a hand-built recursive descent parser. Note that this can be easily built without separate lexer/parser stages. So, if the lexemes depend on context it is easy to handle when using this approach.
Implement your own parser generator which turns lexemes on and off based on the context of the parser. So, the lexer and the parser would be integrated together using this approach.
I once worked for a major network security vendor where deep packet inspection was performed by using approach (3), i.e. we had a custom parser generator. The reason for this is that approach (1) doesn't work for two reasons: firstly, data can't be fed to lex/flex and yacc/bison incrementally, and secondly, HTTP can't be parsed by using lex/flex and yacc/bison because the meaning of the string "HTTP" depends on its location, i.e. it could be a header value or the protocol specifier. The approach (2) didn't work because data can't be fed incrementally to recursive descent parsers.
I should add that if you want to have meaningful error messages, a recursive descent parser approach is heavily recommended. My understanding is that the current version of gcc uses a hand-built recursive descent parser.

Can nested parentheticals be parsed in chemical formulae?

I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}

Replace number expression with flex

I use Flex for replace number expression in code source:
For instance:
Input string: ... echo "test"; if ($isReady) $variable = 2 * 5; ...
Desired result string: ... echo "test"; if ($isReady) $variable = 10; ...
My code:
%{
#include <stdio.h>
#include <stdlib.h>
%}
MYEXP [0-9]+[ \t\n\r]*\+[ \t\n\r]*[0-9]+
%%
{MYEXP} {
printf("multiplication ");
// code for processing
}
%%
void main()
{
yylex();
}
How can I process multiplication with Flex? Or I have to process with C language?
Some of the answers are in the comments, but the question has not yet been closed with an answer in two years. I thought some notes, for the purposes of completion, would be useful for people who are thinking of things like this in the future.
Simple arithmetic expression, in the form exemplified in the question can be recognised by a tool like flex, which matches regular expressions using an FSA (Finite State Automaton - or FSM Finite State Machine). This works when the syntax is simple id + id, but fails when the expressions become more complex. The handling of the operator precedence in id + id * id and the nested parenthesis in something like ((id + id) * (id + id)) means that a Regular Grammar can no longer work. This requires a context-free grammar. (Computer Science students should know this from Chomsky Language Theory). So the operations can only be performed in flex for the simplest forms of expression.
The replacement of simple expressions, which only contain constants, is an optimisation called constant folding and is performed by most compilers as standard. Performing this as a form of pre-processing on most code will not produce any improvement. So when proposing to write tools to do a job like this you have to reflect on whether it is essential or not!
Now down to the actual details of the question, which have been picked up in the comments; yes, a rule will be needed for each operator, addition and multiplication; and when matched a substring will be needed to pick up the operands. It will look something like this:
MYplusEXP [0-9]+[ \t\n\r]*\+[ \t\n\r]*[0-9]+
MYmultEXP [0-9]+[ \t\n\r]*\*[ \t\n\r]*[0-9]+
%%
char [20] left; char * right;
{MYplusEXP} {right = strstr(yytext,"+"); /* yytext is already terminated with \0 */
strncopy(left,yytext,right-yytext+1);
printf("%d",atoi(left)+atoi(right));
}
{MYmultEXP} {right = strstr(yytext,"*");
strncopy(left,yytext,right-yytext+1);
printf("%d",atoi(left)*atoi(right));
}
However I feel a bit dirty after doing that pointer arithmetic
In summary, it might be better done with other tools or not at all!

Resources