Need Lex regular expression to match string upto newline - flex-lexer

I want to parse strings of the type :
a=some value
b=some other value
There are no blanks around '=' and values extend up to newline. There may be leading spaces.
My lex specification (relevant part) is:
%%
a= { printf("Found attr %s\n", yytext); return aATTR; }
^[ \r\t]+ { printf("Found space at the start %s\n", yytext); }
([^a-z]=).*$ { printf("Found value %s\n", yytext); }
\n { return NEWLINE; }
%%
I tried .*$ [^\n]* and a few other regular expressions but to no avail.
This looks pretty simple. Any suggestions? I am also aware that lex returns the longest match so that complicates it further. I get the whole line matched for some regular expressions I tried.

You probably want to incorporate separate start states. These permit you to encode simple contexts. The simple example below captures your id, operator and value on each call to yylex().
%{
char id;
char op;
char *value;
%}
%x VAL OP
%%
<INITIAL>[a-z]+ {
id = yytext[0];
yyleng = 0;
BEGIN OP;
}
<INITIAL,OP>[ \t]*
<OP>=[ \t]* {
op = yytext[0];
yyleng = 0;
BEGIN VAL;
}
<VAL>.*\n {
value = yytext;
BEGIN INITIAL;
return 1;
}
%%

Related

Flex find substring until character

This is my lexer.l file:
%{
#include "../h/Tokens.h"
%}
%option yylineno
%%
[+-]?([1-9]*\.[0-9]+)([eE][+-]?[0-9])? return FLOAT;
[+-]?[1-9]([eE][+-]?[0-9])? return INTEGER;
\"(\\\\|\\\"|[^\"])*\" return STRING;
(true|false) return BOOLEAN;
(func|val|if|else|while|for)* return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
. printf("Unexpected or invalid token: '%s'\n", yytext);
%%
int yywrap(void)
{
return 1;
}
Now, if my lexer finds an unexpected token, it sends an error for every character. I want it to send an error message for every substring until a whitespace or operator.
Example:
Input:
foo bar baz
~±`≥ hello
Output:
Identifier.
Identifier.
Identifier.
Unexpected or invalid token: '~±`≥'
Identifier.
Is there a way to do this with a regex pattern?
Thanks.
Certainly it is possible to do with a regex. But you can't do it with a regex independent of your other token rules. And it may not be trivial to find a correct regex.
In this fairly simple example, though, it's reasonably simple, although there is a corner case. Since there are no multicharacter operators, a character cannot start a token unless it is alphabetic, numeric, one of the operators (-+*.,:;) or a double-quote. And therefore any sequence of such characters is an invalid sequence. Also, I think that you really want ignore whitespace characters (based on the example output), even though your question doesn't show any rule which matches whitespace. So on the assumption that you just left out the whitespace rule, which would be something like
[[:space:]]+ { /* Ignore whitespace */ }
your regex to match a sequence of illegal characters would be
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr, "Invalid sequence %s\m", yytext); }
The corner-case is an unterminated string literal; that is, a token which starts with a " but does not include the matching closing quote. Such a token must necessarily extend to the end of the input, and it can easily be matched by using your string pattern, leaving out the final ". (That works because (f)lex always uses the longest matching pattern, so if there is a terminating " the correct string literal will be matched.)
There are a number of errors in your patterns:
It's almost always a bad idea to match +- at the start of a numeric literal. If you do that, then x+2 will not be correctly analysed; your lexer will return two tokens, an IDENTIFIER and an INTEGER, instead of the correct three tokens (IDENTIFIER, PLUS, INTEGER).
Your FLOAT pattern won't accept numbers starting which contain a 0 before the decimal point, so 0.5 and 10.3 will both fail. Also, you force the exponent to be a single digit, so 1.3E11 won't be matched either. And you force the user to put a digit after the decimal point; most languages accept 3. as equivalent to 3.0. (That last one is not necessarily an error, but it's unconventional.)
Your INTEGER pattern won't accept numbers containing a 0, such as 10. But it will accept scientific notation, which is a little odd; in most languages 3E10 is a floating point constant, not an integer.
Your KEYWORD pattern accepts keywords which are made up of a concatenated series of words, such as forwhilefuncif. You probably didn't intend to put a * at the end of the pattern.
Your string literal pattern allows any sequence of characters other than ", which means a backslash \ will be allowed to match as a single character, even if it is followed by a quote or a backslash. That will result in some string literals not being correctly terminated. For example, given the string literal
"\\"
(which is a string literal containing a single backslash), the regex will match the initial ", then the \ as a single character, and then the \" sequence, and then whatever follows the string literal until it encounters another quote.
The error is the result of flex requiring \ to be escaped inside bracket expressions, unlike Posix regular expressions where \ loses special significance inside brackets.
So that would leave you with something like this:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
[[:space:]]+ /* Ignore whitespace */
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? {
return FLOAT;
}
0|[1-9][0-9]* return INTEGER;
true|false return BOOLEAN;
func|val|if|else|while|for return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
\"(\\\\|\\\"|[^\\"])*\" return STRING;
\"(\\\\|\\\"|[^\\"])* { fprintf(stderr,
"Unterminated string literal\n"); }
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr,
"Invalid sequence %s\m", yytext); }
(If any of those patterns look mysterious, you might want to review the description of flex patterns in the flex manual.)
But I have a feeling that you were looking for something different: a way of magically adapting to any change in the token patterns without excess analysis.
That's possible, too, but I don't know how to do it without code repetition. The basic idea is simple enough: when we encounter an unmatchable character, we just append it to the end of an error token and when we find a valid token, we emit the error message and clear the error token.
The problem is the "when we find a valid token" part, because that means that we need to insert an action at the beginning of every rule other than the error rule. The easiest way to do that is to use a macro, which at least avoids writing out the code for every action.
(F)lex does provide us with some useful tools we can build this on. We'll use one of (f)lex's special actions, yymore(), which causes the current match to be appended to the token being built, which is useful to build up the error token.
In order to know the length of the error token (and therefore to know if there is one), we need an additional variable. Fortunately, (f)lex allows us to define our own local variables inside the scanner. Then we define the macro E_ (whose name was chosen to be short, in order to avoid cluttering the rule actions), which prints the error message, moves yytext over the error token, and resets the error count.
Putting that together:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
int nerrors = 0; /* To keep track of the length of the error token */
/* This macro must be inserted at the beginning of every rule,
* except the fallback error rule.
*/
#define E_ \
if (nerrors > 0) { \
fprintf(stderr, "Invalid sequence %.*s\n", nerrors, yytext); \
yytext += nerrors; yyleng -= nerrors; nerrors = 0; \
} else /* Absorb the following semicolon */
[[:space:]]+ { E_; /* Ignore whitespace */ }
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? { E_; return FLOAT; }
0|[1-9][0-9]* { E_; return INTEGER; }
true|false { E_; return BOOLEAN; }
func|val|if|else|while|for { E_; return KEYWORD; }
[A-Za-z_][A-Za-z_0-9]* { E_; return IDENTIFIER; }
"+" { E_; return PLUS; }
"-" { E_; return MINUS; }
"*" { E_; return MULTI; }
"." { E_; return DOT; }
"," { E_; return COMMA; }
":" { E_; return COLON; }
";" { E_; return SEMICOLON; }
\"(\\\\|\\\"|[^\\"])*\" { E_; return STRING; }
\"(\\\\|\\\"|[^\\"])* { E_;
fprintf(stderr,
"Unterminated string literal\n"); }
. { yymore(); ++nerror; }
That all assumes that we're happy to just produce an error message inside the scanner, and otherwise ignore the erroneous characters. But it may be better to actually return an error indication and let the caller decide how to handle the error. That introduces an extra wrinkle because it requires us to return two tokens in a single action.
For a simple solution, we use another (f)lex feature, yyless(), which allows us to rescan part or all of the current token. We can use that to remove the error token from the current token, instead of adjusting yytext and yyleng. (yyless will do that adjustment for us.) That means that after an error, the next correct token is scanned twice. That may seem inefficient, but it's probably acceptable because:
Most tokens are short,
There's not really much point in optimising for errors. It's much more useful to optimise processing of correct inputs.
To accomplish that, we just need a small change to the E_ macro:
#define E_ \
if (nerrors > 0) { \
yyless(nerrors); \
fprintf(stderr, "Invalid sequence %s\n", yytext); \
nerrors = 0; \
return BAD_INPUT; \
} else /* Absorb the following semicolon */

What is the difference between stack<int> and stack<string> ? could someone please tell the difference

/*
Evaluation Of postfix Expression in C++
Input Postfix expression must be in a desired format.
Operands must be integers and there should be space in between two operands.
Only '+' , '-' , '*' and '/' operators are expected.
*/
#include<iostream>
#include<stack>
#include<string>
using namespace std;
// Function to evaluate Postfix expression and return output
int EvaluatePostfix(string expression);
// Function to perform an operation and return output.
int PerformOperation(char operation, int operand1, int operand2);
// Function to verify whether a character is operator symbol or not.
bool IsOperator(char C);
// Function to verify whether a character is numeric digit.
bool IsNumericDigit(char C);
int main()
{
string expression;
cout<<"Enter Postfix Expression \n";
getline(cin,expression);
int result = EvaluatePostfix(expression);
cout<<"Output = "<<result<<"\n";
}
// Function to evaluate Postfix expression and return output
int EvaluatePostfix(string expression)
{
// Declaring a Stack from Standard template library in C++. whats the difference bet stack<int> and stack<string>
stack<int> S;
for(int i = 0;i< expression.length();i++) {
// Scanning each character from left.
// If character is a delimiter, move on.
if(expression[i] == ' ' || expression[i] == ',') continue;
// If character is operator, pop two elements from stack, perform operation and push the result back.
else if(IsOperator(expression[i])) {
// Pop two operands.
int operand2 = S.top(); S.pop();
int operand1 = S.top(); S.pop();
// Perform operation
int result = PerformOperation(expression[i], operand1, operand2);
//Push back result of operation on stack.
S.push(result);
}
else if(IsNumericDigit(expression[i])){
// Extract the numeric operand from the string
// Keep incrementing i as long as you are getting a numeric digit.
int operand = 0;
while(i<expression.length() && IsNumericDigit(expression[i])) {
// For a number with more than one digits, as we are scanning from left to right.
// Everytime , we get a digit towards right, we can multiply current total in operand by 10
// and add the new digit.
operand = (operand*10) + (expression[i] - '0');
i++;
}
// Finally, you will come out of while loop with i set to a non-numeric character or end of string
// decrement i because it will be incremented in increment section of loop once again.
// We do not want to skip the non-numeric character by incrementing i twice.
i--;
// Push operand on stack.
S.push(operand);
}
}
// If expression is in correct format, Stack will finally have one element. This will be the output.
return S.top();
}
// Function to verify whether a character is numeric digit.
bool IsNumericDigit(char C)
{
if(C >= '0' && C <= '9') return true;
return false;
}
// Function to verify whether a character is operator symbol or not.
bool IsOperator(char C)
{
if(C == '+' || C == '-' || C == '*' || C == '/')
return true;
return false;
}
// Function to perform an operation and return output.
int PerformOperation(char operation, int operand1, int operand2)
{
if(operation == '+') return operand1 +operand2;
else if(operation == '-') return operand1 - operand2;
else if(operation == '*') return operand1 * operand2;
else if(operation == '/') return operand1 / operand2;
else cout<<"Unexpected Error \n";
return -1;
}

Why the following LEX program is not printing "No. of tokens"

My code is printing the identifiers,separators and all other things except it is not printing the number of tokens.Can't point out the problem.
%{
int n=0;
%}
%%
"while"|"if"|"else"|"printf" {
n++;
printf("\t keywords : %s", yytext);}
"int"|"float" {
n++;printf("\t identifier : %s", yytext);
}
"<="|"=="|"="|"++"|"-"|"*"|"+" {
n++;printf("\t operator : %s", yytext);
}
[(){}|, ;] {n++;printf("\t seperator : %s", yytext);}
[0-9]*"."[0-9]+ {
n++;printf("\t float : %s", yytext);
}
[0-9]+ {
n++;printf("\t integer : %s", yytext);
}
.;
%%
int main(void)
{
yylex();
printf("\n total no. of tokens = %d\n",n);
}
int yywrap()
{
return 0;
}
If yywrap() returns 0, the lexer assumes that yywrap() has somehow arranged for yyin to have more data, and the lexer will continue to read input. So your lexer will never terminate.
If you want to signal that there is no more data, you need to return 1 from yywrap().
It's probably better to avoid the need for yywrap by placing
%option noyywrap
in the flex prologue.
I usually use %option noinput nounput noyywrap, which eliminates some compiler warnings assuming you ask for compiler warnings, which you should always do. Also %option nodefault can help you find lex specification bugs, since it will complain if some input does not have a matching rule. (The default (f)lex action on unrecognised input is to simply write the unmatched character to standard output. That's not usually very helpful, and unlike an error message, it is very easy to miss.) Finally, %option 8bit is only necessary if you request a lexer optimised for speed rather than table-size. But it doesn't hurt to add it, and it might save you from an embarrassing bug if you (or someone) someday decides to try the faster scanner skeleton. (Not recommended, except in very special circumstances.)

Bison parser expanding rule instead of reducing rule

Is it possible for a Bison rule to expand instead of reducing so that it turns into more tokens? Asked a different way: is it possible to insert extra tokens to be parsed before the next token in the parser input?
Here is an example where I might want this:
Suppose I want a parser that understands three token types. Numbers (just positive integers for the sake of simplicity - INT), words (any number of letters, upper or lower case STRING) and some kind of other symbol (lets use an exclamation mark for no good reason - EXC)
Suppose I have a rule that reduces a word followed by a number followed by an exclamation mark. This rule results in an integer type, let's say for now that it simply doubles its input. This rule also allows itself to be the integer that it parses.
I also have a rule to accept any number of these in a row (the start rule).
The Bison parser look like this: (quicktest.y)
%{
#include <stdio.h>
%}
%union {
int INT_VAL;
}
%token STRING EXC
%token <INT_VAL> INT
%type <INT_VAL> somenumber
%%
start: somenumber {printf ("Result: %d\n", $1);}
| start somenumber {printf ("Result: %d\n", $2);}
;
somenumber: STRING INT EXC {$$ = $2 *2;}
| STRING somenumber EXC {$$ = $2 *2;}
;
%%
main(int argc, char ** argv){
yyparse();
}
yyerror(char* s){
fprintf(stderr, "%s\n", s);
}
The tokens can be generated with a flex lexer like so: (quicktest.l)
%{
#include "quicktest.tab.h"
%}
%%
[A-Za-z]+ {return STRING;}
[1-9]+ {yylval.INT_VAL = atoi(yytext); return INT;}
"!" {return EXC;}
. {}
This can be built with the following commands:
bison -d quicktest.y
flex quicktest.l
gcc -o quicktest quicktest.tab.c lex.yy.c -lfl -ggdb
I can now input something like this:
double double 2 ! !
and get the result 8
Now if I want the user to be able to avoid having lots of exclamation marks on one line, like this:
a b c d e f 2 ! ! ! ! ! !
I'd like to be able to allow them to input something like this:
a b c d e f 2 !*6
So I can add a flex expression for such a token that simply extracts the number of exclamations needed:
!\*[1-9]+ {
char *number = malloc(sizeof(char) * (strlen(yytext)-1));
strcpy(number, yytext+2);
yylval.INT_VAL = atoi(number);
free(number);
printf("Multiple exclamations: %d\n", yylval.INT_VAL);
return REPEAT_EXC;
}
But how would I implement the bison side of things?
I can add the token type like so:
%token <INT_VAL> REPEAT_EXC
And then a rule of some kind perhaps?
repeat_exc: REPEAT_EXC {/*expand into n exclamation marks (EXC tokens)*/}
;
Does Bison support this in any way?
If not how should I implement this?
Should I somehow have the lexer return the EXC token n times when it receives the repeat exc expression? (I'd rather avoid this if possible as this requires the flex code to keep record of some kind of state, it could be in the repeat exclamation state or in a normal state. The lexer is then not as simple to maintain.)
That's really not possible in a context-free grammar.
It's not that difficult to do in a traditional lexer, but as you say it requires that the lexer maintain state. An easier approach is to use a push parser, where the parser is called from the lexer rather than the other way around. [Note 1]
The bison manual doesn't explain the API very well; if you declare a pure push parser, the interface you get is:
int yypush_parse(yypstate*, int, const YYSTYPE*);
or, if position-tracking is enabled:
int yypush_parse(yypstate*, int, const YYSTYPE*, YYLTYPE*);
I made fairly minimal changes to your example, in order to show the push_parser interface. First, the parser; the only differences are the %define directives to declare a push parser; the elimination of main (the lexer is now top-level), and the declaration of yyerror with an explicit void return type. [Note 2]
%{
#include <stdio.h>
void yyerror(char* msg);
%}
%define api.pure full
%define api.push-pull push
%union {
int INT_VAL;
}
%token STRING EXC
%token <INT_VAL> INT
%type <INT_VAL> somenumber
%%
start: somenumber {printf ("Result: %d\n", $1);}
| start somenumber {printf ("Result: %d\n", $2);}
;
somenumber: STRING INT EXC {$$ = $2 *2;}
| STRING somenumber EXC {$$ = $2 *2;}
;
%%
void yyerror(char* s){
fprintf(stderr, "%s\n", s);
}
The lexer has some more substantial changes, but I don't think the end result is any harder to read or maintain. It might even be easier.
The macro PARSE sends a token with a specified type tag and value to yyparse; the macro PARSE_TOKEN sends a token without a semantic value.
The %options line removes several warnings from the compile step
The initialization of the parser state was added. (Indented lines after the %% and before any rule are inserted at the top of the lexer function, in this case yypush_parse, so they can be used to declare and initialize local variables.)
The INT rule was changed to allow 10 to be a valid integer.
The !*<int> rule was added.
The <<EOF>> rule was added. (It's pretty well boiler-plate for lexer-driven push-parsing.)
A main function was added, which calls yylex.
(Oh, and I changed a rule to avoid echoing new lines.)
%{
#include "push.tab.h"
#define PARSE(tok,tag,val) do { \
YYSTYPE yylval = {.tag=val}; \
int status = yypush_parse(ps, tok, &yylval); \
if (status != YYPUSH_MORE) return status; \
} while(0)
#define PARSE_TOKEN(tok) do { \
int status = yypush_parse(ps, tok, 0); \
if (status != YYPUSH_MORE) return status; \
} while(0)
%}
%option noyywrap nounput noinput
%%
yypstate *ps = yypstate_new ();
[A-Za-z]+ {PARSE_TOKEN(STRING);}
[1-9][0-9]* {PARSE(INT,INT_VAL,atoi(yytext));}
"!*"[1-9][0-9]* {int r = atoi(yytext+2);
while (r--) PARSE_TOKEN(EXC);
}
"!" {PARSE_TOKEN(EXC);}
.|\n {}
<<EOF>> {int status = yypush_parse(ps, 0, 0);
yypstate_delete(ps);
return status;
}
%%
int main(int argc, char** argv) {
return yylex();
}
Notes
This is the style of the lemon parser generator. lemon was originally written to create the sqlite SQL parser but is used in various projects precisely for the convenience of the "push" interface. bison's push-parser support is more recent, and very welcome.
I'm not crazy about INT_VAL; I prefer lower-case for union tags, but I was trying to minimize the diff.

semicolon as delimiter in custom grammar parsed by flex/bison

I'm trying to write a simple parser for a meta programming language.
Everything works fine, but I want to use ';' as statement delimiter and not newline or ommit the semicolon entirely.
So this is the expected behaviour:
// good code
v1 = v2;
v3 = 23;
should parse without errors
But:
// bad code
v1 = v2
v3 = 23;
should fail
yet if I remove the 'empty' rule from separator both codes fail like this:
ID to ID
Error detected in parsing: syntax error, unexpected ID, expecting SEMICOLON
;
If I leave the 'empty' rule active, then both codes are accepted, which is not desired.
ID to ID // should raise error
ID to NUM;
Any help is welcome here, as most tutorials do not cover delimiters at all.
Here is a simplified version of my parser/lexxer:
parser.l:
%{
#include "parser.tab.h"
#include<stdio.h>
%}
num [0-9]
alpha [a-zA-Z_]
alphanum [a-zA-Z_0-9]
comment "//"[^\n]*"\n"
string \"[^\"]*\"
whitespace [ \t\n]
%x ML_COMMENT
%%
<INITIAL>"/*" {BEGIN(ML_COMMENT); printf("/*");}
<ML_COMMENT>"*/" {BEGIN(INITIAL); printf("*/");}
<ML_COMMENT>[.]+ { }
<ML_COMMENT>[\n]+ { printf("\n"); }
{comment}+ {printf("%s",yytext);}
{alpha}{alphanum}+ { yylval.str= strdup(yytext); return ID;}
{num}+ { yylval.str= strdup(yytext); return NUM;}
{string} { yylval.str= strdup(yytext); return STRING;}
';' {return SEMICOLON;}
"=" {return ASSIGNMENT;}
" "+ { }
<<EOF>> {exit(0); /* this is suboptimal */}
%%
parser.y:
%{
#include<stdio.h>
#include<string.h>
%}
%error-verbose
%union{
char *str;
}
%token <str> ID
%token <str> NUM
%token <str> STRING
%left SEMICOLON
%left ASSIGNMENT
%start input
%%
input: /* empty */
| expression separator input
;
expression: assign
| error {}
;
separator: SEMICOLON
| empty
;
empty:
;
assign: ID ASSIGNMENT ID { printf("ID to ID"); }
| ID ASSIGNMENT STRING { printf("ID to STRING"); }
| ID ASSIGNMENT NUM { printf("ID to NUM"); }
;
%%
yyerror(char* str)
{
printf("Error detected in parsing: %s\n", str);
}
main()
{
yyparse();
}
Compiled like this:
$>flex -t parser.l > parser.lex.yy.c
$>bison -v -d parser.y
$>cc parser.tab.c parser.lex.yy.c -lfl -o parser
Never mind... the problematic line was this one:
';' {return SEMICOLON;}
which required to be changed to
";" {return SEMICOLON;}
Now the behaviour is correct. :-)

Resources