Flex find substring until character - flex-lexer

This is my lexer.l file:
%{
#include "../h/Tokens.h"
%}
%option yylineno
%%
[+-]?([1-9]*\.[0-9]+)([eE][+-]?[0-9])? return FLOAT;
[+-]?[1-9]([eE][+-]?[0-9])? return INTEGER;
\"(\\\\|\\\"|[^\"])*\" return STRING;
(true|false) return BOOLEAN;
(func|val|if|else|while|for)* return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
. printf("Unexpected or invalid token: '%s'\n", yytext);
%%
int yywrap(void)
{
return 1;
}
Now, if my lexer finds an unexpected token, it sends an error for every character. I want it to send an error message for every substring until a whitespace or operator.
Example:
Input:
foo bar baz
~±`≥ hello
Output:
Identifier.
Identifier.
Identifier.
Unexpected or invalid token: '~±`≥'
Identifier.
Is there a way to do this with a regex pattern?
Thanks.

Certainly it is possible to do with a regex. But you can't do it with a regex independent of your other token rules. And it may not be trivial to find a correct regex.
In this fairly simple example, though, it's reasonably simple, although there is a corner case. Since there are no multicharacter operators, a character cannot start a token unless it is alphabetic, numeric, one of the operators (-+*.,:;) or a double-quote. And therefore any sequence of such characters is an invalid sequence. Also, I think that you really want ignore whitespace characters (based on the example output), even though your question doesn't show any rule which matches whitespace. So on the assumption that you just left out the whitespace rule, which would be something like
[[:space:]]+ { /* Ignore whitespace */ }
your regex to match a sequence of illegal characters would be
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr, "Invalid sequence %s\m", yytext); }
The corner-case is an unterminated string literal; that is, a token which starts with a " but does not include the matching closing quote. Such a token must necessarily extend to the end of the input, and it can easily be matched by using your string pattern, leaving out the final ". (That works because (f)lex always uses the longest matching pattern, so if there is a terminating " the correct string literal will be matched.)
There are a number of errors in your patterns:
It's almost always a bad idea to match +- at the start of a numeric literal. If you do that, then x+2 will not be correctly analysed; your lexer will return two tokens, an IDENTIFIER and an INTEGER, instead of the correct three tokens (IDENTIFIER, PLUS, INTEGER).
Your FLOAT pattern won't accept numbers starting which contain a 0 before the decimal point, so 0.5 and 10.3 will both fail. Also, you force the exponent to be a single digit, so 1.3E11 won't be matched either. And you force the user to put a digit after the decimal point; most languages accept 3. as equivalent to 3.0. (That last one is not necessarily an error, but it's unconventional.)
Your INTEGER pattern won't accept numbers containing a 0, such as 10. But it will accept scientific notation, which is a little odd; in most languages 3E10 is a floating point constant, not an integer.
Your KEYWORD pattern accepts keywords which are made up of a concatenated series of words, such as forwhilefuncif. You probably didn't intend to put a * at the end of the pattern.
Your string literal pattern allows any sequence of characters other than ", which means a backslash \ will be allowed to match as a single character, even if it is followed by a quote or a backslash. That will result in some string literals not being correctly terminated. For example, given the string literal
"\\"
(which is a string literal containing a single backslash), the regex will match the initial ", then the \ as a single character, and then the \" sequence, and then whatever follows the string literal until it encounters another quote.
The error is the result of flex requiring \ to be escaped inside bracket expressions, unlike Posix regular expressions where \ loses special significance inside brackets.
So that would leave you with something like this:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
[[:space:]]+ /* Ignore whitespace */
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? {
return FLOAT;
}
0|[1-9][0-9]* return INTEGER;
true|false return BOOLEAN;
func|val|if|else|while|for return KEYWORD;
[A-Za-z_][A-Za-z_0-9]* return IDENTIFIER;
"+" return PLUS;
"-" return MINUS;
"*" return MULTI;
"." return DOT;
"," return COMMA;
":" return COLON;
";" return SEMICOLON;
\"(\\\\|\\\"|[^\\"])*\" return STRING;
\"(\\\\|\\\"|[^\\"])* { fprintf(stderr,
"Unterminated string literal\n"); }
[^-+*.,:;[:alnum:][:space:]]+ { fprintf(stderr,
"Invalid sequence %s\m", yytext); }
(If any of those patterns look mysterious, you might want to review the description of flex patterns in the flex manual.)
But I have a feeling that you were looking for something different: a way of magically adapting to any change in the token patterns without excess analysis.
That's possible, too, but I don't know how to do it without code repetition. The basic idea is simple enough: when we encounter an unmatchable character, we just append it to the end of an error token and when we find a valid token, we emit the error message and clear the error token.
The problem is the "when we find a valid token" part, because that means that we need to insert an action at the beginning of every rule other than the error rule. The easiest way to do that is to use a macro, which at least avoids writing out the code for every action.
(F)lex does provide us with some useful tools we can build this on. We'll use one of (f)lex's special actions, yymore(), which causes the current match to be appended to the token being built, which is useful to build up the error token.
In order to know the length of the error token (and therefore to know if there is one), we need an additional variable. Fortunately, (f)lex allows us to define our own local variables inside the scanner. Then we define the macro E_ (whose name was chosen to be short, in order to avoid cluttering the rule actions), which prints the error message, moves yytext over the error token, and resets the error count.
Putting that together:
%{
#include "../h/Tokens.h"
%}
%option yylineno noyywrap
%%
int nerrors = 0; /* To keep track of the length of the error token */
/* This macro must be inserted at the beginning of every rule,
* except the fallback error rule.
*/
#define E_ \
if (nerrors > 0) { \
fprintf(stderr, "Invalid sequence %.*s\n", nerrors, yytext); \
yytext += nerrors; yyleng -= nerrors; nerrors = 0; \
} else /* Absorb the following semicolon */
[[:space:]]+ { E_; /* Ignore whitespace */ }
(\.[0-9]+|[0-9]+\.[0-9]*)([eE][+-]?[0-9]+)? { E_; return FLOAT; }
0|[1-9][0-9]* { E_; return INTEGER; }
true|false { E_; return BOOLEAN; }
func|val|if|else|while|for { E_; return KEYWORD; }
[A-Za-z_][A-Za-z_0-9]* { E_; return IDENTIFIER; }
"+" { E_; return PLUS; }
"-" { E_; return MINUS; }
"*" { E_; return MULTI; }
"." { E_; return DOT; }
"," { E_; return COMMA; }
":" { E_; return COLON; }
";" { E_; return SEMICOLON; }
\"(\\\\|\\\"|[^\\"])*\" { E_; return STRING; }
\"(\\\\|\\\"|[^\\"])* { E_;
fprintf(stderr,
"Unterminated string literal\n"); }
. { yymore(); ++nerror; }
That all assumes that we're happy to just produce an error message inside the scanner, and otherwise ignore the erroneous characters. But it may be better to actually return an error indication and let the caller decide how to handle the error. That introduces an extra wrinkle because it requires us to return two tokens in a single action.
For a simple solution, we use another (f)lex feature, yyless(), which allows us to rescan part or all of the current token. We can use that to remove the error token from the current token, instead of adjusting yytext and yyleng. (yyless will do that adjustment for us.) That means that after an error, the next correct token is scanned twice. That may seem inefficient, but it's probably acceptable because:
Most tokens are short,
There's not really much point in optimising for errors. It's much more useful to optimise processing of correct inputs.
To accomplish that, we just need a small change to the E_ macro:
#define E_ \
if (nerrors > 0) { \
yyless(nerrors); \
fprintf(stderr, "Invalid sequence %s\n", yytext); \
nerrors = 0; \
return BAD_INPUT; \
} else /* Absorb the following semicolon */

Related

Why the following LEX program is not printing "No. of tokens"

My code is printing the identifiers,separators and all other things except it is not printing the number of tokens.Can't point out the problem.
%{
int n=0;
%}
%%
"while"|"if"|"else"|"printf" {
n++;
printf("\t keywords : %s", yytext);}
"int"|"float" {
n++;printf("\t identifier : %s", yytext);
}
"<="|"=="|"="|"++"|"-"|"*"|"+" {
n++;printf("\t operator : %s", yytext);
}
[(){}|, ;] {n++;printf("\t seperator : %s", yytext);}
[0-9]*"."[0-9]+ {
n++;printf("\t float : %s", yytext);
}
[0-9]+ {
n++;printf("\t integer : %s", yytext);
}
.;
%%
int main(void)
{
yylex();
printf("\n total no. of tokens = %d\n",n);
}
int yywrap()
{
return 0;
}
If yywrap() returns 0, the lexer assumes that yywrap() has somehow arranged for yyin to have more data, and the lexer will continue to read input. So your lexer will never terminate.
If you want to signal that there is no more data, you need to return 1 from yywrap().
It's probably better to avoid the need for yywrap by placing
%option noyywrap
in the flex prologue.
I usually use %option noinput nounput noyywrap, which eliminates some compiler warnings assuming you ask for compiler warnings, which you should always do. Also %option nodefault can help you find lex specification bugs, since it will complain if some input does not have a matching rule. (The default (f)lex action on unrecognised input is to simply write the unmatched character to standard output. That's not usually very helpful, and unlike an error message, it is very easy to miss.) Finally, %option 8bit is only necessary if you request a lexer optimised for speed rather than table-size. But it doesn't hurt to add it, and it might save you from an embarrassing bug if you (or someone) someday decides to try the faster scanner skeleton. (Not recommended, except in very special circumstances.)

PEG grammar to suppress execution in if command

I want to create a grammar parsing some commands. Most is working flawless but the "if(condition,then-value,else-value)" is not working together with "out" command to show some value.
It works fine in case the output-command is outside the if-command:
out(if(1,42,43))
→ output and return 42 as expected OK
But at the moment the output-command is inside then- and else-part (which is required to be more intuitive) it fails:
if(1,out(42),out(43))
→ still return only 42 as expected OK, but the output function is called twice with 42 and 43
I'm working under C with the peg/leg parser generator here
The problem is also reproducible with PEG.js online parser generator here when using the following very much simplified grammar:
Expression
= Int
/ "if(" cond:Expression "," ok:Expression "," nok:Expression ")" { return cond?ok:nok; }
/ "out(" num:Expression ")" { window.alert(num); return num;}
Int = [0-9]+ { return parseInt(text(), 10); }
The "window.alert()" is only a placeholder for the needed output function, but for this problem it acts the same.
It looks like the scanner have to match the full if-command with then-
and else-value until the closing bracket ")". So it matches both out-commands and they both execute the defined function - which is not what I expect.
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
(I've already experimented with "&" predicate element without success)
(Maybe left-recursion vs. right-recursion could help here, but used peg/leg-generator seems to only supports right-recursion)
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
I'm not familiar with the tools in question, but it would surprise me if this were possible. And even if it were, you'd run into a similar problem when implementing loops: now you'd need to execute the action multiple times.
What you need is for your actions to not directly execute the code, but return something that can be used to execute it.
The usual way that interpreters work is that the parser produces some sort of representation of the source code (such as bytecode or an AST), which is then executed as a separate step.
The simplest (but perhaps not cleanest) way to make your parser work without changing too much would be to just wrap all your actions in 0-argument functions. You could then call the functions returned by the sub-expressions if and only if you want them to be executed. And to implement loops, you could then simply call the functions multiple times.
An solution could be using a predicate expression "& {expression}" (not to be confused by predicate element "& element")
Expression
  = Function
  
Function
  = Int
  / "if(" IfCond "," ok:Function "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:Function ")" { return nok; }
  / "out(" num:Function ")" { window.alert("Out:"+num); return num;}
 
FunctionDisabled
  = Int
/ "if(" IfCond "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return nok; }
/ "out(" num:FunctionDisabled ")" { return num;}
IfCond
  = cond:FunctionDisabled   &{ return cond; }
                   
Int = [0-9]+ { return parseInt(text(), 10); }
The idea is to define the out() twice, once really doing something and a second time disabled without output.
The condition of the if-command is evaluated using the code inside {}, so if the condition is false, the whole expression match failes.
Visible drawback is the redundant definition of the if-command for then and else and recursive disabled

Reading new line giving syntax error in LEX YACC

I am trying to parse a code, and for that i have written LEX and YACC file which will given below. first line it is reading correctly but after that it is giving syntax error, it is not reading next line,should i modify input and unput function,i am reading from file and writing my output in a file.....i have just started using LEX YACC, need some of the idea.
input file
b_7 = _6 + b_3;
a_8 = b_7 - c_5;
lex file
%{
/*
parser for ssa;
*/
#include<stdio.h>
#include<stdlib.h>
#include"y.tab.h"
%}
%%
[\t]+ ;
\n ;
[if]+ printf("first input\n");
[else]+ return(op);
[=]+ return(equal);
[+]+ return(op);
[*]+ return(op);
[-]+ return(op);
[\<][b][b][ ]+[1-9][\>] {return(bblock);}
([[_][a-z]])|([a-z][_][0-9]+)|([0-9]+) {return(var);}
. ;
%%
yacc file
%{
/* lexer for ssa gramer to use for recognizing operations*/
#include<stdio.h>
char add_graph(char,char,...);
%}
%token opif opelse equal op bblock var
%%
sentence: var equal var op var { add_graph($1,$2,$3,$4,$5);}
;
%%
extern FILE *yyin;
main(argc,argv)
int argc;
char **argv;
{
if(argc > 1) {
FILE *file;
file=fopen(argv[1],"r");
if(file==NULL) {
fprintf(stderr,"couldnot open%s\n",argv[0]);
exit(1);
}
yyin=file;
}
do
{
yyparse();
}while (!feof(yyin));
fclose(yyin);
}
char add_graph(something)
{
.....
.....
}
yyerror(s)
char *s;
{
fprintf(stderr,"%s there is error\n",s);
}
yywrap()
{
printf("the output");
}
Lots of problems here:
your grammar is expecting the token op, but your lexer will never produce it, instead producing opadd opmul etc
your example has ; at the end of lines, but neither your lexer nor parser deal with them. The default lexer action of copying to stdout is almost never what you want.
your yacc file tries to use \\ as some sort of comment marker, but yacc doesn't understand that. Some versions of yacc understand C++-style // as a comment, but not all
your grammar only allows for one sentence in the input
your sentence has a spurious op at the end (on the next line), which is not a separate sentence rule -- you need | to separate rules.
you attempt to loop if you haven't reached the eof when yyparse returns, but if there's an error, its likely that the input will still have some cruft that will cause an immediate error, resulting in an error storm -- probably not what you want.
Your grammar only permits one sentence. So if there is any input after the first sentence, an error will be raised. You want to permit one or more sentences. Try this in your .y file:
%%
sentences : sentences sentence
| sentence
;
sentence : var equal var op var { add_graph($1,$2,$3,$4,$5);}
;
%%
DAVID IS SAYING CORRECT BUT ONE MORE MODIFICATION NEED TO BE MADE
ADD
";" ;
SEE IF THIS CAN HELP.acknowledge me if i am wrong.

Flex use constants Container

i have to program a compiler with flex.
But i don't like the given code and want to make my self.
lexfile.l:
{%
typedef enum { EQ=0, NE, PLUS, MINUS, SEMICOLON } punctuationType;
typedef enum { PRINT=100, WHILE, IDENT } keywordType;
%}
%%
"!=" { return NEQ; }
"=" { return EQ; }
"+" { return PLUS; }
"-" { return MINUS; }
";" { return SEMICOLON; }
%%
Is there a better solution?
I have searched for a solution but the other solution is to define the Constants.
#define EQ 0
#define NE 1
...
Output Example:
Operator EQ
Operator NE
The Question is only, if there is a better type instead the Enum
Whatever you return has to be understood by the compiler. If you're using yacc, you don't get the choice: you have to abide by whatever %token generates, which are defined for you in y.tab.h.: you don't have to do anything at all.
On the other hand there's no need to have either names or flex rules for the single-char special characters: you can just return yytext[0] for all of them and use the corresponding literals in the .y file.
You don't really give enough details for further comment.

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

Resources