Lex: expected expression before ‘[’ token when writing regular expressions - flex-lexer

I'm new to lex/yacc and following this tutorial: https://www.youtube.com/watch?v=54bo1qaHAfk
here's my lex file
%{
#include "main.h"
#include <stdio.h>
%}
%%
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
"&" {return RUN_DAEMON;}
"|" {return SYM_PIPE;}
">" {return RED_STDOUT;}
"<" {return RED_STDIN;}
">>" {return APP_STDOUT;}
[ \t\n]+ {;}
. {printf("unexpected character\n");}
%%
int yywrap(){
return 1;
}
however after run lex command I try to compile lex.yy.c with gcc it spam me with this error
sbash.l: In function ‘yylex’:
sbash.l:7:5: error: expected expression before ‘[’ token
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
^
sbash.l:7:6: error: ‘a’ undeclared (first use in this function)
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
^
sbash.l:7:6: note: each undeclared identifier is reported only once for each function it appears in
sbash.l:7:14: error: ‘_a’ undeclared (first use in this function)
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
^~
sbash.l:7:17: error: ‘zA’ undeclared (first use in this function)
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
^~
sbash.l:7:20: error: ‘Z0’ undeclared (first use in this function)
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
^~
sbash.l:7:29: error: expected expression before ‘{’ token
[a-zA-Z][_a-zA-Z0-9]* {return IDENTIFIER;}
^
sbash.l:13:7: error: stray ‘\’ in program
[ \t\n]+ {;}
^
sbash.l:13:9: error: stray ‘\’ in program
[ \t\n]+ {;}
unfortunately I cannot find what's going wrong even googled (many example's expression writes exactly same as my code).
My lex version is 2.6.1 and is on CentOS8

As explained in the Flex manual chapter on flex input file format, pattern rules must start at the left margin:
The rules section of the flex input contains a series of rules of the form:
pattern action
where the pattern must be unindented and the action must begin on the same line. (Some emphasis added)
Indented lines on the rules section are just passed through as-is. In particular, indented lines prior to the first rule are inserted at the top of the yylex function, which is frequently useful. But flex makes no attempt to verify that code included in this way is valid; errors will be detected when the generated scanner is compiled.

Related

Why is my parser showing errors for valid programs?

I can't figure out what's wrong with my parser. Here are the associated files:
parse.y
declarations: INTEGER_SIZE IDENTIFIER TERMINATOR {declare($1,$2);}
void yyerror(char *err){
printf("\n\nYYError on line %d: Error = %s\n", yylineno, err);
}
scan.l
[Xx]+ {yylval.size = strlen(yytext);
When running it against the valid program below it shows an error at line 3; when running any of the lines individually it shows an error on line 1 via the yyerror() function.
BEGINING.
XXX XY-1.
XXXX Y.
XXXX Z.
BODY.
PRINT “Please enter a number? ”.
INPUT Y.
MOVE 15 TO Z.
ADD Y TO Z.
PRINT XY-1;” + “;Y;”=”;Z.
END.
To run the files run the following commands:
yacc -d parser.y
lex lexer.l
gcc -o parser lex.yy.c y.tab.c -ll
This non-terminal is called declarations, from which one might think that it matches one or more declarations, or perhaps zero or more declarations:
declarations: INTEGER_SIZE IDENTIFIER TERMINATOR {declare($1,$2);}
But the rule matches exactly three tokens, which is to say one declaration. So when you give it an input with two declarations, it fails on the second one.
Similarly, your non-terminal called statements only matches a single statement, not several as might be expected from its name.
Grammars need to be explicit. If you want to match several declarations, you have to write that:
declarations: declaration
| declarations declaration
By the way, I have seen before grammars written with the belief that you have to write {;} at the end of a production. I'm curious where this idea comes from. Yacc and bison do not require that productions have an action, and anyway an empty action is {}, just as it is in C.

Lex doesn't seem to follow precedence order

I am using ply (a popular python implementation of Lex and Yacc) to create a simple compiler for a custom language.
Currently my lexer looks as follows:
reserved = {
'begin': 'BEGIN',
'end': 'END',
'DECLARE': 'DECL',
'IMPORT': 'IMP',
'Dow': 'DOW',
'Enddo': 'ENDW',
'For': 'FOR',
'FEnd': 'ENDF',
'CASE': 'CASE',
'WHEN': 'WHN',
'Call': 'CALL',
'THEN': 'THN',
'ENDC': 'ENDC',
'Object': 'OBJ',
'Move': 'MOV',
'INCLUDE': 'INC',
'Dec': 'DEC',
'Vibration': 'VIB',
'Inclination': 'INCLI',
'Temperature': 'TEMP',
'Brightness': 'BRI',
'Sound': 'SOU',
'Time': 'TIM',
'Procedure': 'PROC'
}
tokens = ["INT", "COM", "SEMI", "PARO", "PARC", "EQ", "NAME"] + list(reserved.values())
t_COM = r'//'
t_SEMI = r";"
t_PARO = r'\('
t_PARC = r'\)'
t_EQ = r'='
t_NAME = r'[a-z][a-zA-Z_&!0-9]{0,9}'
def t_INT(t):
r'\d+'
t.value = int(t.value)
return t
def t_error(t):
print("Syntax error: Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
Per the documentation, I am creating a dictionary for reserved keywords and then adding them to the tokens list, rather than adding individual rules for them. The documentation also states that precedence is decided following these 2 rules:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
The problem I'm having is that when I test the lexer using this test string
testInput = "// ; begin end DECLARE IMPORT Dow Enddo For FEnd CASE WHEN Call THEN ENDC (asdf) = Object Move INCLUDE Dec Vibration Inclination Temperature Brightness Sound Time Procedure 985568asdfLYBasdf ; Alol"
The lexer returns the following error:
LexToken(COM,'//',1,0)
LexToken(SEMI,';',1,2)
LexToken(NAME,'begin',1,3)
Syntax error: Illegal character ' '
LexToken(NAME,'end',1,9)
Syntax error: Illegal character ' '
Syntax error: Illegal character 'D'
Syntax error: Illegal character 'E'
Syntax error: Illegal character 'C'
Syntax error: Illegal character 'L'
Syntax error: Illegal character 'A'
Syntax error: Illegal character 'R'
Syntax error: Illegal character 'E'
(That's not the whole error but that's enough to see whats happening)
For some reason, Lex is parsing NAME tokens before parsing the keywords. Even after it's done parsing NAME tokens, it doesn't recognize the DECLARE reserved keyword. I have also tried to add reserved keywords with the rest of the tokens, using regular expressions but I get the same result (also the documentation advises against doing so).
Does anyone know how to fix this problem? I want the Lexer to identify reserved keywords first and then to attempt to tokenize the rest of the input.
Thanks!
EDIT:
I get the same result even when using the t_ID function exemplified in the documentation:
def t_NAME(t):
r'[a-z][a-zA-Z_&!0-9]{0,9}'
t.type = reserved.get(t.value,'NAME')
return t
The main problem here is that you are not ignoring whitespace; all the errors are a consequence. Adding a t_ignore definition to your grammar will eliminate those errors.
But the grammar won't work as expected even if you fix the whitespace issue, because you seem to be missing an important aspect of the documentation, which tells you how to actually use the dictionary reserved:
To handle reserved words, you should write a single rule to match an identifier and do a special name lookup in a function like this:
reserved = {
'if' : 'IF',
'then' : 'THEN',
'else' : 'ELSE',
'while' : 'WHILE',
...
}
tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = reserved.get(t.value,'ID') # Check for reserved words
return t
(In your case, it would be NAME and not ID.)
Ply knows nothing about the dictionary reserved and it also has no idea how you produce the token names enumerated in tokens. The only point of tokens is to let Ply know which symbols in the grammar represent tokens and which ones represent non-terminals. The mere fact that some word is in tokens does not serve to define the pattern for that token.

clang ASTMatcher without expanded from macro

I am doing a clang ASTMatcher to find the locations where isnan is defined in my source code. I am trying to understand why there are three matches eventhough I have restricted to match only in the main file. Please find a sample source code below:
#include <math.h>
int main()
{
if(isnan(0.0)){
}
}
When I do clang-query match I am getting the below output:
clang-query> match declRefExpr(isExpansionInMainFile())
Match #1:
/home/clang-llvm/code/test.cpp:6:5: note: "root" binds here
if(isnan(0.0)){
^~~~~~~~~~
/usr/include/math.h:299:9: note: expanded from macro 'isnan'
? __isnanf (x) \
^~~~~~~~
Match #2:
/home/clang-llvm/code/test.cpp:6:5: note: "root" binds here
if(isnan(0.0)){
^~~~~~~~~~
/usr/include/math.h:301:9: note: expanded from macro 'isnan'
? __isnan (x) : __isnanl (x))
^~~~~~~
Match #3:
/home/clang-llvm/code/test.cpp:6:5: note: "root" binds here
if(isnan(0.0)){
^~~~~~~~~~
/usr/include/math.h:301:23: note: expanded from macro 'isnan'
? __isnan (x) : __isnanl (x))
^~~~~~~~
3 matches.
Is there anyway to restrict the match only for the source code and not the macro?
I would appreciate any help.
The macro is treated as pure text replacement during preprocessing, which happens before all your matching start. A quick grep into the math.h gives me this:
# define isnan(x) \
(sizeof (x) == sizeof (float) \
? __isnanf (x) \
: sizeof (x) == sizeof (double) \
? __isnan (x) : __isnanl (x))
This explains why you get three matching results. They are already in your main function before you run the AST Matcher.
To get a single location, depending on your source code. In this particular case, you can achieve by changing your node matcher to a conditional operator.
clang-query> match conditionalOperator(hasFalseExpression(conditionalOperator()), isExpansionInMainFile())
Match #1:
~/test.cpp:4:8: note: "root" binds here
if(isnan(0.0)){
^~~~~~~~~~
/usr/include/math.h:254:7: note: expanded from macro 'isnan'
(sizeof (x) == sizeof (float)
\
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 match.
So I am trying to match the expr that after the macro is replaced.
Hope it helps.

How to use "literal string tokens" in Bison

I am learning Flex/Bison. The manual of Bison says:
A literal string token is written like a C string constant; for
example, "<=" is a literal string token. A literal string token
doesn’t need to be declared unless you need to specify its semantic
value data type
But I do not figure how to use it and I do not find an example.
I have the following code for testing:
example.l
%option noyywrap nodefault
%{
#include "example.tab.h"
%}
%%
[ \t\n] {;}
[0-9] { return NUMBER; }
. { return yytext[0]; }
%%
example.y
%{
#include <stdio.h>
#define YYSTYPE char *
%}
%token NUMBER
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
%%
main(int argc, char **argv) {
yyparse();
}
yyerror(char *s) {
fprintf(stderr, "error: %s\n", s);
}
Makefile
#!/usr/bin/make
# by RAM
all: example
example.tab.c example.tab.h: example.y
bison -d $<
lex.yy.c: example.l example.tab.h
flex $<
example: lex.yy.c example.tab.c
cc -o $# example.tab.c lex.yy.c -lfl
clean:
rm -fr example.tab.c example.tab.h lex.yy.c example
And when I run it:
$ ./example
3<4
<
6>9
>
6=>9
error: syntax error
Any idea?
UPDATE: I want to clarify that I know alternative ways to solve it, but I want to use literal string tokens.
One Alternative: using multiple "literal character tokens":
tokens:
NUMBER '<' '=' NUMBER { printf("<="); }
| NUMBER '=' '>' NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
When I run it:
$ ./example
3<=9
<=
Other alternative:
In example.l:
"<=" { return LE; }
"=>" { return GE; }
In example.y:
...
%token NUMBER
%token LE "<="
%token GE "=>"
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
...
When I run it:
$ ./example
3<=4
<=
But the manual says:
A literal string token doesn’t need to be declared unless you need to
specify its semantic value data type
The quoted manual paragraph is correct, but you need to read the next paragraph, too:
You can associate the literal string token with a symbolic name as an alias, using the %token declaration (see Token Declarations). If you don’t do that, the lexical analyzer has to retrieve the token number for the literal string token from the yytname table.
So you don't need to declare the literal string token, but you still need to arrange for the lexer to send the correct token number, and if you don't declare an associated token name, the only way to find the correct value is to search for the code in the yytname table.
In short, your last example where you define LE and GE as aliases, is by far the most common approach. Separating the tokens into individual characters is not a good idea; it might create shift-reduce conflicts and it will definitely allow invalid inputs, such as putting whitespace between the characters.
If you want to try the yytname solution, there is sample code in the bison manual. But please be aware that this code discovers bison's internal token number, which is not the number which needs to be returned from the scanner. There is no way to get the external token number which is easy, portable and documented; the easy and undocumented way is to look the token number up in yytoknum but since that array is not documented and conditional on a preprocessor macro, there is no guarantee that it will work. Note also that these tables are declared static so the function(s) which rely on them must be included in the bison input file. (Of course, these functions can have external linkage so that they can be called from the lexer. But you can't just use yytname directly in the lexer.)
I havent used flex/bison for a while but two things:
. as far as I remember only matches a single character. yytext is a pointer to a null terminated string char* so yytext[0] is a char which means that you can't match strings this way. You probably need to change it to return yytext. Otherwise . will probably create a token PER character and you'd probably have to write NUMBER '<' '=' NUMBER.

Jflex Unexpected Character error

I started studying jflex. When i try to generate output using jflex for the following code I keep getting an error
Error in file "\abc.flex" (line 29):
Unexpected character
[ \t\n]+ ;
^
1 error, 0 warnings.
Generation aborted.
Code trying to run
letter [a-zA-Z]
digit [0-9]
intlit [0-9]+
%{
#include <stdio.h>
# define BASTYPTOK 257 /*following are output from yacc*/
# define IDTOK 258 /*yacc assigns token numbers */
# define LITTOK 259
# define CINTOK 260
# define INSTREAMTOK 261
# define COUTTOK 262
# define OUTSTREAMTOK 263
# define WHILETOK 264
# define IFTOK 265
# define ADDOPTOK 266
# define MULOPTOK 267
# define RELOPTOK 268
# define NOTTOK 269
# define STRLITTOK 270
main() /*this replaces the main in the lex library*\
{ int p;
while (p= yylex())
printf("%d is \"%s\"\n", p, yytext);
/*yytext is where lex stores the lexeme*/}
%}
%%
[ \t\n]+ ;
"//".*"\n" ;
{intlit} {return(LITTOK);}
cin {return(CINTOK);}
"<<" {return(INSTREAMTOK);}
\<|"==" {return(RELOPTOK);}
\+|\-|"||" {return(ADDOPTOK);}
"=" {return(yytext[0]);}
"(" {return(yytext[0]);}
")" {return(yytext[0]);}
. {return (yytext[0]); /*default action*/}
%%
Can someone please help me figure out, what is causing the issue.
The pattern is also unindented properly.
thanks for your help.
That's valid flex input but it's not valid jflex. Since the included code is in C rather than Java, it's not clear to me why you would want to use jflex, but if your intent is to port the scanner to Java you might want to read the JFlex manual section on porting.
In particular, the sections in JFlex input are quite different from flex:
flex JFlex
definitions and declarations user code
%% %%
rules declarations
%% %%
user code definitions and rules
So your definitions and rules are in the correct section for a flex file, but not for a JFlex file. (JFlex just copies the first section to the output, so it doesn't recognize the various syntax errors resulting from putting flex declarations where JFlex expects valid user code.)
Also, JFlex definitions are of the form name = pattern rather than name pattern, so once you get the order of the file sorted out, you'll also need to add the equals signs. And. of course, rewrite the C code in Java.

Resources