How to use "literal string tokens" in Bison - token

I am learning Flex/Bison. The manual of Bison says:
A literal string token is written like a C string constant; for
example, "<=" is a literal string token. A literal string token
doesn’t need to be declared unless you need to specify its semantic
value data type
But I do not figure how to use it and I do not find an example.
I have the following code for testing:
example.l
%option noyywrap nodefault
%{
#include "example.tab.h"
%}
%%
[ \t\n] {;}
[0-9] { return NUMBER; }
. { return yytext[0]; }
%%
example.y
%{
#include <stdio.h>
#define YYSTYPE char *
%}
%token NUMBER
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
%%
main(int argc, char **argv) {
yyparse();
}
yyerror(char *s) {
fprintf(stderr, "error: %s\n", s);
}
Makefile
#!/usr/bin/make
# by RAM
all: example
example.tab.c example.tab.h: example.y
bison -d $<
lex.yy.c: example.l example.tab.h
flex $<
example: lex.yy.c example.tab.c
cc -o $# example.tab.c lex.yy.c -lfl
clean:
rm -fr example.tab.c example.tab.h lex.yy.c example
And when I run it:
$ ./example
3<4
<
6>9
>
6=>9
error: syntax error
Any idea?
UPDATE: I want to clarify that I know alternative ways to solve it, but I want to use literal string tokens.
One Alternative: using multiple "literal character tokens":
tokens:
NUMBER '<' '=' NUMBER { printf("<="); }
| NUMBER '=' '>' NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
When I run it:
$ ./example
3<=9
<=
Other alternative:
In example.l:
"<=" { return LE; }
"=>" { return GE; }
In example.y:
...
%token NUMBER
%token LE "<="
%token GE "=>"
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
...
When I run it:
$ ./example
3<=4
<=
But the manual says:
A literal string token doesn’t need to be declared unless you need to
specify its semantic value data type

The quoted manual paragraph is correct, but you need to read the next paragraph, too:
You can associate the literal string token with a symbolic name as an alias, using the %token declaration (see Token Declarations). If you don’t do that, the lexical analyzer has to retrieve the token number for the literal string token from the yytname table.
So you don't need to declare the literal string token, but you still need to arrange for the lexer to send the correct token number, and if you don't declare an associated token name, the only way to find the correct value is to search for the code in the yytname table.
In short, your last example where you define LE and GE as aliases, is by far the most common approach. Separating the tokens into individual characters is not a good idea; it might create shift-reduce conflicts and it will definitely allow invalid inputs, such as putting whitespace between the characters.
If you want to try the yytname solution, there is sample code in the bison manual. But please be aware that this code discovers bison's internal token number, which is not the number which needs to be returned from the scanner. There is no way to get the external token number which is easy, portable and documented; the easy and undocumented way is to look the token number up in yytoknum but since that array is not documented and conditional on a preprocessor macro, there is no guarantee that it will work. Note also that these tables are declared static so the function(s) which rely on them must be included in the bison input file. (Of course, these functions can have external linkage so that they can be called from the lexer. But you can't just use yytname directly in the lexer.)

I havent used flex/bison for a while but two things:
. as far as I remember only matches a single character. yytext is a pointer to a null terminated string char* so yytext[0] is a char which means that you can't match strings this way. You probably need to change it to return yytext. Otherwise . will probably create a token PER character and you'd probably have to write NUMBER '<' '=' NUMBER.

Related

why my lexical analyzer can not recognize numbers and ids and operators

my lexical analyzer in flex can not recognize numbers and ids and operators ,only keywords were recognized where is my mistake? this is my code:
%{
#include<stdio.h>
%}
Nums [0-9]
LowerCase [a-z]
UpperCase [A-Z]
Letters LowerCase|UpperCase|[_]
Id {Letters}({Letters}|{Nums})*
operators +|-|\|*
%%
"if" {printf("if keyword founded \n");}
"then" {printf("then keyword founded \n");}
"else" {printf("else keyword founded \n");}
Operators {printf(" operator founded \n");}
Id {printf(" id founded ");}
%%
int main (void)
{ yylex(); return(0);}
int yywrap(void)
{ return 1;}
The pattern Operators is equivalent to "Operators", so it only matches that single word. If you meant to expand the macro by that name, the syntax is {Operators}. (Actually, {operators} since you seem to have inconsistently spelled the macro name in all lower-case.)
If you do that, flex will complain because of the syntax error in that macro. (Syntax errors in macros aren't detected unless the macro is expanded. That's just one of the problems with using macros.)
You have different problems with your other macros. For example, Nums doesn't appear in any rule at all.
My suggestion would be to use fewer (or no) macros and more character classes. Eg.:
[[:alpha:]_][[:alnum:]_]* { /* Action for identifier. */ }
[[:digit:]]+ { /* Action for number. */ }
[-+*/] { /* Action for operator. */ }
Please read the Patterns section in the flex manual for a full description of the pattern syntax, including the named character class expressions used in the first two patterns above.
To use a named definition, it ust be enclosed in {}. So your Letters rule should be
Letters {LowerCase}|{UpperCase}|[_]
... as it is, it matches the literal inputs LowerCase and UpperCase. Similarly in your rules, you want
{Operators} ...
{Id} ...
as what you have will match the literal input strings Operators and Id

How do I fix my ANTLR Parser to separate comments from multiplication?

I'm using ANTLR4 to try to parse code that has asterisk-leading comments, like:
* This is a comment
I was initially having issues with multiplication expressions getting mistaken for these comments, so decided to make my lexer rule:
LINE_COMMENT : '\r\n' '*' ~[\r\n]* ;
This forces there to be a newline so it doesn't see 2 * 3, with '* 3' being a comment.
This worked just fine until I had code that starts with a comment on the first line, which does not have a newline to begin with. For example:
* This is the first line of the code's file\r\n
* This is the second line of the codes's file\r\n
I have also tried the {getCharPositionInLine==x}? to make sure that it only recognizes a comment if there is an asterisk or spaces/tabs coming first in the current line. This works when using
antlr4 *.g4
, but will not work with my JavaScript parser generated using
antlr4 -Dlanguage=JavaScript *.g4
Is there a way to get the same results of {getCharPositionInLine==x}? with my JavaScript parser or some way to prevent multiplication from being recognized as a comment? I should also mention that this coding language doesn't use semicolons at the end of lines.
I've tried playing around with this simple grammar, but I haven't had any luck.
grammar wow;
program : expression | Comment ;
expression : expression '*' expression
| NUMBER ;
Comment : '*' ~[\r\n]*;
NUMBER : [0-9]+ ;
Asterisk : '*' ;
Space : ' ' -> skip;
and using a test file: test.txt
5 * 5
Make the comment rule match at least one more non-whitespace character, otherwise it could match the same content as the Asterisk rule, like so:
Comment: '*' ' '* ~[\r\n]+;
Do comments have to be at the beginning of line?
If so you can check it with this._tokenStartCharPositionInLine == 0 and have lexer rule like this
Comment : '*' ~[\r\n]* {this._tokenStartCharPositionInLine == 0}?;
If not, you should gather information about previous tokens, which could allow us to have multiplication (for example your NUMBER rule), so you should write something like (java code)
#lexer::members {
private static final Set<Integer> MULTIPLIABLE_TOKENS = new HashSet<>();
static {
MULTIPLIABLE_TOKENS.add(NUMBER);
}
private boolean canBeMultiplied = false;
#Override
public void emit(final Token token) {
final int type = token.getType();
if (type != Whitespace && type != Newline) { // skip ws tokens from consideration
canBeMultiplied = MULTIPLIABLE_TOKENS.contains(type);
}
super.emit(token);
}
}
Comment : {!canBeMultiplied}? '*' ~[\r\n]*;
UPDATE
If you need function analogs for JavaScript, take a look into the sources -> Lexer.js

Flex: How to define a term to be the first one at the beginning of a line(exclusively)

I need some help regarding a problem I face in my flex code.
My task: To write a flex code which recognizes the declaration part of a programming language, described below.
Let a programming language PL. Its variable definition part is described as follows:
At the beginning we have to start with the keyword "var". After writing this keyword we have to write the variable names(one or more) separated by commas ",". Then a colon ":" is inserted and after that we must write the variable type(say real, boolean, integer or char in my example) followed by a semicolon ";". After doing the previous steps there is the potentiality to declare into a new line new variables(variable names separated by commas "," followed by colon ":" followed by variable type followed by a semicolon ";"), but we must not use the "var" keyword again at the beginning of the new line( the "var" keyword is written once!!!)
E.g.
var number_of_attendants, sum: integer;
ticket_price: real;
symbols: char;
Concretely, I do not know how to make it possible to define that each and every declaration part must start only with the 'var' keyword. Until now, if I would begin a declaration part directly declaring a variable, say x (without having written "var" at the beginning of the line), then no error would occur(unwanted state).
My current flex code below:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real"|"boolean"|"integer"|"char"
SUBEXPRESSION [{VAR_NAME}[","{VAR_NAME}]*":"[ \t\n]*{VAR_TYPE}";"]+
EXPRESSION {VAR_DEFINER}{SUBEXPRESSION}
%%
^{EXPRESSION} {
printf("This is not a well-syntaxed expression!\n");
return 0;
}
{EXPRESSION} printf("This is a well-syntaxed expression!\n");
";"[ \t\n]*{VAR_DEFINER} {
printf("The keyword 'var' is defined once at the beginning of a new line. You can not use it again\n");
return 0;
}
{VAR_DEFINER} printf("A keyword: %s\n", yytext);
^{VAR_DEFINER} printf("Each and every declaration part must start with the 'var' keyword.\n");
{VAR_TYPE}";" printf("The variable type is: %s\n", yytext);
{VAR_NAME} printf("A variable name: %s\n", yytext);
","/[ \t\n]*{VAR_NAME} /* eat up commas */
":"/[ \t\n]*{VAR_TYPE}";" /* eat up single colon */
[ \t\n]+ /* eat up whitespace */
. {
printf("Unrecognized character: %s\n", yytext);
return 0;
}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
I hope to have made it as much as possible clear.
I am looking forward to reading your answers!
You seem to be trying to do too much in the scanner. Do you really have to do everything in Flex? In other words, is this an exercise to learn advanced use of Flex, or is it a problem that may be solved using more appropriate tools?
I've read that the first Fortran compiler took 18 staff-years to create, back in the 1950's. Today, "a substantial compiler can be implemented even as a student project in a one-semester compiler design course", as the Dragon Book from 1986 says. One of the main reasons for this increased efficiency is that we have learned how to divide the compiler into modules that can be constructed separately. The two first such parts, or phases, of a typical compiler is the scanner and the parser.
The scanner, or lexical analyzer, can be generated by Flex from a specification file, or constructed otherwise. Its job is to read the input, which consists of a sequence of characters, and split it into a sequence of tokens. A token is the smallest meaningful part of the input language, such as a semicolon, the keyword var, the identifier number_of_attendants, or the operator <=. You should not use the scanner to do more than that.
Here is how I woould write a simplified Flex specification for your tokens:
[ \t\n] { /* Ignore all whitespace */ }
var { return VAR; }
real { return REAL; }
boolean { return BOOLEAN; }
integer { return INTEGER; }
char { return CHAR; }
[a-zA-Z][a-zA-Z0-9_]* { return VAR_NAME; }
. { return yytext[0]; }
The sequence of tokens is then passed on to the parser, or syntactical analyzer. The parser compares the token sequence with the grammar for the language. For example, the input var number_of_attendants, sum : integer; consists of the keyword var, a comma-separated list of variables, a colon, a data type keyword, and a semicolon. If I understand what your input is supposed to look like, perhaps this grammar would be correct:
program : VAR typedecls ;
typedecls : typedecl | typedecls typedecl ;
typedecl : varlist ':' var_type ';' ;
varlist : VAR_NAME | varlist ',' VAR_NAME ;
var_type : REAL | BOOLEAN | INTEGER | CHAR ;
This grammar happens to be written in a format that Bison, a parser-generator that often is used together with Flex, can understand.
If you separate your solution into a lexical part, using Flex, and a grammar part, using Bison, your life is likely to be much simpler and happier.

Valid regular expression for identifier using flex

I'm trying to make a regular expression that will only work when a valid identifier name is given, using flex (the name cannot start with a number). I'm using this code :
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
%}
%%
"if" { printf("IF "); }
[a-zA-Z_][a-zA-Z_0-9]* { printf("%s ", yytext); }
%%
int main() {
yylex();
}
but it is not working. how to make sure that flex accepts only a valid identifier?
When I provide the input:
if
abc
9abc
I see the following output:
IF
abc
9abc
but I expected:
IF
abc
(nothing)
Your patterns do not match all possible inputs.
In such cases, (f)lex adds a default catch-all rule, of the form
.|\n { ECHO; }
In other words, any character not recognized by your patterns will simply be printed on stdout. That will be the case with the newline characters in your input, as well as with the digit 9. After the 9 is recognized by the default rule, the remaining input will again be recognized by your identifier rule.
So you probably wanted something like this:
%option warn nodefault
%%
[[:space:]]+ ; /* Ignore whitespace */
"if" { /* TODO: Handle an "if" token */ }
[[:alpha:]_][[:alnum:]_]* { /* TODO: Handle an identifier token */ }
. { /* TODO: Handle an error */ }
Instead of printing information to stdout in an action as a debugging or learning aid, I strongly suggest you use the -T (or --trace) option when you are building your scanner. That will automatically output debugging information in a consistent and complete manner; it would have told you that the default rule was being matched, for example.
Notes:
%option nodefault tells flex not to insert a default rule. I recommend always using it, because it will keep you out of trouble. The warn option ensures that a warning is issued in this case; I think that warn is default flex behaviour but the manual suggests using it and it cannot hurt.
It's good style to use standard character class expressions. Inside a character class ([…]), [:xxx:] matches anything for which the standard library function isxxx would return true. So [[:space:]]+ matches one or more whitespace characters, including space, tab, and newline (and some others), [[:alpha:]_] matches any letter or an underscore, and [[:alnum:]_]* matches any number (including 0) of letters, digits, or underscores. See the Patterns section of the manual.

Bison: How to ignore a token if it doesn't fit into a rule

I'm writing a program that handles comments as well as a few other things. If a comment is in a specific place, then my program does something.
Flex passes a token upon finding a comment, and Bison then looks to see if that token fits into a particular rule. If it does, then it takes an action associated with that rule.
Here's the thing: the input I'm receiving might actually have comments in the wrong places. In this case, I just want to ignore the comment rather than flagging an error.
My question:
How can I use a token if it fits into a rule, but ignore it if it doesn't? Can I make a token "optional"?
(Note: The only way I can think of of doing this right now is scattering the comment token in every possible place in every possible rule. There MUST be a better solution than this. Maybe some rule involving the root?)
One solution may be to use bison's error recovery (see the Bison manual).
To summarize, bison defines the terminal token error to represent an error (say, a comment token returned in the wrong place). That way, you can (for example) close parentheses or braces after the wayward comment is found. However, this method will probably discard a certain amount of parsing, because I don't think bison can "undo" reductions. ("Flagging" the error, as with printing a message to stderr, is not related to this: you can have an error without printing an error--it depends on how you define yyerror.)
You may instead want to wrap each terminal in a special nonterminal:
term_wrap: comment TERM
This effectively does what you're scared to do (put in a comment in every single rule), but it does it in fewer places.
To force myself to eat my own dog food, I made up a silly language for myself. The only syntax is print <number> please, but if there's (at least) one comment (##) between the number and the please, it prints the number in hexadecimal, instead.
Like this:
print 1 please
1
## print 2 please
2
print ## 3 please
3
print 4 ## please
0x4
print 5 ## ## please
0x5
print 6 please ##
6
My lexer:
%{
#include <stdio.h>
#include <stdlib.h>
#include "y.tab.h"
%}
%%
print return PRINT;
[[:digit:]]+ yylval = atoi(yytext); return NUMBER;
please return PLEASE;
## return COMMENT;
[[:space:]]+ /* ignore */
. /* ditto */
and the parser:
%debug
%error-verbose
%verbose
%locations
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str) {
fprintf(stderr, "error: %s\n", str);
}
int yywrap() {
return 1;
}
extern int yydebug;
int main(void) {
yydebug = 0;
yyparse();
}
%}
%token PRINT NUMBER COMMENT PLEASE
%%
commands: /* empty */
|
commands command
;
command: print number comment please {
if ($3) {
printf("%#x", $2);
} else {
printf("%d", $2);
}
printf("\n");
}
;
print: comment PRINT
;
number: comment NUMBER {
$$ = $2;
}
;
please: comment PLEASE
;
comment: /* empty */ {
$$ = 0;
}
|
comment COMMENT {
$$ = 1;
}
;
So, as you can see, not exactly rocket science, but it does the trick. There's a shift/reduce conflict in there, because of the empty string matching comment in multiple places. Also, there's no rule to fit comments in between the final please and EOF. But overall, I think it's a good example.
Treat comments as whitespace at the lexer level.
But keep two separate rules, one for whitespace and one for comments, both returning the same token ID.
The rule for comments (+ optional whitespace) keeps track of the comment in a dedicated structure.
The rule for whitespace resets the structure.
When you enter that “specific place”, look if the last whitespace was a comment or trigger an error.

Resources