Is there a way to localize error messages from bison/flex? - localization

Do bison and flex allow user to natively localize error messages?
For example, I would like to translate following message: syntax error, unexpected NUMBER, expecting $end to other language and replace NUMBER/$end with something more human-readable.

Use yyerror and YY_USER_ACTION for additional data.
void yyerror(const char *s) {
sprintf(dummmy, "%s line %d col %d word '%s'\n", s, myline, mycolumn, yytext);
print_error(dummmy);
in the lex file
#define YY_USER_ACTION \
addme(yy_start, yytext); \
mycolumn += yyleng;\
if(*yytext == '\n') { myline++; mycolumn = 0; } else 0; \

Related

Match rule only at the begining of a file

Problem
I'm writing a sort of a script language interpreter.
I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.
The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.)
Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.
Let's focus on a shebang in the example.
I've written some simple grammar that illustrates the problems I'm facing.
Lexer:
%%
#!.+ { printf("shebang: \"%s\"\n", yytext + 2); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Input file:
#!my-program
# some multiline
thingy #
aaa bbb
ccc#!not a shebang#ddd
eee
Expected output:
shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Actual output:
thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb
ccc#"
error: '!'
id: not
id: a
id: shebang
error: '#'
id: ddd
id: eee
My (bad?) solution
I figured that this is a good case to use start conditions.
I managed to use them to write a lexer that does work, however, it's rather ugly:
%s MAIN
%%
<INITIAL>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Notice that I had to specify the start condition MAIN before the rule #[^#]*#.
It's because it would otherwise collide with the shebang rule #!.+.
Unfortunately, the INITIAL start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).
Is there some way to make the INITIAL exclusive or choose a different start condition to be the default?
Here's a simpler solution, assuming you're using Flex (as per your flex-lexer tag):
%option noinput nounput noyywrap nodefault yylineno
%{
#define YY_USER_INIT BEGIN(STARTUP);
%}
%x STARTUP
%%
<STARTUP>#!.* { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
<STARTUP>.|\n { BEGIN(INITIAL); yyless(0); }
/* Rest is INITIAL */
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
Test:
rici$ flex -o shebang.c shebang.l
rici$ gcc -Wall -o shebang shebang.c -lfl
rici$ ./shebang <<"EOF"
> #!my-program
> # some multiline
> thingy #
> aaa bbb
> ccc#!not a shebang#ddd
> eee
> EOF
Shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Notes:
The %option line:
prevents "Unused function" warnings;
removes the need for yywrap;
shows an error if there's some possible input which doesn't match any pattern;
counts input lines in the global yylineno
The macro YY_USER_INIT is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.
yyless(0) causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)
The library -lfl includes yywrap() (although in this case, it's not used), and a simple main() definition rather similar to the one in your example.
(1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll.
There is an indirect way to select a different start condition.
The start condition is an integer variable (e.g. INITIAL is 0).
You can get its current value using a macro YY_START and if it equals INITIAL, change it to another value, effectively replacing it.
%x BEGINNING
%s MAIN
%%
%{
if (YY_START == INITIAL)
BEGIN(BEGINNING);
%}
<BEGINNING>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<BEGINNING>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
The disadvantages of this solution are:
The code block will execute every time you call yylex (not only the first time, when it's actually needed). The overhead is small enough to be ignored, though.
Flex will still generate the whole state machine as if INITIAL was used. I have no idea if this creates a lot of code or if it would matter in a big parser.

Match this text but only if it comes at the end of the file

Page 78 of the Flex user's manual says:
There is no way to write a rule which is "match this text, but only if
it comes at the end of the file". You can fake it, though, if you
happen to have a character lying around that you don't allow in your
input. Then you can redefine YY_INPUT to call your own routine which,
if it sees an EOF, returns the magic character first (and remembers to
return a real EOF next time it's called.
I am trying to implement that approach. In fact I managed to get it working (see below). For this input:
Hello, world
How are you?#
I get this (correct) output:
Here's some text Hello, world
Saw this string at EOF How are you?
But I had to do two things in my implementation to get it to work; two things that I shouldn't have to do:
I had to call yyterminate(). If I don't call yyterminate() then the output is this:
Here's some text Hello, world
Saw this string at EOF How are you?
Saw this string at EOF
I shouldn't be getting that last line. Why am I getting that last line?
I don't understand why I had to do this: tmp[yyleng-1] = '\0'; (subtract 1). I should be able to do this: tmp[yyleng] = '\0'; (not subtract 1) Why do I need to subtract 1?
%option noyywrap
%{
int sawEOF = 0;
#define YY_INPUT(buf,result,max_size) \
{ \
if (sawEOF == 1) \
result = YY_NULL; \
else { \
int c = fgetc(yyin); \
if (c == EOF) { \
sawEOF = 1; \
buf[0] = '#'; \
result = 1; \
} \
else { \
buf[0] = c; \
result = 1; \
} \
} \
}
%}
EOF_CHAR #
%%
[^\n#]*{EOF_CHAR} { char *tmp = strdup(yytext);
tmp[yyleng-1] = '\0';
printf("Saw this string at EOF %s\n", tmp);
yyterminate();
}
[^\n#]+ { printf("Here's some text %s\n", yytext); }
\n { }
%%
int main(int argc, char *argv[])
{
yyin = fopen(argv[1], "r");
yylex();
fclose(yyin);
return 0;
}

Processing character returned by yyless within a start condition in yacc

For the code snippet below, the "ASSN: =" block for {EQ} is not triggered for an input of "CC=gcc\n" - I don't understand why this is, the equals character is being passed, as it is being processed by the next rule for {CHAR}.
How can I ensure that the {EQ} rule for is processed when the equals character is 'pushed' back by yyless?
The byacc code is pretty much empty with a single dummy rule, but with the relevant %token lines.
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include "y.tab.h"
extern YYSTYPE yylval;
%}
%x ASSIGNMENT
%option noyywrap
DIGIT [0-9]
ALPHA [A-Za-z]
SPACE [ ]
TAB [\t]
WS [ \t]+
NEWLINE (\n|\r|\r\n)
IDENT [A-Za-z_][A-Za-z_0-9]+
EQ =
CHAR [^\r\n]+
%%
<*>"#"{CHAR}{NEWLINE}
({IDENT}{EQ})|({IDENT}{WS}{EQ}) {
yylval.strval = strndup(yytext,
strlen(yytext)-1);
printf("NORM: %s\n", yylval.strval);
yyless(strlen(yytext)-1);
BEGIN(ASSIGNMENT);
return TOK_IDENT;
}
<ASSIGNMENT>{
{EQ} {
printf("ASSN: =\n");
return TOK_ASSIGN;
}
{CHAR} {
printf("ASSN: %s\n", yytext);
return TOK_STRING;
}
{NEWLINE} {
BEGIN(INITIAL);
}
}
{WS}
{NEWLINE}
. {
printf("DOT : %s\n", yytext);
}
<*><<EOF>> {
printf("EOF\n");
return 0;
}
%%
int main()
{
printf("Start\n\n");
int ret;
while( (ret = yylex()) ) {
printf("LEX : %u\n", ret);
}
printf("\nEnd\n");
}
Example output:
Start
NORM: CC
LEX : 257
ASSN: =gcc
LEX : 259
EOF
End
My issue was that flex matches the longest rule first, so {CHAR} was always winning over {EQ}. I solved this by introducing another Start Condition to consume the {EQ}{WS}? before passing to

Does each call to `yylex()` generate a token or all the tokens for the input?

I am trying to understand how flex works under the hood.
In the following first example, it seems that main() calls yylex() only once, and yylex() generates all the tokens for the entire input.
In the second example, it seems that main() calls yylex() once per token generated, and yylex() generates a token per call.
Does each call to yylex() generate a token or all the tokens for the input?
Why is yylex() called different number of times in the two examples?
I heard that yylex() is like a coroutine, and each call to it will resume with the rest of the input left from last call and generate a token. In that sense, how does the first example calls yylex() just once and generate all the tokens in the input?
/* just like Unix wc */
%{
int chars = 0;
int words = 0;
int lines = 0;
%}
%%
[a-zA-Z]+ { words++; chars += strlen(yytext); }
\n { chars++; lines++; }
. { chars++; }
%%
main(int argc, char **argv)
{
yylex();
printf("%8d%8d%8d\n", lines, words, chars);
}
$ ./a.out
The boy stood on the burning deck
shelling peanuts by the peck
^D
2 12 63
$
and
/* recognize tokens for the calculator and print them out */
%{
enum yytokentype {
NUMBER = 258,
ADD = 259,
SUB = 260,
MUL = 261,
DIV = 262,
ABS = 263,
EOL = 264
};
int yylval;
%}
%%
"+" { return ADD; }
"-" { return SUB; }
"*" { return MUL; }
"/" { return DIV; }
"|" { return ABS; }
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
\n { return EOL; }
[ \t] { /* ignore whitespace */ }
. { printf("Mystery character %c\n", *yytext); }
%%
main(int argc, char **argv)
{
int tok;
while(tok = yylex()) {
printf("%d", tok);
if(tok == NUMBER) printf(" = %d\n", yylval);
else printf("\n");
}
}
$ ./a.out
a / 34 + |45
Mystery character a
262
258 = 34
259
263
258 = 45
264
^D
$
Flex doesn't decide when the scanner will return (except for the default EOF rule). The scanner which it builds performs lexical actions in a loop until some action returns. So it is entirely up to you how you want to structure your scanner.
However, the classic yyparse/yylex processing model consists of the parser calling yylex() every time it needs a new token. So it expects yylex() to return immediately once it finds a token.
In your first code example, there is no parser and the scanner action is limited to printing out the token. While the example is perfectly correct, relying on the scanner loop to repeatedly execute actions, I'd prefer the second model even if you don't (yet) intend to add a parser, because it will make it easier to decouple token handling from token generation.
That doesn't mean that every lexical action will contain a return statement, though. Some lexical patterns correspond to non-tokens (comments and whitespace, for example), and the corresponding action will most likely do nothing (other than possibly recording input position) so that the scanner will continue to search fir a token to return.
(F)lex scanners are not easy to make into coroutines, so if a coroutine is really required (for example, to incrementally parse a asynchronous input), then another tool might be preferred.
Bison does offer the possiblity to generate a "push parser" in which the scanner calls the parser every time it finds a token, rather than returning to the parser. But neither the "push" nor the traditional "pull" model have anything to do with coroutines, IMHO, and the use of the word to describe parser/scanner interaction strikes me as imprecise and unuseful (although I have a lot of respect for the author you might be quoting.)

Flex yylineno set to 1

I'm writing a simple parser for tcpdump logs, could you please tell me why I can't get proper line number?
%{
char str[80];
%}
%option yylineno
...
%%
^{HOURS}:{MINUTES}:{MINUTES} if(input()=='.') { strcpy(str, yytext); BEGIN(A); } else {printf("Wrong hour %d", yylineno); }
<A>({NDPS}|{DPS})\.({NDPS}|{DPS})\.({NDPS}|{DPS})|\.{NDPS} printf("Wrong IP!, %d", yylineno);
<A>[ ]{DPS}\.{DPS}\.{DPS}\.{DPS} strcat(str, " from "); strcat(str, yytext+1); BEGIN(B);
...
When I tried this, it turned out that I had to have a rule that actually matches newline for yylineno to be updated. With the following rule it worked, and without it yylineno never changed:
\n { }

Resources