I'm writing a simple parser for tcpdump logs, could you please tell me why I can't get proper line number?
%{
char str[80];
%}
%option yylineno
...
%%
^{HOURS}:{MINUTES}:{MINUTES} if(input()=='.') { strcpy(str, yytext); BEGIN(A); } else {printf("Wrong hour %d", yylineno); }
<A>({NDPS}|{DPS})\.({NDPS}|{DPS})\.({NDPS}|{DPS})|\.{NDPS} printf("Wrong IP!, %d", yylineno);
<A>[ ]{DPS}\.{DPS}\.{DPS}\.{DPS} strcat(str, " from "); strcat(str, yytext+1); BEGIN(B);
...
When I tried this, it turned out that I had to have a rule that actually matches newline for yylineno to be updated. With the following rule it worked, and without it yylineno never changed:
\n { }
Related
Problem
I'm writing a sort of a script language interpreter.
I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.
The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.)
Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.
Let's focus on a shebang in the example.
I've written some simple grammar that illustrates the problems I'm facing.
Lexer:
%%
#!.+ { printf("shebang: \"%s\"\n", yytext + 2); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Input file:
#!my-program
# some multiline
thingy #
aaa bbb
ccc#!not a shebang#ddd
eee
Expected output:
shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Actual output:
thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb
ccc#"
error: '!'
id: not
id: a
id: shebang
error: '#'
id: ddd
id: eee
My (bad?) solution
I figured that this is a good case to use start conditions.
I managed to use them to write a lexer that does work, however, it's rather ugly:
%s MAIN
%%
<INITIAL>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Notice that I had to specify the start condition MAIN before the rule #[^#]*#.
It's because it would otherwise collide with the shebang rule #!.+.
Unfortunately, the INITIAL start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).
Is there some way to make the INITIAL exclusive or choose a different start condition to be the default?
Here's a simpler solution, assuming you're using Flex (as per your flex-lexer tag):
%option noinput nounput noyywrap nodefault yylineno
%{
#define YY_USER_INIT BEGIN(STARTUP);
%}
%x STARTUP
%%
<STARTUP>#!.* { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
<STARTUP>.|\n { BEGIN(INITIAL); yyless(0); }
/* Rest is INITIAL */
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
Test:
rici$ flex -o shebang.c shebang.l
rici$ gcc -Wall -o shebang shebang.c -lfl
rici$ ./shebang <<"EOF"
> #!my-program
> # some multiline
> thingy #
> aaa bbb
> ccc#!not a shebang#ddd
> eee
> EOF
Shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Notes:
The %option line:
prevents "Unused function" warnings;
removes the need for yywrap;
shows an error if there's some possible input which doesn't match any pattern;
counts input lines in the global yylineno
The macro YY_USER_INIT is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.
yyless(0) causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)
The library -lfl includes yywrap() (although in this case, it's not used), and a simple main() definition rather similar to the one in your example.
(1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll.
There is an indirect way to select a different start condition.
The start condition is an integer variable (e.g. INITIAL is 0).
You can get its current value using a macro YY_START and if it equals INITIAL, change it to another value, effectively replacing it.
%x BEGINNING
%s MAIN
%%
%{
if (YY_START == INITIAL)
BEGIN(BEGINNING);
%}
<BEGINNING>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<BEGINNING>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
The disadvantages of this solution are:
The code block will execute every time you call yylex (not only the first time, when it's actually needed). The overhead is small enough to be ignored, though.
Flex will still generate the whole state machine as if INITIAL was used. I have no idea if this creates a lot of code or if it would matter in a big parser.
For the code snippet below, the "ASSN: =" block for {EQ} is not triggered for an input of "CC=gcc\n" - I don't understand why this is, the equals character is being passed, as it is being processed by the next rule for {CHAR}.
How can I ensure that the {EQ} rule for is processed when the equals character is 'pushed' back by yyless?
The byacc code is pretty much empty with a single dummy rule, but with the relevant %token lines.
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include "y.tab.h"
extern YYSTYPE yylval;
%}
%x ASSIGNMENT
%option noyywrap
DIGIT [0-9]
ALPHA [A-Za-z]
SPACE [ ]
TAB [\t]
WS [ \t]+
NEWLINE (\n|\r|\r\n)
IDENT [A-Za-z_][A-Za-z_0-9]+
EQ =
CHAR [^\r\n]+
%%
<*>"#"{CHAR}{NEWLINE}
({IDENT}{EQ})|({IDENT}{WS}{EQ}) {
yylval.strval = strndup(yytext,
strlen(yytext)-1);
printf("NORM: %s\n", yylval.strval);
yyless(strlen(yytext)-1);
BEGIN(ASSIGNMENT);
return TOK_IDENT;
}
<ASSIGNMENT>{
{EQ} {
printf("ASSN: =\n");
return TOK_ASSIGN;
}
{CHAR} {
printf("ASSN: %s\n", yytext);
return TOK_STRING;
}
{NEWLINE} {
BEGIN(INITIAL);
}
}
{WS}
{NEWLINE}
. {
printf("DOT : %s\n", yytext);
}
<*><<EOF>> {
printf("EOF\n");
return 0;
}
%%
int main()
{
printf("Start\n\n");
int ret;
while( (ret = yylex()) ) {
printf("LEX : %u\n", ret);
}
printf("\nEnd\n");
}
Example output:
Start
NORM: CC
LEX : 257
ASSN: =gcc
LEX : 259
EOF
End
My issue was that flex matches the longest rule first, so {CHAR} was always winning over {EQ}. I solved this by introducing another Start Condition to consume the {EQ}{WS}? before passing to
I am trying to understand how flex works under the hood.
In the following first example, it seems that main() calls yylex() only once, and yylex() generates all the tokens for the entire input.
In the second example, it seems that main() calls yylex() once per token generated, and yylex() generates a token per call.
Does each call to yylex() generate a token or all the tokens for the input?
Why is yylex() called different number of times in the two examples?
I heard that yylex() is like a coroutine, and each call to it will resume with the rest of the input left from last call and generate a token. In that sense, how does the first example calls yylex() just once and generate all the tokens in the input?
/* just like Unix wc */
%{
int chars = 0;
int words = 0;
int lines = 0;
%}
%%
[a-zA-Z]+ { words++; chars += strlen(yytext); }
\n { chars++; lines++; }
. { chars++; }
%%
main(int argc, char **argv)
{
yylex();
printf("%8d%8d%8d\n", lines, words, chars);
}
$ ./a.out
The boy stood on the burning deck
shelling peanuts by the peck
^D
2 12 63
$
and
/* recognize tokens for the calculator and print them out */
%{
enum yytokentype {
NUMBER = 258,
ADD = 259,
SUB = 260,
MUL = 261,
DIV = 262,
ABS = 263,
EOL = 264
};
int yylval;
%}
%%
"+" { return ADD; }
"-" { return SUB; }
"*" { return MUL; }
"/" { return DIV; }
"|" { return ABS; }
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
\n { return EOL; }
[ \t] { /* ignore whitespace */ }
. { printf("Mystery character %c\n", *yytext); }
%%
main(int argc, char **argv)
{
int tok;
while(tok = yylex()) {
printf("%d", tok);
if(tok == NUMBER) printf(" = %d\n", yylval);
else printf("\n");
}
}
$ ./a.out
a / 34 + |45
Mystery character a
262
258 = 34
259
263
258 = 45
264
^D
$
Flex doesn't decide when the scanner will return (except for the default EOF rule). The scanner which it builds performs lexical actions in a loop until some action returns. So it is entirely up to you how you want to structure your scanner.
However, the classic yyparse/yylex processing model consists of the parser calling yylex() every time it needs a new token. So it expects yylex() to return immediately once it finds a token.
In your first code example, there is no parser and the scanner action is limited to printing out the token. While the example is perfectly correct, relying on the scanner loop to repeatedly execute actions, I'd prefer the second model even if you don't (yet) intend to add a parser, because it will make it easier to decouple token handling from token generation.
That doesn't mean that every lexical action will contain a return statement, though. Some lexical patterns correspond to non-tokens (comments and whitespace, for example), and the corresponding action will most likely do nothing (other than possibly recording input position) so that the scanner will continue to search fir a token to return.
(F)lex scanners are not easy to make into coroutines, so if a coroutine is really required (for example, to incrementally parse a asynchronous input), then another tool might be preferred.
Bison does offer the possiblity to generate a "push parser" in which the scanner calls the parser every time it finds a token, rather than returning to the parser. But neither the "push" nor the traditional "pull" model have anything to do with coroutines, IMHO, and the use of the word to describe parser/scanner interaction strikes me as imprecise and unuseful (although I have a lot of respect for the author you might be quoting.)
I'm trying to modify a flex+bison generator to allow the inclusion of code snippets denoted by surrounding '{{' and '}}'. Unlike the multi-line comment case, I must capture all of the content.
My attempts either fail in the case where the '{{' and the '}}' are on the same line or they are painfully slow.
My first attempt was something like this:
%{
#include <stdio.h>
// sscce implementation of a growing string buffer
char codeBlock[4096];
int codeOffset;
const char* curFilename = "file.l";
extern int yylineno;
void add_code_line(const char* yytext)
{
codeOffset += sprintf(codeBlock + codeOffset, "#line %u \"%s\"\n\t%s\n", yylineno, curFilename, yytext);
}
%}
%option stack
%option yylineno
%x CODE_FRAG
%%
"{{"[ \n]* { codeOffset = 0; yy_push_state(CODE_FRAG); }
<CODE_FRAG>"}}" { codeBlock[codeOffset] = 0; printf("// code\n%s\n", codeBlock); yy_pop_state(); }
<CODE_FRAG>[^\n]* { add_code_line(yytext); }
<CODE_FRAG>\n
\n
.
Note: the "codeBlock" implementation is a contrivance for the purpose of an SSCCE only. It's not what I'm actually using.
This works for a simple test case:
{{ from line 1
from line 2
}}
{{
from line 7
}}
Outputs
// code
#line 1 "file.l"
from line 1
#line 2 "file.l"
from line 2
// code
#line 7 "file.l"
from line 7
But it can't handle
{{ hello }}
The two solutions I can think of are:
/* capture character-by-character */
<CODE_FRAG>. { add_code_character(yytext[0]); }
And
<INITIAL>"{{".*?"}}" { int n = strlen(yytext); yytext + (n - 2) = 0; add_code(yytext + 2); }
The former seems likely to be slow, and the latter just feels wrong.
Any ideas?
--- EDIT ---
The following appears to achieve the result desired, but I'm not sure if it's a "good" Flex way to do this:
"{{"[ \n]* { codeOffset = 0; yy_push_state(CODE_FRAG); }
<CODE_FRAG>"}}" { codeBlock[codeOffset] = 0; printf("// code\n%s\n", codeBlock); yy_pop_state(); }
<CODE_FRAG>.*?/"}}" { add_code_line(yytext); }
<CODE_FRAG>.*? { add_code_line(yytext); }
<CODE_FRAG>\n
Flex doesn't implement non-greedy matches. So .*? won't work the way you expect it to in flex. (It will be an optional .*, which is indistinguishable from .*)
Here's a regular expression which will match from {{ as far as possible without a }}:
"{{"([}]?[^}])*
That might not be what you want, since it won't allow nested {{...}} within your code blocks. However, you didn't mention that as a requirement and none of your examples functions that way.
The above regular expression does not match the closing }}, which appears to be what you want since it lets you call add_code(yytext+2) without modifying the temporary buffer. However, you do need to deal with the }} in your action. See below.
The regular expression above will match to the end of the file if there is no matching }}. You probably want to deal with that as an error; the simplest way is to check if EOF is encountered while you are trying to ignore the }}
"{{"([}]?[^}])* { add_code(yytext+2);
if (input() == EOF || input() == EOF) {
/* Produce an error, unclosed {{ */
}
}
Do bison and flex allow user to natively localize error messages?
For example, I would like to translate following message: syntax error, unexpected NUMBER, expecting $end to other language and replace NUMBER/$end with something more human-readable.
Use yyerror and YY_USER_ACTION for additional data.
void yyerror(const char *s) {
sprintf(dummmy, "%s line %d col %d word '%s'\n", s, myline, mycolumn, yytext);
print_error(dummmy);
in the lex file
#define YY_USER_ACTION \
addme(yy_start, yytext); \
mycolumn += yyleng;\
if(*yytext == '\n') { myline++; mycolumn = 0; } else 0; \