parsing bibtex with bison - parsing

I am a novice. I want to parse bibtex file using flex/bison. A sample
bibtex is:
#Book{a1,
author="amook",
Title="ASR",
Publisher="oxf",
Year="2010",
Add="UK",
Edition="1",
}
#Article{a2,
Author="Rudra Banerjee",
Title={FeNiMo},
Publisher={P{\"R}B},
Issue="12",
Page="36690",
Year="2011",
Add="UK",
Edition="1",
}
and for parsing this I have written the following code:
%{
#include <stdio.h>
#include <stdlib.h>
%}
%{
char yylval;
int YEAR,i;
//char array_author[1000];
%}
%x author
%x title
%x pub
%x year
%%
#                               printf("\nNEWENTRY\n");
[a-zA-Z][a-zA-Z0-9]*            {printf("%s",yytext);
                                        BEGIN(INITIAL);}
author=                         {BEGIN(author);}
<author>\"[a-zA-Z\/.]+\"        {printf("%s",yytext);
                                        BEGIN(INITIAL);}
year=                           {BEGIN(year);}
<year>\"[0-9]+\"                {printf("%s",yytext);
                                        BEGIN(INITIAL);}
title=                          {BEGIN(title);}
<title>\"[a-zA-Z\/.]+\"         {printf("%s",yytext);
                                        BEGIN(INITIAL);}
publisher=                      {BEGIN(pub);}
<pub>\"[a-zA-Z\/.]+\"           {printf("%s",yytext);
                                        BEGIN(INITIAL);}
[a-zA-Z0-9\/.-]+=        printf("ENTRY TYPE ");
\"                      printf("QUOTE ");
\{                      printf("LCB ");
\}                      printf(" RCB");
;                       printf("SEMICOLON ");
\n                      printf("\n");
%%
int main(){
  yylex();
//char array_author[1000];
//printf("%d%s",&i,array_author[i]);
i++;
return 0;
}
The problem is that I want to separate key and val in different
variables and store it in some place (may be array).
Can I have some insight?

If I'd seen this question a year ago I would have made a contemporaneous comment so the question could be improved. The code supplied is not a parser, but regular expressions coded for flex only. Scanning an input file for tokens using regular expressions is but a part of building a parser. No grammar or structure for the bibtex file has been defined for bison.
To separate the key and val, if that what was all that was required, could be done much more easily with tools like awk and sed than flex. One thing I'd point out is that the vals always follow an equal sign. Kinda makes them easy to identify without any special syntactic jiggery pokery.
As we have no information as to why we need to parse a bibtex file, and the ultimate goal of the exercise its hard to see what would be the best approach.
Edit: This question is a duplicate, as the OP asked it again and it was answered: parse bibtex with flex+bison: revisited

Related

Eliminate characters between numbers in Lex code

How can I eliminate characters between two or more integer numbers in lex code?
Ex:12bd35
output:12 35
Lex builds lexical analyzers, which are intended to split the input into separate tokens. Once you recognize a token, you can ignore it, which is somewhat similar to "eliminating characters". But you always need to recognise them.
So you might start with the following minimalist scanner:
%option noinput nounput noyywrap
%%
[[:digit:]]+ { ECHO; fputc(' ', yyout); } /* print numbers.
[^[:digit:]]+ ; /* ignore everything else. */
And then modify it to fit your actual need.

Valid regular expression for identifier using flex

I'm trying to make a regular expression that will only work when a valid identifier name is given, using flex (the name cannot start with a number). I'm using this code :
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
%}
%%
"if" { printf("IF "); }
[a-zA-Z_][a-zA-Z_0-9]* { printf("%s ", yytext); }
%%
int main() {
yylex();
}
but it is not working. how to make sure that flex accepts only a valid identifier?
When I provide the input:
if
abc
9abc
I see the following output:
IF
abc
9abc
but I expected:
IF
abc
(nothing)
Your patterns do not match all possible inputs.
In such cases, (f)lex adds a default catch-all rule, of the form
.|\n { ECHO; }
In other words, any character not recognized by your patterns will simply be printed on stdout. That will be the case with the newline characters in your input, as well as with the digit 9. After the 9 is recognized by the default rule, the remaining input will again be recognized by your identifier rule.
So you probably wanted something like this:
%option warn nodefault
%%
[[:space:]]+ ; /* Ignore whitespace */
"if" { /* TODO: Handle an "if" token */ }
[[:alpha:]_][[:alnum:]_]* { /* TODO: Handle an identifier token */ }
. { /* TODO: Handle an error */ }
Instead of printing information to stdout in an action as a debugging or learning aid, I strongly suggest you use the -T (or --trace) option when you are building your scanner. That will automatically output debugging information in a consistent and complete manner; it would have told you that the default rule was being matched, for example.
Notes:
%option nodefault tells flex not to insert a default rule. I recommend always using it, because it will keep you out of trouble. The warn option ensures that a warning is issued in this case; I think that warn is default flex behaviour but the manual suggests using it and it cannot hurt.
It's good style to use standard character class expressions. Inside a character class ([…]), [:xxx:] matches anything for which the standard library function isxxx would return true. So [[:space:]]+ matches one or more whitespace characters, including space, tab, and newline (and some others), [[:alpha:]_] matches any letter or an underscore, and [[:alnum:]_]* matches any number (including 0) of letters, digits, or underscores. See the Patterns section of the manual.

rule exclusion in flex

I am trying to write a flex file which recognizes (-! comment !-) as one token called comment. The following is my file:
%{
#include <stdio.h>
void showToken(char* name);
void error();
void enter();
int lineNum=1;
%}
%option yylineno
%option noyywrap
whitespace ([\t ])
enter ([\n])
startcomment (\(\-\!)
endcomment (\!\-\))
comment (^\!\-\))
%%
{startcomment}{comment}*{endcomment} showToken("COMMENT");
{enter} enter();
{whitespace}
. error();
%%
void showToken(char* name){
printf("%d %s %s %d% \n",lineNum,name, yytext);
}
void enter(){
lineNum++;
}
void error(){
printf("%d error %s \n",lineNum,yytext);
}
but i fail for a simple (-! comment !-) input, this file does recognize the (-! and !-) but fails to recognize my comment rule. I did try replacing it with comment (^{endcomment}) but it did not work, any suggestions?
You seem to think that ^ means the following pattern should not match, but it means to match the start of a line. Inside a character class ^ does mean everything but the character class, but outside a character class its meaning is totally different.
In answer to your question for an alternative. Your problem is similar to C-comment /* comment */. The following expression matches C-comment:
"/*"([^*]|"*"+[^/*])*"*"+"/"
Alternatively and more intuitive (if you like) you can use a sub-automaton:
%x comment
%%
"/*" { BEGIN(comment); }
<comment>(.|"\n") { /* Skip */ }
<comment>"*/" { BEGIN(INITIAL); }
%%
I'll leave it as an exercise to apply this to your comment style. Having !-) as the closing of your comment, makes the first solution a bit more complicated.
Note that in general the second solution is preferred because it does not cause the use of a big buffer. The first solution will create a buffer containing the complete comment (which can be big), whereas the buffer requirements for the second solution is at most two characters long.
The easiest way to maintain line-numbers is using the %option yylineno as flex will then keep track of line-numbers in the variable int yylineno. Alternatively you can count the number of new-lines in yytext. In the second solution you can split the second rule and make a separate case for "\n" and count line-numbers there.

Replace number expression with flex

I use Flex for replace number expression in code source:
For instance:
Input string: ... echo "test"; if ($isReady) $variable = 2 * 5; ...
Desired result string: ... echo "test"; if ($isReady) $variable = 10; ...
My code:
%{
#include <stdio.h>
#include <stdlib.h>
%}
MYEXP [0-9]+[ \t\n\r]*\+[ \t\n\r]*[0-9]+
%%
{MYEXP} {
printf("multiplication ");
// code for processing
}
%%
void main()
{
yylex();
}
How can I process multiplication with Flex? Or I have to process with C language?
Some of the answers are in the comments, but the question has not yet been closed with an answer in two years. I thought some notes, for the purposes of completion, would be useful for people who are thinking of things like this in the future.
Simple arithmetic expression, in the form exemplified in the question can be recognised by a tool like flex, which matches regular expressions using an FSA (Finite State Automaton - or FSM Finite State Machine). This works when the syntax is simple id + id, but fails when the expressions become more complex. The handling of the operator precedence in id + id * id and the nested parenthesis in something like ((id + id) * (id + id)) means that a Regular Grammar can no longer work. This requires a context-free grammar. (Computer Science students should know this from Chomsky Language Theory). So the operations can only be performed in flex for the simplest forms of expression.
The replacement of simple expressions, which only contain constants, is an optimisation called constant folding and is performed by most compilers as standard. Performing this as a form of pre-processing on most code will not produce any improvement. So when proposing to write tools to do a job like this you have to reflect on whether it is essential or not!
Now down to the actual details of the question, which have been picked up in the comments; yes, a rule will be needed for each operator, addition and multiplication; and when matched a substring will be needed to pick up the operands. It will look something like this:
MYplusEXP [0-9]+[ \t\n\r]*\+[ \t\n\r]*[0-9]+
MYmultEXP [0-9]+[ \t\n\r]*\*[ \t\n\r]*[0-9]+
%%
char [20] left; char * right;
{MYplusEXP} {right = strstr(yytext,"+"); /* yytext is already terminated with \0 */
strncopy(left,yytext,right-yytext+1);
printf("%d",atoi(left)+atoi(right));
}
{MYmultEXP} {right = strstr(yytext,"*");
strncopy(left,yytext,right-yytext+1);
printf("%d",atoi(left)*atoi(right));
}
However I feel a bit dirty after doing that pointer arithmetic
In summary, it might be better done with other tools or not at all!

Bison: How to ignore a token if it doesn't fit into a rule

I'm writing a program that handles comments as well as a few other things. If a comment is in a specific place, then my program does something.
Flex passes a token upon finding a comment, and Bison then looks to see if that token fits into a particular rule. If it does, then it takes an action associated with that rule.
Here's the thing: the input I'm receiving might actually have comments in the wrong places. In this case, I just want to ignore the comment rather than flagging an error.
My question:
How can I use a token if it fits into a rule, but ignore it if it doesn't? Can I make a token "optional"?
(Note: The only way I can think of of doing this right now is scattering the comment token in every possible place in every possible rule. There MUST be a better solution than this. Maybe some rule involving the root?)
One solution may be to use bison's error recovery (see the Bison manual).
To summarize, bison defines the terminal token error to represent an error (say, a comment token returned in the wrong place). That way, you can (for example) close parentheses or braces after the wayward comment is found. However, this method will probably discard a certain amount of parsing, because I don't think bison can "undo" reductions. ("Flagging" the error, as with printing a message to stderr, is not related to this: you can have an error without printing an error--it depends on how you define yyerror.)
You may instead want to wrap each terminal in a special nonterminal:
term_wrap: comment TERM
This effectively does what you're scared to do (put in a comment in every single rule), but it does it in fewer places.
To force myself to eat my own dog food, I made up a silly language for myself. The only syntax is print <number> please, but if there's (at least) one comment (##) between the number and the please, it prints the number in hexadecimal, instead.
Like this:
print 1 please
1
## print 2 please
2
print ## 3 please
3
print 4 ## please
0x4
print 5 ## ## please
0x5
print 6 please ##
6
My lexer:
%{
#include <stdio.h>
#include <stdlib.h>
#include "y.tab.h"
%}
%%
print return PRINT;
[[:digit:]]+ yylval = atoi(yytext); return NUMBER;
please return PLEASE;
## return COMMENT;
[[:space:]]+ /* ignore */
. /* ditto */
and the parser:
%debug
%error-verbose
%verbose
%locations
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str) {
fprintf(stderr, "error: %s\n", str);
}
int yywrap() {
return 1;
}
extern int yydebug;
int main(void) {
yydebug = 0;
yyparse();
}
%}
%token PRINT NUMBER COMMENT PLEASE
%%
commands: /* empty */
|
commands command
;
command: print number comment please {
if ($3) {
printf("%#x", $2);
} else {
printf("%d", $2);
}
printf("\n");
}
;
print: comment PRINT
;
number: comment NUMBER {
$$ = $2;
}
;
please: comment PLEASE
;
comment: /* empty */ {
$$ = 0;
}
|
comment COMMENT {
$$ = 1;
}
;
So, as you can see, not exactly rocket science, but it does the trick. There's a shift/reduce conflict in there, because of the empty string matching comment in multiple places. Also, there's no rule to fit comments in between the final please and EOF. But overall, I think it's a good example.
Treat comments as whitespace at the lexer level.
But keep two separate rules, one for whitespace and one for comments, both returning the same token ID.
The rule for comments (+ optional whitespace) keeps track of the comment in a dedicated structure.
The rule for whitespace resets the structure.
When you enter that “specific place”, look if the last whitespace was a comment or trigger an error.

Resources