Character Position from starting of a line - flex-lexer

In flex, what's the way to get character position from the starting of a line?
I've got a post regarding position from start of a file but i want it from the start of a line.
Also it should handle case like this:
/** this is
a comment
*/int x,y;
output:
Here position of "int"=3
Please give me some hints to implement this.

I presume that the "post regarding position from start of a file" is this one, or something similar. The following is based on that answer.
To track the current column offset, you only need to add the code to reset the offset value when you hit a newline character. If you follow the model presented in the linked answer, you can use a YY_USER_ACTION macro like this to do the adjustment:
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.first_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
int col; \
for (col = 1; yytext[yyleng - col] != '\n'; ++col) {} \
yylloc.last_column = col; \
yylloc.last_line = yylineno; \
}
The above code assumes that the starting line/column of the current token is the end line/column of the previous token, which means that yylloc needs to be correctly initialized. Normally you don't need to worry about this, because bison will automatically declare and initialize yylloc (to {1,1,1,1}), as long as it knows that you are using location information.
The test in the third line of the macro optimizes the common case where there was no newline in the token, in which case yylineno will not have changed since the beginning of the token. In the else clause, we know that a newline will be found in the token, which means we don't have to check for buffer underflow. (If you call input() or otherwise manipulate yylineno yourself, then you'll need to fix the for loop.)
Note that the code will not work properly if you use yyless or yymore, or if you call input.
With yymore, yylloc will report the range of the last token segment, rather than the entire token; to fix that, you'll need to save the real token beginning.
To correctly track the token range with yyless, you'll probably need to rescan the token after the call to yyless (although the rescan can be avoided if there is no newline in the token).
And after calling input, you'll need to manually update yylloc.last_column for each character read. Don't adjust yylineno; flex will deal with that correctly. But you do need to update yylloc.last_line if yylineno has changed.

Related

JFlex - lex structured document

The following image represents code that I need to lex.
The document has the following format (in columns term):
1 - 5: comment
6 - 79: actual code
80 - ..: comment
If I only had to lex the middle part, there would be no issues at all.
Unfortunately, the initial and terminating comment are always present in the document.
Any ideas on how this could be implemented?
I was thinking about implementing a two-phases lexer, but my thoughts are a bit confused still.
The question
If I only had to lex the middle part, there would be no issues at all. Unfortunately,
the initial and terminating comment are always present in the document. Any ideas on
how this could be implemented?
First solution
I would add a rule to the lexer that matches any character, as the last rule, and I would modify these functions to return a space symbol, if you are in columns 1 to 5 or beyond column 79, like so (assuming the type for space is 20):
%{
private Symbol symbol(int type) {
if((yycolumn >= 1 && yycolumn <= 5) || yycolumn > 79)
type = 20;
return new Symbol(type, yyline, yycolumn);
}
private Symbol symbol(int type, Object value) {
if((yycolumn >= 1 && yycolumn <= 5) || yycolumn > 79)
type = 20;
return new Symbol(type, yyline, yycolumn, value);
}
%}
The solution preserves column information. If you need to preserve the comments, then create a comment-characer token and return it instead of the space token.
Second solution
Or I would add two rules to the lexer one that matches the first comment in each line and returns a whitespace token of length 5:
^.....
And one that matches the second comment in each line and return a whitespace token with the length of the comment:
^(?<=...............................................................................).*
I have never used the non-capturing 'only if preceded by' with JFlex, so I don't know of it is supported. Sorry.
The solution preserves column information. Again, if you need to preserve the comments, then return a comment token, otherwise return a whitespace token.
Third solution
Or I would write two lexers, the first one replaces the first 5 characters in every line with white space (to preserve column information for the 2nd lexer) and remove the characters after column 79.
The first lexer can be written in any language OR you can use the command line tool sed (or a similar tool) to do it. Here is an example using sed:
The input to sed named input.txt:
ABCDE67890123456789012345678901234567890123456789012345678901234567890123456789FGHJKL
ABCDEThis is the text we want, not the start and not the end of the line. FGHJKL
The sed command:
sed 's/^.....\(..........................................................................\).*$/\1/' input.txt > output.txt
The output from sed named output.txt:
67890123456789012345678901234567890123456789012345678901234567890123456789
This is the text we want, not the start and not the end of the line.
You can modify the script to preserve column positions by inserting 5 spaces in the replacement part of the command, but it is not suited for returning the comments.

Handling new lines in Flex/Bison

I am trying to make a C-like language using Flex/Bison. My problem is that I can't find a proper way to handle new lines. I have to ignore all new lines so I don't returnt them as a token to Bison because that would make the grammar rules so difficult to make but I am asked in some rules to make a mandatory change of line. For example:
Program "identifier" -> mandatory change of line
Function "identifier"("parameters") -> mandatory change of line
If I return \n as a token to flex then i have to put new lines in all of my grammar rules and that's surely not practical. I tried to make a variable work like a switch or something but it didn't quite work.
Any help or suggestion?
If the required newline is simply aesthetic -- that is, if it isn't required in order to avoid an ambiguity -- then the easiest way to enforce it is often just to track token locations (which is something that bison and flex can help you with) so that you can check in your reduction action that two consecutive tokens were not on the same line:
func_defn: "function" IDENT '(' opt_arg_list ')' body "end" {
if (#5.last_line == #6.first_line) {
yyerror("Body of function must start on a new line");
/* YYABORT; */ /* If you want to kill the parse at this point. */
}
// ...
}
Bison doesn't require any declarations or options in order to use locations; it will insert location support if it notices that you use #N in any action (which is how you refer to the location of a token). However, it is sometimes useful to insert a %locations declaration to force location support. Normally no other change is necessary to your grammar.
You do have to insert a little bit of code in your lexer in order to report the location values to the parser. Locations are communicated through a global variable called yylloc, whose value is of type YYLTYPE. By default, YYLTYPE is a struct with four int members: first_line, first_column, last_line, last_column. (See the Bison manual for more details.) These fields need to be set in your lexer for every token. Fortunately, flex allows you to define the macro YY_USER_ACTION, which contains code executed just before every action (even empty actions), which you can use to populate yylloc. Here's one which will work for many simple lexical analysers; you can put it in the code block at the top of your flex file.
/* Simple YY_USER_ACTION. Will not work if any action includes
* yyless(), yymore(), input() or REJECT.
*/
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.last_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
yylloc.last_line = yylineno; \
yylloc.last_column = yytext + yyleng - strrchr(yytext, '\n'); \
}
If the simple location check described above isn't sufficient for your use case, then you can do it through what's called "lexical feedback": a mechanism where the parser not only collects information from the lexical scanner, but also communicates back to the lexer when some kind of lexical change is needed.
Lexical feedback is usually discouraged because it can be fragile. It's always important to remember that the parser and the scanner are not necessarily synchronised. The parser often (but not always) needs to know the next token after the current production, so the lexical state when a production's action is being executed might be the lexical state after the next token, rather than the state after the last token in the production. But it might not; many parser generators, including Bison, try to execute an action immediately if they can figure out that the same action will be executed regardless of the next token. Unfortunately, that's not always predictable. In the case of Bison, for example, changing the parsing algorithm from the default LALR(1) to Canonical LR(1) or to GLR can also change a particular reduction action from immediate to deferred.
So if you're going to try to communicate with the scanner, you should try to do so in a way that will work whether or not the scanner has already been asked for the lookahead token. One way to do this is to put the code which communicates with the scanner in a Mid-Rule Action one token earlier than the token which you want to influence. [Note 1]
In order to make newlines "mostly optional", we need to tell the lexer when it should return a newline instead of ignoring it. One way to do this is to export a function which the lexer can call. We put the definition of that function into the generated parser and its declaration into the generated header file:
/* Anything in code requires and code provides sections is also
* copied into the generated header. So we can use it to declare
* exported functions.
*/
%code requires {
#include <stdbool.h>
bool need_nl(void);
}
%%
// ...
/* See [Note 2], below. */
/* Program directive. */
prog_decl: "program" { need_nl_flag = true; } IDENT '\n'
/* Function definition */
func_defn: "function" IDENT
'(' opt_arg_list { need_nl_flag = true; } ')' '\n'
body
"end"
// ...
%%
static bool need_nl_flag = false;
/* The scanner should call this function when it sees a newline.
* If the function returns true, the newline should be returned as a token.
* The function resets the value of the flag, so it must not be called for any
* other purpose. (This interface allows us to set the flag in a parser action
* without having to worry about clearing it later.)
*/
bool need_nl(void) {
bool temp = need_nl_flag;
need_nl_flag = false;
return temp;
}
// ...
Then we just need a small adjustment to the scanner in order to call that function. This uses Flex's set difference operator {-} to make a character class containing all whitespace other than a newline. Because we put that rule first, the second rule will only be used for whitespace including at least one newline character. Note that we only return one newline token for any sequence of blank lines.
([[:space:]]{-}[\n])+ { /* ignore whitespace */ }
[[:space:]]+ { if (need_nl()) return '\n'; }
Notes
That's not something you can do without thought, either: it might also be an error to change the scanner configuration too soon. In the action, you can check whether or not the lookahead token has already been read by looking at the value of yychar. If yychar is YYEMPTY, then no lookahead token has been read. If it is YYEOF, then an attempt was made to read a lookahead token but the end of input was encountered. Otherwise, the lookahead token has already been read.
It might seem tempting to use two actions, one before the token prior to the one you want to affect, and one just before that token. The first action could execute only if yychar is not YYEMPTY, indicating that the lookahead token has already been read and the scanner is about to read the token you want to change, while the second action will only execute if yychar at that point is YYEMPTY. But it's entirely possible that for a particular parse both of those conditions are true, or that neither is true.
Bison does have one configuration which you can use to make the lookahead decision completely predictable. If you set %define lr.default-reduction accepting, then Bison will always attempt to read a lookahead symbol, and you can be sure that placing the action one token early will work. Unless you are using the parser interactively, there is no real cost for enabling this option. But it won't work with old Bison versions or with other parser generators such as byacc.
For this grammar, we could have put the mid-rule actions just before the '\n' tokens rather than one token earlier (as long as the parser is never converted to a GLR or Canonical-LR parser). That's because in both rules, the MRA will go in between two tokens, and (presumably) there are no other rules which might apply up to the first of these tokens. Under those circumstances Bison can certainly know that the MRA can be reduced without examining the lookahead token to see if it is \n: either the next token is a newline and the reduction was required, or the next token is not a newline, which will be a syntax error. Since Bison does not guarantee to detect syntax errors before reduction actions are run, it can reduce the MRA action before knowing whether the parse will succeed.
There is a pattern called trailing context you cant try : https://people.cs.aau.dk/~marius/sw/flex/Flex-Regular-Expressions.html
"identifier"/[\n]
"function-identifier"/[\n]

It's possible yylval be a struct instead a union?

On Bison, it's possible yylval be a struct instead a union ? I know that i can define yylval as union with %union{} but is there a way to define yylval as struct ? to return the line and the string of a identifier for exemple and access
these information on a action of some gramar rule on bison.
Yes, you can #define YYSTYPE to be any type you want instead of using %union. However, it is rarely useful to do so1 -- if you want source position info, you're much better off using %position in combination with %union.
Its also possible (and common) to use structs within the %union declaration. This makes it easy for some rules to return multiple values (effectively).
1The main problem being that if you use %type to specify the use of one struct field, its painful to use other fields in the same action. You need to do everything manually, thus losing the benefit of bison's union type checking
If you want to keep location information (line number and column number) for your tokens, you can use Bison's location facility, which keeps a location object for each token and non-terminal separately from the semantic value. In an action, you refer to a symbol's location as #n.
The location stack is created and maintained automatically by bison if it sees that you have referred to a location anywhere in a rule.
By default, the location datatype is:
typedef struct YYLTYPE {
int first_line;
int first_column;
int last_line;
int last_column;
} YYLTYPE;
The location information for tokens must be set by the lexer. If you are using the default API, it is stored in the global variable yylloc. The parser will create location information for non-terminals by using the range from the beginning of the first item of a production up to the end of the last item. (For empty productions, a zero-length location object is generated, starting and ending with the start position of the lookahead token.)
Both of these defaults can be overridden if necessary. See the Bison manual for details.
Flex will track line numbers if asked to with %option yylineno, but it does not track column positions, which is a bit annoying. Also, yylloc requires both a starting and an ending line number; yylineno in a flex action will be the line number at the end of the token. Most commonly, you will use the YY_USER_ACTION macro to maintain the value of yylloc; an example implementation (taken from this answer, which you should read if you use this code) is:
%option yylineno
%{
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.first_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
int col; \
for (col = 1; yytext[yyleng - col] != '\n'; ++col) {} \
yylloc.last_column = col; \
yylloc.last_line = yylineno; \
}
%}

Gnu FLex: How does yyunput works

I have got a problem understanding flex yyunput behavior.
I want to put back some charackters
For exemple:
My scanner found CALL{space}{cc}
cc N?Z|N?C|P[OE]?|M
%%
CALL{blank}{cc} {BEGIN CON; return yy::ez80asm_parser::make_CALL(loc);}
CALL{mmode}{blank}{cc} {BEGIN CON; return yy::ez80asm_parser::make_CALL(loc);}
CALL {BEGIN ARG; return yy::ez80asm_parser::make_CALL(loc);}
and I want to give back the {cc} so it will be scanned next time.
What are the both arguments of yyunput has to be? I couldn't found any helpfully information about that funktion.
Any hints are wellcome
Jürgen
You can't "give back the {cc}" because the regular expression doesn't have pieces. (Flex does not do captures, either, so it wouldn't help to put parentheses around it.)
If you just want to rescan part of a token, it is much better to use yyless than unput, since yyless mostly just changes a pointer. With a single call to yyless you can return as many characters as you like, so you only need to know how many characters to return. (More precisely, you tell it how many characters you want to keep in yytext; the remainder are returned and yytext is truncated accordingly.)
For reference, unput is a macro whose single argument is a single character which will be pushed onto the beginning of the unconsumed input, overwriting yytext as it goes. (In the C++ API, it calls the internal member function ::yyunput, supplying it an additional necessary argument. Don't call this function directly.)
If you need to push several characters onto the input, you need to unput them one at a time, starting with the last one. Since unput destroys the value of yytext, you need to make sure that you've already copied it if you need it before calling unput.
In your case, I think neither of these is appropriate. What you probably want to do is to not include the {cc} pattern in match in the first place, which you can do with flex's trailing context operator /. (That assumes that you don't need to include the characters matched by {cc} in the semantic value you will be returning; in the example provided, yytext does not appear to be part of the semantic value, so the assumption should be safe.) To do so, you might write something like:
CALL{mmode}?{blank}/{cc} {BEGIN CON; return yy::ez80asm_parser::make_CALL(loc);}
CALL {BEGIN ARG; return yy::ez80asm_parser::make_CALL(loc);}
(Note: I combined your first two patterns into a single one since they seem to have the same action, but if you actually need the characters matched by {mmode} you might not want to do that.)
If that doesn't work, for whatever reason, use yyless. You'll need to know how many characters you want to return to the input, so I imagine you would end up with something like:
CALL{mmode}?{blank}{cc} { BEGIN CON;
int to_keep = yyleng - 1;
switch (yytext[to_keep]) {
case 'C': case 'Z':
if (yytext[to_keep - 1] == 'N') --to_keep;
break;
case 'E': case 'O': --to_keep; break
case 'P': case 'N': break;
default: assert(false); /* internal error */
}
yyless(to_keep);
return yy::ez80asm_parser::make_CALL(loc);
}
For details on the trailing context operator, see the Flex manual section on patterns (search for the word "trailing"; there is an important note towards the end as well) as well as the first paragraph of the following chapter on matching. yyless and unput are both documented in the chapter on actions, which includes examples of their usage.

Question in Flex (parser)

I want to ask you a question about Flex, the program for parsing code.
Supposing I have an instruction like this one, in the rules part:
"=" BEGIN(attribution);
<attribution>{var_name} { fprintf(yyout, "="); ECHO; }
<attribution>";" BEGIN(INITIAL);
{var_name} is a regular expression that matches a variable's name, and all I want to do is to copy at the output all the attribution instructions, such as
a = 3;
or
b = a;
My rule though cannot write with fprintf the left member of the attribution, but only
= 3;
or
=a;
One solution for that might be that, after I make the match "=" and I am in the attribution state, to go 2 positions back as to get the left operand as well.
How can I do that in Flex?
Why are you using flex for syntactical analysis?
What you are doing sounds like a bison stuff not a flex job.
You'll be able to store previous token.
If you still want to use flex, you can use the / pattern.
Using this may lead to inefficiencies and the lexer can be bogus; it depends of the whole rule set.
{var_name}/"=" { ECHO; BEGIN(attribution); }
See the flex manual.

Resources