How to unlex using Flex(The Fast Lexical Analyzer)? - flex-lexer

Is there any way to put a token back into the input stream using Flex? I imagine some function like yyunlex().

There is the macro REJECT which will put the token back to stream and continue to match other the rules as though the first matched didn't. If you just want to put some char back to stream #Kizaru's answer will suffice.
Example snippet:
%%
a |
ab |
abc |
abcd ECHO; REJECT;
.|\n printf("xx%c", *yytext);
%%

You have a few options.
You can put each character for the token back onto the input stream using unput(ch) where ch is the character. This call puts ch as the next character on the input stream (next character to be considered in scanning). So you could do this if you save the string during the token match.
You might want to look into yyless(0) which will put all of the characters from the token back onto the input stream too. I never used this one though, so I'm not sure if there are any gotchas. You can specify an integer n hwich will put all but the first n characters back on the input stream.
Now, if you're going to do this often during scanning/parsing, you might want to use lex just to build tokens and place the tokens onto your own data structure for parsing. This is akin to what bison and yacc's generated yyparse() function does.

Related

How to replace some characters of input file, before it getting lexed in flex?

How to replace all occurrences of some character or char-sequence with some other character or char-sequence, before flex lexes it. For example I want B\65R to match identifier rule as it is equivalent to BAR in my grammar. So, essentially I want to turn a sequence of \dd into its equivalent ascii character and then lex it. (\65 -> A, \66 -> B, …).
I know, I can first search the entire file for a sequence of \dd and replace it with equivalent character and then feed it to flex. But I wonder if there exists a better way. Something like writing a rule that matches \dd and then replacing it with corresponding alternative in the input stream, so that, I don't have to parse entire file twice.
Several options...
Next, flex is going to read from a filter that
substitutes "\dd" by "chr(dd)" (untested).
You could run something along the lines of
YYIN = popen("perl -pe 's/\\(\d\d)/chr($1)/e' ", "r");
yylex()....

Does -> skip change the behavior of the lexer rule precedence?

I am writing a grammar to parse a configuration export file from a closed system. when a parameter identified in the export file has a particularly long string value assigned to it, the export file inserts "\r\n\t" (double quotes included) every so often in the value. In the file I'll see something like:
"stuff""morestuff""maybesomemorestuff"\r\n\t"morestuff""morestuff"...etc."
In that line, "" is the way the export file escapes a " that is part of the actual string value - vs. a single " which indicates the end of the string value.
my current approach for the grammar to get this string value to the parser is to grab "stuff" as a token and \r\n\t as a token. So I have rules like:
quoted_value : (QUOTED_PART | QUOTE_SEPARATOR)+ ;
QUOTED_PART : '"' .*? '"';
QUOTE_SEPARATOR : '\r\n\t';
WS : [ \t\r\n] -> skip; //note - just one char at a time
I get no errors when I lex or parse a sample string. However, in the token stream - no QUOTE_SEPARATOR tokens show up and there is literally nothing in the stream where they should have been.
I had expected that since QUOTE_SEPARATOR is longer than WS and that it is first in the grammar that it would be selected, but it is behaving as if WS was matched and the characters were skipped and not send to the token string.
Does the -> skip do something to change how rule precedence works?
I am also open to a different approach to the lexing that completely removes the "\r\n\t" (all five characters) - this way just seemed easier and it should be easy enough for the program that will process the parse tree to deal with as other manipulations to the data will be done there anyway (my first grammar - teach me;) ).
No, skip does not affect rule precedence.
Change the QUOTE_SEPARATOR rule to
QUOTE_SEPARATOR : '\\r\\n\\t' ;
in order to match the actual textual content of the source string.

How would I create a parser which consumes a character that is also at the beginning and end

How would I create a parser that allows a character which also happens to be the same as the begin/end character. Using the following example:
'Isn't it hot'
The second single-quote should be accepted as part of the content that is between the beginning and ending single-quote. I created a parser like this:
char("'").seq((word()|char("'")|whitespace()).plus()).seq(char("'"))
but it fails as:
Failure[1:15]: "'" expected
If I use "any()|char("'") then it greedily consumes the ending single-quote causing an error as well.
Would I need to create an actual Grammar class? I have attempted to create one but can't figure out how to make a Parser that doesn't try to consume the end marker greedily.
The problem is that plus() is greedy and blind. This means the repetition consumes as much input as possible, but does not consider what comes afterwards. In your example, everything up to the end of the input is consumed, but then the last quote in the sequence cannot be matched anymore.
You can solve the problem by using the non-blind variation plusGreedy(Parser) instead:
char("'")
.seq((word() | char("'") | whitespace()).plusGreedy(char("'")))
.seq(char("'"));
This consumes the input as long as there is still a char("'") left that can be consumed afterwards.

How to detect partial unfinished token and join its pieces that are obtained from two consequent portions of input?

I am writing toy terminal, where I use Flex to parse normal text and control sequences that I get from tty. One detail of Cocoa machinery is that it reads from tty by chunks of 1024 bytes so that any token described in my .lex file at any time can become broken into two parts: some bytes of a token are the last bytes of first 1024 chunk and remaining bytes are the very first bytes of next 1024 bytes chunk.
So I need to somehow:
First of all detect this situation: when a token is split between two 1024-byte chunks.
Remember the first part of a token
When second 1024-chunk arrives, restore that first part by putting it in front of this second chunk somehow.
I am completely new to Flex so I am looking for a right way to accomplish this.
I have created dumb simple lexer to assist this discussion.
My question about this demo is:
How can I detect that last "FO" (unfinished "FOO") token is actually an unfinished token that is it is not an exception to my grammar but just needs its "O" from next chunk of input?
You should let flex do the reading. It is designed to work that way; it will do all the buffering necessary, including the case where a token is split between two (or more) input buffers.
If you cannot simply read from stdin using the standard fread function, then you can redefine the way the flex-generated parser gets input by redefining the macro YY_INPUT. See the "Generated Parser" chapter of the flex manual for a description of this macro.
I have accepted #rici's answer as correct one as it gave me important hint about redefining the macro YY_INPUT.
In this answer I just want to share some details for newbies like me.
I have used How to make YY_INPUT point to a string rather than stdin in Lex & Yacc (Solaris) as example of custom YY_INPUT and this made my artificial example to work correctly with partial tokens.
To make Flex work correctly with partial tokens, the input should not contain '\0' symbols, i.e. scanning process should be "endless". Here is how YY_INPUT is redefined:
int readInputForLexer(char *buffer, int *numBytesRead, int maxBytesToRead) {
static int Flip = 0;
if ((Flip++ % 2) == 0) {
strcpy(buffer, "FOO F");
*numBytesRead = 5; // IMPORTANT: this is 5, not 6, to cut off \0
} else {
strcpy(buffer, "OO FOO");
*numBytesRead = 6; // IMPORTANT: this is 6, not 7, to cut off \0
}
return 0;
}
In this example partial token F-OO is glued by Flex into a correct one: FOO.
As #rici pointed out in his comment, correct way to stop scanning is to set: *numBytesRead = 0.
See also another answer by #rici on similar SO question: Flex, continuous scanning stream (from socket). Did I miss something using yywrap()?.
See my example for further details.

Getting tokens based on length and position inside input

On my input I have stream of characters which are not separated by any delimiter, like this:
input = "150001"
I want to make parser(using JISON), which tokenize based on position and length, this should be my tokens:
15 - system id (first 2 numbers)
0001 - order num (4 numbers after)
Can you give me some advice how can I accomplish this,
I tried to add my tokens like this:
%lex
%%
[0-9]{2} return "SYSTEM_ID"
[0-9]{4} return "ORDER_NUM"
\lex
%%
But as expected this is not working :)
Is there some way to parse this kind of inputs, where you parse by length of characters ?
You can make a simple parser using state-declarations, and assigning a state to each of those rules. Referring to JISON's documentation, it would change to something like this (noting that your lexer is still incomplete because it does nothing for the identifier or "="):
%lex
%s system_id order_num
%%
/* some more logic is needed to accept identifier, then "=", each
with its own state, and beginning "system_id" state.
*/
<system_id>[0-9]{2} this.begin("order_num"); return "SYSTEM_ID"
<order_num>[0-9]{4} this.begin('INITIAL'); return "ORDER_NUM"
\lex
%%

Resources