How to detect partial unfinished token and join its pieces that are obtained from two consequent portions of input? - flex-lexer

I am writing toy terminal, where I use Flex to parse normal text and control sequences that I get from tty. One detail of Cocoa machinery is that it reads from tty by chunks of 1024 bytes so that any token described in my .lex file at any time can become broken into two parts: some bytes of a token are the last bytes of first 1024 chunk and remaining bytes are the very first bytes of next 1024 bytes chunk.
So I need to somehow:
First of all detect this situation: when a token is split between two 1024-byte chunks.
Remember the first part of a token
When second 1024-chunk arrives, restore that first part by putting it in front of this second chunk somehow.
I am completely new to Flex so I am looking for a right way to accomplish this.
I have created dumb simple lexer to assist this discussion.
My question about this demo is:
How can I detect that last "FO" (unfinished "FOO") token is actually an unfinished token that is it is not an exception to my grammar but just needs its "O" from next chunk of input?

You should let flex do the reading. It is designed to work that way; it will do all the buffering necessary, including the case where a token is split between two (or more) input buffers.
If you cannot simply read from stdin using the standard fread function, then you can redefine the way the flex-generated parser gets input by redefining the macro YY_INPUT. See the "Generated Parser" chapter of the flex manual for a description of this macro.

I have accepted #rici's answer as correct one as it gave me important hint about redefining the macro YY_INPUT.
In this answer I just want to share some details for newbies like me.
I have used How to make YY_INPUT point to a string rather than stdin in Lex & Yacc (Solaris) as example of custom YY_INPUT and this made my artificial example to work correctly with partial tokens.
To make Flex work correctly with partial tokens, the input should not contain '\0' symbols, i.e. scanning process should be "endless". Here is how YY_INPUT is redefined:
int readInputForLexer(char *buffer, int *numBytesRead, int maxBytesToRead) {
static int Flip = 0;
if ((Flip++ % 2) == 0) {
strcpy(buffer, "FOO F");
*numBytesRead = 5; // IMPORTANT: this is 5, not 6, to cut off \0
} else {
strcpy(buffer, "OO FOO");
*numBytesRead = 6; // IMPORTANT: this is 6, not 7, to cut off \0
}
return 0;
}
In this example partial token F-OO is glued by Flex into a correct one: FOO.
As #rici pointed out in his comment, correct way to stop scanning is to set: *numBytesRead = 0.
See also another answer by #rici on similar SO question: Flex, continuous scanning stream (from socket). Did I miss something using yywrap()?.
See my example for further details.

Related

why we need both Look Ahead symbol and read ahead symbol in Compiler

well i was reading some common concepts regarding parsing in compiler..i came across look ahead and read ahead symbol i search and read about them but i am stuck like why we need both of them ? would be grateful for any kind suggestion
Lookahead symbol: when node being considered in parse tree is for a terminal, and the
terminal matches lookahead symbol,then we advance in both parse and
input
read aheadsymbol: lexical analyzer may need to read some character
before it can decide on the token to be returned
One of these is about parsing and refers to the next token to be produced by the lexical scanner. The other one, which is less formal, is about lexical analysis and refers to the next character in the input stream. It should be clear which is which.
Note that while most parsers only require a single lookahead token, it is not uncommon for lexical analysis to have to backtrack, which is equivalent to examining several unconsumed input characters.
I hope I got your question right.
Consider C.
It has several punctuators that begin the same way:
+, ++, +=
-, --, -=, ->
<, <=, <<, <<=
...
In order to figure out which one it is when you see the first + or - or <, you need to look ahead one character in the input (and then maybe one more for <<=).
A similar thing can happen at a higher level:
{
ident1 ident2;
ident3;
ident4:;
}
Here ident1, ident3 and ident4 can begin a declaration, an expression or a label. You can't tell which one immediately. You can consult your existing declarations to see if ident1 or ident3 is already known (as a type or variable/function/enumeration), but it's still ambiguous because a colon may follow and if it does, it's a label because it's permitted to use the same identifier for both a label and a type/variable/function/enumeration (those two name spaces do not intersect), e.g.:
{
typedef int ident1;
ident1 ident2; // same as int ident2
int ident3 = 0;
ident3; // unused expression of value 0
ident1:; // unused label
ident2:; // unused label
ident3:; // unused label
}
So, you may very well need to look ahead a character or a token (or "unread" one) to deal with situations like these.

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.
This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).
My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

How to unlex using Flex(The Fast Lexical Analyzer)?

Is there any way to put a token back into the input stream using Flex? I imagine some function like yyunlex().
There is the macro REJECT which will put the token back to stream and continue to match other the rules as though the first matched didn't. If you just want to put some char back to stream #Kizaru's answer will suffice.
Example snippet:
%%
a |
ab |
abc |
abcd ECHO; REJECT;
.|\n printf("xx%c", *yytext);
%%
You have a few options.
You can put each character for the token back onto the input stream using unput(ch) where ch is the character. This call puts ch as the next character on the input stream (next character to be considered in scanning). So you could do this if you save the string during the token match.
You might want to look into yyless(0) which will put all of the characters from the token back onto the input stream too. I never used this one though, so I'm not sure if there are any gotchas. You can specify an integer n hwich will put all but the first n characters back on the input stream.
Now, if you're going to do this often during scanning/parsing, you might want to use lex just to build tokens and place the tokens onto your own data structure for parsing. This is akin to what bison and yacc's generated yyparse() function does.

Reading EDI Formatted Files

I'm new to EDI, and I have a question.
I have read that you can get most of what you need about an EDI format by looking at the last 3 characters of the ISA line. This is fine if every EDI used line breaks to separate entities, but I have found that many are single line files with any number of characters used as breaks. I have noticed that the VERY last character in every EDI I've parsed is the break character. I've looked at a few hundred, and have found no exceptions to this. If I first grab that character, and use that to obtain the last 3 of the ISA line, should I reasonably expect that I will be able to parse data from an EDI?
I don't know if this helps, but the EDI 'types' in question tend to be 850, 875. I'm not sure if that is a standard or not, but it may be worth mentioning.
the transaction type of edi doesn't really matter (850 = order, 875 = grocery po). having written a few edi parsers, here are a few things i've found:
you should be able to count on the ISA (and the ISA only) being fixed width (105 characters if memory serves).
strip off the first 105 characters. everything after that and before the first occurance of "GS" is your line terminator (this can be anything, include a 0x07 - the beep - so watch out if you're outputting to stdout for debugging or you may have a bunch of beeps coming out of the speaker). normally this is 1 or 2 characters, sometimes it can be more (if the person sending you the data adds an extra terminator for some reason). once you have the line terminator, you can get the segment (field) delimiter. i normally pull the 3 character of the GS line and use that, though the 4th character of the ISA line should work as well.
also be aware that you can get a file with multiple ISA's in it. in that case you cannot count on the line or field separators being the same within each ISA.
another thing .. it is also possible (again, not sure if its spec) for an edi file to have a variable length ISA. this is very rare, but i had to accommodate it. if that happens you have to parse the line into its fields. the last field in the ISA is only a character long, so you can determine the real length of the ISA from it. if it were me, i wouldn't worry about this unless you see a file like it. it is a rare occurance.
what i've said above may not be to the letter of the "spec" ... that is, i'm not sure its legal to have different line separators in the same file, but in different ISAs, but it is technically possible and I accommodate it because i have to process files that come through in that manner. the edi processor i use processes upwards of 5000 files a day with over 3000 possible sources of data (so i see a lot of weird stuff).
best regards,
don
EDI content is composed of segments and elements.
To parse it, you will need to break it up into segments first, and then elements like so (in PHP):
<?php
$edi = "YOUR EDIT STRING!";
$segment_delimeter = "~";
$element_delimeter = "*";
//First break it into segments
$segments = explode($segment_delimiter, $edi);
//Now break each segment into elements
$segs_and_elems = array();
foreach($segments as $segment){
$segs_and_elems[] = explode(element_delimeter, $segment);
}
//To echo out what type of EDI this is for example:
foreach($segs_and_elems as $seg){
if($seg[0] == "GS"){ echo($seg[1]); }
}
?>
Hope this helps get you started.
For header information the following java will let you get the basic info pretty easy.
C# has the split as well and the code looks very similar
try {
String sCurrentLine;
fileContent = new BufferedReader(new FileReader(filePathName));
sCurrentLine = fileContent.readLine();
// get the delimiter after ISA, if you know your field delimiter just force it.
// we look at lots of different senders messages so never sure what it will be.
delimiterElement = sCurrentLine.substring(3,1); // Grab the delimiter they are using
String[] splitMessage = sCurrentLine.split(delimiterElement,16); // to get the messages if everything is on one line of course
senderQualifier = splitMessage[5]; //who sent something we need fixed qualifier
senderID = splitMessage[6]; //who sent something we need fixed alias
ISA = splitMessage[13]; // Control number
testIndicator = splitMessage[15];
dateStamp = splitMessage[9];
timeStamp = splitMessage[10];
... do stuff with the pieces of info ...

Resources