Flex scanning, differentiating between string (with single spaces) and padding (more than one space) - parsing

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.

This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).

My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

Related

Flex confusing to transform string character by character

I want to use flex to transform a string based on simple rules. I have rules like the first character stays the same and the second and third characters might change. Like if the second character was a letter, it becomes the number listed in the rules below. If the third is a digit, it becomes a certain letter.
%%
/*^[a-z] {char *yycopy = strdup( yytext ); unput(yycopy[0]);}*/
[ajs] {putchar('1');}
[bkt] {putchar('2');}
[clu] {putchar('3');}
[dmv] {putchar('4');}
[1] {putchar('j');}
[2] {putchar('k');}
[3] {putchar('l');}
[4] {putchar('m');} /*more number rules till 9*/
%%
int yywrap(void){return 1;}
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
while (yylex());
}
If there are different rules for characters in different positions within the string, how can I use start conditions to change a particular character (i.e. the rules for the second and third character are different).
You switch start condition by using the BEGIN action. Flex never automatically changes start condition, so you when you need to return to the initial start condition (called INITIAL), you have to do so explicitly (BEGIN(INITIAL)).
You need to declare start condition names in the (f)lex prologue, usually with the %x command. (%s is also possible but with different semantics. See the Flex manual for details.)
You indicate that a start condition applies to a rule by starting the rule with a start condition name in angle brackets. You can put more than one start condition inside the angle brackets; separate them with commas and don't use spaces. Don't put a space after the angle brackets either; they are part of the pattern and (f)lex patterns cannot include unquoted space characters.
BEGIN is a macro and it does not require parentheses around the start condition name, but I suggest always using them anyway, so you don't have to worry about what the macro expands to. Start condition names are small integers (either enum constants or preprocessor macros) but nothing guarantees their value, so don't make assumptions.
That's about it. So you could implement your astro numerological codifier with:
%x SECOND THIRD REST
%%
[a-z] ECHO; BEGIN(SECOND);
<SECOND>[ajs] putchar('1'); BEGIN(THIRD);
/* More SECOND rules */
<THIRD>1 putchar('j'); BEGIN(REST);
/* More THIRD rules */
<*>.*\n? ECHO; BEGIN(INITIAL);
(I deliberately did not add any <REST> rules beacause the fallback at the end covers it. I also deliberately left out the anchor in the first rule because my rules guarantee that the INITIAL start condition is 9nly in force at the beginning of a line. See the last rule. The last rule specifies an optional newline in case the file does not end with a newline, which occasionally happens although it's technically invalid.)

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser.
For the sake of brevity, I have reduced this language to a minimum so that my questions are still open.
The language has implicit and explicit strings, arrays and objects. An implicit string is just a sequence of characters that does not contain <, {, [ or ]. An explicit string looks like <salt<text>salt> where salt is an arbitrary identifier (i.e. [a-zA-Z][a-zA-Z0-9]*) and text is an arbitrary sequence of characters that does not contain the salt.
An array starts with [, followed by objects and/or strings and ends with ].
All characters within an array that don't belong to an array, object or explicit string do belong to an implicit string and the length of each implicit string is maximal and greater than 0.
An object starts with { and ends with } and consists of properties. A property starts with an identifier, followed by a colon, then optional whitespaces and then either an explicit string, array or object.
So the string [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ] represents an array with 6 items: " name:test ", "<html>[]</html>", " ", { name: "test" }, "bla" and " " (the object is notated in json).
As one can see, this language is not context free due to the explicit string (that I don't want to miss). However, the syntax tree is nonambiguous.
So my question is: Is a property a token that may be returned by a tokenizer? Or should the tokenizer return T_identifier, T_colon when he reads an object property?
The real language allows even prefixes in the identifier of a property, e.g. ns/name:<a<test>a> where ns is the prefix for a namespace.
Should the tokenizer return T_property_prefix("ns"), T_property_prefix_separator, T_property_name("name"), T_property_colon or just T_property("ns/name") or even T_identifier("ns"), T_slash, T_identifier("name"), T_colon?
If the tokenizer should recognize properties (which would be useful for syntax highlighters), he must have a stack, because name: is not a property if it is in an array. To decide whether bla: in [{foo:{bar:[test:baz]} bla:{}}] is a property or just an implicit string, the tokenizer must track when he enters and leave an object or array.
Thus, the tokenizer would not be a finite state machine any more.
Or does it make sense to have two tokenizers - the first, which separates whitespaces from alpha-numerical character sequences and special characters like : or [, the second, which uses the first to build more semantical tokens? The parser could then operate on top of the second tokenizer.
Anyways, the tokenizer must have an infinite lookahead to see when an explicit string ends. Or should the detection of the end of an explicit string happen inside the parser?
Or should I use a parser generator for my undertaking? Since my language is not context free, I don't think there is an appropriate parser generator.
Thanks in advance for your answers!
flex can be requested to provide a context stack, and many flex scanners use this feature. So, while it may not fit with a purist view of how a scanner scans, it is a perfectly acceptable and supported feature. See this chapter of the flex manual for details on how to have different lexical contexts (called "start conditions"); at the very end is a brief description of the context stack. (Don't miss the sentence which notes that you need %option stack to enable the stack.) [See Note 1]
Slightly trickier is the requirement to match strings with variable end markers. flex does not have any variable match feature, but it does allow you to read one character at a time from the scanner input, using the function input(). That's sufficient for your language (at least as described).
Here's a rough outline of a possible scanner:
%option stack
%x SC_OBJECT
%%
/* initial/array context only */
[^][{<]+ yylval = strdup(yytext); return STRING;
/* object context only */
<SC_OBJECT>{
[}] yy_pop_state(); return '}';
[[:alpha:]][[:alnum:]]* yylval = strdup(yytext); return ID;
[:/] return yytext[0];
[[:space:]]+ /* Ignore whitespace */
}
/* either context */
<*>{
[][] return yytext[0]; /* char class with [] */
[{] yy_push_state(SC_OBJECT); return '{';
"<"[[:alpha:]][[:alnum:]]*"<" {
/* We need to save a copy of the salt because yytext could
* be invalidated by input().
*/
char* salt = strdup(yytext);
char* saltend = salt + yyleng;
char* match = salt;
/* The string accumulator code is *not* intended
* to be a model for how to write string accumulators.
*/
yylval = NULL;
size_t length = 0;
/* change salt to what we're looking for */
*salt = *(saltend - 1) = '>';
while (match != saltend) {
int ch = input();
if (ch == EOF) {
yyerror("Unexpected EOF");
/* free the temps and do something */
}
if (ch == *match) ++match;
else if (ch == '>') match = salt + 1;
else match = salt;
/* Don't do this in real code */
yylval = realloc(yylval, ++length);
yylval[length - 1] = ch;
}
/* Get rid of the terminator */
yylval[length - yyleng] = 0;
free(salt);
return STRING;
}
. yyerror("Invalid character in object");
}
I didn't test that thoroughly, but here is what it looks like with your example input:
[ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ]
Token: [
Token: STRING: -- name:test --
Token: STRING: --<html>[]</html>--
Token: STRING: -- --
Token: {
Token: ID: --name--
Token: :
Token: STRING: --test--
Token: }
Token: STRING: --bla--
Token: STRING: -- --
Token: ]
Notes
In your case, unless you wanted to avoid having a parser, you don't actually need a stack since the only thing that needs to be pushed onto the stack is an object context, and a stack with only one possible value can be replaced with a counter.
Consequently, you could just remove the %option stack and define a counter at the top of the scan. Instead of pushing the start condition, you increment the counter and set the start condition; instead of popping, you decrement the counter and reset the start condition if it drops to 0.
%%
/* Indented text before the first rule is inserted at the top of yylex */
int object_count = 0;
<*>[{] ++object_count; BEGIN(SC_OBJECT); return '{';
<SC_OBJECT[}] if (!--object_count) BEGIN(INITIAL); return '}'
Reading the input one character at a time is not the most efficient. Since in your case, a string terminate must start with >, it would probably be better to define a separate "explicit string" context, in which you recognized [^>]+ and [>]. The second of these would do the character-at-a-time match, as with the above code, but would terminate instead of looping if it found a non-matching character other than >. However, the simple code presented may turn out to be fast enough, and anyway it was just intended to be good enough to do a test run.
I think the traditional way to parse your language would be to have the tokenizer return T_identifier("ns"), T_slash, T_identifier("name"), T_colon for ns/name:
Anyway, I can see three reasonable ways you could implement support for your language:
Use lex/flex and yacc/bison. The tokenizers generated by lex/flex do not have stack so you should be using T_identifier and not T_context_specific_type. I didn't try the approach so I can't give a definite comment on whether your language could be parsed by lex/flex and yacc/bison. So, my comment is try it to see if it works. You may find information about the lexer hack useful: http://en.wikipedia.org/wiki/The_lexer_hack
Implement a hand-built recursive descent parser. Note that this can be easily built without separate lexer/parser stages. So, if the lexemes depend on context it is easy to handle when using this approach.
Implement your own parser generator which turns lexemes on and off based on the context of the parser. So, the lexer and the parser would be integrated together using this approach.
I once worked for a major network security vendor where deep packet inspection was performed by using approach (3), i.e. we had a custom parser generator. The reason for this is that approach (1) doesn't work for two reasons: firstly, data can't be fed to lex/flex and yacc/bison incrementally, and secondly, HTTP can't be parsed by using lex/flex and yacc/bison because the meaning of the string "HTTP" depends on its location, i.e. it could be a header value or the protocol specifier. The approach (2) didn't work because data can't be fed incrementally to recursive descent parsers.
I should add that if you want to have meaningful error messages, a recursive descent parser approach is heavily recommended. My understanding is that the current version of gcc uses a hand-built recursive descent parser.

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

Flex regular expression for comments

I'm trying to learn flex and having trouble with a regular expression to catch comments.
Assuming a comment begins with // and runs to the end of the line, I would like the program to recognize the entire comment and set yytext equal to it.
So far ["//".*$] is not cutting the mustard.
Thank you
Putting your text in square brackets creates a character class matching any one character from among those between the brackets. Also, quotation marks are not special in Flex's regex syntax. You want something along these lines:
/* definitions (for more readable rules) */
/* The \134 are octal escapes for the '/' character, for clarity: */
CMNT_START \134\134
%%
/* rules */
{CMNT_START}.*$ /* yytext automatically contains the matched text*/;

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Resources