Start states in Lex / Flex - parsing

I'm using Flex and Bison for a parser generator, but having problems with the start states in my scanner.
I'm using exclusive rules to deal with commenting, but this grammar doesn't seem to match quoted tokens:
%x COMMENT
// { BEGIN(COMMENT); }
<COMMENT>[^\n] ;
<COMMENT>\n { BEGIN(INITIAL); }
"==" { return EQUALEQUAL; }
. ;
In this simple example the line:
// a == b
isn't matched entirely as a comment, unless I include this rule:
<COMMENT>"==" ;
How do I get round this without having to add all these tokens into my exclusive rules?

Matching C-style comments in Lex/Flex or whatever is well documented:
in the documentation, as well as various variations around the Internet.
Here is a variation on that found in the Flex documentation:
<INITIAL>{
"//" BEGIN(IN_COMMENT);
}
<IN_COMMENT>{
\n BEGIN(INITIAL);
[^\n]+ // eat comment
"/" // eat the lone /
}

Try adding a "+" after the [^n] rule. I don't know why the exclusive state is still picking up '==' even in an exclusive state, but apparently it is. Flex will normally match the rule that matches the most text, and adding the "+" will at least make the two rules tie in length. Putting the COMMENT rule first will cause it to be used in case of a tie.

The clue is:
The problem is this 'eat comment'
rule doesn't seem to match tokens with
more than one character
so add a * to match zero or more non-newlines. You want Zero otherwise a empty comment will not match.
%x COMMENT
// { BEGIN(COMMENT); }
<COMMENT>[^\n]* ;
<COMMENT>\n { BEGIN(INITIAL); }
"==" { return EQUALEQUAL; }
. ;

Related

Can't identify the invalid identifier in my FLEX code

I am trying to build my own compiler which outputs the type of input the user gives, for example, abcd is an identifier, and 1242 is an integer. I have implemented it as below:
textProg.l
%{
#define IDENTIFIER 10
#define INTEGER 11
%}
IDENTIFIER [a-zA-Z_][a-zA-Z0-9_]*
INTEGER [1-9][0-9]*|"0"
%%
{IDENTIFIER} { return IDENTIFIER; }
{INTEGER} { return INTEGER; }
%%
int main() {
int token;
while(token = yylex()) {
if(token == IDENTIFIER) { printf("IDENTIFIER"); }
else if(token == INTEGER) { printf("INTEGER"); }
else { printf("INVALID"); }
}
}
This works perfectly when I run the following commands:
flex testProg.l
cc lex.yy.c -lfl
./a.out
Sample working input
sample
IDENTIFIER
1993
INTEGER
The problem arises when I try to input an invalid token, for example 12abc. This is neither an integer nor an identifier and should output "INVALID" but it outputs:
12abc
INTEGER
IDENTIFIER
What happened is that 12 and abc are taken as separate tokens instead of one. How can I avoid this?
Many languages use lexical analysers which are perfectly happy to let 12abc be an integer followed by a identifier. Why not? If that means something in the language, then that's probably what the user meant. If it doesn't mean anything, it will trigger a syntax error, so the user will be informed.
But, OK, you want to recognise that as an error. In that case you need to recognise the erroneous input as an error, and the first step is to recognise it as a token. That's easy if you remember flex's match precedences:
[[:alpha:]_][[:alnum:]_]* { return IDENTIFIER; }
[1-9][[:digit:]]*|0 { return NUMBER; }
[[:alnum:]_]+ { return BADTOKEN; }
Note that I replaced your macros with actual patterns, using named character classes for readability, and removed the redundant quotes on "0".
Flex parses 12abc as two separate tokens because you didn't tell it it shouldn't.
Lex derivatives, like Flex, works by one very simple but effective algorithm:
They start at the position when the last token ended (or at beginning of the text) and try to find a rule that matches the most characters from this point. (If there are multiple rules that match the same number of characters, the one defined in the "*.l" file first is chosen.)
That's it. Notice there is nothing about it having to match a whole word.
That's actually a good thing. It is why in most programming languages you don't need to explicitly separate tokens. You can write things like (2+30L)/2 and the lexer for that language will figure out where each token ends, without additional hints like whitespaces. (The tokens would be (, 2, +, 30, L, ), / and 2.)
If you want to disable this fancy mechanism for the specific case of putting numbers and identifiers together, you will need to create a rule that explicitly forbids it, e.g:
{IDENTIFIER} { return IDENTIFIER; }
{INTEGER} { return INTEGER; }
[0-9A-Za-z_]+ { return ERROR; }
Notice that this new rule also matches valid identifiers and integers. However, it won't be used for them because it is under them on the rules list.

Capture names containing --but not ending-- in dashes

I am trying to capture names (not starting with a number) which could contain dashes, such as hello-world. My problem is that I also have rules for single dashes and symbols which conflict with it:
[A-Za-z][A-Za-z0-9-]+ { /* capture "hello-world" */ }
"-" { return '-'; }
">" { return '>'; }
When the lexer reads hello-world-> the previous rules yield hello-world- and >, whereas I expected hello-world, - and > to be captured individually. To solve it I fixed it this way:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ { /* ensure final dash is never included at the end */ }
That works, except for single-letter words, so finally I implemented this:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ { /* ensure final dash is never included at the end */ }
[A-Za-z][A-Za-z0-9]* { /* capture possible single letter words */ }
Question: Is there a more elegant way to do it?
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z][A-Za-z0-9]*
Note that, as you said, the first rule already covers everything that's not a single letter. So the second rule only has to match single letters and can be shortened to just [A-Za-z]:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z]
Now the second rule is a mere prefix of the first, so we can combine this into a single rule by making the part after the first letter optional:
[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9]+)?
The + on the last bit is unnecessary because everything except the last character can as well be matched by the middle part, so the simplest version is:
[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9])?

Match the input with string using lex

I'm trying to match the prefix of the string Something. For example, If input So,SOM,SomeTH,some,S, it is all accepted because they are all prefixes of Something.
My code
Ss[oO]|Ss[omOMOmoM] {
printf("Accept Something": %s\n", yytext);
}
Input
Som
Output
Accept Something: So
Invalid Character
It's suppose to read Som because it is a prefix of Something. I don't get why my code doesn't work. Can anyone correct me on what I am doing wrong?
I don't know what you think the meaning of
Ss[oO]|Ss[omOMOmoM]
is, but what it matches is either:
an S followed by an s followed by exactly one of the letters o or O, or
an S followed by an s followed by exactly one of the letters o, O, m or M. Putting a symbol more than once inside a bracket expression has no effect.
Also, I don't see how that could produce the output you report. Perhaps there was a copy-and-paste error, or perhsps you have other pattern rules.
If you want to match prefixes, use nested optional matches:
s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?
If you want case-insensitive matcges, you could write out all the character classes, but that gets tiriesome; simpler is to use a case-insensitve flag:
(?i:s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?)
(?i: turns on the insensitive flag, until the matching close parenthesis.
In practice, this is probably not what you want. Normally, you will want to recognise a complete word as a token. You could then check to see if the word is a prefix in the rule action:
[[:alpha:]]+ { if (yyleng <= strlen("something") && 0 == strncasemp(yytext, "something", yyleng) {
/* do something */
}
}
There is lots of information in the Flex manual.
Right now your code (as shown) should only match "Sso" or "SsO" or "Ssm" or "SsM".
You have two alternatives that each start with Ss (without square brackets) so those will be matched literally. That's followed by either [oO] or [omOMomoM], but the characters in square brackets represent alternatives, so that's equivalent to [oOmM] --i.e., any one character of of o, O, m or M.
I'd start with: %option caseless to make it a case-insensitive scanner, so you don't have to list the upper- and lower-case equivalents of every letter.
Then it's probably easiest to just list the alternatives literally:
s|so|som|some|somet|someth|somethi|somethin|something { printf("found prefix"); }
I guess you can make the pattern a bit shorter (at least in the source code) by doing something on this order:
s(o(m(e(t(h(i(n(n(g)?)?)?)?)?)?)?)?)? { printf("found prefix"); }
Doesn't seem like a huge improvement to me, but some might find it more attractive than I do.
If you don't want to use %option caseless the basic idea helps more:
[sS]([oO]([mM]([eE]([tT]([hH]([iI]([nN]([gG])?)?)?)?)?)?)?)? { printf("found prefix"); }
Listing every possible combination of upper and lower case would get tedious.

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.
This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).
My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

Jison Lexer - Detect Certain Keyword as an Identifier at Certain Times

"end" { return 'END'; }
...
0[xX][0-9a-fA-F]+ { return 'NUMBER'; }
[A-Za-z_$][A-Za-z0-9_$]* { return 'IDENT'; }
...
Call
: IDENT ArgumentList
{{ $$ = ['CallExpr', $1, $2]; }}
| IDENT
{{ $$ = ['CallExprNoArgs', $1]; }}
;
CallArray
: CallElement
{{ $$ = ['CallArray', $1]; }}
;
CallElement
: CallElement "." Call
{{ $$ = ['CallElement', $1, $3]; }}
| Call
;
Hello! So, in my grammar I want "res.end();" to not detect end as a keyword, but as an ident. I've been thinking for a while about this one but couldn't solve it. Does anyone have any ideas? Thank you!
edit: It's a C-like programming language.
There's not quite enough information in the question to justify the assumptions I'm making here, so this answer may be inexact.
Let's suppose we have a somewhat Lua-like language in which a.b is syntactic sugar for a["b"]. Furthermore, since the . must be followed by a lexical identifier -- in other words, it is never followed by a syntactic keyword -- we'd like to inhibit keyword recognition in this context.
That's a pretty simple rule. It's simple enough that the lexer could implement it without any semantic information at all; all that it says is that the token which follows a . must be an identifier. In this context, keywords should be treated as identifiers, and anything else other than an identifier is an error.
We can do this with start conditions. Specifically, we define a start condition which is only used after a . token:
%x selector
%%
/* White space and comment rules need to explicitly include
* the selector condition
*/
<INITIAL,selector>\s+ ;
/* Other rules, including keywords, are unmodified */
"end" return "END";
/* The dot rule triggers a new start condition */
"." this.begin("selector"); return ".";
/* Outside of the start condition, identifiers don't change state. */
[A-Za-z_]\w* yylval = yytext; return "ID";
/* Only identifiers are valid in this start condition, and if found
* the start condition is changed back. Anything else is an error.
*/
<selector>[A-Za-z_]\w* yylval = yytext; this.popState(); return "ID";
<selector>. parse_error("Expecting identifier");
Modify your parser, so it always knows what it is expecting to read next (that will be some set of tokens, you can compute this using the notion of First(x) for x being any nonterminal).
When lexing, have the lexer ask the parser what set of tokens it expects next.
Your keywork reconizer for 'end' asks the parser, and it either ways "expecting 'end'" at which pointer the lexer simply hands on the 'end' lexeme, or it says "expecting ID" at which point it hands the parser an ID with name text "end".
This may or may not be convenient to get your parser to do. But you need something like this.
We use a GLR parser; our parser accepts multiple tokens in the same place. Our solution is to generate both the 'end' keyword and and the identifier with text "end" and shove them both into the GLR parser. It can handle local ambiguity; the multiple parses caused by this proceed until the parser with the wrong assumption encounters a syntax error, and then it just vanishes, by fiat. The last standing parser is the one with the right set of assumptions. This scheme is somewhat like the first one, just that we hand the parser the choices and it decides rather than making the lexer decide.
You might be able to send your parser a "two-interpretation" lexeme, e.g., a keyword-in-context lexeme, which in essence claims it it both a keyword and/or an identifier. With a single token lookahead internally, the parser can likely decide easily and restamp the lexeme. Not as general as the GLR solution, but probably works in a lot of cases.

Resources