Match the input with string using lex - flex-lexer

I'm trying to match the prefix of the string Something. For example, If input So,SOM,SomeTH,some,S, it is all accepted because they are all prefixes of Something.
My code
Ss[oO]|Ss[omOMOmoM] {
printf("Accept Something": %s\n", yytext);
}
Input
Som
Output
Accept Something: So
Invalid Character
It's suppose to read Som because it is a prefix of Something. I don't get why my code doesn't work. Can anyone correct me on what I am doing wrong?

I don't know what you think the meaning of
Ss[oO]|Ss[omOMOmoM]
is, but what it matches is either:
an S followed by an s followed by exactly one of the letters o or O, or
an S followed by an s followed by exactly one of the letters o, O, m or M. Putting a symbol more than once inside a bracket expression has no effect.
Also, I don't see how that could produce the output you report. Perhaps there was a copy-and-paste error, or perhsps you have other pattern rules.
If you want to match prefixes, use nested optional matches:
s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?
If you want case-insensitive matcges, you could write out all the character classes, but that gets tiriesome; simpler is to use a case-insensitve flag:
(?i:s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?)
(?i: turns on the insensitive flag, until the matching close parenthesis.
In practice, this is probably not what you want. Normally, you will want to recognise a complete word as a token. You could then check to see if the word is a prefix in the rule action:
[[:alpha:]]+ { if (yyleng <= strlen("something") && 0 == strncasemp(yytext, "something", yyleng) {
/* do something */
}
}
There is lots of information in the Flex manual.

Right now your code (as shown) should only match "Sso" or "SsO" or "Ssm" or "SsM".
You have two alternatives that each start with Ss (without square brackets) so those will be matched literally. That's followed by either [oO] or [omOMomoM], but the characters in square brackets represent alternatives, so that's equivalent to [oOmM] --i.e., any one character of of o, O, m or M.
I'd start with: %option caseless to make it a case-insensitive scanner, so you don't have to list the upper- and lower-case equivalents of every letter.
Then it's probably easiest to just list the alternatives literally:
s|so|som|some|somet|someth|somethi|somethin|something { printf("found prefix"); }
I guess you can make the pattern a bit shorter (at least in the source code) by doing something on this order:
s(o(m(e(t(h(i(n(n(g)?)?)?)?)?)?)?)?)? { printf("found prefix"); }
Doesn't seem like a huge improvement to me, but some might find it more attractive than I do.
If you don't want to use %option caseless the basic idea helps more:
[sS]([oO]([mM]([eE]([tT]([hH]([iI]([nN]([gG])?)?)?)?)?)?)?)? { printf("found prefix"); }
Listing every possible combination of upper and lower case would get tedious.

Related

Flex lexical analyzer not behaving as expected

I'm trying to use Flex to match basic patterns and print something.
%%
^[^qA-Z]*q[a-pr-z0-9]*4\n {printf("userid1, userid2 \n"); return 1;}
%%
int yywrap(void){return 1;}
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
while (yylex());
}
Resolved dumb question
I don't know what you are trying to do, so I'll focus on the immediate issue, which is your last pattern:
^[^qA-Z]*q[a-pr-z0-9]*4[a-pr-z0-9]*4[a-pr-z0-9]*\n
That pattern starts by matching [^qA-Z]*, which is any number of anything which is not a q nor a capital letter (A-Z). Then it matches a q.
Here it's worth considering all the things which are not a q nor a capital letter (A-Z). Obviously, that includes lower-case letters such as s (other than q). It also includes digits. And it includes any other character: punctuation, whitespace, even control characters. In particular, it includes a newline character.
So when you type
10s10<newline>
That certainly could be the start of the last pattern. The scanner hasn't yet seen a q so it doesn't know whether the pattern will eventually match, but it hasn't yet failed. So it keeps on reading more characters, including more newlines.
When you eventually type a q, the scanner can continue with the rest of the pattern. Depending on what you type next, it might or might not be able to continue. If, as seems likely, your input eventually fails to match the pattern, the lexer will fall back to the longest successful match, which is the first pattern. At that point, it will perform the first action
Negative character classes need to be used with a bit of caution. It's s easy to fall into the trap of thinking that "not ..." only includes "reasonable" input. But it includes everything. Often, as in this case, you'll want to at least exclude newlines.,

Flex expression required for validating certain expression based upon the first three characters only

For my parser, for the purpose of this question, any line starting with a single lowercase letter among a set of lowercase letters, followed by the character '=' followed by any other character is a valid line. So, the following are valid lines (all starting from first column):
a=20
b=50 70
q=20 Hello There
z=-
Any other line is not valid. My need is to match the complement. How do I write a flex expression to match the invalid lines. My confusion arises from the ^ which means start of line as well as complement the expression.
I thought ^[abq][=].+ would match the acceptable line so merely complementing it with ^ will do. But ^ at the start of the expression always implies match at start of the line. I made a few other attempts but that did not work too. Though not relevant, the expression is used as the first step to discard invalid SDP lines. See here for details from the relevant SDP RFC, if it matters.
The simplest approach is to always match entire lines (or use different start conditions to lexically analyse the rest of valid lines). Although flex does not have a negation operator (the [^…] negative character class is not an operator), in this case the expressions are pretty simple and can be expressed easily enough. Note that it doesn't matter that the various "invalid line" patterns are not disjoint, since it doesn't matter which one matches a particular invalid line. So here are three patterns which I believe collectively match all invalid lines
[^abqz\n].* { /* Starts with the wrong letter */ }
.[^=\n] { /* Second character not = */ }
.$ { /* Only one character in line */ }

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

What does these two regex match?

I can't figure out what does this regex match:
A: "\\/\\/c\\/(\\d*)"
B: "\\/\\/(\\d*)"
I suppose they are matching some kind of number sequence since \d matches any digit but I'd like to know an example of a string that would be a match for this regex.
The pattern syntax is that specified by ICU. Expressions are created with NSRegularExpression in an iOS app and are correct.
The first matches //c/ + 0 or more digits. The second matches // + 0 or more digits. In both the digits are captured.
An example of a match for A) is //c/123
An example of a match for B) is //12345
When I use Cygwin which emulates Bash on Windows, I sometimes run into situations where I have to escape my escape characters which is what I think is making this expression look so weird. For instance, when I use sed to look for a single '\' I sometimes have to write it as '\\\\'. (Funny, StackOverflow proved my point. If you write 4 backslashes in the comment, it only shows two. So if you process it again, they might all disappear depending on your situation).
Considering this, it might be helpful to think of pairs of backslashes as representing only one if you're coming from a similar situation. My guess would be you are. Because of this I would say Erik Duymelinck is probably spot on. This will capture a sequence of digits that may or may not follow a couple slashes and a c:
//c/000
//00000
This regex matches an odd sequence of characters, which, at first glance, almost seem like a regex, since \d is a digit, and followed by an asterisk (\d*) would mean zero-or-more digits. But it's not a digit, because the escape-slash is escaped.
\\/\\/c\\/(\\d*)
So, for instance, this one matches the following text:
\/\/c\/\
\/\/c\/\d
\/\/c\/\dd
\/\/c\/\ddd
\/\/c\/\dddd
\/\/c\/\ddddd
\/\/c\/\dddddd
...
This one is almost the same
\\/\\/(\\d*)
except you just delete the c\/ from the above results:
\/\/\
\/\/\d
\/\/\dd
\/\/\ddd
\/\/\dddd
\/\/\ddddd
\/\/\dddddd
...
In both cases, the final \ and optional d is [capture group][1] one.
My first impression was that these regexes were intended for escaping in Java strings, meaning they would be completely invalid. If the were escaped for Java strings, such as
Pattern p = Pattern.compile("\\/\\/c\\/(\\d*)");
It would be invalid, because after un-escaping, it would result in this invalid regex:
\/\/c\/(\d*)
The single escape-slashes (\) are invalid. But the \d is valid, as it would mean any digit.
But again, I don't think they're invalid, and they're not escaped for a Java string. They're just odd.

flex usage of (?r-s:pattern)

I am trying to use the regular expression (?r-s:pattern) as mentioned in the Flex manual.
Following code works only when i input small letter 'a' and not the caps 'A'
%%
[(?i:a)] { printf("color"); }
\n { printf("NEWLINE\n"); return EOL;}
. { printf("Mystery character %s\n", yytext); }
%%
OUTPUT
a
colorNEWLINE
A
Mystery character A
NEWLINE
Reverse is also true i.e. if i change the line (?i:a) to (?i:A) it only considers 'A' as valid input and not 'a'.
If I remove the square brackets i.e. [] it gives error as
"ex1.lex", line 2: unrecognized rule
If I enclose the "(?i:a)" then it compiles but after executing it always goes to last rule i.e. "Mystery character..."
Please let me know how to use it properly.
I guess I am late.. :) Anyway, which flex version are you using, I have version 2.5.35 installed and correctly recognizes above pattern. Perhaps you're using old version!!!
Now regarding the enclosing with [] brackets. It works because as per [] regex rule it will try to match any of individual (, ?, i, :, a or ). Thats why a gets recognized and not A (because it is not in the list).
The way I read the manual, the rule without the square brackets should perform the case-insensitive matching you're looking for--I can't explain why you get an error at compile time. But you can achieve the same behavior in one of two ways. One, you can enumerate the upper and lower case characters in the character class:
%%
[Aa] { printf("color"); }
%%
Two, you can specify the case-insensitive scanner option, either on the command line as -i or --case-insensitive or in your .l file:
%%
%option case-insensitive
[a] {printf("color"); }
%%

Resources