How does Flex distinguish between A, AB, and ABC? - parsing

I made this experiment for Flex to see if I enter ABC, if it will see all A, AB, ABC or only ABC or only the first match in the list of expressions.
%{
#include <stdio.h>
%}
%%
A puts("got A");
AB puts("got AB");
ABC puts("got ABC");
%%
int main(int argc, char **argv)
{
yylex();
return 0;
}
When I enter ABC after compiling and running the program, it responds with "Got ABC" which really surprises me since I thought lex doesn't keep track of visited text, and only finds the first match; but actually, it seems to find the longest match.
What strategy does Flex use to respond to A if and only if there is no longer match?

The fact that (F)lex uses the maximal-munch principle should hardly be surprising, since it is well documented in the Flex manual:
When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text…. If it finds two or more matches of the same length, the rule listed first in the flex input file is chosen.
(First paragraph of the section "How the input is matched")
The precise algorithm is exceedingly simple: every time a token is requested, flex scans the text, moving through the DFA. Every time it hits an accepting state, it records the current text position. When no more transitions are possible, it returns to the last recorded accept position, and that becomes the end of the token.
The consequence is that (F)lex can scan the same text multiple times, although it only scans once for each token.
A set of lexical rules which require excessive back-tracking will slow down the lexical scan. This is discussed in the Flex manual section Performance Considerations, along with some strategies to avoid the issue. However, except in pathological cases, the overhead from back-tracking is not noticeable.

Related

Conditionals for flex

Is is possible to place conditional statements for rules in flex? I need this in order to match a specific rule only when some condition is true. Something like this:
%option c++
%option noyywrap
%%
/*if (condition) {*/
[0-9] {}
/*}*/
[0-9]{2} {}
. /* eat up any unmatched character */
%%
void yylex(void);
int main()
{
FlexLexer *lexer = new yyFlexLexer();
lexer->yylex();
delete lexer;
}
Or is it possible to modify the final c++ generated code, in order to match only some specific regex rules?
UPDATE:
Using start conditions doesn't seem to help. What I want is, depending on some external variable (like isNthRegexActive), to be able to match or not a specific regex.
For example, if I have 4 regex rules, and the first and 2nd are not active, the program should only check for the other 2, and always check for all of them (don't stop at first match - maybe use REJECT).
Example for 4 rules:
/* Declared at the top */
isActive[0] = false;
isActive[1] = false;
isActive[2] = true;
isActive[3] = true;
%%
[0-9]{4} { printf("1st: %s\n", yytext); REJECT;}
[0-3]{2}[0-3]{2} { printf("2nd: %s\n", yytext); REJECT; }
[1-3]{2}[0-3]{2} { printf("3rd: %s\n", yytext); REJECT; }
[1-2]{2}[1-2]{2} { printf("4th: %s\n", yytext); REJECT; }
.
%%
For the input: 1212 the result should be:
3rd: 1212
4th: 1212
Don't use REJECT unless you absolutely have no alternative. It massively slows down the lexical scan, artificially limits the size of the token buffer, and makes your lexical analysis very hard to reason about.
You might be thinking that a scanner generated by (f)lex tests each regular expression one at a time in order to select the best one for a given match. It does not work that way; that would be much too slow. What it does is, effectively, check all the regular expressions in parallel by using a precompiled deterministic state machine represented as a lookup table.
The state machine does exactly one transition for each input byte, using a simple O(1) lookup into the the transition table. (There are different table compression techniques which trade off table size against the constant in the O(1) lookup, but that doesn't change the basic logic.) What that all means is that you can use as many regular expressions as you wish; the time to do the lexical analysis does not depend on the number or complexity of the regular expressions. (Except for caching effects: if your transition table is really big, you might start running into cache misses during the transition lookups. In such cases, you might prefer a compression algorithm which compresses more.)
In most cases, you can use start conditions to achieve conditional matching, although you might need a lot of start conditions if there are a more than a few interacting conditions. Since the scanner can only have one active start condition, you'll need to generate a different start condition for each legal combination of the conditions you want to consider. That's usually most easily achieved through automatic generation of your scanner rules, but it can certainly be done by hand if there aren't too many.
It's hard to provide a more concrete suggestion without knowing what kinds of conditions you need to check.

How to prevent Flex from ignoring previous analysis?

I recently started using Lex, as a simple way to explain the problem I encoutered, supposing that I'm trying to realize a lexical analyser with Flex that print all the letters and also all the bigrams in a given text, that seems very easy and simple, but once I implemented it, I've realised that it shows bigrams first and only shows letters when they are single, example: for the following text
QQQZ ,JQR
The result is
Bigram QQ
Bigram QZ
Bigram JQ
Letter R
Done
This is my lex code
%{
%}
letter[A-Za-z]
Separ [ \t\n]
%%
{letter} {
printf(" Letter %c\n",yytext[0]);
}
{letter}{2} {
printf(" Bigram %s\n",yytext);
}
%%
main()
{ yylex();
printf("Done");
}
My question is How can realise the two analysis seperatly, knowing that my actual problem isn't as simple as this example
Lexical analysers divide the source text into separate tokens. If your problem looks like that, then (f)lex is an appropriate tool. If your problem does not look like that, then (f)lex is probably not the correct tool.
Doing two simultaneous analyses of text is not really a use case for (f)lex. One possibility would be to use two separate reentrant lexical analysers, arranging to feed them the same inputs. However, that will be a lot of work for a problem which could easily be solved in a few lines of C.
Since you say that your problem is different from the simple problem in your question, I did not bother to either write the simple C code or the rather more complicated code to generate and run two independent lexical analysers, since it is impossible to know whether either of those solutions is at all relevant.
If your problem really is matching two (or more) different lexemes from the same starting position, you could use one of two strategies, both quite ugly (IMHO):
I'm assuming the existence of handler functions:
void handle_letter(char ch);
void handle_bigram(char* s); /* Expects NUL-terminated string */
void handle_trigram(char* s); /* Expects NUL-terminated string */
For historical reasons, lex implements the REJECT action, which causes the current match to be discarded. The idea was to let you process a match, and then reject it in order to process a shorter (or alternate) match. With flex, the use of REJECT is highly discouraged because it is extremely inefficient and also prevents the lexer from resizing the input buffer, which arbitrarily limits the length of a recognisable token. However, in this particular use case it is quite simple:
[[:alpha:]][[:alpha:]][[:alpha:]] handle_trigram(yytext); REJECT;
[[:alpha:]][[:alpha:]] handle_bigram(yytext); REJECT;
[[:alpha:]] handle_letter(*yytext);
If you want to try this solution, I recommend using flex's debug facility (flex -d ...) in order to see what is going on.
See debugging options and REJECT documentation.
The solution I would actually recommend, although the code is a bit clunkier, is to use yyless() to reprocess part of the recognised token. This is quite a bit more efficient than REJECT; yyless() just changes a single pointer, so it has no impact on speed. Without REJECT, we have to know all the lexeme handlers which will be needed, but that's not very difficult. A complication is the interface for handle_bigram, which requires a NUL-terminated string. If your handler didn't impose this requirement, the code would be simpler.
[[:alpha:]][[:alpha:]][[:alpha:]] { handle_trigram(yytext);
char tmp = yytext[2];
yytext[2] = 0;
handle_bigram(yytext);
yytext[2] = tmp;
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]][[:alpha:]] { handle_bigram(yytext);
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]] handle_letter(*yytext);
See yyless() documentation

NSRegularExpression not matching number sign (#)

I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.

How to get the last matched text in Flex parser

I want match something like:
var i=1;
So I want to know if var has started at word boundary.
When it matches this line I want to know the last character of previous yytext.
Just to be sure that a char before var is really a non variable character( aka "\b" in regex)
One crude way to maintain old_yytext in each rule and also have a default rule ".".
How to get it?
The only way is to save a copy of the previous token, or at least the last character. Flex's buffer management strategy does not guarantee that the previous token still exists in memory. It is possible that the current token starts at the beginning of flex's buffer.
But doing the work of saving the previous token in every rule would be really silly. You should trust flex to work as advertised, and write appropriate rules. For example, if your identifier pattern looks like this:
[[:alpha:]][[:alnum:]]*
then it is impossible for var to immediately follow an identifier because it would have been included in the idebtifier.
There is one common case in a "normal" flex scanner definition where a keyword or identifier might immediately follow an alphanumeric character, which is when the keyword immediately follows a number (123var). This is not usually a problem, because in almost all languages, it will trigger a syntax error (and if it isn't a syntax error, maybe it is ok :-) )
If you really want to trigger a lexical error, you can add a pattern which recognizes a number followed by a letter.

Checking input grammar and deciding a result

Say I have a string "abacabacabadcdcdcd" and I want to apply a simple set of rules:
abaca->a
dcd->d
From left to right s.t. the string ends up being "abad". This output will be used to make a decision. After the rules are applied, if the output string does not match preset strings such as "abad", the original string would be discarded. ex. Every string should distill down to "abad", kick if it doesn't.
I have this hard-coded right now as regex, but there are many instances of these small rule sets. I am looking for something that will take a set of simple rules and compile (or just a function?) into something I can feed the string to and retrieve a result. The rule sets are independent of each other.
The input is tightly controlled, and the rules in use will be simple. Speed is the most important aspect.
I've looked at Bison and ANTLR, but I don't think I need anything nearly that powerful...
What am I looking for?
Edit: Should mention that the strings are made up of a couple letters. Usually 5, i.e. "abcde". There are no spaces, etc. Just letters.
If it is going to go fast, you can start out with a map, that contains your rules as key value pairs of strings. You can then compile this map to a sort of state machine, a tree with char keys, where the associated value is either a replacement string, or another tree.
You then go char by char through your string. Look up the current char in the tree. If you find another tree, look up the next character in that tree, etc.
At some point, either:
the lookup will fail, and then you know that the string you've seen so far is not the prefix of any rule. You can skip the current character and continue with the next.
or you get a replacement string. In that case, you can replace the characters between the current char and the last one you looked up inclusive by the replacement string.
The only difficulty is if the replacement can itself be part of a pattern to replace. Example:
ab -> e
cd -> b
The input:
acd -> ab (by rule 2)
ab -> e (by rule 1) ????
Now the question is if you want to reconsider ab to give e?
If this is so, you must start over from the beginning after each replacement. In addition, it will be hard to tell whether the replacement ever ends, except if all the rules you have are such that the right hand side is shorter than the left hand side. For, in that case, a finite string will get reduced in a finite amount of time.
But if we don't need to reconsider, the algorithm above will go straight through the string.

Resources