I have a regular expression which looks something like this:
(\bee[0-9]{9}in\b)|(\bee[0-9]{9}[a-zA-Z]{2}\b)
Now if the input string is ee123456789ab then the second part of | matches the string. But if the input string is ee123456789in first part of | consumes the whole string and the second part doesn't get a change to match the string? I want both parts of | to have their change to match the string so that I come to know that both parts were able to match the string. Is it even possible to do that using regular expression?
You can use lookahead assertions:
^(?=(ee[0-9]{9}in$)?)(?=(ee[0-9]{9}[a-zA-Z]{2}$)?)
This will capture a match in both \1 and \2; if either of the two is empty, then the corresponding part of the regex has not matched.
I've changed the word boundary anchors to start/end of string anchors since you're testing against the entire string, not just substrings.
In Python:
>>> import re
>>> r = re.compile(r"^(?=(ee[0-9]{9}in$)?)(?=(ee[0-9]{9}[a-zA-Z]{2}$)?)")
>>> m = r.match("ee123456789ab")
>>> m.group(1)
>>> m.group(2)
'ee123456789ab'
>>> m = r.match("ee123456789in")
>>> m.group(1)
'ee123456789in'
>>> m.group(2)
'ee123456789in'
Explanation:
^ # Start of string
(?= # Look ahead to see if it's possible to match...
( # and capture...
ee[0-9]{9}in # regex 1
$ # (end of string)
)? # (make the match optional)
) # End of lookahead
(?= # Second lookahead, same idea...
(
ee[0-9]{9}[a-zA-Z]{2}
$
)?
)
It's not possible with regular expressions. If any part of it matches, it's considered a match. You would have to do it with two different expressions and see if both succeeded.
An OR is a or no matter what, can't get around that.
As #Tim mentioned it can be done with lookahead(s):
You can stand still and look at the same text more than once.
So, one way is to look at each expression without moving,
each expression is optional. -
(?= ( ee [0-9]{9} in )? )
(?= ( ee [0-9]{9} [a-zA-Z]{3} )? )
This is bad because, although the position will advance after the last
expression, it will only advance 1 inter-character position. It also
allows overlapp when searching in a global context.
Searches can be sped up by consuming a character -
(?= ( ee [0-9]{9} in )? )
(?= ( ee [0-9]{9} [a-zA-Z]{3} )? )
.
The engine does an optimization when something is consumed,
advances in chunks (unknown how it decides).
If you have other expressions included with these, it requires that the
position be advanced past here or nothing will match. This could also eliminate
overlapped matching of text (if thats a goal).
Its actually hard to avoid overlap unless you know for sure one expression will
be longer than the other. If thats the case then you could always do a conditional
(if available) to consume the larger text -
(?= ( ee [0-9]{9} in )? )
(?= ( ee [0-9]{9} [a-zA-Z]{3} )? )
(?(2) \2 | \1 )
And, if you know one is a subset of the other, you could just do this -
(?= ( ee [0-9]{9} in )? ) ( ee [0-9]{9} [a-zA-Z]{3} )
Either way, depending on the expressions, much thought has to go into designing
consumption into the regex to avoid overlap.
Related
I am writing a simple scanner in flex. I want my scanner to print out "integer type seen" when it sees the keyword "int". Is there any difference between the following two ways?
1st way:
%%
int printf("integer type seen");
%%
2nd way:
%%
"int" printf("integer type seen");
%%
So, is there a difference between writing if or "if"? Also, for example when we see a == operator, we print something. Is there a difference between writing == or "==" in the flex file?
There's no difference in these specific cases -- the quotes(") just tell lex to NOT interpret any special characters (eg, for regular expressions) in the quoted string, but if there are no special characters involved, they don't matter:
[a-z] printf("matched a single letter\n");
"[a-z]" printf("matched the 5-character string '[a-z]'\n");
0* printf("matched zero or more zero characters\n");
"0*" printf("matched a zero followed by an asterisk\n");
Characters that are special and mean something different outside of quotes include . * + ? | ^ $ < > [ ] ( ) { } /. Some of those only have special meaning if they appear at certain places, but its generally clearer to quote them regardless of where they appear if you want to match the literal characters.
I want to capture some strings, but how come this is not working? I noticed that using [] it only detects each individual character, I wanted to know if it is possible with more characters
I want to take these combinations, but it's wrong
A ||
Z <<
O ~~~
O..
Current Code:
C = [[
A
B|
C<
Z<<
O~~~
O.
O..
]]
C = C:gsub("(\n%a[(||)(<<)(~~~)(%.%.%.)])",function(a)
print(a)
end)
Output:
B|
C<
Z<
O~
O.
O.
Your Pattern should be something like: (\n%a[|<~%.]+).
Placing a ( inside a lua pattern set just adds ( to the list of chars that could be matched it does not make a "sub-set" or force a required match length.
Lua patterns do not match multiple chars if repeated in a single set. to match multiple chars you need to use the +, * or use multiple instance of the set like this: (\n%a[|<~%.][|<~%.][|<~%.]).
Issues with this are that multiple instances of the set must all match, while if the + is used you have variability in the length of instances you could match such as one . rather than three.
You can not enforce granularity to match 2 different lengths of characters. By this, I mean you can not match specifically O<< and O~~~ in the same pattern while not matching O<<<, O~~ or O<<~.
Resources to learn more about Lua patterns:
FHUG - Understanding Lua Patterns
I need to check wether matching parenthesis is present in a string that might have emoticons (like :) or :(). For example, "(:)())()", "(abcd)()ghijk)((mnop)qert)"
I have used the patterns "^[:\\(|:\\)]" to check for emoticons and "\\([^()]*\\)" to check for matching parenthesis present, but they are not detected. How can I do this?
The really simple solution to this problem is to count the parentheses, trying to solve it with regular expressions is hard though extended regular expressions can handle it. Here is a sketch of the simple algorithm:
Set openParenthesisCount to 0
Iterate over the string:
If current character is ( increment openParenthesisCount
If current character is ) decrement openParenthesisCount, if count goes negative then fail (too many closing)
If current character is : lookahead and skip next character if it is a parenthesis (skip smilies)
If openParenthesisCount is zero => succeed
HTH
As far as I can tell, you want to match a string if and only if it contains matching parentheses, after ignoring every occurrence of ":)" and ":(" in the string, if any.
So, try this:
^((?!:).)*\(.*(?<!:)\).*
It will match the following strings:
()
(abd)
(())(
(:))
(:(:))
(:)())()
(abcd)()ghijk)((mnop)qert)
(abc):
(:abc)
But will NOT match the following:
)(
(:)
(:(
:(:)
:()
(:)
:()(:)
(
)
(abc
abc)
(abc:)
:(abc)
I want to run two different lua string find on the same string " (55)"
Pattern 1 "[^%w_](%d+)", should match any number
Pattern 2 "[%(|%)|%%|%+|%=|%-|%{%|%}|%,|%:|%*|%^]", should match any of these ( ) % + = - { } , : * ^ characters.
Both of these patterns return 2, why? Also if I run a string match, they return ( and 55 respectivly (as expected).
It seems you are using the patterns with string.find that finds the first occurrence of the pattern in the string passed. If an instance of the pattern is found a pair of values representing the start and end of the string is returned. If the pattern cannot be found nil is returned.
Both patterns find a match at Position 2: [^%w_](%d+) finds ( because it is matched with [^%w_] (a char other than letter, digit or _), and [%(|%)|%%|%+|%=|%-|%{%|%}|%,|%:|%*|%^] matches the ( because it is part of the character set.
However, the first pattern can be re-written using a frontier pattern, %f[%w_]%d+, that will match 1+ digits if not preceded with letters, digits or underscore, and the second pattern does not require such heavy escaping, [()%%+={},:*^-] is enough (only % needs escaping here, as the - is placed at the end of the character set and is thus treated as a literal hyphen).
See this Lua demo:
a = " (55)"
for word in string.gmatch(a, "%f[%w_]%d+") do print(word) end
-- 55
for word in string.gmatch(a, "[()%%+={},:*^-]+") do print(word) end
-- (, )
I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.