Ignoring emoticons when checking for balanced parentheses in an NSString - ios

I need to check wether matching parenthesis is present in a string that might have emoticons (like :) or :(). For example, "(:)())()", "(abcd)()ghijk)((mnop)qert)"
I have used the patterns "^[:\\(|:\\)]" to check for emoticons and "\\([^()]*\\)" to check for matching parenthesis present, but they are not detected. How can I do this?

The really simple solution to this problem is to count the parentheses, trying to solve it with regular expressions is hard though extended regular expressions can handle it. Here is a sketch of the simple algorithm:
Set openParenthesisCount to 0
Iterate over the string:
If current character is ( increment openParenthesisCount
If current character is ) decrement openParenthesisCount, if count goes negative then fail (too many closing)
If current character is : lookahead and skip next character if it is a parenthesis (skip smilies)
If openParenthesisCount is zero => succeed
HTH

As far as I can tell, you want to match a string if and only if it contains matching parentheses, after ignoring every occurrence of ":)" and ":(" in the string, if any.
So, try this:
^((?!:).)*\(.*(?<!:)\).*
It will match the following strings:
()
(abd)
(())(
(:))
(:(:))
(:)())()
(abcd)()ghijk)((mnop)qert)
(abc):
(:abc)
But will NOT match the following:
)(
(:)
(:(
:(:)
:()
(:)
:()(:)
(
)
(abc
abc)
(abc:)
:(abc)

Related

Match the input with string using lex

I'm trying to match the prefix of the string Something. For example, If input So,SOM,SomeTH,some,S, it is all accepted because they are all prefixes of Something.
My code
Ss[oO]|Ss[omOMOmoM] {
printf("Accept Something": %s\n", yytext);
}
Input
Som
Output
Accept Something: So
Invalid Character
It's suppose to read Som because it is a prefix of Something. I don't get why my code doesn't work. Can anyone correct me on what I am doing wrong?
I don't know what you think the meaning of
Ss[oO]|Ss[omOMOmoM]
is, but what it matches is either:
an S followed by an s followed by exactly one of the letters o or O, or
an S followed by an s followed by exactly one of the letters o, O, m or M. Putting a symbol more than once inside a bracket expression has no effect.
Also, I don't see how that could produce the output you report. Perhaps there was a copy-and-paste error, or perhsps you have other pattern rules.
If you want to match prefixes, use nested optional matches:
s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?
If you want case-insensitive matcges, you could write out all the character classes, but that gets tiriesome; simpler is to use a case-insensitve flag:
(?i:s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?)
(?i: turns on the insensitive flag, until the matching close parenthesis.
In practice, this is probably not what you want. Normally, you will want to recognise a complete word as a token. You could then check to see if the word is a prefix in the rule action:
[[:alpha:]]+ { if (yyleng <= strlen("something") && 0 == strncasemp(yytext, "something", yyleng) {
/* do something */
}
}
There is lots of information in the Flex manual.
Right now your code (as shown) should only match "Sso" or "SsO" or "Ssm" or "SsM".
You have two alternatives that each start with Ss (without square brackets) so those will be matched literally. That's followed by either [oO] or [omOMomoM], but the characters in square brackets represent alternatives, so that's equivalent to [oOmM] --i.e., any one character of of o, O, m or M.
I'd start with: %option caseless to make it a case-insensitive scanner, so you don't have to list the upper- and lower-case equivalents of every letter.
Then it's probably easiest to just list the alternatives literally:
s|so|som|some|somet|someth|somethi|somethin|something { printf("found prefix"); }
I guess you can make the pattern a bit shorter (at least in the source code) by doing something on this order:
s(o(m(e(t(h(i(n(n(g)?)?)?)?)?)?)?)?)? { printf("found prefix"); }
Doesn't seem like a huge improvement to me, but some might find it more attractive than I do.
If you don't want to use %option caseless the basic idea helps more:
[sS]([oO]([mM]([eE]([tT]([hH]([iI]([nN]([gG])?)?)?)?)?)?)?)? { printf("found prefix"); }
Listing every possible combination of upper and lower case would get tedious.

What does these two regex match?

I can't figure out what does this regex match:
A: "\\/\\/c\\/(\\d*)"
B: "\\/\\/(\\d*)"
I suppose they are matching some kind of number sequence since \d matches any digit but I'd like to know an example of a string that would be a match for this regex.
The pattern syntax is that specified by ICU. Expressions are created with NSRegularExpression in an iOS app and are correct.
The first matches //c/ + 0 or more digits. The second matches // + 0 or more digits. In both the digits are captured.
An example of a match for A) is //c/123
An example of a match for B) is //12345
When I use Cygwin which emulates Bash on Windows, I sometimes run into situations where I have to escape my escape characters which is what I think is making this expression look so weird. For instance, when I use sed to look for a single '\' I sometimes have to write it as '\\\\'. (Funny, StackOverflow proved my point. If you write 4 backslashes in the comment, it only shows two. So if you process it again, they might all disappear depending on your situation).
Considering this, it might be helpful to think of pairs of backslashes as representing only one if you're coming from a similar situation. My guess would be you are. Because of this I would say Erik Duymelinck is probably spot on. This will capture a sequence of digits that may or may not follow a couple slashes and a c:
//c/000
//00000
This regex matches an odd sequence of characters, which, at first glance, almost seem like a regex, since \d is a digit, and followed by an asterisk (\d*) would mean zero-or-more digits. But it's not a digit, because the escape-slash is escaped.
\\/\\/c\\/(\\d*)
So, for instance, this one matches the following text:
\/\/c\/\
\/\/c\/\d
\/\/c\/\dd
\/\/c\/\ddd
\/\/c\/\dddd
\/\/c\/\ddddd
\/\/c\/\dddddd
...
This one is almost the same
\\/\\/(\\d*)
except you just delete the c\/ from the above results:
\/\/\
\/\/\d
\/\/\dd
\/\/\ddd
\/\/\dddd
\/\/\ddddd
\/\/\dddddd
...
In both cases, the final \ and optional d is [capture group][1] one.
My first impression was that these regexes were intended for escaping in Java strings, meaning they would be completely invalid. If the were escaped for Java strings, such as
Pattern p = Pattern.compile("\\/\\/c\\/(\\d*)");
It would be invalid, because after un-escaping, it would result in this invalid regex:
\/\/c\/(\d*)
The single escape-slashes (\) are invalid. But the \d is valid, as it would mean any digit.
But again, I don't think they're invalid, and they're not escaped for a Java string. They're just odd.

ActiveSupport::Inflector::camelize - help in understanding regex

Short version:
I am having a rather hard time understanding two rather complex regular expressions in the ActiveSupport::Inflector::camelize method.
This is the definition of the camelize method:
def camelize(term, uppercase_first_letter = true)
string = term.to_s
if uppercase_first_letter
string = string.sub(/^[a-z\d]*/) { inflections.acronyms[$&] || $&.capitalize }
else
string = string.sub(/^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/) { $&.downcase }
end
string.gsub(/(?:_|(\/))([a-z\d]*)/i) { "#{$1}#{inflections.acronyms[$2] || $2.capitalize}" }.gsub('/', '::')
end
I have some difficulty understanding:
string = string.sub(/^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/) { $&.downcase }
and:
string.gsub(/(?:_|(\/))([a-z\d]*)/i) { "#{$1}#{inflections.acronyms[$2] || $2.capitalize}" }.gsub('/', '::')
Please explain to me what they mean. Thank you.
Long version
This shows me trying to understand the regex and how I interpret them to mean. It would be very helpful if you could go through this and correct my mistakes.
For the first regex
string = string.sub(/^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/) { $&.downcase }
Based on what I am seeing, inflections.acronym_regex is from the Inflections class in the ActiveSupport::Inflector module, and in the initialize method of the Inflections class,
def initialize
#plurals, #singulars, #uncountables, #humans, #acronyms, #acronym_regex = [], [], [], [], {}, /(?=a)b/
end
acronym_regex is assigned /(?=a)b/. From what I understand from http://www.ruby-doc.org/core-2.0.0/Regexp.html#class-Regexp-label-Anchors ,
(?=pat) - Positive lookahead assertion: ensures that the following characters match pat, but doesn't include those characters in the matched text
So /(?=a)b/ ensures that character a is inside the text, but we dont include character a inside the matched text, and what immediately follows character a must be character b. In other words, "abc" would match this regex, but "bbc" would not match this regex, and the matched text for "abc" would be "b" (instead of "ab").
So combining the value of inflections.acronym_regex into this regex /^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/, I do not know which of the following two regex results:
A. /^(?:/(?=a)b/(?=\b|[A-Z_])|\w)/
B. /^(?:(?=a)b(?=\b|[A-Z_])|\w)/
although I am thinking it is B. From what I understand, (?: provides grouping without capturing, (?= means positive lookahead assertion, \b matches word boundaries when outside brackets and matches backspace when inside brackets. So in english terms, regex B, when matching against a text, will find a string that begins with an a character, followed by a b character, and one of (1. backspace [whatever that may mean] 2. any uppercase character or underscore 3. any english alphabetic character, digit, or underscore).
However, I find it strange that passing upper_case_first_letter = false to the camelize function should cause it to match a string starting with the characters ab, given that that does not seem to be how the camelize function behaves.
For the second regex
string.gsub(/(?:_|(\/))([a-z\d]*)/i) { "#{$1}#{inflections.acronyms[$2] || $2.capitalize}" }.gsub('/', '::')
The regex is:
/(?:_|(\/))([a-z\d]*)/i
I am guessing that this regex will match a substring that starts with either an _ or /, followed by 0 or more (upper or lowercase english alpabetic characters or digit). Furthermore, for the first group (?:_|(\/)), whether we match the _ or /, the ([a-z\d]*) capturing group will always be regarded as the second group. I do understand the part where the block tries to look up inflections.acronyms[$2] and on failure, does $2.captitalize.
Since (?: means grouping without capturing, what is the value of $1 when we match _ ? Is it still _ ? And for the .gsub('/', '::') portion, I am guessing that it gets applied for each match in the initial gsub, instead of being applied to the overall string after the outer gsub call is done?
Apologies for the really long post. Please point out my errors in understanding the 2 regular expressions, or explain them in a better way if you can do it.
Thank you.
However, I find it strange that passing upper_case_first_letter =
false to the camelize function should cause it to match a string
starting with the characters ab, given that that does not seem to be
how the camelize function behaves.
?: acts like a . here and does match the string (ie. single character) but there is no grouping, therefore the match is in $&.
Since (?: means grouping without capturing, what is the value of $1
when we match _ ? Is it still _ ?
It's nil since there is no capturing. The value is in $2
And for the .gsub('/', '::') portion, I am guessing that it gets
applied for each match in the initial gsub, instead of being applied
to the overall string after the outer gsub call is done?
It's applied to the overall result as gsub with block returns a string and the gsub('/', '::') is outside of a block.

Write a Lex rule to parse Integer and Float

I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.

searching strings for keywords: questions about the "failure function"

I've got a question on failure function description from "Compilers: Principles, Techniques, and Tools" aka DragonBook
Firstly, the quote:
In order to process text strings rapidly and search those strings for a keyword,
it is useful to define, for keyword b1b2...bn, and position s in that keyword , a failure function, f (s) ...
The objective is that b1b2.. - bf(s) is the longest proper prefix of
b1...bs, that is also a suffix of b1...bs. The reason f (s) is important is that
if we are trying to match a text string for blb2..bn, and we have matched the
first s positions, but we then fail (i.e., the next position of the text string does
not hold bs+l), then f (s) is the longest prefix of b1..bn that could possibly
match the text string up to the point we are at. Of course, the next character of
the text string must be bf(s)+1 or else we still have problems and must consider
a yet shorter prefix, which will be bf(f(s)).
So, the questions:
1. If we've matched s positions with the text, why f (s) is the longest prefix of b1..bn that matches the string? I think s - is the longest prefix.
2. Next character of the text string must be bf(s)+1, why? We have a mismatch at this position, does it matter at all what the char is?
f(s) is the longest prefix at that position that might match the entire keyword. The idea is not to try to match the keyword with the text from the start, but to find a position where the keyword appears.
Consider a search for the word 'aaaba' in the text 'aaaabaa'. The match fails after the three first a's, but it's not necessary to retry from the second 'a', since we know that if the next letter is a 'b' (which it is), we may have a match there.

Resources