Is there a way to compile a NSRegularExpression to match multiple strings? - ios

ICU and Java's regular expression support (and probably other platforms) separate compilation of a regular expression from matching it to a specific string. This improves performance when a common regex pattern is matched with multiple strings, since it only needs to be compiled once.
Is there any way to do this with NSRegularExpression? Its design appears to combine these two steps, if I'm reading the documentation correctly.

They are two steps. First, you create a regular expression:
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"<h1>(.*?)</h1>"
options:NSRegularExpressionCaseInsensitive
error:&error];
And then, second, you use it (obviously use whatever method you want):
[regex enumerateMatchesInString:htmlString
options:0
range:NSMakeRange(0, [htmlString length])
usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) {
// do whatever you want
}];
Am I misunderstanding the question?

Related

Checking if a given character is an emoji in Objective-C

I have an NSString and would like to check if a given character at a certain index is an emoji.
However, there doesn't seem to be any reliable way to create an NSCharacterSet of emoji characters, since they change from iOS update to update. And a lot of the available solutions rely on Swift features such as UnicodeScalar. All solutions seem to involve hardcoding the codepoint values for emojis.
As such, is it possible to check for emojis at all?
It's a bit of a complicated question because Unicode is complicated, but you can use NSRegularExpression to do this:
NSString *s = #"where's the emoji 😎 ?";
NSRegularExpression *r = [NSRegularExpression regularExpressionWithPattern:#"\\p{Emoji_Presentation}" options:0 error:NULL];
NSRange range = [r rangeOfFirstMatchInString:s options:0 range:NSMakeRange(0, s.length)];
NSLog(#"location %lu length %lu", (unsigned long)range.location, (unsigned long)range.length);
produces:
2019-01-16 18:07:42.629 emoji[50405:6837084] location 18 length 2
I'm using the \p{Unicode property name} pattern to match characters which have the specified Unicode property. I'm using the property Emoji_Presentation to get those character which present as Emoji by default. You should review Unicode® Technical Standard #51 — Annex A: Emoji Properties and Data Files and the data files linked in A.1 to decide which property you actually care about.

Using capture groups within an NSRegularExpression pattern

Is a regex of the following form legit in Obj C?
"<(img|a|div).*?>.*?</$1>"
I know it's valid in JS with a \1 instead of $1, but I'm having little luck in Obj C.
NSRegularExpression uses ICU Regular Expressions which uses \n syntax for back references where n is the nth capture group.
<(img|a|div).*?>.*?</\\1>
Yes, I do believe you can work with capture groups. I had to work with them a bit a little while ago and I have an example in:
-(NSString *) extractMediaLink:(NSString *)link withRegex:(NSString *)regex{
NSString * utf8Link = [link stringByRemovingPercentEncoding];
NSError * regexError = nil;
NSRegularExpression * regexParser = [NSRegularExpression regularExpressionWithPattern:regex
options:NSRegularExpressionCaseInsensitive|NSRegularExpressionUseUnixLineSeparators
error:&regexError];
NSTextCheckingResult * regexResults = [regexParser firstMatchInString:utf8Link
options:0
range:NSMakeRange(0, [utf8Link length])];
NSString * matchedResults = [utf8Link substringWithRange:[regexResults rangeAtIndex:1]]; // the second capture group will always have the ID
return matchedResults.length ? matchedResults : #"";
}
When you use an instance of NSRegularExpression to generate an NSTextCheckingResult, the NSTextCheckingResult has a property of numberOfRanges which is documented with:
A result must have at least one range, but may optionally have more (for example, to represent regular expression capture groups).
In my example above (Note: I happen to be parsing HTML, but using an addition pod that traverses HTML by XPath queries, TFHpple -- a lifesaver if you absolutely have to parse HTML), I use the -[NSRegularExpression firstMatchInString:options:range:] to check for the first instance of the tag that matches my regex pattern. From that NSTextCheckingResult I pull out the proper index of the capture group I'm interested in (in this case, [regexResults rangeAtIndex:1])
But, getting to this point was a huge pain in the ass. But to make sure you're getting the right expressions I would highly recommend using Regex101 with the Python setting, and then passing the refined regex into Patterns (Mac App Store)
If you want the full look, I have a fairly detailed project here, but keep in mind it's still a WIP.

NSRegularExpression search within matched string for the same pattern

I use NSRegularExpression to find matched by a certain patter, which is visible in the snippet:
- (NSString *)functionPattern
{
return #"[A-Za-z]{1,}\\([A-Za-z0-9,\\(\\)]{1,}\\)";
}
- (void)test
{
NSString *formula = #"AVERAGE(G17,G18,AVERAGE(G20,G21,MIN(G30,G31)))";
NSError *error;
NSRegularExpression *functionRegex = [NSRegularExpression regularExpressionWithPattern:[self functionPattern]
options:NSRegularExpressionCaseInsensitive
error:&error];
if (error) {
NSLog(#"error");
return;
}
NSArray *matches = [functionRegex matchesInString:formula options:0 range:NSMakeRange(0, formula.length)];
for (NSUInteger i = 0; i < matches.count; i++) {
NSTextCheckingResult *result = matches[i];
NSString *match = [formula substringWithRange:result.range];
NSLog(#"%#", match);
}
}
My natural expectation was to get 3 matches: AVERAGE(...), AERAGE(...) and MIN(...). Surprisingly for me, i only get one match: AVERAGE(G17,G18,AVERAGE(G20,G21,MIN(G30,G31))).
If the formula is AVERAGE(G17,G18)+AVERAGE(G20,G21,MIN(G30,G31)), i'll get 2 matches: AVERAGE(G17,G18) and AVERAGE(G20,G21,MIN(G30,G31)). In other words, after a match is found, a search for the pattern is not performed in the range of the matched string.
Please advice how to overcome this, and find all possible matches. Am i missing something simple here?
What i'm doing is parsing and evaluating math expressions. All works fine, except for the cases of nested functions. If i know all possible function names in advance, can that be utilised somehow?
I'm hoping to fine more or less elegant approach; if i can i'd like to avoid things like "stripping off function name and parentheses'"
Help is much appreciated.
You cannot do what you want, at least the way you want to.
Regular expressions are technically a type 3 grammar and cannot describe recursive languages; your math expressions can contain other math expressions.
You could do something along the line of what you say you don't want to do. For example you could match an expression containing just a single pair of balanced parentheses, so in AVERAGE(G20,G21,MIN(G30,G31)) you could match MIN(G30,G31). If you then replaced the match by a marker and matched again you could match the next level, etc. But this not a good way to do it.
General math expressions can be described by a type 2 grammar, and can be easily parsed using a recursive descent parser. Such parsers are very easy to write. Essentially you write down the grammar you wish to parse and then write a function for each production. Google will get you started down this route.
If you don't want to write the parser yourself you can use a parser generator, in which case you still need to write the grammar, or search for one of the math expression libraries.
HTH

NSRegularExpression acts weird (and the regex is correct)

I've got this regex
([0-9]+)\(([0-9]+),([0-9]+)\)
that I'm using to construct a NSRegularExpression with no options (0). That expression should match strings like
1(135,252)
and yield three matches: 1, 135, 252. Now, I've confirmed with debuggex.com that the expression is correct and does what I want. However, iOS refuses to acknowledge my efforts and the following code
NSString *nodeString = #"1(135,252)";
NSArray *r = [nodeRegex matchesInString:nodeString options:0 range:NSMakeRange(0, nodeString.length)];
NSLog(#"--- %#", nodeString);
for(NSTextCheckingResult *t in r) {
for(int i = 0; i < t.numberOfRanges; i++) {
NSLog(#"%#", [nodeString substringWithRange:[t rangeAtIndex:i]]);
}
}
insists in saying
--- 1(135,252)
135,252
13
5,252
5
252
which is clearly wrong.
Thoughts?
Your regex should look like this
[NSRegularExpression regularExpressionWithPattern:#"([0-9]+)\\(([0-9]+),([0-9]+)\\)"
options:0
error:NULL];
Note the double backslashes in the pattern. They are needed because the backslash is used to escape special characters (like for example quotes) in C and Objective-C is a superset of C.
If you are looking for a handy tool for working with regular expressions I can recommend Patterns. Its very cheap and can export straight to NSRegularExpressions.
Debuggex currently supports only raw regexes. This means that if you are using the regex in a string, you need to escape your backslashes, like this:
([0-9]+)\\(([0-9]+),([0-9]+)\\)
Also note that Debuggex does not (yet) support Objective C. This probably won't matter for simple regexes, but for more complicated ones, different engines do different things. This can result in some unexpected behavior.

Validating IP address by regular expression - Unknown escape sequence

I am working on an iOS project that require using regular expression to validate ipv4 address.
I use following code
// only support ip4 currently
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:#"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
options:0
error:nil];
NSUInteger numberOfMatches = [regex numberOfMatchesInString:IpString options:0 range:NSMakeRange(0, [IpString length])];
return (numberOfMatches==1?TRUE:FALSE);
XCode keep warning me "unknown escape sequence .". When return true when I type "1.3.6.-6" or "2.3.33".
How can I use dot(.) in regex? Thanks
You need to double backslash your ., as the first backslash is being interpreted by NSString, and it's looking for an escape character for . (which doesn't exist). Double backslashing (\\.) will cause the first backslash to escape the second backslash (which does exist), meaning you can use \ normally.
So for example, your regex will be:
#"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"

Resources