iOS - regex to match word boundary, including underscore - ios

I have a regex that I'm trying to run to match a variety of search terms. For example:
the search "old" should match:
-> age_old
-> old_age
but not
-> bold - as it's not at the start of the word
To do this, I was using a word boundary. However, word boundary doesn't take into account underscores. As mentioned here, there are work arounds available in other languages. Unfortunately, with NSRegularExpression, this doesn't look possible. Is there any other way to get a word boundary to work? Or other options?

TLDR: Use one of the following:
let rx = "(?<=_|\\b)old(?=_|\\b)"
let rx = "(?<![^\\W_])old(?![^\\W_])"
let rx = "(?<![\\p{L}\\d])old(?![\\p{L}\\d])"
See a regex demo #1, regex demo #2 and regex demo #3.
Swift and Objective C support ICU regex flavor. This flavor supports look-behinds of fixed and constrained width.
(?= ... )    Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ... )    Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ... )    Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ... )    Negative Look-behind assertion.
So, you can use
let regex = "(?<![\\p{L}\\d])old(?![\\p{L}\\d])";
See regex demo
Here is a Swift code snippet extracting all "old"s:
func matchesForRegexInText(regex: String, text: String) -> [String] {
do {
let regex = try NSRegularExpression(pattern: regex, options: [])
let nsString = text as NSString
let results = regex.matchesInString(text,
options: [], range: NSMakeRange(0, nsString.length))
return results.map { nsString.substringWithRange($0.range)}
} catch let error as NSError {
print("invalid regex: \(error.localizedDescription)")
return []
}
}
let s = "age_old -> old_age but not -> bold"
let rx = "(?<![\\p{L}\\d])old(?![\\p{L}\\d])"
let matches = matchesForRegexInText(rx, text: s)
print(matches) // => ["old", "old"]

Related

iOS Swift: looking for ranges of matching word in a string

I need to make a function that returns me ranges of matching words in a given string, for example, given the sentence below:
Hey, bro! Your brother is also her brother.
I want to find an array of Range in the sentence that matches the word "bro", it should match the exact word (case insensitive), so "bro" should only match "bro" but not "brother".
I thought about:
split the sentence, e.g. "hey", "bro", "your", "brother", "is", "also", "her", "brother"
map each word to a word with range, e.g. "hey" would become ["hey", 0...2]
filter and map the word and range array, matching "bro"
Step 2 needs some treatment to make sure the range for each word (in the sentence) can be mapped to the right word, e.g. the first "brother" and second "brother" should have different ranges depending on where they are located.
Is there any smarter way of doing this?
Edit:
Sorry, I forgot to mention, the reason for not using Regex was that sometimes the word has a dot in it, for example:
there is orange in the basket.
from the above sentence, finding the string "or.ge" using regex would match "orange" as well.
I have tested in Playground, You can use this extension to get the values matching this reg ex.
extension String {
func ranges(of substring: String, options: CompareOptions = [], locale: Locale? = nil) -> [Range<Index>] {
var ranges: [Range<Index>] = []
while ranges.last.map({ $0.upperBound < self.endIndex }) ?? true,
let range = self.range(of: substring, options: options, range: (ranges.last?.upperBound ?? self.startIndex)..<self.endIndex, locale: locale)
{
ranges.append(range)
}
return ranges
}
}
let searchString = "bro"
var str = "Hey, bro! Your brother is also her brother."
var reg = str.ranges(of: "(?<![\\p{L}\\d])\(searchString)(?![\\p{L}\\d])", options: [.regularExpression, .caseInsensitive])
str.removeSubrange(reg.first!)
print(str)
Credits to,
iOS - regex to match word boundary, including underscore
One simple solution is to use regular expressions with \b to match “word boundaries”, e.g.
let searchString = "bro"
let sentence = "Hey, Bro! Your brother is also her brother."
let regex = try! NSRegularExpression(pattern: #"\b\#(searchString)\b"#, options: .caseInsensitive)
regex.enumerateMatches(in: sentence, range: NSRange(sentence.startIndex..., in: sentence)) { match, _, _ in
guard let match = match else { return }
print(match.range)
// or, if you want a String.Range
if let range = Range(match.range, in: sentence) {
print(sentence[range])
}
}
There are other richer API (e.g. the Natural Language framework), which, while not perfect, provide richer parsing of natural language text. For example, the below will differentiate between the verb “saw” and noun “saw”:
import NaturalLanguage
let text = "I saw the hammer. I did not see a saw."
let tagger = NLTagger(tagSchemes: [.lexicalClass])
tagger.string = text
let options: NLTagger.Options = [.omitWhitespace, .joinContractions]
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, range in
guard let tag = tag else { return true }
print(tag, String(text[range]))
return true
}
Producing:
NLTag(_rawValue: Pronoun) I
NLTag(_rawValue: Verb) saw
NLTag(_rawValue: Determiner) the
NLTag(_rawValue: Noun) hammer
NLTag(_rawValue: SentenceTerminator) .
NLTag(_rawValue: Pronoun) I
NLTag(_rawValue: Verb) did
NLTag(_rawValue: Adverb) not
NLTag(_rawValue: Verb) see
NLTag(_rawValue: Determiner) a
NLTag(_rawValue: Noun) saw
NLTag(_rawValue: SentenceTerminator) .

Convert placeholders such as %1$s to {x} in Swift

I'm parsing an XML doc (using XMLParser) and some of the values have php-like placeholders, e.g. %1$s, and I would like to convert those to {x-1}.
Examples:
%1$s ---> {0}
%2$s ---> {1}
I'm doing this in a seemingly hacky way, using regex:
But there must be a better implementation of this regex.
Consider a string:
let str = "lala fawesfgeksgjesk 3rf3f %1$s rk32mrk3mfa %2$s fafafczcxz %3$s czcz $#$##%## %4$s qqq %5$s"
Now we're going to extract the integer strings between strings % and $s:
let regex = try! NSRegularExpression(pattern: "(?<=%)[^$s]+")
let range = NSRange(location: 0, length: str.utf16.count)
let matches = regex.matches(in: str, options: [], range: range)
matches.map {
print(String(str[Range($0.range, in: str)!]))
}
Works quite fine. The issue is that the "4" value got mixed up because of the preceding random strings before the %4$s.
Prints:
1
2
3
## %4
5
Is there any better way to do this?
This might not be a very efficient (or swifty :)) way but it gets the job done. What it does is that it searches for a given reg ex and uses the matched substring to extract the numeric value and decrease it and then perform a simple replace between the substring and a newly constructed placeholder value. This is executed in a loop until no more matches are found.
let pattern = #"%(\d*)\$s"#
while let range = str.range(of: pattern, options: .regularExpression) {
let placeholder = str[range]
let number = placeholder.trimmingCharacters(in: CharacterSet(charactersIn: "0123456789.").inverted)
if let value = Int(number) {
str = str.replacingOccurrences(of: placeholder, with: "{\(value - 1)}")
}
}

Which NSRegularExpression was found using the | operator

I'm currently implementing NSRegularExpressions to check for patterns inside a UITextView string in my project.
The patterns check and operations are working as expected; for example: I'm trying to find the regular **bold** markdown pattern and if I find it I apply some text attributed to the range, and it works as expected.
I have though came across a problem. I don't know how to run multiple patterns at once and apply different operations for each pattern found.
In my UITextView delegate textViewDidChange or shouldChangeTextIn range: NSRange I am running the bold pattern check \\*{2}([\\w ]+)\\*{2} but then I am as well running the italic pattern check \\_{1}([\\w ]+)\\_{1}, looping again through the UITextView text.
I have implemented the following custom function, that applies the passed in regex to the string, but I have to call this function multiple times to check for each pattern, that's why I'd love to put the pattern check into one single, then "parse" each match.
fileprivate func regularExpression(regex: NSRegularExpression, type: TypeAttributes) {
let str = inputTextView.attributedText.string
let results = regex.matches(in: str, range: NSRange(str.startIndex..., in: str))
_ = results.map { self.applyAttributes(range: $0.range, type: type) }
}
Thanks.
EDIT
I can "merge" both patterns with the | operand like the following:
private let combinedPattern = "\\*{2}([\\w ]+)\\*{2}|\\_{1}([\\w ]+)\\_{1}"
but my problem is to know which pattern was found the \\*{2}([\\w ]+)\\*{2} one or the \\_{1}([\\w ]+)\\_{1}
If you use the combined pattern you have the results in different range of the match result.
If you want to access the first capture group (the bold pattern) you need to access the range at 1. When the match matches the second group you will have the first with an invalid range, so you need to check if it's valid of not this way:
results.forEach {
var range = $0.range(at: 1)
if range.location + range.length < str.count {
self.applyAttributes(range: range, type: .bold)
}
range = $0.range(at: 2)
if range.location + range.length < str.count {
self.applyAttributes(range: range, type: .italic)
}
}
After that you can extend your TypeAttributes enum to return the index range that is linked to your regular expression:
extension NSRange {
func isValid(for string:String) -> Bool {
return location + length < string.count
}
}
let attributes: [TypeAttributes] = [.bold, .italic]
results.forEach { match in
attributes.enumerated().forEach { index, attribute in
let range = match.range(at: index+1)
if range.isValid(for: str) {
self.applyAttributes(range: range, type: attribute[index])
}
}
}

How to get range of specific substring even if a duplicate

I want to detect the words that begin with a #, and return their specific ranges. Initially I tried using the following code:
for word in words {
if word.hasPrefix("#") {
let matchRange = theSentence.range(of: word)
//Do stuff with this word
}
}
This works fine, except if you have a duplicate hashtag it will return the range of the first occurrence of the hashtag. This is because of the nature of the range(_:) function.
Say I have the following string:
"The range of #hashtag should be different to this #hashtag"
This will return (13, 8) for both hashtags, when really it should return (13, 8) as well as (50, 8). How can this be fixed? Please note that emojis should be able to be detected in the hashtag too.
EDIT
If you want to know how to do this with emojis to, go here
Create regex for that and use it with the NSRegularExpression and find the matches range.
var str = "The range of #hashtag should be different to this #hashtag"
let regex = try NSRegularExpression(pattern: "(#[A-Za-z0-9]*)", options: [])
let matches = regex.matchesInString(str, options:[], range:NSMakeRange(0, str.characters.count))
for match in matches {
print("match = \(match.range)")
}
Why don't you separate your word in chunks where each chunk starts with #. Then you can know how many times your word with # appears in sentence.
Edit: I think that regex answer is the best way for this but this is an other approach for same solution.
var hastagWords = [""]
for word in words {
if word.hasPrefix("#") {
// Collect all words which begin with # in an array
hastagWords.append(word)
}
}
// Create a copy of original word since we will change it
var mutatedWord = word.copy() as! String
for hashtagWord in hastagWords {
let range = mutatedWord.range(of: hashtagWord)
if let aRange = range {
// If range is OK then remove the word from original word and go to an other range
mutatedWord = mutatedWord.replacingCharacters(in: aRange, with: "")
}
}

Return range with first and last character in string

I have a string: "Hey #username that's funny". For a given string, how can I search the string to return all ranges of string with first character # and last character to get the username?
I suppose I can get all indexes of # and for each, get the substringToIndex of the next space character, but wondering if there's an easier way.
If your username can contain only letters and numbers, you can use regular expression for that:
let s = "Hey #username123 that's funny"
if let r = s.rangeOfString("#\\w+", options: NSStringCompareOptions.RegularExpressionSearch) {
let name = s.substringWithRange(r) // #username123"
}
#Vladimir's answer is correct, but if you're trying to find multiple occurrences of "username", this should also work:
let s = "Hey #username123 that's funny"
let ranges: [NSRange]
do {
// Create the regular expression.
let regex = try NSRegularExpression(pattern: "#\\w+", options: [])
// Use the regular expression to get an array of NSTextCheckingResult.
// Use map to extract the range from each result.
ranges = regex.matchesInString(s, options: [], range: NSMakeRange(0, s.characters.count)).map {$0.range}
}
catch {
// There was a problem creating the regular expression
ranges = []
}
for range in ranges {
print((s as NSString).substringWithRange(range))
}

Resources