Swift: Split String into sentences - ios

I'm wondering how I can split a string containing several sentences into an array of the sentences.
I know about the split function but spliting by "." doesn't suite for all cases.
Is there something like mentioned in this answer

You can use NSLinguisticsTagger to identify SentenceTerminator tokens and then split into an array of strings from there.
I used this code and it worked great.
https://stackoverflow.com/a/57985302/10736184
let text = "My paragraph with weird punctuation like Nov. 17th."
var r = [Range<String.Index>]()
let t = text.linguisticTags(
in: text.startIndex..<text.endIndex,
scheme: NSLinguisticTagScheme.lexicalClass.rawValue,
tokenRanges: &r)
var result = [String]()
let ixs = t.enumerated().filter {
$0.1 == "SentenceTerminator"
}.map {r[$0.0].lowerBound}
var prev = text.startIndex
for ix in ixs {
let r = prev...ix
result.append(
text[r].trimmingCharacters(
in: NSCharacterSet.whitespaces))
prev = text.index(after: ix)
}
Where result will now be an array of sentence strings. Note that the sentence will have to be terminated with '?', '!', '.', etc to count. If you want to split on newlines as well, or other Lexical Classes, you can add
|| $0.1 == "ParagraphBreak"
after
$0.1 == "SentenceTerminator"
to do that.

If you are capable of using Apple's Foundation then solution could be quite straightforward.
import Foundation
var text = """
Let's split some text into sentences.
The text might include dates like Jan.13, 2020, words like S.A.E and numbers like 2.2 or $9,999.99 as well as emojis like 👨‍👩‍👧‍👦! How do I split this?
"""
var sentences: [String] = []
text.enumerateSubstrings(in: text.startIndex..., options: [.localized, .bySentences]) { (tag, _, _, _) in
sentences.append(tag ?? "")
}
There are ways do it with pure Swift of course. Here is quick and dirty split:
let simpleText = """
This is a very simple text.
It doesn't include dates, abbreviations, and numbers, but it includes emojis like 👨‍👩‍👧‍👦! How do I split this?
"""
let sentencesPureSwift = simpleText.split(omittingEmptySubsequences:true) { $0.isPunctuation && !Set("',").contains($0)}
It could be refined with reduce().

Take a look on this link :
How to create String split extension with regex in Swift?
it shows how to combine regex and componentsSeparatedByString.

Try this:-
var myString : NSString = “This is a test”
var myWords: NSArray = myString.componentsSeparatedByString(“ “)
//myWords is now: ["This", "is", "a", "test"]

Related

How to split uncode string into characters

I have strings like
"\U0aac\U0ab9\U0ac1\U0ab5\U0a9a\U0aa8",
"\U0a97\U0ac1\U0ab8\U0acd\U0ab8\U0acb",
"\U0aa6\U0abe\U0ab5\U0acb",
"\U0a96\U0a82\U0aa1"
But I want to split this strings by unicode character
I dont know hot to do. I know components seprated by function but it's no use here.
\nAny help would be apperiaciated
If the strings you're getting really contain \U characters, you need to parse them manually and extract the unicode scalar values. Something like this:
let strings = [
"\\U0aac\\U0ab9\\U0ac1\\U0ab5\\U0a9a\\U0aa8",
"\\U0a97\\U0ac1\\U0ab8\\U0acd\\U0ab8\\U0acb",
"\\U0aa6\\U0abe\\U0ab5\\U0acb",
"\\U0a96\\U0a82\\U0aa1"
]
for str in strings {
let chars = str.components(separatedBy: "\\U")
var string = ""
for ch in chars {
if let val = Int(ch, radix: 16), let uni = Unicode.Scalar(val) {
string.unicodeScalars.append(uni)
}
}
print(string)
}
You can map your array, split its elements at non hexa digit values, compact map them into UInt32 values, initializate unicode scalars with them and map the resulting elements of your array into a UnicodeScalarView and init a new string with it:
let arr = [
#"\U0aac\U0ab9\U0ac1\U0ab5\U0a9a\U0aa8"#,
#"\U0a97\U0ac1\U0ab8\U0acd\U0ab8\U0acb"#,
#"\U0aa6\U0abe\U0ab5\U0acb"#,
#"\U0a96\U0a82\U0aa1"#]
let strings = arr.map {
$0.split { !$0.isHexDigit }
.compactMap { UInt32($0, radix: 16) }
.compactMap(Unicode.Scalar.init)
}.map { String(String.UnicodeScalarView($0)) }
print(strings)
This will print
["બહુવચન", "ગુસ્સો", "દાવો", "ખંડ"]
So, the string that comes back already has the "\" because in order to use components you'd need to have an additional escaping "\" so that you'd be able to do:
var listofCodes = ["\\U0aac\\U0ab9\\U0ac1\\U0ab5\\U0a9a\\U0aa8", "\\U0aac\\U0ab9\\U0ac1\\U0ab5\\U0a9a\\U0aa8"]
var unicodeArray :[String] = []
listofCodes.forEach { string in
unicodeArray
.append(contentsOf: string.components(separatedBy: "\\"))
unicodeArray.removeAll(where: {value in value == ""})
}
print(unicodeArray)
I will revise this answer once you specify how you are obtaining these strings, as is I get a non-valid string error from the start.

Getting specific range of characters from a string swift 3

i would like to get specific characters from my string, like this for example
var str = "hey steve #steve123, you would love this!"
// Without doing this
var theCharactersIWantFromTheString = "#steve123"
How would i just retrieve the #steve123 from str
You can use regular expression "\\B\\#\\w+".
let pattern = "\\B\\#\\w+"
let sentence = "hey steve #steve123, you would love this!"
if let range = sentence.range(of: pattern, options: .regularExpression) {
print(sentence.substring(with: range)) // "#steve123\n"
}

Refactored Solution In Swift

I've been studying for a coding exam by doing the HackerRank test cases, for the most part I've been doing well, but I get hung up on some easy cases and you all help me when I can't see the solution. I'm working on this problem:
https://www.hackerrank.com/challenges/ctci-ransom-note
A kidnapper wrote a ransom note but is worried it will be traced back to him. He found a magazine and wants to know if he can cut out whole words from it and use them to create an untraceable replica of his ransom note. The words in his note are case-sensitive and he must use whole words available in the magazine, meaning he cannot use substrings or concatenation to create the words he needs.
Given the words in the magazine and the words in the ransom note, print Yes if he can replicate his ransom note exactly using whole words from the magazine; otherwise, print No.
Input Format
The first line contains two space-separated integers describing the respective values of (the number of words in the magazine) and (the number of words in the ransom note).
The second line contains space-separated strings denoting the words present in the magazine.
The third line contains space-separated strings denoting the words present in the ransom note.
Each word consists of English alphabetic letters (i.e., to and to ).
The words in the note and magazine are case-sensitive.
Output Format
Print Yes if he can use the magazine to create an untraceable replica of his ransom note; otherwise, print No.
Sample Input
6 4
give me one grand today night
give one grand today
Sample Output
Yes
Explanation
All four words needed to write an untraceable replica of the ransom note are present in the magazine, so we print Yes as our answer.
And here is my solution:
import Foundation
func main() -> String {
let v = readLine()!.components(separatedBy: " ").map{Int($0)!}
var a = [String](); var b = [String]()
if v[0] < v[1] { return "No"}
for i in 0 ..< 2 {
if i == 0 {
a = (readLine()!).components(separatedBy: " ")
} else { b = (readLine()!).components(separatedBy: " ") }
}
// Get list of elements that intersect in each array
let filtered = Set(a).intersection(Set(b))
// Map set to set of Boolean where true means set a has enough words to satisfy set b's needs
let checkB = filtered.map{ word in reduceSet(b, word: word) <= reduceSet(a, word: word) }
// If mapped set does not contain false, answer is Yes, else No
return !checkB.contains(false) ? "Yes" : "No"
}
func reduceSet(_ a: [String], word: String) -> Int {
return (a.reduce(0){ $0 + ($1 == word ? 1 : 0)})
}
print(main())
I always time out on three of the 20 test-cases with this solution. So the solution seems to solve all the test cases, but not within their required time constraints. These are great practice, but it's so extremely frustrating when you get stuck like this.
I should note that I use Sets and the Set(a).intersection(Set(b)) because when I tried mapping an array of Strings, half the test-cases timed out.
Any cleaner, or more efficient solutions will be greatly appreciated! Thank you!
Thanks to #Alexander - I was able to solve this issue using NSCountedSet instead of my custom reduce method. It's much cleaner and more efficient. Here is the solution:
import Foundation
func main() -> String {
let v = readLine()!.components(separatedBy: " ").map{Int($0)!}
var a = [String](); var b = [String]()
if v[0] < v[1] { return "No"}
for i in 0 ..< 2 {
if i == 0 {
a = (readLine()!).components(separatedBy: " ")
} else { b = (readLine()!).components(separatedBy: " ") }
}
let countA = NSCountedSet(array: a)
let countB = NSCountedSet(array: b)
let intersect = Set(a).intersection(Set(b))
let check = intersect.map{ countB.count(for: $0) <= countA.count(for: $0) }
return !check.contains(false) ? "Yes" : "No"
}
print(main())
Many thanks!
I took the leisure of making some improvements on your code. I put comments to explain the changes:
import Foundation
func main() -> String {
// Give more meaningful variable names
let firstLine = readLine()!.components(separatedBy: " ").map{Int($0)!}
let (magazineWordCount, ransomNoteWordCount) = (firstLine[0], firstLine[1])
// a guard reads more like an assertion, stating the affirmative, as opposed to denying the negation.
// it also
guard magazineWordCount > ransomNoteWordCount else { return "No" }
// Don't use a for loop if it only does 2 iterations, which are themselves hardcoded in.
// Just write the statements in order.
let magazineWords = readLine()!.components(separatedBy: " ")
let ransomNoteWords = readLine()!.components(separatedBy: " ") //You don't need ( ) around readLine()!
let magazineWordCounts = NSCountedSet(array: magazineWords)
let ransomNoteWordCounts = NSCountedSet(array: ransomNoteWords)
// intersect is a verb. you're looking for the noun, "intersection"
// let intersection = Set(a).intersection(Set(b))
// let check = intersect.map{ countB.count(for: $0) <= countA.count(for: $0) }
// You don't actually care for the intersection of the two sets.
// You only need to worry about exactly the set of words that
// exists in the ransom note. Just check them directly.
let hasWordWithShortage = ransomNoteWordCounts.contains(where: { word in
magazineWordCounts.count(for: word) < ransomNoteWordCounts.count(for: word)
})
// Don't negate the condition of a conditional expression. Just flip the order of the last 2 operands.
return hasWordWithShortage ? "No" : "Yes"
}
print(main())
with the comments removed:
import Foundation
func main() -> String {
let firstLine = readLine()!.components(separatedBy: " ").map{Int($0)!}
let (magazineWordCount, ransomNoteWordCount) = (firstLine[0], firstLine[1])
guard magazineWordCount > ransomNoteWordCount else { return "No" }
let magazineWords = readLine()!.components(separatedBy: " ")
let ransomNoteWords = readLine()!.components(separatedBy: " ")
let magazineWordCounts = NSCountedSet(array: magazineWords)
let ransomNoteWordCounts = NSCountedSet(array: ransomNoteWords)
let hasWordWithShortage = ransomNoteWordCounts.contains{ word in
magazineWordCounts.count(for: word) < ransomNoteWordCounts.count(for: word)
}
return hasWordWithShortage ? "No" : "Yes"
}
print(main())
It's simpler, and much easier to follow. :)

componentsseparatedbystring by multiple separators in Swift

So here is the string s:
"Hi! How are you? I'm fine. It is 6 p.m. Thank you! That's it."
I want them to be separated to a array as:
["Hi", "How are you", "I'm fine", "It is 6 p.m", "Thank you", "That's it"]
Which means the separators should be ". " + "? " + "! "
I've tried:
let charSet = NSCharacterSet(charactersInString: ".?!")
let array = s.componentsSeparatedByCharactersInSet(charSet)
But it will separate p.m. to two elements too. Result:
["Hi", " How are you", " I'm fine", " It is 6 p", "m", " Thank you", " That's it"]
I've also tried
let array = s.componentsSeparatedByString(". ")
It works well for separating ". " but if I also want to separate "? ", "! ", it become messy.
So any way I can do it? Thanks!
There is a method provided that lets you enumerate a string. You can do so by words or sentences or other options. No need for regular expressions.
let s = "Hi! How are you? I'm fine. It is 6 p.m. Thank you! That's it."
var sentences = [String]()
s.enumerateSubstringsInRange(s.startIndex..<s.endIndex, options: .BySentences) {
substring, substringRange, enclosingRange, stop in
sentences.append(substring!)
}
print(sentences)
The result is:
["Hi! ", "How are you? ", "I\'m fine. ", "It is 6 p.m. ", "Thank you! ", "That\'s it."]
rmaddy's answer is correct (+1). A Swift 3 implementation is:
var sentences = [String]()
string.enumerateSubstrings(in: string.startIndex ..< string.endIndex, options: .bySentences) { substring, substringRange, enclosingRange, stop in
sentences.append(substring!)
}
You can also use regular expression, NSRegularExpression, though it's much hairier than rmaddy's .bySentences solution. In Swift 3:
var sentences = [String]()
let regex = try! NSRegularExpression(pattern: "(^|\\s+)(\\w.*?[.!?]+)(?=(\\s+|$))")
regex.enumerateMatches(in: string, range: NSMakeRange(0, string.characters.count)) { match, flags, stop in
sentences.append((string as NSString).substring(with: match!.rangeAt(2)))
}
Or Swift 2:
let regex = try! NSRegularExpression(pattern: "(^|\\s+)(\\w.*?[.!?]+)(?=(\\s+|$))", options: [])
var sentences = [String]()
regex.enumerateMatchesInString(string, options: [], range: NSMakeRange(0, string.characters.count)) { match, flags, stop in
sentences.append((string as NSString).substringWithRange(match!.rangeAtIndex(2)))
}
The [.!?] syntax matches any of those three characters. The | means "or". The ^ matches the start of the string. The $ matches the end of the string. The \\s matches a whitespace character. The \\w matches a "word" character. The * matches zero or more of the preceding character. The + matches one or more of the preceding character. The (?=) is a look-ahead assertion (e.g. see if there's something there, but don't advance through that match).
I've tried to simplify this a bit, and it's still pretty complicated. Regular expressions offer rich text pattern matching, but, admittedly, it is a little dense when you first use it. But this rendition matches (a) repeated punctuation (e.g. "Thank you!!!"), (b) leading spaces, and (c) trailing spaces, too.
If the splitting basis is something a little more esoteric than sentences, this extension could work.
extension String {
public func components(separatedBy separators: [String]) -> [String] {
var output: [String] = [self]
for separator in separators {
output = output.flatMap { $0.components(separatedBy: separator) }
}
return output.map { $0.trimmingCharacters(in: .whitespaces)}
}
}
let artists = "Rihanna, featuring Calvin Harris".components(separated by: [", with", ", featuring"])
I tried to find a regex to solve this too: (([^.!?]+\s)*\S+(\.|!|\?))
Here the explanation from regexper and an example
Well I've found a regex too from here
var pattern = "(?<=[.?!;…])\\s+(?=[\\p{Lu}\\p{N}])"
let s = "Hi! How are you? I'm fine. It is 6 p.m. Thank you! That's it."
let sReplaced = s.stringByReplacingOccurrencesOfString(pattern, withString:"[*-SENTENCE-*]" as String, options:NSStringCompareOptions.RegularExpressionSearch, range:nil)
let array = sReplaced.componentsSeparatedByString("[*-SENTENCE-*]")
Perhaps it's not a good way as it has to first replace and than separate the string. :)
UPDATE:
For regex part, if you also want to match Chinese/Japanese punctuations (which space after each punctuation is not necessary), you can use the following one:
((?<=[.?!;…])\\s+|(?<=[。!?;…])\\s*)(?=[\\p{L}\\p{N}])

Swift sort NSArray

I have a problem with my sorting algorithm.
My NSArray (here vcd.signals.key) contains values of Strings for example:
"x [0]", "x [18]", "x [15]", "x [1]"...
When I try to sort this the result ends up in
"x [0]", "x [15]", "x [18]", "x [1]"
instead of:
"x [0]", "x [1]", "x [15]", "x [18]"
This is my code:
let sortedKeys = sorted(vcd.signals.keys) {
var val1 = $0 as! String
var val2 = $1 as! String
return val1 < val2
}
Any idea how I can fix this issue?
Your problem come associated with your comparison , for example see what happen when you compare the two following strings:
println("x [15]" < "x [1]") // true
This is because the default lexicography comparer goes character for character, position by position comparing ,and of course 5 in position 3 is less than ] in position 3:
println("5" < "]") // true
For the explained above you need to create you own comparer but , only compare for the numbers inside the [$0]. For achieve this I use regular expressions to match any numbers inside the brackets like in the following way:
func matchesForRegexInText(regex: String!, text: String!) -> [String] {
let regex = NSRegularExpression(pattern: regex,
options: nil, error: nil)!
let nsString = text as NSString
let results = regex.matchesInString(text,
options: nil, range: NSMakeRange(0, nsString.length))
as! [NSTextCheckingResult]
return map(results) { nsString.substringWithRange($0.range)}
}
var keysSorted = keys.sorted() {
var key1 = $0
var key2 = $1
var pattern = "([0-9]+)"
var m1 = self.matchesForRegexInText(pattern, text: key1)
var m2 = self.matchesForRegexInText(pattern, text: key2)
return m1[0] < m2[0]
}
In the above regular expression I assume that the numbers only appears inside the brackets and match any number inside the String, but feel free to change the regular expression if you want to achieve anything more. Then you achieve the following:
println(keysSorted) // [x [0], x [1], x [15], x [18]]
I hope this help you.
The issue you are running into is the closing brace character ']' comes after digits. This means that "18" is less than "1]". As long as all your strings share the form of "[digits]" then you can remove the closing brace, sort the strings, add the closing brace back to your final array. The code below works for Swift 2:
let arr = ["x [0]", "x [18]", "x [15]", "x [1]"]
let sorted = arr.map { $0.substringToIndex($0.endIndex.predecessor()) }.sort().map { $0 + "]" }
print(sorted)

Resources