Extremely slow parsing text with Swift - ios

I'm trying to parse a 46k characters text document and it takes forever to do so. Here's what I do :
for i in 0..<html.length() - SEARCH_START.length() {
if html.substring(i, end: i+SEARCH_START.length()) == SEARCH_START {
start = i + SEARCH_START.length();
break;
}
if i % 1000 == 0 {
NSLog("i = \(i)")
}
}
extension String {
public func length () -> Int {
return self.characters.count
}
public func substring(_ start : Int, end : Int) -> String {
if self.characters.count <= 0 {
return ""
}
let realEnd = end>0 ? end : 0
return self.substring(with: self.index(self.startIndex, offsetBy: start)..<self.index(self.startIndex, offsetBy: realEnd))
}
}
Sorry, had to extend String class to do less rewriting from Android.
So, Log is being triggered each 6.5 seconds for a next thousand means almost 5 minutes to get to the end. The wole process should take milliseconds. What's the deal? Any way to speed it up?

You Int indexing extension is the problem. To get a substring at position n, it needs to go through all the characters 0..n. Therefore your algorithm has O(n^2) (quadratic) complexity instead of the expected O(n) (linear) complexity.
Don't use that extension.
To search for a substring, there is a native method
if let range = html.range(of: SEARCH_START) {
let integerIndex = html.distance(from: html.startIndex, to: range.upperBound)
print(integerIndex)
}
If you really want to work with integers, you should convert your string to an array of characters first:
let chars = Array(html.characters)
and work with subarrays instead of substrings.
Edit:
To better understand what happens in your extension:
self.substring(with: self.index(self.startIndex, offsetBy: start)..<self.index(self.startIndex, offsetBy: realEnd))
In Java, where String is an array and supports random indexing this would be a constant (fast) operation. However, in Swift this is composed from 3 steps:
self.index(self.startIndex, offsetBy: start) iterates from the first character until it finds the character at index start.
self.index(self.startIndex, offsetBy: realEnd)) iterates from the first character until it finds the character at index realEnd.
Gets the substring (fast)
In short, for every substring at start position n, the algorithm has to iterate over 2n characters. To get a single substring at index 20000, you need 40000 operations!

Related

How Can I Construct an Efficient CoreData Search, Including Allowing For Preceding and Trailing Characters Here?

Based on straight SQL searches in a previous app, I am adding CoreData searching to a new app. These searches are in a custom dictionary db that the app contains; this function does the work:
public func wordMatcher (pad: Int, word: Array<String>, substitutes : Set<String> ) {
let context = CoreDataManager.shared.persistentContainer.viewContext
var query: Array<String>
var foundPositions : Set<Int> = []
var searchTerms : Array<String> = []
if word.count >= 4 {
for i in 0..<word.count {
for letter in substitutes {
query = word
query[i] = letter
searchTerms.append(query.joined())
let rq: NSFetchRequest<Word> = Word.fetchRequest()
rq.predicate = NSPredicate(format: "name LIKE %#", query.joined())
rq.fetchLimit = 1
do {
if try context.fetch(rq).count != 0 {
foundPositions.insert(i)
break
}
} catch {
}
}
// do aggregated searchTerms search here instead of individual searches?
}
}
}
The NSFetchRequest focuses on one permutation at a time. But I'm accumulating the search string fragments in the array searchTerms because I don't know if it would be more efficient to construct a single query connected with ORs, and I also don't know how to do that in CoreData.
The focus is on the positions in the original term word: I need to indicate if any given location has at least one of the substitutes as a valid fit. So to implement the aggregate searchTerms approach, a FetchRequest would have to happen for each location in the base term.
A second complication is the one referred to in the title of the question. I am using LIKE because the search term in the FetchRequest could be a substring in a longer word. However, the maximum number of letters is 11, and pad is the starting point of the original term in that field of 11 spaces.
So if pad is 3, then I would need to allow for 0..<pad preceding characters. And because there may be trailing characters, I would also want results with 0..<(11 - (pad + word.count)) alphabetic characters after the last letter in the search term.
Regex seems like one way to do this, but I haven't found a clear example of how to do this in this case, and especially with the multiple search terms (if that's the way to go). The limits of SQLite in the previous version forced constructing multiple queries with increasing numbers of "_" underscores to indicate the padding characters; that tended to really explode the number of queries.
BTW, substitutes is limited to an absolute maximum of 9 values, and in practice is usually below 5, so things are a little more manageable.
I would like to get a grip on this, and so if anyone can provide direction or examples that can make this a reasonably efficient function, the help is appreciated greatly.
EDIT:
I've realized that I need a result for each position in the target string, with cases where the leading and trailing spaces also may need to contain a substitute as well.
So I'm moving to this:
public func wordMatcher (pad: Int, word: Array<String>, substitutes : Set<String> ) {
let context = CoreDataManager.shared.persistentContainer.viewContext
var pad_ = pad
var query: Array<String>
var foundPositions : Set<Int> = []
let rq: NSFetchRequest<Word> = Word.fetchRequest()
rq.fetchLimit = 1
let subs = "[\(substitutes.joined())]"
// if word.count >= 4 { // because those locations will be blocked off anyway otherwise
let start = pad > 0 ? -1 : 0
let finish = 11 - (pad + word.count) > 0 ? word.count + 1 : word.count
for i in start..<finish {
query = word
var _pad = 11 - (pad + word.count)
if i == -1 {
query = Array(arrayLiteral: subs) + query
pad_ -= 1
} else if i > word.count {
query.append(subs)
_pad -= 1
} else {
pad_ = pad
query[i] = subs
}
let endPad = _pad > 0 ? "{0,\(_pad)}" : ""
let predMatch = ".\(query.joined())\(endPad)"
print(predMatch)
rq.predicate = NSPredicate(format:"position <= %# AND word MATCHES %#", pad_, predMatch)
do {
if try context.fetch(rq).count != 0 {
foundPositions.insert(i)
}
} catch {
}
// }
}
lFreq = foundPositions
}
This relies on a regex substitution, inserted into the original target string. What I'll have to find out is if this is fast enough at the edge cases, but it may not be critical even in the worst case.
predMatch will end up looking something like "ab[xyx]d{0,3}", and I think I can get rid of the position section by changing it to be "{0,2}ab[xyx]d{0,3}". But I guess I'm going to have to try to find out.

Swift3 Random Extension Method

I was using this extension method to generate a random number:
func Rand(_ range: Range<UInt32>) -> Int {
return Int(range.lowerBound + arc4random_uniform(range.upperBound - range.lowerBound + 1))
}
I liked it b/c it was no nonsense, you just called it like this:
let test = Rand(1...5) //generates a random number between 1 and 5
I honestly don't know why things need to be so complicated in Swift but I digress..
So i'm receiving an error now in Swift3
No '...' candidates produce the expected contextual result type 'Range<UInt32>'
Would anyone know what this means or how I could get my awesome Rand function working again? I guess x...y no longer creates Ranges or x..y must be explicitly defined as UInt32? Any advice for me to make things a tad easier?
Thanks so much, appreciate your time!
In Swift 3 there are four Range structures:
"x" ..< "y" ⇒ Range<T>
"x" ... "y" ⇒ ClosedRange<T>
1 ..< 5 ⇒ CountableRange<T>
1 ... 5 ⇒ CountableClosedRange<T>
(The operators ..< and ... are overloaded so that if the elements are stridable (random-access iterators e.g. numbers and pointers), a Countable Range will be returned. But these operators can still return plain Ranges to satisfy the type checker.)
Since Range and ClosedRange are different structures, you cannot implicitly convert a them with each other, and thus the error.
If you want Rand to accept a ClosedRange as well as Range, you must overload it:
// accepts Rand(0 ..< 5)
func Rand(_ range: Range<UInt32>) -> Int {
return Int(range.lowerBound + arc4random_uniform(range.upperBound - range.lowerBound))
}
// accepts Rand(1 ... 5)
func Rand(_ range: ClosedRange<UInt32>) -> Int {
return Int(range.lowerBound + arc4random_uniform(range.upperBound + 1 - range.lowerBound))
}
A nice solution is presented in Generic Range Algorithms
(based on How to be DRY on ranges and closed ranges? in the swift-users mailing list).
It uses the fact that both CountableRange and CountableClosedRange
are collections, and in fact a RandomAccessCollection.
So you can define a single (generic) function which accepts both open and closed
integer ranges:
func rand<C: RandomAccessCollection>(_ coll: C) -> C.Iterator.Element {
precondition(coll.count > 0, "Cannot select random element from empty collection")
let offset = arc4random_uniform(numericCast(coll.count))
let idx = coll.index(coll.startIndex, offsetBy: numericCast(offset))
return coll[idx]
}
rand(1...5) // random number between 1 and 5
rand(2..<10) // random number between 2 and 9
but also:
rand(["a", "b", "c", "d"]) // random element from the array
Alternatively as a protocol extension method:
extension RandomAccessCollection {
func rand() -> Iterator.Element {
precondition(count > 0, "Cannot select random element from empty collection")
let offset = arc4random_uniform(numericCast(count))
let idx = index(startIndex, offsetBy: numericCast(offset))
return self[idx]
}
}
(1...5).rand()
(2..<10).rand()
["a", "b", "c", "d"].rand()
You could rewrite Rand() to use Int if that is your primary use case:
func Rand(_ range: Range<Int>) -> Int {
let distance = UInt32(range.upperBound - range.lowerBound)
return range.lowerBound + Int(arc4random_uniform(distance + 1))
}
Or as kennytm points out, use Rand(1..<6)

fastest indexOf function for strings

I am currently using following extension for a string to get the index for a specific string in a big string:
func indexOf(target: String, startIndex: Int) -> Int
{
var startRange = advance(self.startIndex, startIndex)
var range = self.rangeOfString(target, options: NSStringCompareOptions.LiteralSearch, range: Range<String.Index>(start: startRange, end: self.endIndex))
if let range = range {
return distance(self.startIndex, range.startIndex)
} else {
return -1
}
}
I am calling this many times and I have a performance issue.
Does anyone have an idea how to do the indexOf() faster ?
Currently I am doing this in swift. Will doing this in Objective-C and bridging give a better performance ? Or probably if possible include any C Code ? Any ideas ?
UPDATE more about the Background
I have a long text, say with 5000 characters.
The Text contains several Metadata tags beside from normal text. These Tags are like {{blabl{{ sdasdg }} abla}} ; [[bla bla|blabla]] ; {|bla|}.
I like to remove them or format them in a specific way.
I can't use regular expression for this, because regular expression does not support stacked expressions ({{ {{ {{ {{dsgasdg}} }}}} }} )
So I wrote my own functions, which works, but is very slow.
What I am actually doing is I go throught the text and I am simply searchiong for these tags. For this I need a base function like the following, to determine which tag is the first and at which position. When I found a tag I will go to the next and so on. I recognized, that this is my most timeconsuming part of all. Of course I am calling this also a lot of time.
func getStart(sText:String, alSearchPatterns:[String], ifrom:Int) -> (Pattern:String, index:Int) {
var bweiter:Bool=true;
var actualcharacter:Character;
var returnPattern="";
var returnIndex = -1;
println("ifrom : " + String(ifrom));
var indexfound:Int = -1;
// finde ersten character der Patterns
var bsuchepattern=true;
for(var i=0;i<alSearchPatterns.count && bsuchepattern;i++){
let sPattern=alSearchPatterns[i];
let pattern_first_char=sPattern[0];
//let pattern_first_char=String(sPattern[0]);
let characterIndex = sText.indexOfCharacter(pattern_first_char, fromIndex: ifrom); // find(sText, pattern_first_char);
//let characterIndex = sText.indexOf(pattern_first_char, startIndex: ivon);
if(characterIndex != -1){
if((indexfound == -1) || characterIndex < indexfound){
// found something that is first of all actually.
let patternlength=sPattern.length;
let substring_in_text=sText.substring(characterIndex, endIndex: characterIndex + patternlength);
if(substring_in_text.equals(sPattern)){
returnPattern=sPattern;
returnIndex=characterIndex;
}
}
}
}
return (returnPattern,returnIndex);
}
Any hints how to do this more performant or any hints on how to do this better in general.

How to add a character at a particular index in string in Swift

I have a string like this in Swift:
var stringts:String = "3022513240"
If I want to change it to string to something like this: "(302)-251-3240", I want to add the partheses at index 0, how do I do it?
In Objective-C, it is done this way:
NSMutableString *stringts = "3022513240";
[stringts insertString:#"(" atIndex:0];
How to do it in Swift?
Swift 3
Use the native Swift approach:
var welcome = "hello"
welcome.insert("!", at: welcome.endIndex) // prints hello!
welcome.insert("!", at: welcome.startIndex) // prints !hello
welcome.insert("!", at: welcome.index(before: welcome.endIndex)) // prints hell!o
welcome.insert("!", at: welcome.index(after: welcome.startIndex)) // prints h!ello
welcome.insert("!", at: welcome.index(welcome.startIndex, offsetBy: 3)) // prints hel!lo
If you are interested in learning more about Strings and performance, take a look at #Thomas Deniau's answer down below.
If you are declaring it as NSMutableString then it is possible and you can do it this way:
let str: NSMutableString = "3022513240)"
str.insert("(", at: 0)
print(str)
The output is :
(3022513240)
EDIT:
If you want to add at starting:
var str = "3022513240)"
str.insert("(", at: str.startIndex)
If you want to add character at last index:
str.insert("(", at: str.endIndex)
And if you want to add at specific index:
str.insert("(", at: str.index(str.startIndex, offsetBy: 2))
var myString = "hell"
let index = 4
let character = "o" as Character
myString.insert(
character, at:
myString.index(myString.startIndex, offsetBy: index)
)
print(myString) // "hello"
Careful: make sure that index is smaller than or equal to the size of the string, otherwise you'll get a crash.
Maybe this extension for Swift 4 will help:
extension String {
mutating func insert(string:String,ind:Int) {
self.insert(contentsOf: string, at:self.index(self.startIndex, offsetBy: ind) )
}
}
var phone= "+9945555555"
var indx = phone.index(phone.startIndex,offsetBy: 4)
phone.insert("-", at: indx)
index = phone.index(phone.startIndex, offsetBy: 7)
phone.insert("-", at: indx)
//+994-55-55555
To Display 10 digit phone number into USA Number format (###) ###-#### SWIFT 3
func arrangeUSFormat(strPhone : String)-> String {
var strUpdated = strPhone
if strPhone.characters.count == 10 {
strUpdated.insert("(", at: strUpdated.startIndex)
strUpdated.insert(")", at: strUpdated.index(strUpdated.startIndex, offsetBy: 4))
strUpdated.insert(" ", at: strUpdated.index(strUpdated.startIndex, offsetBy: 5))
strUpdated.insert("-", at: strUpdated.index(strUpdated.startIndex, offsetBy: 9))
}
return strUpdated
}
You can't, because in Swift string indices (String.Index) is defined in terms of Unicode grapheme clusters, so that it handles all the Unicode stuff nicely. So you cannot construct a String.Index from an index directly. You can use advance(theString.startIndex, 3) to look at the clusters making up the string and compute the index corresponding to the third cluster, but caution, this is an O(N) operation.
In your case, it's probably easier to use a string replacement operation.
Check out this blog post for more details.
Swift 4.2 version of Dilmurat's answer (with code fixes)
extension String {
mutating func insert(string:String,ind:Int) {
self.insert(contentsOf: string, at:self.index(self.startIndex, offsetBy: ind) )
}
}
Notice if you will that the index must be against the string you are inserting into (self) and not the string you are providing.
You can't use in below Swift 2.0 because String stopped being a collection in Swift 2.0. but in Swift 3 / 4 is no longer necessary now that String is a Collection again. Use native approach of String,Collection.
var stringts:String = "3022513240"
let indexItem = stringts.index(stringts.endIndex, offsetBy: 0)
stringts.insert("0", at: indexItem)
print(stringts) // 30225132400
The simple and easy way is to convert String to Array to get the benefit of the index just like that:
let input = Array(str)
If you try to index into String without using any conversion.
Here is the full code of the extension:
extension String {
subscript (_ index: Int) -> String {
get {
String(self[self.index(startIndex, offsetBy: index)])
}
set {
if index >= count {
insert(Character(newValue), at: self.index(self.startIndex, offsetBy: count))
} else {
insert(Character(newValue), at: self.index(self.startIndex, offsetBy: index))
}
}
}
}
Now that you can read and write a single character from string using its index just like you originally wanted to:
var str = "car"
str[3] = "d"
print(str)
It’s simple and useful way to use it and get through Swift’s String access model.
Now that you’ll feel it’s smooth sailing next time when you can loop through the string just as it is, not casting it into Array.
Try it out, and see if it can help!
here it is my answer. - how to add one string into another string - at any given index - using for loop and prefix suffix -
let a = "manchester"
let b = "hello"
var c = ""
c = "\(a.prefix(a.description.count/2)) \(b) \(a.suffix(a.description.count/2))"
print(c)
output -
manch hello ester

NSRange in Strings having dialects

I was working on an app, which takes input in a language called "Tamil". So in order to find the range of any particular charater in the string i have used the below code.
var range = originalWord.rangeOfString("\(character)")
println("\(range.location)")
So this works fine except for some cases.
there are some characters like this -> í , ó . // am just saying an example.
So like this combination, in other languages there are several vowel diacritcs are there.
If i have this word "alv`in"
// which is alvin , but i used "v" with a dialect.
If i print the unicde value of these characters in xcode, i will get each unicode. But for "v`" there will be two unicode values but its considered as a single character.
So if i check this character in the above mentioned code. i get the folowing result. Which gives errors in my program.
range.location // 2147483647 , its not a single digit.? why.?
But for other characters its just prints the correct Int Value. // Single digit like "3"
Anybody have any idea of how to get this done.? How can i achieve this if i use characters with dialets
.?
code given below
// userInput = "இல்லம்"
var originalWord : NSString = ("இல்லம்")
var originalArray = Array("இல்லம்")
var userInputWord = Array(String(userInput))
// -------------------------------------------
for character in String(userInput)
{
switch character
{
case originalArray[0] :
// here matches first character of the userinput to the original word first character
// the character exists at the 0th index
var range = originalWord.rangeOfString("\(character)")
if range.location == 0
{
// same character in the same index
// correctValue increase by one (cow Value)
cowValue += 1
}
else
{
// same character but in the different index
// Wrong value increase by one (bull Value)
bullValue += 1
}
case originalArray[1] :
// here matches first character of the userinput to the original word first character
// the character exists at the 1th index
var range = originalWord.rangeOfString("\(character)")
println("\(range.location)") // here i get he long Int Value instead of single digit
if range.location == 1
{
// same character in the same index
// correctValue increase by one (cow Value)
cowValue += 1
}
else
{
// same character but in the different index
// Wrong value increase by one (bull Value)
bullValue += 1
}
You should use Swift strings instead of NSString, because Swift strings have
full Unicode support including composed character sequences, (extended) grapheme clusters etc.
For Swift strings, rangeOfString() returns an optional Range<String.Index>
which is a bit more complicated to handle. You can also use find() instead to
find the position of a character. This might help as a starting point:
var cowValue = 0
var bullValue = 0
let userInput = "இல்லம்"
let originalWord = "இல்லம்"
let originalArray = Array("இல்லம்")
for character in userInput {
switch character {
case originalArray[0] :
if let pos = find(originalWord, character) {
// Character found in string
println(pos)
if pos == originalWord.startIndex {
// At position 0
cowValue += 1
} else {
// At a different position
bullValue += 1
}
} else {
// Character not found in string
}
case originalArray[1] :
if let pos = find(originalWord, character) {
// Character found in string
println(pos)
if pos == advance(originalWord.startIndex, 1) {
// At position 1
cowValue += 1
} else {
// At a different position
bullValue += 1
}
} else {
// Character not found in string
}
default:
println("What ?")
}
}
Check out the documentation for NSString's rangeOfComposedCharacterSequenceAtIndex: and rangeOfComposedCharacterSequencesForRange:
You want to look for Composed Character Sequences, not individual characters.

Resources