Using Regular Expressions to Extract a Value in iOS Swift - ios

I want to extract Vimeo Id from its URL. I have tried to many solutions but not found what I exactly want for swift. I refer to many questions and found one solution in JAVA. I want same behaviour in iOS swift so I can extract the ID from matched group array.
Using Regular Expressions to Extract a Value in Java
I use following vimeo URL regex and I want group-3 if string matched with regex:
"[http|https]+:\/\/(?:www\.|player\.)?vimeo\.com\/(?:channels\/(?:\w+\/)?|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|video\/|)([a-zA-Z0-9_\-]+)(&.+)?"
Test Vimeo URL: https://vimeo.com/62092214?query=foo

let strToTest = "https://vimeo.com/62092214?query=foo"
let pattern = "[http|https]+:\\/\\/(?:www.|player.)?vimeo.com\\/(?:channels\\/(?:\\w+\\/)?|groups\\/([^\\/]*)\\/videos\\/|album\\/(\\d+)\\/video\\/|video\\/|)([a-zA-Z0-9_\\-]+)(&.+)?"
let regex = try! NSRegularExpression.init(pattern: pattern, options: [])
let match = regex.firstMatch(in: strToTest, options: [], range: NSRange.init(location: 0, length: strToTest.count))
let goup3Range = match?.range(at: 3)
let substring = (strToTest as NSString).substring(with: goup3Range!)
print("substring: \(substring)")
That should work.
You need to escape all \ in the pattern.
You need to call range(at:) to get the range of the group you want according to your pattern (currently group3), then substring.
What should be improved?
Well, I did all sort of force unwrapped (every time I wrote a !). for the sake of the logic and not add do/catch, if let, etc. I strongly suggest you check them carefully.

Here is yet another version. I am using named capturing group, a bit different than the answer provided by Larme.
let regex = "[http|https]+:\\/\\/(?:www\\.|player\\.)?vimeo\\.com\\/(?:channels\\/(?:\\w+\\/)?|groups\\/(?:[^\\/]*)\\/videos\\/|album\\/(?:\\d+)\\/video\\/|video\\/|)(?<vimeoId>[a-zA-Z0-9_\\-]+)(?:&.+)?"
let vimeoURL = "https://vimeo.com/62092214?query=fooiosiphoneswift"
let regularExpression = try! NSRegularExpression(pattern: regex,
options: [])
let match = regularExpression.firstMatch(in: vimeoURL,
options: [],
range: NSRange(vimeoURL.startIndex ..< vimeoURL.endIndex,
in: vimeoURL))
if let range = match?.range(withName: "vimeoId"),
let stringRange = Range(range, in: vimeoURL) {
let vimeoId = vimeoURL[stringRange]
}
Also, please check that I have modified your regex a bit, such that everything else except vimeoId are non-capturing.

Related

NLTagger: enumerating tags of multiple types in one pass

Using the NLTagger class, I'm wondering if anyone can recommend the most straightforward way to enumerate through the tagged tokens in a given text, but pulling out multiple tag types per token. For example, to enumerate the words in a given text, pulling out (lemma, lexical category) for each.
It seems that the enumerateTags() method and associated NLTag class have the limitation of only reporting one particular tag type per enumeration. So I can achieve what I want by making multiple passes over the text, e.g. pulling out the string ranges that match given criteria on the first pass and then matching things up on later passes. For example, I could lemmatise all of the nouns and verbs like this:
let tagger = NLTagger(tagSchemes: [.lemma, .nameTypeOrLexicalClass])
tagger.string = //some text
let keyWordCategories: [NLTag] = [.noun, .verb]
let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace, .joinNames]
// In the first pass, we're going to record which ranges are of categories we're interested in
var keywordRanges = Set<Range<String.Index>>(minimumCapacity: 200)
// First pass: which are the nouns and verbs?
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange in
if let tag = tag {
if (keyWordCategories.contains(tag)) {
keywordRanges.insert(tokenRange)
}
}
return true
}
// Second pass: lemmatise, filtering on just the nouns and verbs
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lemma, options: options) { tag, tokenRange in
if let tag = tag {
if (keywordRanges.contains(tokenRange)) {
lemmas.insert(tag.rawValue)
}
}
return true
}
This mechanism achieves the desired functionality, but strikes me as a somewhat clumsy and potentially inefficient way to have to go about things. I would have expected to be able to enumerate (lemma, lexical category) in a single pass. I'm assuming that the NLTagger instance caches things behind the scenes so that it's not as terrible as it looks in terms of efficiency. But it's still far from ideal in terms of simplicity of the code. Can anyone more familiar with this API advise on whether this is really the intended pattern?
You could use tags(in:unit:scheme:options:) to obtain lemmas in concrete range, instead of iterating through each lemma of tagger:
let tagger = NLTagger(tagSchemes: [.lemma, .nameTypeOrLexicalClass])
tagger.string = text
let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace, .joinNames]
let keyWordCategories = Set<NLTag>(arrayLiteral: .noun, .verb)
var lemmas = Set<String>()
let unit: NLTokenUnit = .word
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: unit, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange in
if tag.map(keyWordCategories.contains) == true {
if let lemma = tagger.tags(in: tokenRange, unit: unit, scheme: .lemma, options: options).first?.0?.rawValue {
lemmas.insert(lemma)
}
}
return true
}

Find a word after and before a string

I have a string like so...
ab-0-myCoolApp.theAppAB.in
How can I get the word myCoolApp from this string...? Also there are many strings in the same format i.e myCoolApp can be myCoolAppABX or myCoolAppABCD etc.
that could be a really brief solution (=one of the many ones) to your problem, but the core concept would be something like that in every case.
the input has some random values:
let inputs = ["ab-0-myCoolApp.theAppAB.in", "ab-0-myCoolAppABX.theAppAB.in", "ab-0-myCoolAppABXC.theAppAB.in"]
and having a regular expression to find matches:
let regExp = try? NSRegularExpression(pattern: "-([^-]*?)\\.", options: NSRegularExpression.Options.caseInsensitive)
then Release the Kraken:
inputs.forEach { string in
regExp?.matches(in: string, options: NSRegularExpression.MatchingOptions.reportProgress, range: NSMakeRange(0, string.lengthOfBytes(using: .utf8))).forEach({
let match = (string as NSString).substring(with: $0.range(at: 1))
debugPrint(match)
})
}
finally it prints out the following list:
"myCoolApp"
"myCoolAppABX"
"myCoolAppABXC"
NOTE: you may need to implement further failsafes during getting the matches or you can refactor the entire idea at your convenience.

I receive an improperly formatted unicode in a String

I am working with a web API that gives me strings like the following:
"Eat pok\u00e9."
Xcode complains that
Expected Hexadecimal code in braces after unicode escape
My understanding is that it should be converted to pok\u{00e9}, but I do not know how to achieve this.
Can anybody point me in the right direction for me develop a way of converting these as there are many in this API?
Bonus:
I also need to remove \n from the strings.
You may want to give us more context regarding what the raw server payload looked like, and show us how you're displaying the string. Some ways of examining strings in the debugger (or if you're looking at raw JSON) will show you escape strings, but if you use the string in the app, you'll see the actual Unicode character.
I wonder if you're just looking at raw JSON.
For example, I passed the JSON, {"foo": "Eat pok\u00e9."} to the following code:
let jsonString = String(data: data, encoding: NSUTF8StringEncoding)!
print(jsonString)
let dictionary = try! NSJSONSerialization.JSONObjectWithData(data, options: []) as! [String: String]
print(dictionary["foo"]!)
And it output:
{"foo": "Eat pok\u00e9."}
Eat poké.
By the way, this standard JSON escape syntax should not be confused with Swift's string literal escape syntax, in which the hex sequence must be wrapped in braces:
print("Eat pok\u{00e9}.")
Swift uses a different escape syntax in their string literals, and it should not be confused with that employed by formats like JSON.
#Rob has an excellent solution for the server passing invalid Swift String literals.
If you need to convert "Eat pok\u00e9.\n" to Eat poké it can be done as follows with Swift 3 regex.
var input = "Eat pok\\u00e9.\n"
// removes newline
input = String(input.characters.map {
$0 == "\n" ? " " : $0
})
// regex helper function for sanity's sake
func regexGroup(for regex: String!, in text: String!) -> String {
do {
let regex = try RegularExpression(pattern: regex, options: [])
let nsString = NSString(string: text)
let results = regex.matches(in: text, options: [], range: NSMakeRange(0, nsString.length))
let group = nsString.substring(with: results[0].range)
return group
} catch let error as NSError {
print("invalid regex: \(error.localizedDescription)")
return ""
}
}
let unicodeHexStr = regexGroup(for:"0\\w*", in: input)
let unicodeHex = Int(unicodeHexStr, radix: 16)!
let char = Character(UnicodeScalar(unicodeHex)!)
let replaced = input.stringByReplacingOccurrencesOfString("\\u"+unicodeHexStr, withString: String(char))
// prints "Eat poké"
print(replaced)
\u{00e9} is a formatting that's specific to Swift String literals. When the code is compiled, this notation is parsed and converted into the actual Unicode Scalar it represents.
What you've received is a String that escapes Unicode scalars in a particlar way. Transform those escaped Unicode Scalars into the Unicode Scalars they represent, see this answer.

swift ios alpha numeric regex that allows underscores and dashes

I am using this lib for validation and are trying to add my own regex.
What I want to do is to make a regex that allows alphanumeric A-Z 0-9 together with dashes and unserscores -_
I have tryed let regex = "[a-zA-Z0-9_-]" but I cant get it to work.
I also want the regex to not only allow english letters, but all languishes.
The lib works cause I have made another regex that only allows ints 0-9 which works
let intRegex = "^[0-9]*$"
Your regex look good but it will only match a single character. Do this "^[a-zA-Z0-9_-]*$" instead to match more than one character.
breakup --
^ -- start of string
[\pL0-9_-] -- characters you want to allow
* -- any number of characters (the crucial bit you were missing)
$ -- end of string
Building up on #charsi's answer
extension String {
var isAlphanumericDashUnderscore: Bool {
get {
let regex = try! NSRegularExpression(pattern: "^[a-zA-Z0-9_-]*$", options: .caseInsensitive)
return regex.firstMatch(in: self, options: [], range: NSRange(location: 0, length: count)) != nil
}
}
}

how do I properly express this 'rangeOfCharacter' statement using swift 3?

I have some logic that allows me to listen to the editing of a textfield for invalid characters from a character set I created, obviously do to the rearrangement in swift 3 syntax, I get the following error:
Cannot invoke initializer for type 'Range<Index>' with an argument list of type '(DefaultBidirectionalIndices<String.CharacterView>)
on this line of code:
if let _ = string.rangeOfCharacter(from: invalidCharacters, options: [], range:Range<String.Index>(string.characters.indices))
I've looked into the new API doc's but can't seem to find a correct formatting for this line in swift 3... any suggestions?
You just need to use ..< operator to create your range. Note: If you want to check the whole string you can omit the options and range parameters.
if let _ = string.rangeOfCharacter(from: invalidCharacters, options: [], range: string.startIndex..<string.endIndex) {
}
Or simply:
if let _ = string.rangeOfCharacter(from: invalidCharacters) {
}

Resources