String search with Turkish dotless i - ios

When searching the text Çınaraltı Café for the text Ci using the code
NSStringCompareOptions options =
NSCaseInsensitiveSearch |
NSDiacriticInsensitiveSearch |
NSWidthInsensitiveSearch;
NSLocale *locale = [NSLocale localeWithLocaleIdentifier:#"tr"];
NSRange range = [haystack rangeOfString:needle
options:options
range:NSMakeRange(o, haystack.length)
locale:locale];
I get range.location equals NSNotFound.
It's not to do with the diacritic on the initial Ç because I get the same result searching for alti where the only odd character is the ı. I also get a valid match searching for Cafe which contains a diacritic (the é).
The apple docs mention this situation as notes on the locale parameter and I think I'm following them. Though I guess I'm not because it's not working.
How can I get a search for 'i' to match both 'i' and 'ı'?

I don't know whether this helps as an answer, but perhaps explains why it's happening.
I should point out I'm not an expert in this matter, but I've been looking into this for my own purposes and been doing some research.
Looking at the Unicode collation chart for latin, the equivalent characters to ASCII "i" (\u0069) do not include "ı" (\u0131), whereas all the other letters in your example string are as you expect, i.e.:
"c" (\u0063) does include "Ç" (\u00c7)
"e" (\u0065) does include "é" (\u00e9)
The ı character is listed separately as being of primary difference to i. That might not make sense to a Turkish speaker (I'm not one) but it's what Unicode have to say about it, and it does fit the logic of the problem you describe.
In Chrome you can see this in action with an in-page search. Searching in the page for ASCII i highlights all the characters in its block and does not match ı. Searching for ı does the opposite.
By contrast, MySQL's utf8_general_ci collation table maps uppercase ASCII I to ı as you want.
So, without knowing anything about iOS, I'm assuming it's using the Unicode standard and normalising all characters to latin by this table.
As to how you match Çınaraltı with Ci - if you can't override the collation table then perhaps you can just replace i in your search strings with a regular expression, so you search on Ç[iı] instead.

I wrote a simple extension in Swift 3 for Turkish string search.
let turkishSentence = "Türkçe ya da Türk dili, batıda Balkanlar’dan başlayıp doğuda Hazar Denizi sahasına kadar konuşulan Altay dillerinden biridir."
let turkishWannabe = "basLayip"
let shouldBeTrue = turkishSentence.contains(turkishString: turkishWannabe, caseSensitive: false)
let shouldBeFalse = turkishSentence.contains(turkishString: turkishWannabe, caseSensitive: true)
You can check it out from https://github.com/alpkeser/swift_turkish_string_search/blob/master/TurkishTextSearch.playground/Contents.swift

I did this and seems to work well for me.. hope it helps!
NSString *cleanedHaystack = [haystack stringByReplacingOccurrencesOfString:#"ı"
withString:#"i"];
cleanedHaystack = [cleanedHaystack stringByReplacingOccurrencesOfString:#"İ"
withString:#"I"];
NSString *cleanedNeedle = [needle stringByReplacingOccurrencesOfString:#"ı"
withString:#"i"];
cleanedNeedle = [cleanedNeedle stringByReplacingOccurrencesOfString:#"İ"
withString:#"I"];
NSUInteger options = (NSDiacriticInsensitiveSearch |
NSCaseInsensitiveSearch |
NSWidthInsensitiveSearch);
NSRange range = [cleanedHaystack rangeOfString:cleanedNeedle
options:options];

As Tim mentions, we can use regular expression to match text containing i or ı. I also didn't want to add a new field or change the source data as the search looks up huge amounts of string. So I ended up a solution using regular expressions and NSPredicate.
Create NSString category and copy this method. It returns basic or matching pattern. You can use it with any method that accepts regular expression pattern.
- (NSString *)zst_regexForTurkishLettersWithCaseSensitive:(BOOL)caseSensitive
{
NSMutableString *filterWordRegex = [NSMutableString string];
for (NSUInteger i = 0; i < self.length; i++) {
NSString *letter = [self substringWithRange:NSMakeRange(i, 1)];
if (caseSensitive) {
if ([letter isEqualToString:#"ı"] || [letter isEqualToString:#"i"]) {
letter = #"[ıi]";
} else if ([letter isEqualToString:#"I"] || [letter isEqualToString:#"İ"]) {
letter = #"[Iİ]";
}
} else {
if ([letter isEqualToString:#"ı"] || [letter isEqualToString:#"i"] ||
[letter isEqualToString:#"I"] || [letter isEqualToString:#"İ"]) {
letter = #"[ıiIİ]";
}
}
[filterWordRegex appendString:letter];
}
return filterWordRegex;
}
So if the search word is Şırnak, it creates Ş[ıi]rnak for case sensitive and Ş[ıiIİ]rnak for case insensitive search.
And here are the possible usages.
NSString *testString = #"Şırnak";
// First create your search regular expression.
NSString *searchWord = #"şır";
NSString *searchPattern = [searchWord zst_regexForTurkishLettersWithCaseSensitive:NO];
// Then create your matching pattern.
NSString *pattern = searchPattern; // Direct match
// NSString *pattern = [NSString stringWithFormat:#".*%#.*", searchPattern]; // Contains
// NSString *pattern = [NSString stringWithFormat:#"\\b%#.*", searchPattern]; // Begins with
// NSPredicate
// c for case insensitive, d for diacritic insensitive
NSPredicate *predicate = [NSPredicate predicateWithFormat:#"self matches[cd] %#", pattern];
if ([predicate evaluateWithObject:testString]) {
// Matches
}
// If you want to filter an array of objects
NSArray *matchedCities = [allAirports filteredArrayUsingPredicate:
[NSPredicate predicateWithFormat:#"city matches[cd] %#", pattern]];
You can also use NSRegularExpression but I think using case and diacritic insensitive search with NSPredicate is much more simpler.

Related

Validate a string using regex

I want to validate a string to check if it is alphanumeric and contains "-" and "." with the alphanumeric characters. So I have done something like this to form the regex pattern
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"[a-zA-Z0-9\\.\\-]"
options:NSRegularExpressionCaseInsensitive
error:&error];
NSPredicate *regexTest = [NSPredicate predicateWithFormat:#"SELF MATCHES %#", regex];
BOOL valid = [regexTest evaluateWithObject:URL_Query];
App crashes stating that the regex pattern cannot be formed . Can anyone give me a quickfix to what am i doing wrong? Thanks in advance.
You must pass a variable of type NSString to the NSPredicate SELF MATCHES:
NSString * URL_Query = #"PAS.S.1-23-";
NSString * regex = #"[a-zA-Z0-9.-]+";
NSPredicate *regexTest = [NSPredicate predicateWithFormat:#"SELF MATCHES %#", regex];
BOOL valid = [regexTest evaluateWithObject:URL_Query];
See the Objective C demo
Note that you need no anchors with the SELF MATCHES (the regex is anchored by default) and you need to add + to match one or more allows symbols, or * to match 0+ (to also allow an empty string).
You do not need to escape the hyphen at the start/end of the character class, and the dot inside a character class is treated as a literal dot char.
Also, since both the lower- and uppercase ASCII letter ranges are present in the pattern, you need not pass any case insensitive flags to the regex.

iOS - NSString regex match

I have a string for example:
NSString *str = #"Strängnäs"
Then I use a method for replace scandinavian letters with *, so it would be:
NSString *strReplaced = #"Str*ngn*s"
I need a function to match str with strReplaced. In other words, the * should be treated as any character ( * should match with any character).
How can I achieve this?
Strängnäs should be equal to Str*ngn*s
EDIT:
Maybe I wasn't clear enough. I want * to be treated as any character. So when doing [#"Strängnäs" isEqualToString:#"Str*ngn*s"] it should return YES
I think the following regex pattern will match all non-ASCII text considering that Scandinavian letters are not ASCII:
[^ -~]
Treat each line separately to avoid matching the newline character and replace the matches with *.
Demo: https://regex101.com/r/dI6zN5/1
Edit:
Here's an optimized pattern based on the above one:
[^\000-~]
Demo: https://regex101.com/r/lO0bE9/1
Edit 1: As per your comment, you need a UDF (User defined function) that:
takes in the Scandinavian string
converts all of its Scandinavian letters to *
takes in the string with the asterisks
compares the two strings
return True if the two strings match, else false.
You can then use the UDF like CompareString(ScanStr,AsteriskStr).
I have created a code example using the regex posted by JLILI Amen
Code
NSString *string = #"Strängnäs";
NSError *error = nil;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"[^ -~]" options:NSRegularExpressionCaseInsensitive error:&error];
NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0 range:NSMakeRange(0, [string length]) withTemplate:#"*"];
NSLog(#"%#", modifiedString);
Output
Str*ngn*s
Not sure exactly what you are after, but maybe this will help.
The regular expression pattern which matches anything is. (dot), so you can create a pattern from your strReplaced by replacing the *'s with .'s:
NSString *pattern = [strReplaced stringByReplacingOccurencesOfString:#"*" withString:"."];
Now using NSRegularExpression you can construct a regular expression from pattern and then see if str matches it - see the documentation for the required methods.

Checking for a valid Hebrew regex return always YES

I've a certain regex pattern to check against.
Valid result is is only Hebrew language, letters, marks etc.
//////////Regex//////////
static NSString *const HEBREW_NUMBERS_NON_NUMERIC_CHAR = #"([\u0590-\u05FF]*|[0-9]*|[\\s]*|[.-:;,?!/&*()+=_'\"]*)+";
+ (BOOL)hasValidOpenLine:(NSString *)openLine
{
if (openLine.length >= MIN_NUMBER_OF_CHARACTERS_IN_OPEN_LINE || openLine.length <= MAX_NUMBER_OF_CHARACTERS_IN_OPEN_LINE) {
NSError *errorRegex;
NSRegularExpression *regexOpenLine = [[NSRegularExpression alloc] initWithPattern:HEBREW_NUMBERS_NON_NUMERIC_CHAR
options:0
error:&errorRegex];
NSRange range = NSMakeRange(0, openLine.length);
if ([regexOpenLine numberOfMatchesInString:openLine options:0 range:range] > 0) {
return YES;
}
}
return NO;
}
But no matter what I type, it always return me YES even for only English string.
There may be two things going wrong here, depending on your test string. First off, the stars in your regex allow for empty matches against strings which would otherwise not match, which is why your regex might match English strings — matching your regex on #"Hello, world!" returns {0, 0}, a range whose location is not NSNotFound, but whose length is zero.
The other issue is that you're not anchoring your search. This will allow the regex to match against singular characters in strings that would otherwise not match (e.g. the , in #"Hello, world!"). What you need to do is anchor the regex so that the whole string has to match, or else the regex rejects it.
Your modified code can look something like this:
static NSString *const HEBREW_NUMBERS_NON_NUMERIC_CHAR = #"([\u0590-\u05FF]|[0-9]|[\\s]|[.-:;,?!/&*()+=_'\"])+";
+ (BOOL)hasValidOpenLine:(NSString *)openLine
{
if (openLine.length >= MIN_NUMBER_OF_CHARACTERS_IN_OPEN_LINE || openLine.length <= MAX_NUMBER_OF_CHARACTERS_IN_OPEN_LINE) {
NSError *errorRegex;
NSRegularExpression *regexOpenLine = [[NSRegularExpression alloc] initWithPattern:HEBREW_NUMBERS_NON_NUMERIC_CHAR
options:0
error:&errorRegex];
if ([regexOpenLine numberOfMatchesInString:openLine options:NSMatchingAnchored range:NSMakeRange(0, openLine.length)] > 0) {
return YES;
}
}
return NO;
}
This will now match against strings like #"שלום!", and not strings like #"Hello, world!" or #"Hello: היי", which is what I assume you're going for.
In the future, if you're looking to debug regexes, use -[NSRegularExpression rangeOfFirstMatchInString:options:range:] or -[NSRegularExpression enumerateMatchesInString:options:range:usingBlock:]; they can help you find matches that may cause your regex to accept unnecessarily.

Check Objective-C String for specific characters

For an app I'm working on, I need to check if a text field contains only the letters A, T, C, or G. Furthermore, I would like to make specialized error messages for any other inputed characters. ex) "Don't put in spaces." or "The letter b isn't an accepted value." I have read a couple other posts like this, but they are alphanumeric, I only want specified characters.
One approach for you, far from unique:
NString has methods to find substrings, represented as an NSRange of location & offset, made up from characters in a given NSCharacterSet.
The set of what should be in the string:
NSCharacterSet *ATCG = [NSCharacterSet characterSetWithCharactersInString:#"ATCG"];
And the set of what shouldn't:
NSCharacterSet *invalidChars = [ATCG invertedSet];
You can now search for any range of characters consisting of invalidChars:
NSString *target; // the string you wish to check
NSRange searchRange = NSMakeRange(0, target.length); // search the whole string
NSRange foundRange = [target rangeOfCharacterFromSet:invalidChars
options:0 // look in docs for other possible values
range:searchRange];
If there are no invalid characters then foundRange.location will be equal to NSNotFound, otherwise you change examine the range of characters in foundRange and produce your specialised error messages.
You repeat the process, updating searchRange based on foundRange, to find all the runs of invalid characters.
You could accumulate the found invalid characters into a set (maybe NSMutableSet) and produce the error messages at the end.
You can also use regular expressions, see NSRegularExpressions.
Etc. HTH
Addendum
There is a really simple way to address this, but I did not give it as the letters you give suggest to me you may be dealing with very long strings and using provided methods as above may be a worthwhile win. However on second thoughts after your comment maybe I should include it:
NSString *target; // the string you wish to check
NSUInteger length = target.length; // number of characters
BOOL foundInvalidCharacter = NO; // set in the loop if there is an invalid char
for(NSUInteger ix = 0; ix < length; ix++)
{
unichar nextChar = [target characterAtIndex:ix]; // get the next character
switch (nextChar)
{
case 'A':
case 'C':
case 'G':
case 'T':
// character is valid - skip
break;
default:
// character is invalid
// produce error message, the character 'nextChar' at index 'ix' is invalid
// record you've found an error
foundInvalidCharacter = YES;
}
}
// test foundInvalidCharacter and proceed based on it
HTH
Use NSRegulareExpression like this.
NSString *str = #"your input string";
NSRegularExpression *regEx = [NSRegularExpression regularExpressionWithPattern:#"A|T|C|G" options:0 error:nil];
NSArray *matches = [regEx matchesInString:str options:0 range:NSMakeRange(0, str.length)];
for (NSTextCheckingResult *result in matches) {
NSLog(#"%#", [str substringWithRange:result.range]);
}
Also for the options parameter you have to look in the documentation to pick one that fits.
Look at the NSRegularExpression class reference.
Visit: https://developer.apple.com/library/mac/documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html

How to validate a phone number with + symbol in objective c?

I am so confused about the regex methods. My requirement is to validate a phone number that may contains + symbol in its prefix. Then all the charactors should be numerals only. For this, how can i create a regular expression in objective c.
I'm late answering, but I found an interesting solution when I recently have had the same problem. It uses the built-in cocoa methods instead of custom regex.
- (BOOL)validatePhoneNumberWithString:(NSString *)string {
if (nil == string || ([string length] < 2 ) )
return NO;
NSError *error;
NSDataDetector *detector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypePhoneNumber error:&error];
NSArray *matches = [detector matchesInString:string options:0 range:NSMakeRange(0, [string length])];
for (NSTextCheckingResult *match in matches) {
if ([match resultType] == NSTextCheckingTypePhoneNumber) {
NSString *phoneNumber = [match phoneNumber];
if ([string isEqualToString:phoneNumber]) {
return YES;
}
}
}
return NO;
}
I wouldn't say this is a definitive answer but it should give you a start.
^\x2b[0-9]+
Will match any string that starts with a '+' and then any amount of numbers greater than 0.
For instance:
+441312002000 - Full phone number matched.
+4413120c2000 - +4413120 is matched.
++441312002000 - No match
441312002000 - No Match
If there are further constraints on length etc then specifiy and I can update the regex. I agree with other poster about using RegexKitLite.
Use RegexKitLite, check the following http://regexkit.sourceforge.net/RegexKitLite/
^\+?[0-9]*$
should do:
^ # start of string
\+? # match zero or one + characters
[0-9]* # match any number of digits
$ # end of string
To use the regex in a string, you'll need to double the backslashes: #"^\\+?[0-9]*$" should work according to other regex examples I've seen, but I don't know Objective-C and may be wrong about this.
This post nicely explains the regex -- http://blog.stevenlevithan.com/archives/validate-phone-number. You have to use "\" instead of "\" to prevent the Objective C preprocessor from interpreting regex escape codes as character string escape codes.
Here is the NSString you would use for the requested match
NSString *northAmRegexWithOptionalLeadingOne = #"^(?:\\+?1[-. ]?)?\\(?([2-9][0-8][0-9])\\)?[-. ]?([2-9][0-9]{2})[-. ]?([0-9]{4})$";
+*[0-9]{length of phone}. Should work.

Resources