How to take non-english characters from UITextField and consider them as normal characters - ios

i have a database that contains non-english words ( for those who wonders turkish letters). And i have an algorithm which compares the input with database.
So my problem is this; in my database all the strings are written with turkish characters. So lets say i have thıs element to compare heyyö. When user enters heyyo it won't find it since they are considered as different words.
My first thought was put special cases and when a non-english character found consider whether english or non-english letter ( like g with ğ or i with ı) but that means a lot of brute force.
how can i do this with elegance.
Oh and user enters this inputs from a textfield if that wasn't implied.

The removal of diacritics is called "folding." You can compare strings without regard to diacritics using the option NSDiacriticInsensitiveSearch.
[string compare:otherString options:NSDiacriticInsensitiveSearch] == NSOrderedSame
You can similarly generate a folded string using stringByFoldingWithOptions:locale:.
Note that this only removes diacritics. There are many ways that characters can "seem" the same without being the same. Turkish is somewhat notorious about this because the lowercase version of "I" is "ı" (LATIN SMALL DOTLESS I), not "i". If you're particularly dealing with Turkish, you may have to account for this.

What you can do is something like this:
NSString *input = #"heyyö";
NSData *intermediaryDataForm = [input dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *output = [[NSString alloc] initWithData:intermediaryDataForm encoding:NSASCIIStringEncoding];
That way, because the turkish letters are not part of ASCII, and you are allowing a lossy conversion, then it automatically changes 'ö' to 'o' when converted to the NSData form. Then converting it back to NSString solves the issue.

Related

NSAttributedString & decomposedStringWithCanonicalMapping ranges

I'm running into problems with international (in this case Korean) NSString values.
The same input string is used in two different parts of the program. The first part finds a substring that needs highlighting, stores the NSString and the range for the highlighting into a database.
The second part of the program retrieves the string and displays the highlighting.
The marking part is done using an NSString that has been normalized in Unicode Normalization Form C using the precomposedStringWithCanonicalMapping method on NSString. An NSRange and an NSString are then stored into the Core Data database.
The graphical highlighting is performed by retrieving the NSRange and NSString from the database, putting the NSString into the same Form C using the same method, using this to initialize an NSMutableAttributedString and using the NSRange to set its text attributes.
At this stage, the program crashes because the NSMutableAttributedString is 80 characters long, whereas the NSString was 81 characters long..
NSAttributedString does not have a precomposedStringWithCanonicalMapping method and I assume it changes the representation internally resulting in a different encoding and thus length.
What can I do?
is the a way of forcing NSAttributedString to keep an underlying encoding?
is there a way of converting an NSRange from one encoding to another?
or is there anything else I can do?
Ok,
I did eventually find out what had happened. In one particular place in the program I mistakenly used decomposed​String​With​Canonical​Mapping rather than precomposed​String​With​Canonical​Mapping and that's where the "wrong" mapping came from.

How to decode \U201a\U00c4\U00f2\U201a\U00c4\U00f4 this? [duplicate]

This question already has answers here:
UTF8 character decoding in Objective C
(4 answers)
Closed 6 years ago.
Aam getting long text from server and that text contains character \U201a\U00c4\U00f2He-Must-Not-Be-Named\U201a\U00c4\U00f4.
When I display text in textView am getting some different characters...
How do I get normal Text in objective c???
Please help me out with this
When I received data from server I use
infoDictionary = [NSJSONSerialization JSONObjectWithData:data options:0 error:nil];
and from that infoDictionary I get text like
locks his cousin Dudley in the snake\U201a\U00c4\U00f4s captivity just in the blink of an eye. Each wand has a magical core such as phoenix\U201a\U00c4\U00f4s hair or dragon heartstring, that performs all the magic.
\n
And I assign this value to textView like
detailsTextView.text = [infoDictionary objectForKey:#"DESCRIPTION"];
But in textView am getting some different characters..
There are two possibilities, one more likely, one less likely.
The less likely one is that your server sends rubbish when it tries to translate its data into JSON.
The more likely one is that you are just frightening yourself, and there is nothing wrong. Something like \U201a\U00c4\U00f2He-Must-Not-Be-Named\U201a\U00c4\U00f4 is exactly how non-ASCII characters are encoded in UTF-8. For example, U201A is the Unicode character "Single Low-9 Quotation Mark". Use the character viewer in MacOS X to find out what the characters are if you are curious. If you use NSLog, you will also get the same strange characters. They should be displayed in your text view perfectly fine.
However, in your case, the sequence \U00c4\U00f2 or \U00c4\U00f4 seems to be highly unusual. This would seem to be a problem with the server code, or with the actual data that is stored. If you are given rubbish data, there's nothing you can do about it. It's also not created by one of the typical stupid mistakes on the server (storing MacRoman characters, or taking UTF-8 and assume the bytes are code points). The only thing you can do is to contact whoever is supplying this data.
Now there is something you can do. You can use the method stringByReplacingOccurencesOfString: to replace nonsense data with something sensible. I wouldn't expect the sequence \U201a\U00c4\U00f4s = ’ to ever appear in a string that I display. So figure out what string belongs there (say a quotation mark) and replace it. So get the description into an NSString, use stringByReplacingOccurencesOfString: and use the result. There may be more strange combinations than just this one.
stringWithUTF8String: takes const char* as an argument, so no "#"
symbol in the front.
NSString *description = [infoDictionary objectForKey:#"DESCRIPTION"];
NSString *str = [NSString stringWithUTF8String:description.UTF8String];
detailsTextView.text = str;
Show this str in your textview.

Why we should use uppercaseStringWithLocale of NSString to get correct uppercase string?

I have a very simple task: from server I get UTF-8 string as byte array and I need to show all symbols from this string in upper case. From this string you can get really any symbol of unicode table.
I know how to do it in a line of code:
NSString* upperStr = [[NSString stringWithCString:utf8str encoding:NSUTF8StringEncoding] uppercaseString];
And seems to me it works with all symbols which I have checked. But I don't understand: why we need method uppercaseStringWithLocale? When we work with unicode each symbol has unique place in unicode table and we can easily find does it have upper/lower case representation. What trouble I might have if I use uppercaseString instead uppercaseStringWithLocale?
Uppercasing is locale-dependent. For example, the uppercase version of "i" is "I" in English, but "İ" in Turkish. If you always apply English rules, uppercasing of Turkish words will end up wrong.
The docs say:
The following methods perform localized case mappings based on the
locale specified. Passing nil indicates the canonical mapping. For
the user preference locale setting, specify +[NSLocale currentLocale].
Assumedly in some locales the mapping from lowercase to uppercase changes even within a character set. I'm not an expert in every language around the globe, but the people who wrote these methods are, so I use 'em.

How do I extract a list of email/mailbox strings within text or test if a string is a proper email address?

Given some arbitrary text, I'd like to extract all email addresses and 'mailbox specifiers' (e.g. "Fred Smith" <fred#me.com>). I looked at NSDataDetector, but it does not handle email addresses.
The way to approach this is to get a really good algorithm that can detect as many valid addresses as possible, and reject improper ones. Probably the best solution would be a parser constructed using lex and yacc, but reasonable solutions exist using regular expressions.
See this site for both a list of tested regular expressions as well as a more in-depth discussion of the problem and possible solutions.
The regular expressions shown on the above site are formatted for PHP, and have leading and trailing '/' markers, as well as 'flags' indicating case-insensitive etc (see this site for more info), so these need to be stripped off before using the expression in an Objective-C project. Also, any anchors need stripping too, since we want multiple addresses not just one (i.e., '^' and '$').
NSRegularExpression is the class to use here. What I've found helpful is to store the regular expression in a file in my project, so that you don't need to worry about escaping all the backslashes and quotes. The code then reads the expression into a string, and creates the object as follows:
NSString *fullPath = [[NSBundle mainBundle] pathForResource:self.regex ofType:#"txt"];
NSString *pattern = [NSString stringWithContentsOfFile:fullPath encoding:NSUTF8StringEncoding error:NULL];
__autoreleasing NSError *error = nil;
reg = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error]; // some patterns may not need NSRegularExpressionCaseInsensitive
assert(reg && !error);
Once you have an initialized expression, you use it to return a list of ranges, each range being an address:
NSArray *ret = [reg matchesInString:str options:0 range:NSMakeRange(0, [str length])];
However, we know that all email addresses contain one '#', so it's probably worthwhile to verify that the string has at least one before processing it. Also, since the text might have line and/or carriage returns in it, you might want to strip those first. It's probably better to strip them completely as some mail program might have split a line at some interior point of the address.
Once you have a list of the address ranges, then for the most part the job is done - if all you wanted was the address. However, often addresses are presented in "mailbox specifier' format, where a name is prepended to the address, and the address wrapped with '<' and '>'. This format is covered in RFC5322, in section 3.4.
To recover the name from a 'mailbox specifier', check to see if the address is wrapped with '<' and '>', and if so then find the string preceding the '<', ignoring white space (until you find the first character). Most names will be wrapped in double quotes (common practice), but actually can be naked alphanumeric strings using a backslash escape to include white space or other special characters (like '"').
This same technique can be used for real time verification - say to enable a submit button when a text string becomes a valid email address. In this case you evaluate the string on each user change, and enable/disable the submit button.
If all this seems like a lot of work to code, you can grab an open source project on github.
EDIT1: for a faster, but less rigorous, method see the comment by CodaFi.
EDIT2: it appears the content of a "mailto: URL can be quite complex, the github project only handles the most simple, and does not un-encode the address. This will be addressed in a future update.
EDIT3: the project was updated to fully handle "mailto:" objects, and returns to, cc, bcc, subject, and body, all URLdecoded.

Grails UrlEncoding non latin characters like åäö

I have some link resources with none latin characters like åäö
These are usually user uploaded files
The problem is that i am not successfull in encoding them
using filename.encodeAsURL seems to not encode it the right way
For example the character ö is turned into o%CC%88
Testing to type the same thing in firefox and copy the contents gives %C3%B6
What are the difference between these encodings and what should i use to get the correct encoding??
Both encodings are correct. You are actually seeing the encoding of two different strings.
The key here is noticing the o at the beginning of the string:
o%CC%88 is the letter o followed by Unicode Character Combining Diaeresis, which combines with the previous character when rendered.
%C3%B6 is Unicode Character Latin Small O With Diaeresis.
What you are seeing is that in the first case, the string entered is something like these two characters: o ¨, which are actually rendered as ö.
In the second case, it's the actual character ö.
My guess is you are seeing the difference between two different inputs.
Update based on below discussion: If you are dynamically processing Unicode characters, and you do not have control over the input methods, you can try to normalize the Unicode, using java.text.Normalizer (Java 1.6 or newer).
Normalizing attempts to ensure that all characters are consistently represented, so that accented characters are always represented by a combined character or always by the character+combining mark.
Rough example:
String.metaClass.normalizeUnicode = {
return java.text.Normalizer.normalize(delegate, java.text.Normalizer.Form.NFC)
}
input = input.normalizeUnicode()
There are four forms of normalization. I picked the one that seems to be best for your case based on the description of how they work, but you may prefer to try the other ones and see what works most consistently.
All that being said, if you are try to representing Unicode characters in a URL, and they are not being loaded and processed by the code directly, it's probably best to avoid using non-latin characters altogether. Not only does this have the benefit of consistently, but also significantly shorter and more legible URLs. boo.pdf is a lot easier to read than bo%CC%88o.pdf.

Resources