NSString with Cyrillic to UTF8/Latin encoding - ios

I have a string coming in from a web service, it's a mixture of Cyrillic and Latin/English characters. When building an array by separating the words in the sentence it show's the unicode in place of the letters when using NSLog. I want to know how to convert any of the cyrillic/unicode characters to a proper readable latin/english word. For example..
NSString *sentence = #"The Tobе Elіte"; (e in Tobe is Cyrillic, and i in Elite)
After putting each word in the string into an array, when printing I get this:
(
The,
"Tob\U0435",
"El\U0456te"
)
I need this to transliterate to latin "Tobe" and latin "Elite". If I try comparing what I have now by doing
if(![#"Tobe" isEqualToString:[array objectAtIndex:1]])
//Tobe is not Equal to Tob\U0435
I do apologize if I explained this horribly, if you have any questions to help better understand my problem feel free to ask. I have tried several things to get this encoded to proper UTF8. For example, this does not work:
NSMutableString *buffer = [string mutableCopy];
CFMutableStringRef bufferRef = (__bridge CFMutableStringRef)buffer;
CFStringTransform(bufferRef, NULL, kCFStringTransformToLatin, false);
Ultimately I need to search the array for matching words by using NSPredicate, but with the Unicode in the array it does not allow me to do so. Any help is appreciated.

This works for me:
NSString *sentence = #"The Tobе Elіte";
NSMutableString *buffer = [sentence mutableCopy];
CFMutableStringRef bufferRef = (__bridge CFMutableStringRef)buffer;
CFStringTransform(bufferRef, NULL, kCFStringTransformToLatin, false);
CFStringTransform(bufferRef, NULL, kCFStringTransformStripDiacritics, false);
NSArray *arr = [buffer componentsSeparatedByString:#" "];
NSLog(#"%#", arr);
and you can find some more info here:
http://nshipster.com/cfstringtransform/

Related

Substring char * in Objective C

I need to substring char* to some length and need to convert to NSString.
char *val substring Length
I tried
NSString *tempString = [NSString stringWithCString:val encoding:NSAsciiStringEncoding];
NSRange range = NSMakeRange (0, length);
NSString *finalValue = [tempString substringWithRange: range];
This works but not for other special character languages like chinese.
If i convert To UTF8Encoding then substring length will mismatch.
Is there any other way to substring the char* then convert to UTF8 encoding?
You have to use the encoding, the string is encoded in.
In your case, you say to interpret the string as ASCII string. ASCII does not have chinese characters. Therefore this cannot work with chinese characters: They are not there.
Likely you have an UTF8 encoded string. But simply switching to UTF8 does not help. Since NSString and OS X/iOS at all encodes 16-Bit Unicode, but extended Unicode has 20 bits, chinese characters needs multiple codes. This has some effects, for example -length returns the number of codes, not the number of chinese characters. However, with -rangeOfComposedCharacterSequencesForRange: you can adjust the range.
For example 𠀖 (CJK unified ideograph-0x20016):
NSString *str = #"𠀖"; // One chinese whatever
NSLog(#"%ld", [str length]); // This are "2" characters
NSRange range = {0, 1}; // Range for the "first" character
NSLog(#"%ld %ld", range.location, range.length); // 0 1
range = [str rangeOfComposedCharacterSequencesForRange:range];
NSLog(#"%ld %ld", range.location, range.length); // 0 2
You can get a better answer, if you add information about the encoding of the string coming in and the required encoding for putting out.
Strings are not UTF8 or whatever strings. Strings are strings. Their storage, their representation in computer memory has an encoding, but they don't have an encoding themselves.
I found the solution for my question
char subString[length+1];
strncpy(subString, val, length);
subString[length] = '\0'; // place the null terminator
NSString *finalString = [NSString stringWithCString: subString encoding:NSUTF8StringEncoding];
I did the char* sub string and UTF8 encoding both.

Copyright/Registered symbol encoding not working

I’ve developed an iOS app in which we can send emojis from iOS to web portal and vice versa. All emojis sent from iOS to web portal are displaying perfect except “© and ®”.
Here is the emoji encoding piece of code.
NSData *data = [messageBody dataUsingEncoding:NSNonLossyASCIIStringEncoding];
NSString *encodedString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
// This piece of code returns \251\256 as Unicodes of copyright and registered emojis, as these two Unicodes are not according to standard code so it doesn't display on web portal.
So what should I do to convert them standard Unicodes?
Test Run :
messageBody = #"Copy right symbol : © AND Registered Mark symbol : ®";
// Encoded string i get from the above encoding is
Copy right symbol : \\251 AND Registered Mark symbol : \\256
Where as it should like this (On standard unicodes )
Copy right symbol : \\u00A9 AND Registered Mark symbol : \\u00AE
First, I will try to provide the solution. Then I will try to explain why.
Escaping non-ASCII chars
To escape unicode chars in a string, you shouldn't rely on NSNonLossyASCIIStringEncoding. Below is the code that I use to escape unicode&non-ASCII chars in a string:
// NSMutableString category
- (void)appendChar:(unichar)charToAppend {
[self appendFormat:#"%C", charToAppend];
}
// NSString category
- (NSString *)UEscapedString {
char const hexChar[] = "0123456789ABCDEF";
NSMutableString *outputString = [NSMutableString string];
for (NSInteger i = 0; i < self.length; i++) {
unichar character = [self characterAtIndex:i];
if ((character >> 7) > 0) {
[outputString appendString:#"\\u"];
[outputString appendChar:(hexChar[(character >> 12) & 0xF])]; // append the hex character for the left-most 4-bits
[outputString appendChar:(hexChar[(character >> 8) & 0xF])]; // hex for the second group of 4-bits from the left
[outputString appendChar:(hexChar[(character >> 4) & 0xF])]; // hex for the third group
[outputString appendChar:(hexChar[character & 0xF])]; // hex for the last group, e.g., the right most 4-bits
} else {
[outputString appendChar:character];
}
}
return [outputString copy];
}
(NOTE: I guess Jon Rose's method does the same but I didn't wanna share a method that I didn't test)
Now you have the following string: Copy right symbol : \u00A9 AND Registered Mark symbol : \u00AE
Escaping unicode is done. Now let's convert it back to display the emojis.
Converting back
This is gonna be confusing at first but this is what it is:
NSData *data = [escapedString dataUsingEncoding:NSUTF8StringEncoding];
NSString *converted = [[NSString alloc] data encoding:NSNonLossyASCIIStringEncoding];
Now you have your emojis (and other non-ASCIIs) back.
What is happening?
The problem
In your case, you are trying to create a common language between your server side and your app. However, NSNonLossyASCIIStringEncoding is pretty bad choice for the purpose. Because this is a black-box that is created by Apple and we don't really know what it is exactly doing inside. As we can see, it converts unicode into \uXXXX while converting non-ASCII chars into \XXX. That is why you shouldn't rely on it to build a multi-platform system. There is no equivalent of it in backend platforms and Android.
Yet it is pretty mysterious, NSNonLossyASCIIStringEncoding can still convert back ® from \u00AE while it is converting it into \256 in the first place. I'm sure there are tools on other platforms to convert \uXXXX into unicode chars, that shouldn't be a problem for you.
messageBody is a string there is no reason to convert it to data only to convert it back to a string. Replace your code with
NSString *encodedString = messageBody;
If the messageBody object is incorrect then the way to fix it is to change the way it was created. The server sends data, not strings. The data that the server sends is encoding in some agreed upon way. Generally this encoding is UTF-8. If you know the encoding you can convert the data to a string; if you don't, then the data is gibberish that cannot be read. If the messageBody is incorrect, the problem occurred when it was converted from the data that the server sent. It seems likely that you are parsing it with the incorrect encoding.
The code you posted is just plain wrong. It converts a string to data using one encoding (ASCII) and the reads that data with a different encoding (UTF8). That is like translating a book to Spanish and then having a Portuguese speaker translate it back - it might work for some words, but it is still wrong.
If you are still having trouble then you should share the code of where messageBody is created.
If you server expects a ASCII string with all unicode characters changed to \u00xx then you should first yell at your server guy because he is an idiot. But if that doesn't work you can do the following code
NSString* messageBody = #"Copy right symbol : © AND Registered Mark symbol : ®";
NSData* utf32Data = [messageBody dataUsingEncoding:NSUTF32StringEncoding];
uint32_t *bytes = (uint32_t *) [utf32Data bytes];
NSMutableString* escapedString = [[NSMutableString alloc] init];
//Start a 1 because first bytes are for endianness
for(NSUInteger index = 1; index < escapedString.length / 4 ;index++ ){
uint32_t charValue = bytes[index];
if (charValue <= 127) {
[escapedString appendFormat:#"%C", (unichar)charValue];
}else{
[escapedString appendFormat:#"\\\\u%04X", charValue];
}
}
I'm really do not understand your problem.
You can simply convert ANY character into nsdata and return it into string.
You can simply pass UTF-8 string including both emoji and other symbols using POST request.
NSString* newStr = [[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding];
NSData* data = [newStr dataUsingEncoding:NSUTF8StringEncoding];
It have to work for both server and client side.
But, of course, you have got the other problem that some fonts do not support allutf-8 chars. That's why, e.g., in terminal you might not see some of them. But this is beyong the scope of this question.
NSNonLossyASCIIStringEncoding is used only then you really wnat to convert symbol into chain of symbols. But it is not needed.

Way to detect character that takes up more than one index spot in an NSString?

I'm wondering, is there a way to detect a character that takes up more than 1 index spot in an NSString? (like an emoji). I'm trying to implement a custom text view and when the user pushes delete, I need to know if I should delete only the previous one index spot or more.
Actually NSString use UTF-16.So it is quite difficult to work with characters which takes two UTF-16 charater(unichar) or more.But you can do with rangeOfComposedCharacterSequenceAtIndexto get range and than delete.
First find the last character index from string
NSUInteger lastCharIndex = [str length] - 1;
Than get the range of last character
NSRange lastCharRange = [str rangeOfComposedCharacterSequenceAtIndex: lastCharIndex];
Than delete with range from character (If it is of two UTF-16 than it deletes UTF-16)
deletedLastCharString = [str substringToIndex: lastCharRange.location];
You can use this method with any type of characters which takes any number of unichar
For one you could transform the string to a sequence of characters using [myString UTF8String] and you can then check if the character has its first bit set to one or zero. If its one then this is a UTF8 character and you can then check how many bytes are there to this character. Details about UTF8 can be found on Wikipedia - UTF8. Here is a simple example:
NSString *string = #"ČTest";
const char *str = [string UTF8String];
NSMutableString *ASCIIStr = [NSMutableString string];
for (int i = 0; i < strlen(str); ++i)
if (!(str[i] & 128))
[ASCIIStr appendFormat:#"%c", str[i]];
NSLog(#"%#", ASCIIStr); //Should contain only ASCII characters

Converting NSString to unichar in iOS

I have seen questions in stackoverflow that convert unichar to NSString but now I would like to do the reverse.
How do i do it?
Need some guidance.. Thanks
For example, I have an array of strings:[#"o",#"p",#"q"];
These are strings inside. How do i convert it back to unichar?
The following will work as long as the first character isn't actually two composed characters (in other words as long as the character doesn't have a Unicode value greater than \UFFFF):
unichar ch = [someString characterAtIndex:0];
You could convert it to a buffer in NSData:
if ([string canBeConvertedToEncoding:NSUnicodeStringEncoding]) {
NSData * data = [string dataUsingEncoding:NSUnicodeStringEncoding];
const unichar* const ptr = (const unichar*)data.bytes;
...
}

Extract text from a NSString using regular expressions

I have a NSString in this format:
"Key1-Value1,Key2-Value2,Key3-Value3,..."
I need only keys (with a space after every comma):
Key1, Key2, Key3, etc.
I thought to create an array of components from the string using the comma as separator, and after, for every component, extract all characters since the "-"; then I'd serialize the array elements. But I fear this could be very heavy about performances.
Do you know a way to do this using regular expressions?
The regex will greatly depend on the data you are using. For example if the key or value is allowed to be all numbers, or allowed to contain space and punctuation, you would need to modify the regex. For your current example however this will work.
NSString *example = #"Key1-Value1,Key2-Value2,Key3-Value3,...";
NSString *result = [example stringByReplacingOccurrencesOfString:#"(\\w+)-(\\w+),?"
withString:#"$1, "
options:NSRegularExpressionSearch
range:NSMakeRange(0, [example length])];
result = [result stringByTrimmingCharactersInSet:[NSCharacterSet characterSetWithCharactersInString:#", "]];
NSLog(#"%#", result);

Resources