Objective-C some special char uncontrollably changing - ios

I have a string that include some special char (like é,â,î,ı etc.), When I use substring on this string. I encounter inconsistent results. Some special char change uncontrollably

You are assuming that these are all characters:
[newword substringWithRange:NSMakeRange(0,1)];
[newword substringWithRange:NSMakeRange(1,1)];
[newword substringWithRange:NSMakeRange(2,1)];
[newword substringWithRange:NSMakeRange(3,1)];
// and so on...
In other words, you believe that:
A location always falls at the start of a character.
A character always has length 1.
Both assumptions are wrong. Please read the Characters and Grapheme Clusters chapter of Apple's String Programming Guide (here).
Your é happens to have length 2, because it is a base letter e followed by a combining diacritical accent. If you want it to have length 1, you need to normalize the string before you use it. Call precomposedStringWithCanonicalMapping and use the resulting string.
Example and proof (in Swift, but it won't matter, as I use NSString throughout):
let s = "é,â,î,ı" as NSString
let c = s.substring(with: NSRange(location: 0, length: 1)) // e
let s2 = s.precomposedStringWithCanonicalMapping as NSString
let c2 = s2.substring(with: NSRange(location: 0, length: 1)) // é

You're treating a unicode string like a sequence of bytes. Unicode codepoints, aside from low UTF8 can be multi-byte so you are changing the text style by stripping out parts responsible for the accent above the letter like this part: https://www.compart.com/en/unicode/U+0301
UTF8 is variable width so by treating it as raw bytes you may get weird results, I would suggest using something that is more aware of unicode like ICU (International Components for Unicode).
Now imagine you have a two byte sequence like this (this may not be 100% accurate but it illustrates my point):
0x056 0x000
e NUL
Now you have a UTF8 string with 1 codepoint and a null terminator. Now say you want to add an accent to that e. How would you do that? You could use a special unicode codepoint to modify the e so now the string is:
0x056 0x0CC 0x810 0x000
e U+0301 NUL
Where U+0301 is 2 a byte control character (Combining Acute Accent) and makes the e accented.
Edit: The answer assumes UTF8 encoding which is likely a bad assumption but I think the answer, whether UTF8 or UTF16, or any other type of encoding with control characters, illustrates why you may have mysterious dissapearing accents. While this may be UTF16, for the sake of simplicity let's pretend we live in a world where life is just slightly better because everyone only uses UTF8 and UTF16 doesn't exist.
To address the comment (this is less to do with the question but is some fun trivia) and for some fun detils about NS/CF/Swift runtimes and bridging and constant CF strings and other fun stuff like that: The representation of the actual string in memory is implementation defined and can vary (even for constant strings, trust me, I know, I fixed the ELF implementation of them in Clang for CoreFoundation a few days ago). Anyway, here's some code:
CF_INLINE CFStringEncoding __CFStringGetSystemEncoding(void) {
if (__CFDefaultSystemEncoding == kCFStringEncodingInvalidId) (void)CFStringGetSystemEncoding();
return __CFDefaultSystemEncoding;
}
CFStringEncoding CFStringFileSystemEncoding(void) {
if (__CFDefaultFileSystemEncoding == kCFStringEncodingInvalidId) {
#if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI || DEPLOYMENT_TARGET_WINDOWS
__CFDefaultFileSystemEncoding = kCFStringEncodingUTF8;
#else
__CFDefaultFileSystemEncoding = CFStringGetSystemEncoding();
#endif
}
return __CFDefaultFileSystemEncoding;
}
Throughout CoreFoundation/Foundation/SwiftFoundation (Yes you never know what sort of NSString is actually the one you're holding, they usually pretend to be the same thing but under the hood depending on how you got the object you may be holding onto one of the three variations of it).
This is why code like this exists, because NS/CF(Constant)/Swift strings have implementation defined internal representation.
if (((encoding & 0x0FFF) == kCFStringEncodingUnicode) && ((encoding == kCFStringEncodingUnicode) || ((encoding > kCFStringEncodingUTF8) && (encoding <= kCFStringEncodingUTF32LE)))) {
If you want consistent behavior you have to encode the string using a specific fixed encoding instead of relying on the internal representation.

Related

Firebase or Swift not detecting umlauts

I found some weirdest thing in Firebase Database/Storage. The thing is that I don't know if Firebase or Swift is not detecting umlauts e.g(ä, ö, ü).
I did some easy things with Firebase like upload images to Firebase Storage and then download them into tableview. Some of my .png files had umlauts in the title for example(Röda.png).
So the problem occurs now if I download them. The only time my download url is nil is if the file name contains the umlauts I was talking about.
So I tried some alternatives like in HTML ö - ö. But this is not working. Can you guys suggest me something? I can't use ö - o, ü - u etc.
This is the code when url is nil when trying to set some values into Firebase:
FIRStorage.storage().reference()
.child("\(productImageref!).png")
.downloadURLWithCompletion({(url, error)in
FIRDatabase.database().reference()
.child("Snuses").child(productImageref!).child("productUrl")
.setValue(url!.absoluteString)
let resource = Resource(downloadURL: url!, cacheKey: productImageref)
After spending a fair bit of time research your problem, the difference boils down to how the character ö is encoded and I traced it down to Unicode normalization forms.
The letter ö can be written in two ways, and String / NSString considers them equal:
let str1 = "o\u{308}" // decomposed : latin small letter o + combining diaeresis
let str2 = "\u{f6}" // precomposed: latin small letter o with diaeresis
print(str1, str2, str1 == str2) // ö ö true
But when you percent-encode them, they produce different results:
print(str1.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet())!)
print(str2.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet())!)
// o%CC%88
// %C3%B6
My guess is that Google / Firebase chooses the decomposed form while Apple prefers the other in its text input system. You can convert the file name to its decomposed form to match Firebase:
let str3 = str2.decomposedStringWithCanonicalMapping
print(str3.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet()))
// o%CC%88
This is irrelevant for ASCII-ranged characters. Unicode can be very confusing.
References:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (highly recommended)
Strings in Swift 2
NSString and Unicode
Horray for Unicode!
The short answer is that no, we're actually not doing anything special here. Basically all we do under the hood is:
// This is the list at https://cloud.google.com/storage/docs/json_api/ without the & because query parameters
NSString *const kGCSObjectAllowedCharacterSet =
#"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~!$'()*+,;=:#";
- (nullable NSString *)GCSEscapedString:(NSString *)string {
NSCharacterSet *allowedCharacters =
[NSCharacterSet characterSetWithCharactersInString:kGCSObjectAllowedCharacterSet];
return [string stringByAddingPercentEncodingWithAllowedCharacters:allowedCharacters];
}
What blows my mind is that:
let str1 = "o\u{308}" // decomposed : latin small letter o + combining diaeresis
let str2 = "\u{f6}" // precomposed: latin small letter o with diaeresis
print(str1, str2, str1 == str2) // ö ö true
returns true. In Objective-C (which the Firebase Storage client is built in), it totally shouldn't, as they're two totally different characters (in actuality, the length of str1 is 2 while the length of str2 is 1 in Obj-C, while in Swift I assume the answer is 1 for both).
Apple must be normalizing strings before comparison in Swift (probably a reasonable thing to do, since otherwise it leads to bugs like this where strings are "the same" but compare differently). Turns out, this is exactly what they do (see the "Extended Grapheme Clusters" section of their docs).
So, when you provide two different characters in Swift, they're being propagated to Obj-C as different characters and thus are encoded differently. Not a bug, just one of the many differences between Swift's String type and Obj-C's NSString type. When in doubt, choose a canonical representation you expect and stick with it, but as a library developer, it's very hard for us to choose that representation for you.
Thus, when naming files that contain Unicode characters, make sure to pick a standard representation (C,D,KC, or KD) and always use it when creating references.
let imageName = "smorgasbörd.jpg"
let path = "images/\(imageName)"
let decomposedPath = path.decomposedStringWithCanonicalMapping // Unicode Form D
let ref = FIRStorage.storage().reference().child(decomposedPath)
// use this ref and you'll always get the same objects

Regular Expressions in iOS [duplicate]

I'm creating a regexp for password validation to be used in a Java application as a configuration parameter.
The regexp is:
^.*(?=.{8,})(?=..*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=]).*$
The password policy is:
At least 8 chars
Contains at least one digit
Contains at least one lower alpha char and one upper alpha char
Contains at least one char within a set of special chars (##%$^ etc.)
Does not contain space, tab, etc.
I’m missing just point 5. I'm not able to have the regexp check for space, tab, carriage return, etc.
Could anyone help me?
Try this:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
Explanation:
^ # start-of-string
(?=.*[0-9]) # a digit must occur at least once
(?=.*[a-z]) # a lower case letter must occur at least once
(?=.*[A-Z]) # an upper case letter must occur at least once
(?=.*[##$%^&+=]) # a special character must occur at least once
(?=\S+$) # no whitespace allowed in the entire string
.{8,} # anything, at least eight places though
$ # end-of-string
It's easy to add, modify or remove individual rules, since every rule is an independent "module".
The (?=.*[xyz]) construct eats the entire string (.*) and backtracks to the first occurrence where [xyz] can match. It succeeds if [xyz] is found, it fails otherwise.
The alternative would be using a reluctant qualifier: (?=.*?[xyz]). For a password check, this will hardly make any difference, for much longer strings it could be the more efficient variant.
The most efficient variant (but hardest to read and maintain, therefore the most error-prone) would be (?=[^xyz]*[xyz]), of course. For a regex of this length and for this purpose, I would dis-recommend doing it that way, as it has no real benefits.
simple example using regex
public class passwordvalidation {
public static void main(String[] args) {
String passwd = "aaZZa44#";
String pattern = "(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}";
System.out.println(passwd.matches(pattern));
}
}
Explanations:
(?=.*[0-9]) a digit must occur at least once
(?=.*[a-z]) a lower case letter must occur at least once
(?=.*[A-Z]) an upper case letter must occur at least once
(?=.*[##$%^&+=]) a special character must occur at least once
(?=\\S+$) no whitespace allowed in the entire string
.{8,} at least 8 characters
All the previously given answers use the same (correct) technique to use a separate lookahead for each requirement. But they contain a couple of inefficiencies and a potentially massive bug, depending on the back end that will actually use the password.
I'll start with the regex from the accepted answer:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
First of all, since Java supports \A and \z I prefer to use those to make sure the entire string is validated, independently of Pattern.MULTILINE. This doesn't affect performance, but avoids mistakes when regexes are recycled.
\A(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}\z
Checking that the password does not contain whitespace and checking its minimum length can be done in a single pass by using the all at once by putting variable quantifier {8,} on the shorthand \S that limits the allowed characters:
\A(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])\S{8,}\z
If the provided password does contain a space, all the checks will be done, only to have the final check fail on the space. This can be avoided by replacing all the dots with \S:
\A(?=\S*[0-9])(?=\S*[a-z])(?=\S*[A-Z])(?=\S*[##$%^&+=])\S{8,}\z
The dot should only be used if you really want to allow any character. Otherwise, use a (negated) character class to limit your regex to only those characters that are really permitted. Though it makes little difference in this case, not using the dot when something else is more appropriate is a very good habit. I see far too many cases of catastrophic backtracking because the developer was too lazy to use something more appropriate than the dot.
Since there's a good chance the initial tests will find an appropriate character in the first half of the password, a lazy quantifier can be more efficient:
\A(?=\S*?[0-9])(?=\S*?[a-z])(?=\S*?[A-Z])(?=\S*?[##$%^&+=])\S{8,}\z
But now for the really important issue: none of the answers mentions the fact that the original question seems to be written by somebody who thinks in ASCII. But in Java strings are Unicode. Are non-ASCII characters allowed in passwords? If they are, are only ASCII spaces disallowed, or should all Unicode whitespace be excluded.
By default \s matches only ASCII whitespace, so its inverse \S matches all Unicode characters (whitespace or not) and all non-whitespace ASCII characters. If Unicode characters are allowed but Unicode spaces are not, the UNICODE_CHARACTER_CLASS flag can be specified to make \S exclude Unicode whitespace. If Unicode characters are not allowed, then [\x21-\x7E] can be used instead of \S to match all ASCII characters that are not a space or a control character.
Which brings us to the next potential issue: do we want to allow control characters? The first step in writing a proper regex is to exactly specify what you want to match and what you don't. The only 100% technically correct answer is that the password specification in the question is ambiguous because it does not state whether certain ranges of characters like control characters or non-ASCII characters are permitted or not.
You should not use overly complex Regex (if you can avoid them) because they are
hard to read (at least for everyone but yourself)
hard to extend
hard to debug
Although there might be a small performance overhead in using many small regular expressions, the points above outweight it easily.
I would implement like this:
bool matchesPolicy(pwd) {
if (pwd.length < 8) return false;
if (not pwd =~ /[0-9]/) return false;
if (not pwd =~ /[a-z]/) return false;
if (not pwd =~ /[A-Z]/) return false;
if (not pwd =~ /[%#$^]/) return false;
if (pwd =~ /\s/) return false;
return true;
}
Thanks for all answers, based on all them but extending sphecial characters:
#SuppressWarnings({"regexp", "RegExpUnexpectedAnchor", "RegExpRedundantEscape"})
String PASSWORD_SPECIAL_CHARS = "##$%^`<>&+=\"!ºª·#~%&'¿¡€,:;*/+-.=_\\[\\]\\(\\)\\|\\_\\?\\\\";
int PASSWORD_MIN_SIZE = 8;
String PASSWORD_REGEXP = "^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[" + PASSWORD_SPECIAL_CHARS + "])(?=\\S+$).{"+PASSWORD_MIN_SIZE+",}$";
Unit tested:
Password Requirement :
Password should be at least eight (8) characters in length where the system can support it.
Passwords must include characters from at least two (2) of these groupings: alpha, numeric, and special characters.
^.*(?=.{8,})(?=.*\d)(?=.*[a-zA-Z])|(?=.{8,})(?=.*\d)(?=.*[!##$%^&])|(?=.{8,})(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
I tested it and it works
For anyone interested in minimum requirements for each type of character, I would suggest making the following extension over Tomalak's accepted answer:
^(?=(.*[0-9]){%d,})(?=(.*[a-z]){%d,})(?=(.*[A-Z]){%d,})(?=(.*[^0-9a-zA-Z]){%d,})(?=\S+$).{%d,}$
Notice that this is a formatting string and not the final regex pattern. Just substitute %d with the minimum required occurrences for: digits, lowercase, uppercase, non-digit/character, and entire password (respectively). Maximum occurrences are unlikely (unless you want a max of 0, effectively rejecting any such characters) but those could be easily added as well. Notice the extra grouping around each type so that the min/max constraints allow for non-consecutive matches. This worked wonders for a system where we could centrally configure how many of each type of character we required and then have the website as well as two different mobile platforms fetch that information in order to construct the regex pattern based on the above formatting string.
This one checks for every special character :
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=\S+$).*[A-Za-z0-9].{8,}$
Java Method ready for you, with parameters
Just copy and paste and set your desired parameters.
If you don't want a module, just comment it or add an "if" as done by me for special char
//______________________________________________________________________________
/**
* Validation Password */
//______________________________________________________________________________
private static boolean validation_Password(final String PASSWORD_Arg) {
boolean result = false;
try {
if (PASSWORD_Arg!=null) {
//_________________________
//Parameteres
final String MIN_LENGHT="8";
final String MAX_LENGHT="20";
final boolean SPECIAL_CHAR_NEEDED=true;
//_________________________
//Modules
final String ONE_DIGIT = "(?=.*[0-9])"; //(?=.*[0-9]) a digit must occur at least once
final String LOWER_CASE = "(?=.*[a-z])"; //(?=.*[a-z]) a lower case letter must occur at least once
final String UPPER_CASE = "(?=.*[A-Z])"; //(?=.*[A-Z]) an upper case letter must occur at least once
final String NO_SPACE = "(?=\\S+$)"; //(?=\\S+$) no whitespace allowed in the entire string
//final String MIN_CHAR = ".{" + MIN_LENGHT + ",}"; //.{8,} at least 8 characters
final String MIN_MAX_CHAR = ".{" + MIN_LENGHT + "," + MAX_LENGHT + "}"; //.{5,10} represents minimum of 5 characters and maximum of 10 characters
final String SPECIAL_CHAR;
if (SPECIAL_CHAR_NEEDED==true) SPECIAL_CHAR= "(?=.*[##$%^&+=])"; //(?=.*[##$%^&+=]) a special character must occur at least once
else SPECIAL_CHAR="";
//_________________________
//Pattern
//String pattern = "(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}";
final String PATTERN = ONE_DIGIT + LOWER_CASE + UPPER_CASE + SPECIAL_CHAR + NO_SPACE + MIN_MAX_CHAR;
//_________________________
result = PASSWORD_Arg.matches(PATTERN);
//_________________________
}
} catch (Exception ex) {
result=false;
}
return result;
}
Also You Can Do like This.
public boolean isPasswordValid(String password) {
String regExpn =
"^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}$";
CharSequence inputStr = password;
Pattern pattern = Pattern.compile(regExpn,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches())
return true;
else
return false;
}
Use Passay library which is powerful api.
I think this can do it also (as a simpler mode):
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])[^\s]{8,}$
[Regex Demo]
easy one
("^ (?=.* [0-9]) (?=.* [a-z]) (?=.* [A-Z]) (?=.* [\\W_])[\\S]{8,10}$")
(?= anything ) ->means positive looks forward in all input string and make sure for this condition is written .sample(?=.*[0-9])-> means ensure one digit number is written in the all string.if not written return false
.
(?! anything ) ->(vise versa) means negative looks forward if condition is written return false.
close meaning ^(condition)(condition)(condition)(condition)[\S]{8,10}$
String s=pwd;
int n=0;
for(int i=0;i<s.length();i++)
{
if((Character.isDigit(s.charAt(i))))
{
n=5;
break;
}
else
{
}
}
for(int i=0;i<s.length();i++)
{
if((Character.isLetter(s.charAt(i))))
{
n+=5;
break;
}
else
{
}
}
if(n==10)
{
out.print("Password format correct <b>Accepted</b><br>");
}
else
{
out.print("Password must be alphanumeric <b>Declined</b><br>");
}
Explanation:
First set the password as a string and create integer set o.
Then check the each and every char by for loop.
If it finds number in the string then the n add 5. Then jump to the
next for loop. Character.isDigit(s.charAt(i))
This loop check any alphabets placed in the string. If its find then
add one more 5 in n. Character.isLetter(s.charAt(i))
Now check the integer n by the way of if condition. If n=10 is true
given string is alphanumeric else its not.
Sample code block for strong password:
(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[^a-zA-Z0-9])(?=\\S+$).{6,18}
at least 6 digits
up to 18 digits
one number
one lowercase
one uppercase
can contain all special characters
RegEx is -
^(?:(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=]).*)[^\s]{8,}$
at least 8 digits {8,}
at least one number (?=.*\d)
at least one lowercase (?=.*[a-z])
at least one uppercase (?=.*[A-Z])
at least one special character (?=.*[##$%^&+=])
No space [^\s]
A more general answer which accepts all the special characters including _ would be slightly different:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[\W|\_])(?=\S+$).{8,}$
The difference (?=.*[\W|\_]) translates to "at least one of all the special characters including the underscore".

Determining ascii character in NSString

In one of my UIView classes, I have the UIKeyInput protocol attached to gather input from a UIKeyboard. I'm trying to figure out what ascii character is being used when the space button is pushed (it's not simply ' ' it's something else it appears). Does anyone know what this asci character is or how I can figure out what ascii code is being used?
To look at the value for each character you can do something like this:
NSString *text = ... // the text to examine
for (NSUInteger c = 0; c < text.length; c++) {
unichar char = [text characterAtIndex:c];
NSLog(#"char = %x", (int)char); // Log the hex value of the Unicode character
}
Please note that this code doesn't properly handle any Unicode characters in the range \U10000 and up. This includes many (all?) of the Emoji characters.
If you really need to know what character (or code point) it actually is, use the CFMutableString function CFStringTransform()
That enables you to use transformation argument kCFStringTransformToUnicodeName to get the human readable Unicode name for example or Hex-Any to get escaped Unicode code point.
Otherwise you can do the the unichar approach to simple get the code point.

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

CFStringTokenizer's token range in a UTF8 C string

I'm using CFStringTokenizer to break a load of text into words, but I'm having difficulty bridging whatever encoding CFString is using and UTF8. Consider this:
NSString *theString = #"Lorem ipsum dolor sit amet!";
const char *theCString = [theString cStringUsingEncoding:NSUTF8StringEncoding];
tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,
(__bridge CFStringRef)theString,
CFRangeMake(0, [theString length]),
kCFStringTokenizerUnitWordBoundary,
locale);
while ((tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) != kCFStringTokenizerTokenNone) {
tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer);
memcpy(resultPtr, theCString+tokenRange.location, tokenRange.length);
}
Unfortunately the range reported by the tokenizer is incorrect when trying to read from the C string if any non-ascii characters have been encountered. How can I go about getting the correct range from the tokenizer to be able to pull the correct chars from my C string?
To clarify, the memcpy stuff is a tad more complex than above, and is necessary for performance on my target device, the iPhone. So I can't even do anything like create a CFString substring and convert that, I need the range in the C string. Is there any way to do that without reimplementing various word boundary libraries to get it working for the various different locales I need it to work with? (which is as many as possible, so I can't just iterate through looking for ' ' unfortunately..)
Alec
NSStrings and CFStrings deal in UTF-16, not UTF-8, but that isn't the real problem.
Your code has two problems:
You're assuming that the C string's indexes correspond to the source string's indexes.
You're copying and converting the entire string to a UTF-8 C string at once.
#1 is the cause of the range mismatches, and #2 causes potentially high memory usage, depending on the length and content of the string. (UTF-8 can take as many as four bytes per character in some alphabets—and then add one for the C string terminator.)
You can solve both of these problems in a single change.
Create an NSMutableData to hold the output. For each token, set the data's length to the range's length; then, tell the string to get bytes within the desired range in the desired encoding and store them in the data's mutableBytes buffer. NSString has a method with a very long selector (briefly, getBytes:::::::) that you will want to use for this.
Since you use the range that is relative to the string exclusively with the string, there is no index/range mismatch, and each token will be output correctly.
If you really need a C string, you can set the data's length to the range's length + 1, then set the last byte to '\0' with a separate assignment after getting the token bytes. (Without the separate assignment, the byte may hold a previous value.)

Resources