CFStringTokenizer's token range in a UTF8 C string

CFStringTokenizer's token range in a UTF8 C string - ios

I'm using CFStringTokenizer to break a load of text into words, but I'm having difficulty bridging whatever encoding CFString is using and UTF8. Consider this:
NSString *theString = #"Lorem ipsum dolor sit amet!";
const char *theCString = [theString cStringUsingEncoding:NSUTF8StringEncoding];
tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,
(__bridge CFStringRef)theString,
CFRangeMake(0, [theString length]),
kCFStringTokenizerUnitWordBoundary,
locale);
while ((tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) != kCFStringTokenizerTokenNone) {
tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer);
memcpy(resultPtr, theCString+tokenRange.location, tokenRange.length);
}
Unfortunately the range reported by the tokenizer is incorrect when trying to read from the C string if any non-ascii characters have been encountered. How can I go about getting the correct range from the tokenizer to be able to pull the correct chars from my C string?
To clarify, the memcpy stuff is a tad more complex than above, and is necessary for performance on my target device, the iPhone. So I can't even do anything like create a CFString substring and convert that, I need the range in the C string. Is there any way to do that without reimplementing various word boundary libraries to get it working for the various different locales I need it to work with? (which is as many as possible, so I can't just iterate through looking for ' ' unfortunately..)
Alec

NSStrings and CFStrings deal in UTF-16, not UTF-8, but that isn't the real problem.
Your code has two problems:
You're assuming that the C string's indexes correspond to the source string's indexes.
You're copying and converting the entire string to a UTF-8 C string at once.
#1 is the cause of the range mismatches, and #2 causes potentially high memory usage, depending on the length and content of the string. (UTF-8 can take as many as four bytes per character in some alphabets—and then add one for the C string terminator.)
You can solve both of these problems in a single change.
Create an NSMutableData to hold the output. For each token, set the data's length to the range's length; then, tell the string to get bytes within the desired range in the desired encoding and store them in the data's mutableBytes buffer. NSString has a method with a very long selector (briefly, getBytes:::::::) that you will want to use for this.
Since you use the range that is relative to the string exclusively with the string, there is no index/range mismatch, and each token will be output correctly.
If you really need a C string, you can set the data's length to the range's length + 1, then set the last byte to '\0' with a separate assignment after getting the token bytes. (Without the separate assignment, the byte may hold a previous value.)

Related

Objective-C some special char uncontrollably changing

I have a string that include some special char (like é,â,î,ı etc.), When I use substring on this string. I encounter inconsistent results. Some special char change uncontrollably

You are assuming that these are all characters:
[newword substringWithRange:NSMakeRange(0,1)];
[newword substringWithRange:NSMakeRange(1,1)];
[newword substringWithRange:NSMakeRange(2,1)];
[newword substringWithRange:NSMakeRange(3,1)];
// and so on...
In other words, you believe that:
A location always falls at the start of a character.
A character always has length 1.
Both assumptions are wrong. Please read the Characters and Grapheme Clusters chapter of Apple's String Programming Guide (here).
Your é happens to have length 2, because it is a base letter e followed by a combining diacritical accent. If you want it to have length 1, you need to normalize the string before you use it. Call precomposedStringWithCanonicalMapping and use the resulting string.
Example and proof (in Swift, but it won't matter, as I use NSString throughout):
let s = "é,â,î,ı" as NSString
let c = s.substring(with: NSRange(location: 0, length: 1)) // e
let s2 = s.precomposedStringWithCanonicalMapping as NSString
let c2 = s2.substring(with: NSRange(location: 0, length: 1)) // é

You're treating a unicode string like a sequence of bytes. Unicode codepoints, aside from low UTF8 can be multi-byte so you are changing the text style by stripping out parts responsible for the accent above the letter like this part: https://www.compart.com/en/unicode/U+0301
UTF8 is variable width so by treating it as raw bytes you may get weird results, I would suggest using something that is more aware of unicode like ICU (International Components for Unicode).
Now imagine you have a two byte sequence like this (this may not be 100% accurate but it illustrates my point):
0x056 0x000
e NUL
Now you have a UTF8 string with 1 codepoint and a null terminator. Now say you want to add an accent to that e. How would you do that? You could use a special unicode codepoint to modify the e so now the string is:
0x056 0x0CC 0x810 0x000
e U+0301 NUL
Where U+0301 is 2 a byte control character (Combining Acute Accent) and makes the e accented.
Edit: The answer assumes UTF8 encoding which is likely a bad assumption but I think the answer, whether UTF8 or UTF16, or any other type of encoding with control characters, illustrates why you may have mysterious dissapearing accents. While this may be UTF16, for the sake of simplicity let's pretend we live in a world where life is just slightly better because everyone only uses UTF8 and UTF16 doesn't exist.
To address the comment (this is less to do with the question but is some fun trivia) and for some fun detils about NS/CF/Swift runtimes and bridging and constant CF strings and other fun stuff like that: The representation of the actual string in memory is implementation defined and can vary (even for constant strings, trust me, I know, I fixed the ELF implementation of them in Clang for CoreFoundation a few days ago). Anyway, here's some code:
CF_INLINE CFStringEncoding __CFStringGetSystemEncoding(void) {
if (__CFDefaultSystemEncoding == kCFStringEncodingInvalidId) (void)CFStringGetSystemEncoding();
return __CFDefaultSystemEncoding;
}
CFStringEncoding CFStringFileSystemEncoding(void) {
if (__CFDefaultFileSystemEncoding == kCFStringEncodingInvalidId) {
#if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI || DEPLOYMENT_TARGET_WINDOWS
__CFDefaultFileSystemEncoding = kCFStringEncodingUTF8;
#else
__CFDefaultFileSystemEncoding = CFStringGetSystemEncoding();
#endif
}
return __CFDefaultFileSystemEncoding;
}
Throughout CoreFoundation/Foundation/SwiftFoundation (Yes you never know what sort of NSString is actually the one you're holding, they usually pretend to be the same thing but under the hood depending on how you got the object you may be holding onto one of the three variations of it).
This is why code like this exists, because NS/CF(Constant)/Swift strings have implementation defined internal representation.
if (((encoding & 0x0FFF) == kCFStringEncodingUnicode) && ((encoding == kCFStringEncodingUnicode) || ((encoding > kCFStringEncodingUTF8) && (encoding <= kCFStringEncodingUTF32LE)))) {
If you want consistent behavior you have to encode the string using a specific fixed encoding instead of relying on the internal representation.

Firebase or Swift not detecting umlauts

I found some weirdest thing in Firebase Database/Storage. The thing is that I don't know if Firebase or Swift is not detecting umlauts e.g(ä, ö, ü).
I did some easy things with Firebase like upload images to Firebase Storage and then download them into tableview. Some of my .png files had umlauts in the title for example(Röda.png).
So the problem occurs now if I download them. The only time my download url is nil is if the file name contains the umlauts I was talking about.
So I tried some alternatives like in HTML ö - ö. But this is not working. Can you guys suggest me something? I can't use ö - o, ü - u etc.
This is the code when url is nil when trying to set some values into Firebase:
FIRStorage.storage().reference()
.child("\(productImageref!).png")
.downloadURLWithCompletion({(url, error)in
FIRDatabase.database().reference()
.child("Snuses").child(productImageref!).child("productUrl")
.setValue(url!.absoluteString)
let resource = Resource(downloadURL: url!, cacheKey: productImageref)

After spending a fair bit of time research your problem, the difference boils down to how the character ö is encoded and I traced it down to Unicode normalization forms.
The letter ö can be written in two ways, and String / NSString considers them equal:
let str1 = "o\u{308}" // decomposed : latin small letter o + combining diaeresis
let str2 = "\u{f6}" // precomposed: latin small letter o with diaeresis
print(str1, str2, str1 == str2) // ö ö true
But when you percent-encode them, they produce different results:
print(str1.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet())!)
print(str2.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet())!)
// o%CC%88
// %C3%B6
My guess is that Google / Firebase chooses the decomposed form while Apple prefers the other in its text input system. You can convert the file name to its decomposed form to match Firebase:
let str3 = str2.decomposedStringWithCanonicalMapping
print(str3.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet()))
// o%CC%88
This is irrelevant for ASCII-ranged characters. Unicode can be very confusing.
References:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (highly recommended)
Strings in Swift 2
NSString and Unicode

Horray for Unicode!
The short answer is that no, we're actually not doing anything special here. Basically all we do under the hood is:
// This is the list at https://cloud.google.com/storage/docs/json_api/ without the & because query parameters
NSString *const kGCSObjectAllowedCharacterSet =
#"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~!$'()*+,;=:#";
- (nullable NSString *)GCSEscapedString:(NSString *)string {
NSCharacterSet *allowedCharacters =
[NSCharacterSet characterSetWithCharactersInString:kGCSObjectAllowedCharacterSet];
return [string stringByAddingPercentEncodingWithAllowedCharacters:allowedCharacters];
}
What blows my mind is that:
let str1 = "o\u{308}" // decomposed : latin small letter o + combining diaeresis
let str2 = "\u{f6}" // precomposed: latin small letter o with diaeresis
print(str1, str2, str1 == str2) // ö ö true
returns true. In Objective-C (which the Firebase Storage client is built in), it totally shouldn't, as they're two totally different characters (in actuality, the length of str1 is 2 while the length of str2 is 1 in Obj-C, while in Swift I assume the answer is 1 for both).
Apple must be normalizing strings before comparison in Swift (probably a reasonable thing to do, since otherwise it leads to bugs like this where strings are "the same" but compare differently). Turns out, this is exactly what they do (see the "Extended Grapheme Clusters" section of their docs).
So, when you provide two different characters in Swift, they're being propagated to Obj-C as different characters and thus are encoded differently. Not a bug, just one of the many differences between Swift's String type and Obj-C's NSString type. When in doubt, choose a canonical representation you expect and stick with it, but as a library developer, it's very hard for us to choose that representation for you.
Thus, when naming files that contain Unicode characters, make sure to pick a standard representation (C,D,KC, or KD) and always use it when creating references.
let imageName = "smorgasbörd.jpg"
let path = "images/\(imageName)"
let decomposedPath = path.decomposedStringWithCanonicalMapping // Unicode Form D
let ref = FIRStorage.storage().reference().child(decomposedPath)
// use this ref and you'll always get the same objects

UInt8 XOR'd array result to NSString conversion returns nil every time

I'm having issues working with iOS Swift 2.0 to perform an XOR on a [UInt8] and convert the XORd result to a String. I'm having to interface with a crude server that wants to do simple XOR encryption with a predefined array of UInt8 values and return that result as a String.
Using iOS Swift 2.0 Playground, create the following array:
let xorResult : [UInt8] = [24, 48, 160, 212] // XORd result
let result = NSString(bytes: xorResult, length: xorResult.count, encoding: NSUTF8StringEncoding)
The result is always nil. If you remove the 160 and 212 values from the array, NSString is not nil. If I switch to NSUTF16StringEncoding then I do not receive nil, however, the server does not support UTF16. I have tried converting the values to a hex string, then converting the hex string to NSData, then try to convert that to NSUTF8StringEncoding but still nil until I remove the 160 and 212. I know this algorithm works in Java, however in Java we're using a combination of char and StringBuilder and everything is happy. Is there a way around this in iOS Swift?

To store an arbitrary chunk of binary data as as a string, you need
a string encoding which maps each single byte (0 ... 255) to some
character. UTF-8 does not have this property, as for example 160
is the start of a multi-byte UTF-8 sequence and not valid on its own.
The simplest encoding with this property is the ISO Latin 1 aka
ISO 8859-1, which is the
ISO/IEC 8859-1
encoding when supplemented with the C0 and C1 control codes.
It maps the Unicode code points U+0000 .. U+00FF
to the bytes 0x00 .. 0xFF (compare 8859-1.TXT).
This encoding is available for
(NS)String as NSISOLatin1StringEncoding.
Please note: The result of converting an arbitrary binary chunk to
a (NS)String with NSISOLatin1StringEncoding will contain embedded
NUL and control characters. Some functions behave unexpectedly
when used with such a string. For example, NSLog() terminates the
output at the first embedded NUL character. This conversion
is meant to solve OP's concrete problem (creating a QR-code which
is recognized by a 3rd party application). It is not meant as
a universal mechanism to convert arbitrary data to a string which may
be printed or presented in any way to the user.

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!

The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Percent escaping special characters like é on iOS

I'm currently struggling with percent escaping special characters on iOS, for instance "é" when contained in a query parameter value.
I'm using AFNetworking, but the issue isn't specific to it.
The "é" character should be percent escaped to "%E9", yet the result is "%C3%A9". The reason is because "é" is represented as those 2 bytes in UTF8.
The actual percent escaping method is the well known one and I'm passing UTF8 as string encoding. The string itself is #"é".
static NSString * AFPercentEscapedQueryStringPairMemberFromStringWithEncoding(NSString *string, NSStringEncoding encoding)
{
static NSString * const kAFCharactersToBeEscaped = #":/?&=;+!##$()~";
static NSString * const kAFCharactersToLeaveUnescaped = #"[].";
return (__bridge_transfer NSString *)CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault, (__bridge CFStringRef)string, (__bridge CFStringRef)kAFCharactersToLeaveUnescaped, (__bridge CFStringRef)kAFCharactersToBeEscaped, CFStringConvertNSStringEncodingToEncoding(encoding));
}
I had hoped passing in UTF16 string encoding would solve it, but it doesn't. The result is "%FF%FE%E9%00" in this case, it contains "%E9" but I must be missing something obvious.
Somehow I can't get my head around it.
Any pointers would be awesome.

RFC 3986 explains that, unless the characters you're encoding fall into the unreserved US-ASCII range, the convention is to convert the character to (in this case, A UTF8-encoded) byte value, and and use that value as the percent encoding base.
The behavior you're seeing is correct.
The disparity between the encoded values given for UTF-8 vs. UTF-16 is due to a couple of factors.
Encoding Differences
First, there's the difference in the way that the respective encodings are actually defined. UTF-16 will always use two bytes to represent its character, and essentially concatenates the higher order byte with the lower order byte to define the code. (The ordering of these bytes will depend on whether the code is encoded as Little Endian or Big Endian.) UTF-8, on the other hand, uses a dynamic number of bytes, depending on where in the Unicode code page the character exists. The way UTF-8 relates how many bytes it's going to use is by the bits that are set in the first byte itself.
So if we look at C3 A9, that translates into the following bits:
1100 0011 1010 1001
Looking at RFC 2279, we see that the beginning set of '1's with an terminating '0' denotes how many bytes will be used--in this case, 2. Stripping off the initial 110 metadata, we're left with 00011 from the first byte: that represents the leftmost bits of the actual value.
For the next byte (1010 1001), again from the RFC we see that, for every subsequent byte, 10 will be "prefix" metadata for the actual value. Stripping that off, we're left with 101001.
Concatenating the actual value bits, we end up with 00011 101001, which is 233 in base-10, or E9 in base-16.
Encoding Identification
The other thing to consider specifically from the UTF-16 value (%FF%FE%E9%00) is from the original RFC, which mentions that there's no explicit definition of the encoding used, in the encoded value itself. So in this case, iOS is "cheating", giving you an indication of what encoding is used. FF FE is a well-known byte-ordering mark used in UTF-16 encoded files, to denote that UTF-16 is the encoding used. As for E9 00, as mentioned, UTF-16 always uses two bytes. In this case, since all of its data can be represented in 1 byte, the other is simply null.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart