I found some weirdest thing in Firebase Database/Storage. The thing is that I don't know if Firebase or Swift is not detecting umlauts e.g(ä, ö, ü).
I did some easy things with Firebase like upload images to Firebase Storage and then download them into tableview. Some of my .png files had umlauts in the title for example(Röda.png).
So the problem occurs now if I download them. The only time my download url is nil is if the file name contains the umlauts I was talking about.
So I tried some alternatives like in HTML ö - ö. But this is not working. Can you guys suggest me something? I can't use ö - o, ü - u etc.
This is the code when url is nil when trying to set some values into Firebase:
FIRStorage.storage().reference()
.child("\(productImageref!).png")
.downloadURLWithCompletion({(url, error)in
FIRDatabase.database().reference()
.child("Snuses").child(productImageref!).child("productUrl")
.setValue(url!.absoluteString)
let resource = Resource(downloadURL: url!, cacheKey: productImageref)
After spending a fair bit of time research your problem, the difference boils down to how the character ö is encoded and I traced it down to Unicode normalization forms.
The letter ö can be written in two ways, and String / NSString considers them equal:
let str1 = "o\u{308}" // decomposed : latin small letter o + combining diaeresis
let str2 = "\u{f6}" // precomposed: latin small letter o with diaeresis
print(str1, str2, str1 == str2) // ö ö true
But when you percent-encode them, they produce different results:
print(str1.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet())!)
print(str2.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet())!)
// o%CC%88
// %C3%B6
My guess is that Google / Firebase chooses the decomposed form while Apple prefers the other in its text input system. You can convert the file name to its decomposed form to match Firebase:
let str3 = str2.decomposedStringWithCanonicalMapping
print(str3.stringByAddingPercentEncodingWithAllowedCharacters(.URLPathAllowedCharacterSet()))
// o%CC%88
This is irrelevant for ASCII-ranged characters. Unicode can be very confusing.
References:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (highly recommended)
Strings in Swift 2
NSString and Unicode
Horray for Unicode!
The short answer is that no, we're actually not doing anything special here. Basically all we do under the hood is:
// This is the list at https://cloud.google.com/storage/docs/json_api/ without the & because query parameters
NSString *const kGCSObjectAllowedCharacterSet =
#"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~!$'()*+,;=:#";
- (nullable NSString *)GCSEscapedString:(NSString *)string {
NSCharacterSet *allowedCharacters =
[NSCharacterSet characterSetWithCharactersInString:kGCSObjectAllowedCharacterSet];
return [string stringByAddingPercentEncodingWithAllowedCharacters:allowedCharacters];
}
What blows my mind is that:
let str1 = "o\u{308}" // decomposed : latin small letter o + combining diaeresis
let str2 = "\u{f6}" // precomposed: latin small letter o with diaeresis
print(str1, str2, str1 == str2) // ö ö true
returns true. In Objective-C (which the Firebase Storage client is built in), it totally shouldn't, as they're two totally different characters (in actuality, the length of str1 is 2 while the length of str2 is 1 in Obj-C, while in Swift I assume the answer is 1 for both).
Apple must be normalizing strings before comparison in Swift (probably a reasonable thing to do, since otherwise it leads to bugs like this where strings are "the same" but compare differently). Turns out, this is exactly what they do (see the "Extended Grapheme Clusters" section of their docs).
So, when you provide two different characters in Swift, they're being propagated to Obj-C as different characters and thus are encoded differently. Not a bug, just one of the many differences between Swift's String type and Obj-C's NSString type. When in doubt, choose a canonical representation you expect and stick with it, but as a library developer, it's very hard for us to choose that representation for you.
Thus, when naming files that contain Unicode characters, make sure to pick a standard representation (C,D,KC, or KD) and always use it when creating references.
let imageName = "smorgasbörd.jpg"
let path = "images/\(imageName)"
let decomposedPath = path.decomposedStringWithCanonicalMapping // Unicode Form D
let ref = FIRStorage.storage().reference().child(decomposedPath)
// use this ref and you'll always get the same objects
Related
I have a string that include some special char (like é,â,î,ı etc.), When I use substring on this string. I encounter inconsistent results. Some special char change uncontrollably
You are assuming that these are all characters:
[newword substringWithRange:NSMakeRange(0,1)];
[newword substringWithRange:NSMakeRange(1,1)];
[newword substringWithRange:NSMakeRange(2,1)];
[newword substringWithRange:NSMakeRange(3,1)];
// and so on...
In other words, you believe that:
A location always falls at the start of a character.
A character always has length 1.
Both assumptions are wrong. Please read the Characters and Grapheme Clusters chapter of Apple's String Programming Guide (here).
Your é happens to have length 2, because it is a base letter e followed by a combining diacritical accent. If you want it to have length 1, you need to normalize the string before you use it. Call precomposedStringWithCanonicalMapping and use the resulting string.
Example and proof (in Swift, but it won't matter, as I use NSString throughout):
let s = "é,â,î,ı" as NSString
let c = s.substring(with: NSRange(location: 0, length: 1)) // e
let s2 = s.precomposedStringWithCanonicalMapping as NSString
let c2 = s2.substring(with: NSRange(location: 0, length: 1)) // é
You're treating a unicode string like a sequence of bytes. Unicode codepoints, aside from low UTF8 can be multi-byte so you are changing the text style by stripping out parts responsible for the accent above the letter like this part: https://www.compart.com/en/unicode/U+0301
UTF8 is variable width so by treating it as raw bytes you may get weird results, I would suggest using something that is more aware of unicode like ICU (International Components for Unicode).
Now imagine you have a two byte sequence like this (this may not be 100% accurate but it illustrates my point):
0x056 0x000
e NUL
Now you have a UTF8 string with 1 codepoint and a null terminator. Now say you want to add an accent to that e. How would you do that? You could use a special unicode codepoint to modify the e so now the string is:
0x056 0x0CC 0x810 0x000
e U+0301 NUL
Where U+0301 is 2 a byte control character (Combining Acute Accent) and makes the e accented.
Edit: The answer assumes UTF8 encoding which is likely a bad assumption but I think the answer, whether UTF8 or UTF16, or any other type of encoding with control characters, illustrates why you may have mysterious dissapearing accents. While this may be UTF16, for the sake of simplicity let's pretend we live in a world where life is just slightly better because everyone only uses UTF8 and UTF16 doesn't exist.
To address the comment (this is less to do with the question but is some fun trivia) and for some fun detils about NS/CF/Swift runtimes and bridging and constant CF strings and other fun stuff like that: The representation of the actual string in memory is implementation defined and can vary (even for constant strings, trust me, I know, I fixed the ELF implementation of them in Clang for CoreFoundation a few days ago). Anyway, here's some code:
CF_INLINE CFStringEncoding __CFStringGetSystemEncoding(void) {
if (__CFDefaultSystemEncoding == kCFStringEncodingInvalidId) (void)CFStringGetSystemEncoding();
return __CFDefaultSystemEncoding;
}
CFStringEncoding CFStringFileSystemEncoding(void) {
if (__CFDefaultFileSystemEncoding == kCFStringEncodingInvalidId) {
#if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI || DEPLOYMENT_TARGET_WINDOWS
__CFDefaultFileSystemEncoding = kCFStringEncodingUTF8;
#else
__CFDefaultFileSystemEncoding = CFStringGetSystemEncoding();
#endif
}
return __CFDefaultFileSystemEncoding;
}
Throughout CoreFoundation/Foundation/SwiftFoundation (Yes you never know what sort of NSString is actually the one you're holding, they usually pretend to be the same thing but under the hood depending on how you got the object you may be holding onto one of the three variations of it).
This is why code like this exists, because NS/CF(Constant)/Swift strings have implementation defined internal representation.
if (((encoding & 0x0FFF) == kCFStringEncodingUnicode) && ((encoding == kCFStringEncodingUnicode) || ((encoding > kCFStringEncodingUTF8) && (encoding <= kCFStringEncodingUTF32LE)))) {
If you want consistent behavior you have to encode the string using a specific fixed encoding instead of relying on the internal representation.
Using iOS + Swift, what's the best method to allow special characters .$#[]/ in my Firebase database keys (node names)?
Add percent encoding & decoding! Remember to allow alphanumeric characters (see example below).
var str = "this.is/a#crazy[string]right$here.$[]#/"
if let strEncoded = str.addingPercentEncoding(withAllowedCharacters: .alphanumerics) {
print(strEncoded)
if let strDecoded = strEncoded.removingPercentEncoding {
print(strDecoded)
}
}
The question is
How Do I Allow Special Characters in My Firebase Realtime Database?
The actual answer is there is nothing required to allow Special Characters in Firebase Realtime Database.
For example: given the following code
//self.ref is a reference to the Firebase Database
let str = "this.is/a#crazy[string]right$here.$[]#/"
let ref = self.ref.childByAutoId()
ref.setValue(str)
When the code is run, the following is written to firebase
{
"-KlZovTc2uhQXNzDodW_" : "this.is/a#crazy[string]right$here.$[]#/"
}
As you can see the string is identical to the given string, including the special characters.
It's important to note the question asks about allowing special characters in strings. Everything in Firebase is stored as key: value pairs and the Values can be strings so that's what this answer addresses.
Key's are different
If you create your own keys, they must be UTF-8 encoded, can be a maximum of 768 bytes, and cannot contain ., $, #, [, ], /, or ASCII control characters 0-31 or 127.
The bigger question goes back to; a structure that would require those characters to be included as a key could (and should) probably be re-thought at as there are generally better solutions.
There are some Unicode arrangements that I want to use in my app. I am having trouble properly escaping them for use.
For instance this Unicode sequence: 🅰
If I escape it using an online tool i get: \ud83c\udd70
But of course this is an invalid sequence per the compiler:
var str = NSString.stringWithUTF8String("\ud83c\udd70")
Also if I do this:
var str = NSString.stringWithUTF8String("\ud83c")
I get an error "Invalid Unicode Scalar"
I'm trying to use these Unicode "fonts":
http://www.panix.com/~eli/unicode/convert.cgi?text=abcdefghijklmnopqrstuvwxyz
If I view the source of this website I see sequences like this:
𝕒
Struggling to wrap my head around what is the "proper" way to work with/escape unicode.
And simply need a to figure out a way to get them working on iOS.
Any thoughts?
\ud83c\udd70 is a UTF-16 surrogate pair which encodes the unicode character 🅰 (U+1F170). Swift string literals do not use UTF-16, so that escape sequence doesn't make sense. However, since 1F170 has five digits you can't use a \uXXXX escape sequence (which only accepts four hexadecimal digits). Instead, use a \UXXXXXXXX sequence (note the capital U), which accepts eight:
var str = "\U0001F170" // returns "🅰"
You can also just paste the character itself into your string:
var str = "🅰" // returns "🅰"
Swift is an early Beta, is is broken in many ways. This issue is a Swift bug.
let ringAboveA: String = "\u0041\u030A" is Å and is accepted
let negativeSquaredA: String = "\uD83D\uDD70" is 🅰 and produces an error
Both are decomposed UTF16 characters that are accepted by Objective-C. The difference is that the composed character 🅰 is in plane 1.
Note: to get the UTF32 code point either use the OSX Character Viewer or a code snippet:
NSLog(#"utf32: %#", [#"🅰" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);
utf32: <0001f170>
To get the Character Viewer in the Apple Menu go to the "System Preferences", "Keyboard", "Keyboard" tab and select the checkbox: "Show Keyboard & Character Viewers in menu bar". The "Character View" item will be in the menu bar just to the left of the Date.
After entering the character right (control) click on the character in favorites to copy the search results.
Copied information:
🅰
NEGATIVE SQUARED LATIN CAPITAL LETTER A
Unicode: U+1F170 (U+D83C U+DD70), UTF-8: F0 9F 85 B0
Better yet: Add unicode in the list on the left and select it.
I have some NSString like :
test = #"this is %25test%25 string";
I am trying to replace test with some arabic text , but it is not replacing exactly as it is :
[test stringByReplacingOccurrencesOfString:#"test" withString:#"اختبار"];
and the result is :
this is %25 اختبار %25 string
Some where I read there could be some problem with encoding or text alignment.Is there extra adjustment needed to be done for arabic string operations .
EDIT : I have used NSMutable string insert property but still the same result .
EDIT 2:
One other thing that occurs to me that is causing most of your trouble in this specific example. You have a partially percent-encoded string above. You have spaces, but you also have %25. You should avoid doing that. Either percent-encode a string or don't. Convert it all at once when required (using stringByAddingPercentEscapesUsingEncoding:). Don't try to "hard-code" percent-encoding. If you just used "this is a %اختبار% string" (and then percent-encoded the entire thing at the end), all your directional problems would go away (see how that renders just fine?). The rest of these answers address the more general question when you really need to deal with directionality.
EDIT:
The original answer after the line relates to human-readable strings, and is correct for human-readable strings, but your actual question (based on your followups) is about URLs. URLs are not human-readable strings, even if they occasionally look like them. They are a sequence of bytes that are independent of how they are rendered to humans. "اختبار" cannot be in the path or fragment parts of an URL. These characters are not part of the legal set of characters for those sections (اختبار is allowed to be part of the host, but you have to follow the IDN rules for that).
The correct URL encoding for this is a %25<arabic>%25 string is:
this%20is%20a%20%2525%D8%A7%D8%AE%D8%AA%D8%A8%D8%A7%D8%B1%2525%20string
If you decode and render this string to the screen, it will appear like this:
this is a %25اختبار%25 string
But it is in fact exactly the string you mean (and it is the string you should pass to the browser). Follow the bytes (like the computer will):
this - this (ALPHA)
%20 - <space> (encoded)
is - is (ALPHA)
%20 - <space> (encoded)
a - a (ALPHA)
%20 - <space> (encoded)
%25 - % (encoded)
25 - 25 (DIGIT)
%D8%A7 - ا (encoded)
%D8%AE - خ (encoded)
%D8%AA - ت (encoded)
%D8%A8 - ب (encoded)
%D8%A7 - ا (encoded)
%D8%B1 - ر (encoded)
%25 - % (encoded)
25 - 25 (DIGIT)
%20 - <space> (encoded)
string - string (ALPHA)
The Unicode BIDI display algorithm is doing what it means to do; it just isn't what you expect. But those are the bytes and they're in the correct order. If you add any additional bytes (such as LRO) to this string, then you are modifying the URL and it means something different.
So the question you need to answer is, are you making an URL, or are you making a human-readable string? If you're making an URL, it should be URL-encoded, in which case you will not have this display problem (unless this is part of the host, which is a different set of rules, but I don't believe that's your problem). If this is a human-readable string, see below about how to provide hints and overrides to the BIDI algorithm.
It's possible that you really need both (a human-friendly string, and a correct URL that can be pasted). That's fine, you just need to handle the clipboard yourself. Show the string, but when the user goes to copy it, replace it with the fully encoded URL using UIPasteboard or by overriding copy:. See Copy, Cut, and Paste Operations. This is fairly common (note how in Safari, it displays just "stackoverflow.com" in the address bar but if you copy and paste it, it pastes "https://stackoverflow.com/" Same thing.
Original answer related to human-readable strings.
Believe it or not, stringByReplacingOccuranceOfString: is doing the right thing. It's just not displaying the way you expect. If you walk through characterAtIndex:, you'll find that it's:
% 2 5 ا ...
The problem is that the layout engine gets very confused around all the "neutral direction" characters. The engine doesn't understand whether you meant "%25" to be attached to the left to right part or right to left part. You have to help it out here by giving it some explicit directional characters to work with.
There are a few ways to go about this. First, you can do it the Unicode 6.3 tr9-29 way with Explicit Directional Isolates. This is exactly the kind of problem that Isolates are meant to solve. You have some piece of text whose direction you want to be considered completely independently of all other text. Unicode 6.3 isn't actually supported by iOS or OS X as best I can tell, but for many (though not all) uses, it "works."
You want to surround your Arabic with FSI (FIRST STRONG ISOLATE U+2068) and PDI (POP DIRECTIONAL ISOLATE U+2069). You could also use RLI (RIGHT-TO-LEFT ISOLATE) to be explicit. FSI means "treat this text as being in the direction of the first strong character you find."
So you could ideally do this:
NSString *test = #"this is a %25\u2068test\u2069%25 string";
NSString *arabic = #"اختبار";
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:arabic];
That works if you know what you're going to substitute before hand (so you know where to put the FSI and PDI). If you don't, you can do it the other way and make it part of the substitution:
NSString * const FSI = #"\u2068";
NSString * const PDI = #"\u2069";
NSString *test = #"this is %25test%25 string";
NSString *arabic = #"اختبار";
NSString *replaceString = [#[FSI, arabic, PDI] componentsJoinedByString:#""];
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:replaceString];
I said this "mostly" works. It's fine for UILabel, and it probably is fine for anything using Core Text. But in NSLog output, you'll get these extra "placeholder" characters:
You might get this other places, too. I haven't checked UIWebView for instance.
So there are some other options. You can use directional marks. It's a little awkward, though. LRM and RLM are zero-width strongly directional characters. So you can bracket the arabic with LRM (left to right mark) so that the arabic doesn't disturb the surrounding text. This is a little ugly since it means the substitution has to be aware of what it's substituting into (which is why isolates were invented).
NSString * const LRM = #"\u200e";
NSString *test = #"this is a %25test%25 string";
NSString *replaceString = [#[LRM, arabic, LRM] componentsJoinedByString:#""];
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:replaceString];
BTW, Directional Marks are usually the right answer. They should always be the first thing you try. This particular problem is just a little too tricky.
One more way is to use Explicit Directional Overrides. These are the giant "do what I tell you to do" hammer of the Unicode world. You should avoid them whenever possible. There are some security concerns with them that make them forbidden in certain places (<RLO>elgoog<PDF>.com would display as google.com for instance). But they will work here.
You bracket the whole string with LRO/PDF to force it to be left-to-right. You then bracket the substitution with RLO/PDF to force it to the right-to-left. Again, this is a last resort, but it lets you take complete control over the layout:
NSString * const LRO = #"\u202d";
NSString * const RLO = #"\u202e";
NSString * const PDF = #"\u202c";
NSString *test = [#[LRO, #"this is a %25test%25 string", PDF] componentsJoinedByString:#""];
NSString *arabic = #"اختبار";
NSString *replaceString = [#[RLO, arabic, PDF] componentsJoinedByString:#""];
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:replaceString];
I would think you could solve this problem with the Explicit Directional Embedding characters, but I haven't really found a way to do it without at least one override (for instance, you could use RLE instead of RLO above, but you still need the LRO).
Those should give you the tools you need to figure all of this out. See the Unicode TR9 for the gory details. And if you want a deeper introduction to the problem and solutions, see Cal Henderson's excellent Understanding Bidirectional (BIDI) Text in Unicode.
You should try like this:
NSString *test = #"this is %25test%25 string";
NSString *test2 = [[[test stringByReplacingPercentEscapesUsingEncoding:NSStringEncodingConversionAllowLossy] componentsSeparatedByString:#"test"] componentsJoinedByString:#"اختبار"];
I have an issue in an application I'm writing where I need to compare one NSURL that points to a file and an NSString, which is an incoming string representation of the same file path.
I can't get them to compare – the output I'm given when NSLogging is confusing, perhaps it is a encoding issue?
I can make them look the same with this code: [urlString stringByRemovingPercentEncoding];
The raw output for the NSURL is:
file:///var/mobile/Applications/F14AFBD8-FF60-4094-8BBD-7AC2477E0B20/Documents/1.%20AKTIV%20SA%CC%88LJFOLDER/Sa%CC%88ljfolder2014-SP1.pdf
And for the NSString:
/var/mobile/Applications/F14AFBD8-FF60-4094-8BBD-7AC2477E0B20/Documents/1. AKTIV SÄLJFOLDER/Säljfolder2014-SP1.pdf
If I run stringByRemovingPercentEncoding on the NSURL it looks the same, but they don't compare.
If I run stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding to the NSString I get file:///var/mobile/Applications/F14AFBD8-FF60-4094-8BBD-7AC2477E0B20/Documents/nestle/1.%20AKTIV%20S%C3%84LJFOLDER/S%C3%A4ljfolder2014-SP1.pdf
Note that the percentages is not the same on the urls. I have tried so many things, changing encodings etc. but can't find a way to solve this.
Edit
So, I tried the precomposedStringWithCanonicalMapping as follows:
NSLog(#"EQUAL? :%hhd", [[strippedUrlString precomposedStringWithCanonicalMapping] isEqualToString:[filePath precomposedStringWithCanonicalMapping]]); – returns 0
I logged the strings and got
/Users/xxxxxx/Library/Application Support/iPhone Simulator/7.0/Applications/C05E0885-7B58-4B2F-A6B4-D9388E60462C/Documents/1. AKTIV SÄLJFOLDER/Säljfolder2014-SP1.pdf
with NSLog(#"Precompose url 1: %#", [strippedUrlString precomposedStringWithCanonicalMapping]);
for the first string and
/Users/xxxxxx/Library/Application%20Support/iPhone%20Simulator/7.0/Applications/C05E0885-7B58-4B2F-A6B4-D9388E60462C/Documents/1.%20AKTIV%20SA%CC%88LJFOLDER/Sa%CC%88ljfolder2014-SP1.pdf
with NSLog(#"Precompose file 1: %#", [filePath precomposedStringWithCanonicalMapping]);
for the second.
Tried same code, but with precomposedStringWithCompatibilityMapping and got exactly the same result :(
Probably you ran in a problem that in Unicode equivalent strings are not always binary equal.
http://en.wikipedia.org/wiki/Unicode_equivalence
You have
…SA%CC%88…:
This is the problem.
It means: We have an "A" and a combining diaeresis -> Ä. The diaeresis is the 0xCC88, which is UTF-8 for Unicode 0x0308 (COMBINING DIAERESIS). So the Ä is encoded as an A with an combining diaeresis.
…S%C3%84…:
This is easy. 0xC384 is UTF-8 for 0x00C4 that means A-Umlaut -> Ä
First of all: What is the source of the first string?
Addition: You can use precomposedStringWith…Mapping (NSString).
BTW: You can compare strings without diacritic marks using -compare:withOptions: et al. with the option NSDiacriticInsensitiveSearch. In this case, I assume, string 1 equals string 2. Butt it would equal an "A", too, what is probably not what you want.