In one of my UIView classes, I have the UIKeyInput protocol attached to gather input from a UIKeyboard. I'm trying to figure out what ascii character is being used when the space button is pushed (it's not simply ' ' it's something else it appears). Does anyone know what this asci character is or how I can figure out what ascii code is being used?
To look at the value for each character you can do something like this:
NSString *text = ... // the text to examine
for (NSUInteger c = 0; c < text.length; c++) {
unichar char = [text characterAtIndex:c];
NSLog(#"char = %x", (int)char); // Log the hex value of the Unicode character
}
Please note that this code doesn't properly handle any Unicode characters in the range \U10000 and up. This includes many (all?) of the Emoji characters.
If you really need to know what character (or code point) it actually is, use the CFMutableString function CFStringTransform()
That enables you to use transformation argument kCFStringTransformToUnicodeName to get the human readable Unicode name for example or Hex-Any to get escaped Unicode code point.
Otherwise you can do the the unichar approach to simple get the code point.
Related
I have a string that include some special char (like é,â,î,ı etc.), When I use substring on this string. I encounter inconsistent results. Some special char change uncontrollably
You are assuming that these are all characters:
[newword substringWithRange:NSMakeRange(0,1)];
[newword substringWithRange:NSMakeRange(1,1)];
[newword substringWithRange:NSMakeRange(2,1)];
[newword substringWithRange:NSMakeRange(3,1)];
// and so on...
In other words, you believe that:
A location always falls at the start of a character.
A character always has length 1.
Both assumptions are wrong. Please read the Characters and Grapheme Clusters chapter of Apple's String Programming Guide (here).
Your é happens to have length 2, because it is a base letter e followed by a combining diacritical accent. If you want it to have length 1, you need to normalize the string before you use it. Call precomposedStringWithCanonicalMapping and use the resulting string.
Example and proof (in Swift, but it won't matter, as I use NSString throughout):
let s = "é,â,î,ı" as NSString
let c = s.substring(with: NSRange(location: 0, length: 1)) // e
let s2 = s.precomposedStringWithCanonicalMapping as NSString
let c2 = s2.substring(with: NSRange(location: 0, length: 1)) // é
You're treating a unicode string like a sequence of bytes. Unicode codepoints, aside from low UTF8 can be multi-byte so you are changing the text style by stripping out parts responsible for the accent above the letter like this part: https://www.compart.com/en/unicode/U+0301
UTF8 is variable width so by treating it as raw bytes you may get weird results, I would suggest using something that is more aware of unicode like ICU (International Components for Unicode).
Now imagine you have a two byte sequence like this (this may not be 100% accurate but it illustrates my point):
0x056 0x000
e NUL
Now you have a UTF8 string with 1 codepoint and a null terminator. Now say you want to add an accent to that e. How would you do that? You could use a special unicode codepoint to modify the e so now the string is:
0x056 0x0CC 0x810 0x000
e U+0301 NUL
Where U+0301 is 2 a byte control character (Combining Acute Accent) and makes the e accented.
Edit: The answer assumes UTF8 encoding which is likely a bad assumption but I think the answer, whether UTF8 or UTF16, or any other type of encoding with control characters, illustrates why you may have mysterious dissapearing accents. While this may be UTF16, for the sake of simplicity let's pretend we live in a world where life is just slightly better because everyone only uses UTF8 and UTF16 doesn't exist.
To address the comment (this is less to do with the question but is some fun trivia) and for some fun detils about NS/CF/Swift runtimes and bridging and constant CF strings and other fun stuff like that: The representation of the actual string in memory is implementation defined and can vary (even for constant strings, trust me, I know, I fixed the ELF implementation of them in Clang for CoreFoundation a few days ago). Anyway, here's some code:
CF_INLINE CFStringEncoding __CFStringGetSystemEncoding(void) {
if (__CFDefaultSystemEncoding == kCFStringEncodingInvalidId) (void)CFStringGetSystemEncoding();
return __CFDefaultSystemEncoding;
}
CFStringEncoding CFStringFileSystemEncoding(void) {
if (__CFDefaultFileSystemEncoding == kCFStringEncodingInvalidId) {
#if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI || DEPLOYMENT_TARGET_WINDOWS
__CFDefaultFileSystemEncoding = kCFStringEncodingUTF8;
#else
__CFDefaultFileSystemEncoding = CFStringGetSystemEncoding();
#endif
}
return __CFDefaultFileSystemEncoding;
}
Throughout CoreFoundation/Foundation/SwiftFoundation (Yes you never know what sort of NSString is actually the one you're holding, they usually pretend to be the same thing but under the hood depending on how you got the object you may be holding onto one of the three variations of it).
This is why code like this exists, because NS/CF(Constant)/Swift strings have implementation defined internal representation.
if (((encoding & 0x0FFF) == kCFStringEncodingUnicode) && ((encoding == kCFStringEncodingUnicode) || ((encoding > kCFStringEncodingUTF8) && (encoding <= kCFStringEncodingUTF32LE)))) {
If you want consistent behavior you have to encode the string using a specific fixed encoding instead of relying on the internal representation.
I've ran into some issue displaying the trademark "TM" character on my UILabel.
The "TM" character having problem showing up is \U0099 instead of the usual \U2122
I dig a little deeper and find out the "TM" character \U0099 belongs to a very few Chinese fonts.
So I'm guessing iOS doesn't have the font to show it in labels or does not recognize it at all.
I've tried to scan my data for "\U0099" and stringreplace it to \U2122, but seems like NSString functions will escape unicode characters automatically so this "TM" character won't even be there.
Has anyone encountered this issue before or can give me suggestions as to how to deal with this \U0099 character?
Thanks in advance
It is unclear to me how you've obtained your NSString or what you have actually tried to solve your problem. So this suggestion might be completely unsuitable, but let's see if it helps...
U+0099 is an unassigned Unicode control character, it is not a TM symbol. It is fairly hard to get this character into an NSString as Clang at least objects if you place the escape into a literal, and Cocoa fails to translate a sequence of bytes in UTF-8 into an NSString if it contains it. This problem might be what is behind your comment that you could not string replace it.
However starting with UTF-16, I did manage to create a string with U+0099 in it:
unichar b[] = { 0x61, 0x62, 0x63, 0x99, 0x64, 0x65, 0x66 };
NSString *s = [[NSString alloc] initWithBytes:b length:14 encoding:NSUTF16LittleEndianStringEncoding];
That is the string "abc\U0099def" (calling characterAtIndex:3 will show you this).
Using the same approach an NSString with just U+0099 in it can be generated:
unichar notTMChar = 0x99;
NSString *notTMStr = [[NSString alloc] initWithBytes:¬TMChar length:2 encoding:NSUTF16LittleEndianStringEncoding];
and that can be used in a string replace call:
NSString *t = [s stringByReplacingOccurrencesOfString:notTMStr withString:#"™"];
giving t the value "abc™def" as required.
Warning: We are dealing with an unassigned Unicode control character here. Clang/Cocoa rejected it in UTF-8, it is probably unintentional that it accepted it in UTF-16. Using C library functions to do this is probably more reliable. Xcode 5.1.1 with Clang 5.1 was used for the tests.
HTH
Thanks for the suggestions.
I've talked to my clients and they agreed that \u0099 shouldn't be there.
I have also implemented rmaddy's suggestion to replace instance \u0099 to \u2122.
NSString *problemString = dictionaryWithU099AsValue.description;
if ([problemString rangeOfString:#"0099"].location != NSNotFound) {
NSString *fixedDescriptionString = [[[problemString stringByReplacingOccurrencesOfString:#"U0099" withString:#"U2122"];
// Then I reconstruct the NSString back to a new NSDictionary
}
Note that the trademark symbol ™ appears as hex 99 in Code Page 1252 (a common Windows character set).
I have some NSString like :
test = #"this is %25test%25 string";
I am trying to replace test with some arabic text , but it is not replacing exactly as it is :
[test stringByReplacingOccurrencesOfString:#"test" withString:#"اختبار"];
and the result is :
this is %25 اختبار %25 string
Some where I read there could be some problem with encoding or text alignment.Is there extra adjustment needed to be done for arabic string operations .
EDIT : I have used NSMutable string insert property but still the same result .
EDIT 2:
One other thing that occurs to me that is causing most of your trouble in this specific example. You have a partially percent-encoded string above. You have spaces, but you also have %25. You should avoid doing that. Either percent-encode a string or don't. Convert it all at once when required (using stringByAddingPercentEscapesUsingEncoding:). Don't try to "hard-code" percent-encoding. If you just used "this is a %اختبار% string" (and then percent-encoded the entire thing at the end), all your directional problems would go away (see how that renders just fine?). The rest of these answers address the more general question when you really need to deal with directionality.
EDIT:
The original answer after the line relates to human-readable strings, and is correct for human-readable strings, but your actual question (based on your followups) is about URLs. URLs are not human-readable strings, even if they occasionally look like them. They are a sequence of bytes that are independent of how they are rendered to humans. "اختبار" cannot be in the path or fragment parts of an URL. These characters are not part of the legal set of characters for those sections (اختبار is allowed to be part of the host, but you have to follow the IDN rules for that).
The correct URL encoding for this is a %25<arabic>%25 string is:
this%20is%20a%20%2525%D8%A7%D8%AE%D8%AA%D8%A8%D8%A7%D8%B1%2525%20string
If you decode and render this string to the screen, it will appear like this:
this is a %25اختبار%25 string
But it is in fact exactly the string you mean (and it is the string you should pass to the browser). Follow the bytes (like the computer will):
this - this (ALPHA)
%20 - <space> (encoded)
is - is (ALPHA)
%20 - <space> (encoded)
a - a (ALPHA)
%20 - <space> (encoded)
%25 - % (encoded)
25 - 25 (DIGIT)
%D8%A7 - ا (encoded)
%D8%AE - خ (encoded)
%D8%AA - ت (encoded)
%D8%A8 - ب (encoded)
%D8%A7 - ا (encoded)
%D8%B1 - ر (encoded)
%25 - % (encoded)
25 - 25 (DIGIT)
%20 - <space> (encoded)
string - string (ALPHA)
The Unicode BIDI display algorithm is doing what it means to do; it just isn't what you expect. But those are the bytes and they're in the correct order. If you add any additional bytes (such as LRO) to this string, then you are modifying the URL and it means something different.
So the question you need to answer is, are you making an URL, or are you making a human-readable string? If you're making an URL, it should be URL-encoded, in which case you will not have this display problem (unless this is part of the host, which is a different set of rules, but I don't believe that's your problem). If this is a human-readable string, see below about how to provide hints and overrides to the BIDI algorithm.
It's possible that you really need both (a human-friendly string, and a correct URL that can be pasted). That's fine, you just need to handle the clipboard yourself. Show the string, but when the user goes to copy it, replace it with the fully encoded URL using UIPasteboard or by overriding copy:. See Copy, Cut, and Paste Operations. This is fairly common (note how in Safari, it displays just "stackoverflow.com" in the address bar but if you copy and paste it, it pastes "https://stackoverflow.com/" Same thing.
Original answer related to human-readable strings.
Believe it or not, stringByReplacingOccuranceOfString: is doing the right thing. It's just not displaying the way you expect. If you walk through characterAtIndex:, you'll find that it's:
% 2 5 ا ...
The problem is that the layout engine gets very confused around all the "neutral direction" characters. The engine doesn't understand whether you meant "%25" to be attached to the left to right part or right to left part. You have to help it out here by giving it some explicit directional characters to work with.
There are a few ways to go about this. First, you can do it the Unicode 6.3 tr9-29 way with Explicit Directional Isolates. This is exactly the kind of problem that Isolates are meant to solve. You have some piece of text whose direction you want to be considered completely independently of all other text. Unicode 6.3 isn't actually supported by iOS or OS X as best I can tell, but for many (though not all) uses, it "works."
You want to surround your Arabic with FSI (FIRST STRONG ISOLATE U+2068) and PDI (POP DIRECTIONAL ISOLATE U+2069). You could also use RLI (RIGHT-TO-LEFT ISOLATE) to be explicit. FSI means "treat this text as being in the direction of the first strong character you find."
So you could ideally do this:
NSString *test = #"this is a %25\u2068test\u2069%25 string";
NSString *arabic = #"اختبار";
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:arabic];
That works if you know what you're going to substitute before hand (so you know where to put the FSI and PDI). If you don't, you can do it the other way and make it part of the substitution:
NSString * const FSI = #"\u2068";
NSString * const PDI = #"\u2069";
NSString *test = #"this is %25test%25 string";
NSString *arabic = #"اختبار";
NSString *replaceString = [#[FSI, arabic, PDI] componentsJoinedByString:#""];
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:replaceString];
I said this "mostly" works. It's fine for UILabel, and it probably is fine for anything using Core Text. But in NSLog output, you'll get these extra "placeholder" characters:
You might get this other places, too. I haven't checked UIWebView for instance.
So there are some other options. You can use directional marks. It's a little awkward, though. LRM and RLM are zero-width strongly directional characters. So you can bracket the arabic with LRM (left to right mark) so that the arabic doesn't disturb the surrounding text. This is a little ugly since it means the substitution has to be aware of what it's substituting into (which is why isolates were invented).
NSString * const LRM = #"\u200e";
NSString *test = #"this is a %25test%25 string";
NSString *replaceString = [#[LRM, arabic, LRM] componentsJoinedByString:#""];
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:replaceString];
BTW, Directional Marks are usually the right answer. They should always be the first thing you try. This particular problem is just a little too tricky.
One more way is to use Explicit Directional Overrides. These are the giant "do what I tell you to do" hammer of the Unicode world. You should avoid them whenever possible. There are some security concerns with them that make them forbidden in certain places (<RLO>elgoog<PDF>.com would display as google.com for instance). But they will work here.
You bracket the whole string with LRO/PDF to force it to be left-to-right. You then bracket the substitution with RLO/PDF to force it to the right-to-left. Again, this is a last resort, but it lets you take complete control over the layout:
NSString * const LRO = #"\u202d";
NSString * const RLO = #"\u202e";
NSString * const PDF = #"\u202c";
NSString *test = [#[LRO, #"this is a %25test%25 string", PDF] componentsJoinedByString:#""];
NSString *arabic = #"اختبار";
NSString *replaceString = [#[RLO, arabic, PDF] componentsJoinedByString:#""];
NSString *result = [test stringByReplacingOccurrencesOfString:#"test" withString:replaceString];
I would think you could solve this problem with the Explicit Directional Embedding characters, but I haven't really found a way to do it without at least one override (for instance, you could use RLE instead of RLO above, but you still need the LRO).
Those should give you the tools you need to figure all of this out. See the Unicode TR9 for the gory details. And if you want a deeper introduction to the problem and solutions, see Cal Henderson's excellent Understanding Bidirectional (BIDI) Text in Unicode.
You should try like this:
NSString *test = #"this is %25test%25 string";
NSString *test2 = [[[test stringByReplacingPercentEscapesUsingEncoding:NSStringEncodingConversionAllowLossy] componentsSeparatedByString:#"test"] componentsJoinedByString:#"اختبار"];
So as I work my way through understanding string methods, I came across this useful class
NSCharacterSet
which is defined in this post quite well as being similar to a string excpet it is used for holding the char in an unordered set
What is differnce between NSString and NSCharacterset?
So then I came across the useful method invertedSet, and it bacame a little less clear what was happening exactly. Also I a read page a fter page on it, they all sort of glossed over the basics of what was happening and jumped into advanced explainations. So if you wanted to know what this is and why we use It SIMPLY put, it was not so easy instead you get statements like this from the apple documentation: "A character set containing only characters that don’t exist in the receiver." - and how do I use this exactly???
So here is what i understand to be the use. PLEASE provide in simple terms if I have explained this incorrectly.
Example Use:
Create a list of Characters in a NSCharacterSetyou want to limit a string to contain.
NSString *validNumberChars = #"0123456789"; //Only these are valid.
//Now assign to a NSCharacter object to use for searching and comparing later
validCharSet = [NSCharacterSet characterSetWithCharactersInString:validNumberChars ];
//Now create an inverteds set OF the validCharSet.
NSCharacterSet *invertedValidCharSet = [validCharSet invertedSet];
//Now scrub your input string of bad character, those characters not in the validCharSet
NSString *scrubbedString = [inputString stringByTrimmingCharactersInSet:invertedValidCharSet];
//By passing in the inverted invertedValidCharSet as the characters to trim out, then you are left with only characters that are in the original set. captured here in scrubbedString.
So is this how to use this feature properly, or did I miss anything?
Thanks
Steve
A character set is a just that - a set of characters. When you invert a character set you get a new set that has every character except those from the original set.
In your example you start with a character set containing the 10 standard digits. When you invert the set you get a set that has every character except the 10 digits.
validCharSet = [NSCharacterSet characterSetWithCharactersInString:validNumberChars];
This creates a character set containing the 10 characters 0, 1, ..., 9.
invertedValidCharSet = [validCharSet invertedSet];
This creates the inverted character set, i.e. the set of all Unicode characters without
the 10 characters from above.
scrubbedString = [inputString stringByTrimmingCharactersInSet:invertedValidCharSet];
This removes from the start and end of inputString all characters that are in
the invertedValidCharSet. For example, if
inputString = #"abc123d€f567ghj😄"
then
scrubbedString = #"123d€f567"
Is does not, as you perhaps expect, remove all characters from the given set.
One way to achieve that is (copied from NSString - replacing characters from NSCharacterSet):
scrubbedString = [[inputString componentsSeparatedByCharactersInSet:invertedValidCharSet] componentsJoinedByString:#""]
This is probably not the most effective method, but as your question was about understanding
NSCharacterSet I hope that it helps.
Is it possible to detect if an ascii character belongs to Asian double byte or Cyrillic character sets? Perhaps specific code ranges? I've googled, but not finding anything at first glance.
There's an RSS feed I'm tapping into that has the locale set as 'en-gb'. But there are some Asian double byte characters in the feed itself - which I need to handle differently. Just not sure how to detect it since the meta locale data is incorrect. I do not have access to correct the public feed.
If your rss feed uses utf-8, which it probably does - just look that character value is greater than 255.
A quick Google suggest that you might wanna look at String.charCodeAt
I don't know ActionScript, but I would expect a code snippet to look something like
var stringToTest : String;
for each (var i : Number = 0; i < stringToTest.length; i++) {
if (stringToTest.charCodeAt(i) > 255) {
// Do something to your double-byte character here
} else {
// You have a plain ASCII character here
}
}
I hope this helps!