I want to escape Unicode word to be used in URL to make HTTPRequest, for example want to convert "محمود" to "%D9%85%D8%AD%D9%85%D9%88%D8%AF" I noticed that each character has converted to two HEX
Thanks a lot
Convert to UTF-8, then url-encode chars not in [[:alnum:]].
\Url-encoding is where a character is converted into %<HIGHNIBBLE><LOWNIBBLE> form, where HIGHNIBBLE = (ch >> 4) & 0x0F and LOWNIBBLE = (ch & 0x0F).
Look into RFC 1738 (S) 2.2 for more details.
Because it looks like you're using java, you'll have to work with byte[] instead of String or char[].
Related
If we percent encode the char "€", we will have %E2%82%AC as result. Ok!
My problem:
a = %61
I already know it.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
If yes, will browsers and servers understand the result as the char "a"?
If we percent encode the char "€", we will have %E2%82%AC as result.
€ is Unicode codepoint U+20AC EURO SIGN. The byte sequence 0xE2 0x82 0xAC is how U+20AC is encoded in UTF-8. %E2%82%AC is the URL encoding of those bytes.
a = %61
I already know it.
For ASCII character a, aka Unicode codepoint U+0061 LATIN SMALL LETTER A, that is correct. It is encoded as byte 0x61 in UTF-8 (and most other charsets), and thus can be encoded as %61 in URLs.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
Yes. Any character can be encoded using percent encoding in a URL. Simply encode the character in the appropriate charset, and then percent-encode the resulting bytes. However, most ASCII non-reserved characters do not require such encoding, just use them as-is.
If yes, will browsers and servers understand the result as the char "a"?
In URLs and URL-like content encodings (like application/x-www-webform-urlencoded), yes.
I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.
I can't understand what encoding approach uses Tunderbird while searching on IMAP server with command IMAP SEARCH CHARSET
I've tried to search Russian word "привет" and this was mapped to "?#825B", i.e.
A001 SEARCH CHARSET ISO-8859-1 BODY "?#825B"
How that happen? I'm sure this is correct as I've used sniffer for catch this and the Dovecot server correctly found the mail with "привет" word. The ISO-8859-1 encoding hasn't Russian glyphs at all! So how it was converted?
For example, "привет" (written as Unicode characters) gives "??????" for ISO-8859-1 encoding on my machine or here http://www.motobit.com/util/charset-codepage-conversion.asp
The way that Thunderbird is getting this value is by downcasting a (16-bit?) unicode character to a byte.
For example, in C# (which uses UTF-16 internally for its char and string types), this would get the result you are seeing:
const string text = "привет";
var buffer = new char[text.Length];
for (int i = 0; i < text.Length; i++)
buffer[i] = (char) ((byte) text[i]);
var result = new string (buffer);
How Thunderbird handles surrogate pairs is anyone's guess based on what is known from the question. It might treat the surrogate pair as 2 separate characters (like my above code would) or it might combine them into a 32-bit unicode character and downcast that to a byte.
I have a load testing tool (Borland's SilkPerformer) that is encoding / character as \x252f. Can anyone please tell me what encoding system the tool might be using?
Two different escape sequences have been combined:
a C string hexadecimal escape sequence.
an URL-encoding scheme (percent encoding).
See this little diagram:
+---------> C escape in hexadecimal notation
: +------> hexadecimal number, ASCII for '%'
: : +---> hexadecimal number, ASCII for '/'
\x 25 2f
Explained step for step:
\ starts a C string escape sequence.
x is an escape in hexadecimal notation. Two hex digits are expected to follow.
25 is % in ASCII, see ASCII Table.
% starts an URL encode, also called Percent-encoding. Two hex digits are expected to follow.
2f is the slash character (/) in ASCII.
The slash is the result.
Now I don't know why your software chooses to encode the slash character in such a weird way. Slash characters in urls need to be url encoded if they don't denote directory separators (the same thing the backslash does for Windows). So you will often find the slash character being encoded as %2f. That's normal. But I find it weird and a bit suspicious that the percent character is additionally encoded as a hexadecimal escape sequence for C strings.
I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.