How to create UTF-16 animation in Twitter? - twitter

I use a UTF-16 character picker to create ASCII art in Texbox in HTML, and UTF-16 characters are supported and visible "as is". Now I need to process such ASCII art and save into an Array as UTF-16 characters, process with Javascript as Strings to build ASCII art animations for Twitter like this:
You don't have to be sorry.
Twitter accepts UTF-16 as ASCIIart
For UTF-16 definition go to Wikipedia
http://en.wikipedia.org/wiki/UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064[1] numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.
I already did 2-bytes Unicode picker (UTF-16) and can generate UTF-16 input into Twitter.
==
re:
Removed the link as it's pointing to a Twitter account which doesn't show the mentioned content anymore (w/o scrolling). May appear like spam then. – david Nov 20 at 4:09
That way it may take much longet time to get right answer.

UTF-16 is a character encoding. Twitter only accepts UTF-8 as input. You can convert UTF-16 to UTF-8 without any data loss, so just do that and then send it to Twitter.

Related

Delphi decoded base64 to something

I am stuck a bit in decoding. I got a base64-encoded .rtf file.
A little part of this looks like this: Bek\u252\''fcld\u337\''3f
Which represents: Beküldő
But my output data after decoding is: Bekuld?
If I manually replace the characters it works.
StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);
Does anyone know a general solution for this? Some conversation or something?
For instance, \u242 means Unicode character #242.
So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.
But RTF is a very complex beast.
Check what the RTF 1.5 specifications says about encoding:
\uN This keyword represents a single Unicode character which has no
equivalent ANSI representation based on the current ANSI code page. N
represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in
ANSI representation. In this way, old readers will ignore the \uN
keyword and pick up the ANSI representation properly. When this
keyword is encountered, the reader should ignore the next N
characters, where N corresponds to the last \ucN value encountered.
Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.

Why/how does the browser decide ☃.net goes to xn--n3h.net

If we type into firefox or chrome
http://☃.net/
It takes us to
http://xn--n3h.net/
Which is a mirror of unicodesnowmanforyou.com
What I don't understand is by what rules the unicode snowman can decode to xn--n3h, it doesn't look anything like utf-8 or urlencoding.
I think I found a hint while mucking around in python3, because:
>>> '☃'.encode('punycode')
b'n3h'
But I still don't understand the xn-- part. How are domain names internationalised, what is the standard and where is this stuff documented?
It uses an encoding scheme called Punycode (as you've already discovered from the Python testing you've done), capable of representing Unicode characters in ASCII-only format.
Each label (delimited by dots, so get.me.a.coffee.com has five labels) that contains Unicode characters is encoded in Punycode and prefixed with the string xn--.
The label encoding first copies all the ASCII characters, then appends the encoded Unicode characters. The Unicode characters are always after the final - in the label, so one is added after the ASCII characters if needed.
More detail can be found in this page over at the w3 site, and in RFC 3987. For details on how Punycode actually encodes labels, see the Wikipedia page.

What is the relationship between unicode/utf-8/utf-16 and my local encode GBK?

I've noted that my text file on Windows(chinese version), when port to Ubuntu, turned garbled.
After more research, I know the default encode on Windows CN version is GBK, while on Ubuntu is utf-8, and iconv can do the encode translation, for example, from GBK to utf-8:
iconv -f gbk -t utf-8 input.txt > output.txt
But I am still confused by the relationship of these encode. What are they? what is the similarity and difference between them?
First it is not about the OS, but about the program you are using to read the file.
On a bare .txt, the program has to be able to guess the encoding, which is not always possible, but might work. On an html, encoding is given as metadata, so browsers don't need to do that.
Second, do you know ASCII? Do you see how it represents symbols via numbers? If not this is the first thing you should learn now.
Next, do you see the difference between Unicode and UTF-XXX? It must be clear to you that Unicode is just a map of numbers (code points) to glyphs (symbols, including Chinese characters, ASCII characters, Egyptian characters, etc.)
UTF-XXX on the other hand says, given a string of bytes, which Unicode numbers (code points) do they represent. Therefore, UTF-8 and UTF-16 are different efficient ways to represent Unicode.
As you may imagine, unlike ASCII, both UTF and GBK must allow more than one byte per character, since there are much more than 256 of them.
In GBK all characters are encoded as 1 or 2 bytes.
Since GBK is specialized for Chinese, it uses less bytes in average than UTF-XXX to represent a given Chinese text, and more for other languages.
In UTF-8 and 16, the number of bytes per glyph is variable, so you have to look at how many bytes are used for the Chinese code points.
In Unicode, Chinese glyphs are on the following ranges. Then you have to look at how efficiently UTF-8 and UTF-16 represent those ranges.
According to Wikipedia articles on UTF-8 and UTF-16, the first and most common range for Chinese glyphs 4E00-9FFF is represented in UTF-8 as either 2 or 3 bytes, while in UTF-16 it is represented as 2 bytes. Therefore, if you are going to use lots of Chinese, UTF-16 might be more efficient. You also have to look into the other ranges to see how many bytes per character are used.
For portability, the best choice is UTF, since UTF can represent almost any possible character set, so it is more likely that viewers will have been programmed to decode it correctly. The size gain of GBK is not that large.

Can anyone tell me how to convert UTF-8 value to UCS-2 value in Objective-c?

I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.

Percent Encoded UTF-8 to Ascii(8-bit) conversion

Im reading in urls and they often have percent encoded characters.
Example: %C3%A9 is actually é
According to http://www.microsystools.com/products/sitemap-generator/faq/character-percentage-url-encoding/ , characters in the upper half of 8-Bit ASCII (128-255) are encoded as UTF-8, then their bytes are saved as hex. Now, when I get my URL, the %HEX's have been reencoded as 8-bit ascii, and I need to convert those back to their true 8bit ascii. Is there any function/library I can use, or else, how would I go about the conversion?
Im using C/C++.
First you need to URLDecode. Not a function available in cross-platform C++, but, luckily for you, not a hard problem. Copy bytes from source to target. Non-% bytes just get copied. When you hit %xx, convert XX from hex chars to binary, and you have your byte.
This gives you a buffer of text in UTF-8. You say you want 'ASCII' -- ISO-646. Then you can't have an accented e. I can think of several possibilities for what you really want:
ISO-8859-1. You can use ICU to convert UTF-8 to ISO-8859-1.
ISO-646. You can also use ICU, and I believe it will make accented chars into their ISO-646 equivalents.

Resources