I need to know what are the more than four bytes character in UTF-16 and what is the range of code in UTF-16? - character-encoding

I need to know what are the more than four bytes character in UTF-16 and what is the range of code in UTF-16?
I have looked for on internet and here without success, someone have some material to share?
Thank you very much in advance

Related

How do I get a number from bytes?

I am currently trying to work around with Lua 5.1 bytecode. I've gotten pretty far, and understand a lot. However, I am stuck with a question on instructions and numbers. I understand that the size of the instruction and number are located and defined in the header, but I am not sure how to get the actual number from the 4 bytes (or whatever size is specified in the header).
I've looked at output from ChunkSpy and I don't really understand how it went from those bytes to the number. I'd look in the source but I don't want to just copy it, I want to understand it. If anyone could tell me a bit about it or even point me in the right direction I'd be very grateful.
Thank you!
From A No-Frills Introduction to Lua 5.1 VM Instructions, numbers are stored in the constants pool.
The first byte is 3=LUA_TNUMBER.
The next bytes are the number, with the length as given in the header. Interpretation is based on the length, byte order and the integral flag as given in the header.
Typically, non-integral with 8 bytes means IEEE 754 64-bit double.
Deserializing bytes to double involves extracting the bits for the mantissa and exponent, and combining them with arithmetic operations. Perhaps you want that as a challenge and to start from a description of the standard: What Every Computer Scientist Should Know About Floating-Point Arithmetic, "Formats and Operations" section.

Could someone please explain how integers in the form of 0x0 etc work?

As the title says, could someone please explain how integers work, like those seen in memory addresses. Any answers would be greatly appreciated; the context of the situation is that I want to be able to view edit and delete data stored in a processes allocated memory.
Thanks :D
0x00 notation means that the number is in Hexadecimal Format. Mostly these numbers are used for the memory location addresses or the read/write data packet formats. This link might help.

difference between characters online and on localhost

The question is ,I hope,simple for someone who knows about character encoding.
This is my site.
http://www.football-tennis-stats.com/index.php/stats/display/tennis
Online, the character set is wrong ,I get this weird Â,while on localhost everything is allright.
I know there is a lot of good reading to be done on this subject,but I don't even know where to start.
There does not seem to be any character encoding issue, just spurious data, namely bytes 0xC3 0x82, which represent the character  when interpreted in UTF-8, which is the declared encoding. Otherwise, the content seems to be all ASCII, because the names are in “internationalized”, i.e. anglicized form, e.g. Djokovic instead of Đoković, Soderling instead of Söderling etc. With this data, it does not matter much how you declare its encoding, since ASCII characters mostly have the same representation anyway.
I have no idea where the bytes come from, but they seem to appear systematically between a comma and a space, so it’s apparently something in the code that generates the table.

Can anyone tell me how to convert UTF-8 value to UCS-2 value in Objective-c?

I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.

What does 'lew' stand for in 'lew2' or 'lew4'?

I'm seeing the term 'lew2' and 'lew4' being used in reference to character size in certain files. I know that the number represents how many bytes are used to store certain types of characters (maybe wide chars?), but I'm not sure what the 'lew' part stands for. My best guess is 'length of wide'. Can anyone enlighten me?
My guess would be Little Endian Word 2 Bytes (or 4 Bytes), as opposed to Big Endian.

Resources