UTF-32 encoding in Erlang - erlang

I want to create an application with wxErlang, in which I need to use UTF-32 strings. I can load source code from the file with UTF-8 encoding, but getting errors when the file is converted to UTF-32. I need to use Cyrillic characters in my application, that's why I want to solve this problem with UTF-32 encoding.

If you look at the unicode usage in Erlang page, you'll see that the current release of Erlang, 16A, supports UTF-8 source files. UTF-32 is not supported. However, if you want Cyrillic, UTF-8 has everything you can write in UTF-32.

Related

Japanese encoding JIS_X_0208 codepage in python and C++

I am trying to encode and decode Japanese characters that are incoded in JIS_X_0208.
In python I use this command to encode my string from uft-8 to japanese characters
string.decode('utf8').encode('iso2022_jp')
to encode the kanji properly
I decode it in C++ with this line to UTF-16
MultiByteToWideChar(932, 0, &s[0], s.size(), &unicodeBuffer[0], s.size());
All the kanji are properly encoded/decoded.
But the problem is that it is not compliant with JIS_X_0208. I prefer to specify that the usage of JIS_X_0208 is mandatory and I can't change it.
For instance, the roman character are supposed to be encoded in two bytes with the first one starting with 0x23, for example le letter T should be encoded as 0x23 0x54 (according to both he JIS_X_0208 wikipedia page and the sample I was gevin as example).
I guess the only issue I have is to find the correct codepage for the encoding, but I can't find the one I need.
Does anyone know what the correct codepage is, or at least where I can find the available codepage for C++ and python on Windows?
Thank you in advance.

iOS write to CSV file: which encoding to use

In my iOS app, I have a feature that writes data to a CSV file. This works fine in most cases with the following:
[csvString writeToFile: filePath atomically:YES encoding: NSUTF8StringEncoding error:&error];
I recently got an email from a Japanese user that the CSV file exported has weird symbols instead of Japanese characters. So I switched to using NSUTF16StringEncoding and it seems to work fine for Japanese characters as well.
So the question is: is it better to use NSUTF16StringEncoding, or are there any drawbacks to doing this? It seems that other examples I've seen for writing to CSV files (including CHCSVParser) use
NSUTF8StringEncoding, so I'm not sure which one to prefer.
Thanks.
There's no a "better" encoding.
UTF-8 uses a variable number of bytes per each character, from 1 to 4. UTF-16 uses always 2 bytes for every character. What is best, is really up to you and your business. In theory, if your users are mostly based in Asia and use primarily non-ASCII character, files encoded in UTF-16 are smaller. If your users are primarily living in the Western world and use Latin-based alphabets, using UTF-8 makes every file 50% smaller.
I believe your problem is not with the choice of the encoding, but rather with the presentation. Text editors cannot guess the encoding of a file, so it's possible that your Japanese user was using a text editor that defaults to UTF-16, and thus was unable to represent UTF-8 character sequences correctly.
The solution to this problem is to using the BOM sequence, as per this SO answer: https://stackoverflow.com/a/2585194/192024 (in short: just add those 3 bytes at the beginning of the file to tell editors what encoding to use)

Converting special characters in TStringList

I am using Delphi 7 and have a routine which takes a csv file with a series of records and imports them. This is done by loading it into a TStringList with MyStringList.LoadFromFile(csvfile) and then getting each line with line = MyStringList[i].
This has always worked fine but I have now discovered that special characters are not picked up correctly. For example, Rue François Coppée comes out as Rue François Coppée - the accented French characters are the problem.
Is there a simple way to solve this?
Your file is encoded as UTF-8. For instance consider the ç. As you can see from the link, this is encoded in UTF-8 as 0xC3 0xA7. And in Windows-1252, 0xC3 encodes à and 0xA7 encodes §.
Whether or not you can handle this easily using your ANSI Delphi depends on the prevailing code page under which your program runs.
If you are using Windows 1252 then you will be fine. You just need to decode the UTF-8 encoded text with a call to UTF8Decode.
If you are using a different locale then life gets more difficult. Those characters may not be present in your locale's character set and in that case you cannot represent them in a Delphi string variable which is encoded using the prevailing ANSI charset. If this is the case then you need to use Unicode.
If you care about handling international text then you need to either:
Upgrade to a modern Delphi which has Unicode support, or
Stick to Delphi 7 and use WideString and the TNT Unicode components.
Probably it's not in UTF8 encoding. Try to convert it:
Text := UTF8Encode(Text);
Regards,

Can anyone tell me how to convert UTF-8 value to UCS-2 value in Objective-c?

I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.

ReadLn working with WideString (utf-8 files)

I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?
If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.
If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.
If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do
WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.
You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).

Resources