iOS write to CSV file: which encoding to use - ios

In my iOS app, I have a feature that writes data to a CSV file. This works fine in most cases with the following:
[csvString writeToFile: filePath atomically:YES encoding: NSUTF8StringEncoding error:&error];
I recently got an email from a Japanese user that the CSV file exported has weird symbols instead of Japanese characters. So I switched to using NSUTF16StringEncoding and it seems to work fine for Japanese characters as well.
So the question is: is it better to use NSUTF16StringEncoding, or are there any drawbacks to doing this? It seems that other examples I've seen for writing to CSV files (including CHCSVParser) use
NSUTF8StringEncoding, so I'm not sure which one to prefer.
Thanks.

There's no a "better" encoding.
UTF-8 uses a variable number of bytes per each character, from 1 to 4. UTF-16 uses always 2 bytes for every character. What is best, is really up to you and your business. In theory, if your users are mostly based in Asia and use primarily non-ASCII character, files encoded in UTF-16 are smaller. If your users are primarily living in the Western world and use Latin-based alphabets, using UTF-8 makes every file 50% smaller.
I believe your problem is not with the choice of the encoding, but rather with the presentation. Text editors cannot guess the encoding of a file, so it's possible that your Japanese user was using a text editor that defaults to UTF-16, and thus was unable to represent UTF-8 character sequences correctly.
The solution to this problem is to using the BOM sequence, as per this SO answer: https://stackoverflow.com/a/2585194/192024 (in short: just add those 3 bytes at the beginning of the file to tell editors what encoding to use)

Related

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

My non-Unicode Delphi 7 application allows users to open .txt files.
Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.
I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.
In cases when converting is not possible, the function should return an error.
The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;
If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.
It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).
The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:
The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.
Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:
It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.
There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:
% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert
Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.
Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.
Notepad has trouble with this issue.
The solution as requested, would probably entail more effort than you put into the original program.
Possible solutions
Add a text editor into your program. If you write it, you will be able to read it.
The following solution pushes the translation to established tables provided by Windows.
Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):
Caution  Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.
Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.
Note  The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.
This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.
Simply require the text file to be saved in the local code page.
Other thoughts:
ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.
In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".
You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.
Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.
For example, if it is a .csv file, find a comma in the various formats...
Bottom Line
There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

How to read a text file in ancient encoding?

There is a public project called Moby containing several word lists. Some files contain European alphabets symbols and were created in pre-Unicode time. Readme, dated 1993, reads:
"Foreign words commonly used in English usually include their
diacritical marks, for example, the acute accent e is denoted by ASCII
142."
Wikipedia says that the last ASCII symbol has number 127.
For example this file: http://www.gutenberg.org/files/3203/files/mobypos.txt contains symbols that I couldn't read in any of vatious Latin encodings. (There are plenty of such symbols in the very end of section of words beginning with B, just before C letter. )
Could someone advise please what encoding should be used for reading this file or how can it be converted to some readable modern encoding?
A little research suggests that the encoding for this page is Mac OS Roman, which has é at position 142. Viewing the page you linked and changing the encoding (in Chrome, View → Encoding → Western (Macintosh)) seems to display all the words correctly (it is incorrectly reporting ISO-8859-1).
How you deal with this depends on the language / tools you are using. Here’s an example of how you could convert into UTF-8 with Ruby:
require 'open-uri'
s = open('http://www.gutenberg.org/files/3203/files/mobypos.txt').read
s.force_encoding('macroman')
s.encode!('utf-8')
You are right in that ASCII only goes up to position 127 (it’s a 7-bit encoding), but there are a large number of 8 bit encodings that are supersets of ASCII and people sometimes refer to those as “Extended ASCII”. It appears that whoever wrote the readme you refer to didn’t know about the variety of encodings and thought the one he happened to be using at the time was universal.
There isn’t a general solution to problems like this, as there is no guaranteed way to determine the encoding of some text from the text itself. In this case I just used Wikipedia to look through a few until I found one that matched. Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a good place to start reading about character sets and encodings if you want to learn more.

Squeak Monticello character-encoding

For a work project I am using headless Squeak on a (displayless, remote) Linuxserver and also using Squeak on a Windows developer-machine.
Code on the developer machine is managed using Monticello. I have to copy the mcz to the server using SFTP unfortunately (e.g. having a push-repository on the server is not possible for security reasons). The code is then merged by eg:
MczInstaller installFileNamed: 'name-b.18.mcz'.
Which generally works.
Unfortunately our code-base contains strings that contain Umlauts and other non-ascii characters. During the Monticello-reimport some of them get replaced with other characters and some get replaced with nothing.
I also tried e.g.
MczInstaller installStream: (FileStream readOnlyFileNamed: '...') binary
(note .mcz's are actually .zip's, so binary should be appropriate, i guess it is the default anyway)
Finding out how to make Monticello's transfer preserve the Squeak internal-encoding of non-ascii's is the main Goal of my question. Changing all the source code to only use ascii-strings is (at least in this codebase) much less desirable because manual labor is involved. If you are interested in why it is not a simple grep-replace in this case read this side note:
(Side note: (A simplified/special case) The codebase uses Seaside's #text: method to render strings that contain chars that have to be html-escaped. This works fine with our non-ascii's e.g. it converts ä into ä, if we were to grep-replace the literal ä's by ä explicitly, then we would have to use the #html: method instead (else double-escape), however that would then require that we replace all other characters that have to be html-escaped as well (e.g. &), but then again the source-code itself contains such characters. And there are other cases, like some #text:'s that take third-party strings, they may not be replaced by #html's...)
Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.
The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.
If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.
Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.
MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.
The snippet that you should use is rather:
MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'
Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.
It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.
Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code.
You don't need to convert non 7bit characters to HTML entities. You can use
Character value: 228
for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do
$ä asciiValue => 228
I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.

What should I use? UTF8 or UTF16?

I have to distribute my app internationally.
Let's say I have a control (like a memo) where the user enters some text. The user can be Japanese, Russian, Canadian, etc.
I want to save the string to disk as TXT file for later use. I will use MY OWN function to write the text and not something like TMemo.SaveToFile().
How do I want to save the string to disk? In UTF8 or UTF16 format?
The main difference between them is that UTF8 is backwards compatible with ASCII. As long as you only use the first 128 characters, an application that is not Unicode aware can still process the data (which may be an advantage or disadvantage, depending on your scenario). In particular, when switching to UTF16 every API function needs to be adjusted for 16bit strings, while with UTF8 you can often leave old API functions untouched if they don't do any string processing.
Also UTF8 does not depend on endianess, while UTF16 does, which may complicate string I/O.
A common misconception is that UTF16 is easier to process because each character always occupies exactly two bytes. That is, unfortunately, not true. UTF16 is a variable-length encoding where a character may either take up 2 or 4 bytes. So any difficulties associated with UTF8 regarding variable-length issues apply to UTF16 just as well.
Finally, storage sizes: Another common myth about UTF16 is that it is more storage-efficient than UTF8 for most foreign languages. UTF8 takes less storage for all European languages, which can be encoded with one or two bytes per character. Non-BMP characters take up 4 bytes in both UTF8 and UTF16. The only case in which UTF16 takes less storage is if your text mainly consists of characters from the range U+0800 through U+FFFF, where the characters for Chinese, Japanese and Hindi are stored.
James McNellis gave an excellent talk at BoostCon 2014, discussing the various trade-offs between different encodings in great detail. Even though the talk is titled Unicode in C++, the entire first half is actually language agnostic. A video recording of the full talk is available at Boostcon's Youtube channel, while the slides can be found on github.
Depends on the language of your data.
If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16. You will pay a penalty when reading the data as it will be / needs to be converted to UTF-16 which is the Windows default and used by Delphi's (Unicode) string.
If your data is mostly in non-western languages, UTF-8 can take more storage than UTF-16 as it may take up to 6 4 bytes per character for some. (see comment by #KennyTM)
Basically: do some tests with representative samples of your users' data and see which performs better, both in storage requirements and load times. We have had some surprises with UTF-16 being slower than we thought. The performance gain of not having to transform from UTF-8 to UTF-16 was lost because of disk access as the data volume in UTF-16 is greater.
First of all, be aware that the standard encoding under Windows is UCS2 (until Windows 2000) or UTF-16 (since XP), and that Delphi native "string" type uses the same native format since Delphi 2009 (string=UnicodeString char=WideChar).
In all cases, it is it unsafe to assume 1 WideChar == 1 Unicode character - this is the surrogate problem.
About UTF-8 or UTF-16 choice, it depends on the storage itself:
If your file is a plain text file (including XML) you may use either UTF-8 or UTF-16 - but you will have to use a BOM at the beginning of the file, otherwise applications (like Notepad) may be confused at opening - for XML this is handled by your library (if it is not, change to another library);
If you are sure that your content is mostly 7 bit ASCII, use UTF-8 and the associated BOM;
If your file is some kind of database or a custom binary format, certainly the best format is UTF-16/UCS2, i.e. the default Delphi 2009+ string layout, and certainly the default database API layout;
Some file formats require or prefer UTF-8 (like JSON or even SQLite3), even if UTF-8 files can be bigger than UTF-16 for Asiatic characters.
For instance, we used UTF-8 for our Client-Server framework, since we use JSON as exchange format (which requires UTF-8), and since SQlite3 likes UTF-8. Of course, we had to write some dedicated functions and classes, to avoid conversion to/from string (which is slow for the string=UnicodeString type since Delphi 2009, and may loose some data when used with string=AnsiString type before Delphi 2009. See this post and this unit). The easiest is to rely on the string=UnicodeString type, use the RTL functions which handles directly UTF-16 encoding, and avoid conversions. And do not forget about your previous question.
If disk space and read/write speed is a problem, consider using compression instead of changing the encoding. There are real-time compression around (faster than ZIP), like LZO or our SynLZ.

How to read unicode characters accurately

I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?
You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?

Resources