Safely save a dictionary to file - character-encoding

What's the best way to save a dictionary to a file, so that I can load it later in Tcl on a different computer/system?
I don't think fconfigure $stream -translation binary; puts -nonewline $stream $dict works if there's keys/values with unicode characters > \u00ff. Is "utf-8" encoding OK (to save disk space), or should I always use full "unicode"?
Dictionaries are new to me, but since their textual representation is that of a list with alternating keys/values, and a list is just a string with some extra syntax characters, maybe the question could've been "Safely save a string to file"?

Your question can be simplified:
What encoding should I use?
The answer is: it depends.
If you have only binary data, then binary is the way to go.
If you have mostly ascii chars and a few exceptions (like umlauts etc), then utf-8 ist the best encoding for you (characters outside of \x00 and \x7f are encoded with 2 or more bytes, ascii with 1 byte).
If you have many CJK, then using unicode or some other encoding for that is probably better (2 byte vs 3 byte per char)
And right: the (de)serialization of lists and dicts is in Tcl much easier than in many other languages: Save it as string.

Related

Delphi decoded base64 to something

I am stuck a bit in decoding. I got a base64-encoded .rtf file.
A little part of this looks like this: Bek\u252\''fcld\u337\''3f
Which represents: Beküldő
But my output data after decoding is: Bekuld?
If I manually replace the characters it works.
StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);
Does anyone know a general solution for this? Some conversation or something?
For instance, \u242 means Unicode character #242.
So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.
But RTF is a very complex beast.
Check what the RTF 1.5 specifications says about encoding:
\uN This keyword represents a single Unicode character which has no
equivalent ANSI representation based on the current ANSI code page. N
represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in
ANSI representation. In this way, old readers will ignore the \uN
keyword and pick up the ANSI representation properly. When this
keyword is encountered, the reader should ignore the next N
characters, where N corresponds to the last \ucN value encountered.
Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.

How to identify if a TBytes array may safely convert to AnsiString, string or UTF8String?

Given a TBytes array, can we identify if the array may convert to AnsiString, String or UTF8String without losing any characters?
What you appear to be asking to do is impossible. You seem to have a byte array of unknown provenance, that may be encoded as ANSI, UTF-8 or UTF-16. You are hoping to be able to determine which encoding is correct.
This is impossible because there exist byte arrays that are valid in all three of those encodings, and that represent different strings in each encoding. Raymond Chen shows a nice clean example here: The Notepad file encoding problem, redux.
You can use heuristic algorithms to attempt to guess the encoding, an example of which is IsTextUnicode. But any such approach is by necessity not robust.

What is the relationship between unicode/utf-8/utf-16 and my local encode GBK?

I've noted that my text file on Windows(chinese version), when port to Ubuntu, turned garbled.
After more research, I know the default encode on Windows CN version is GBK, while on Ubuntu is utf-8, and iconv can do the encode translation, for example, from GBK to utf-8:
iconv -f gbk -t utf-8 input.txt > output.txt
But I am still confused by the relationship of these encode. What are they? what is the similarity and difference between them?
First it is not about the OS, but about the program you are using to read the file.
On a bare .txt, the program has to be able to guess the encoding, which is not always possible, but might work. On an html, encoding is given as metadata, so browsers don't need to do that.
Second, do you know ASCII? Do you see how it represents symbols via numbers? If not this is the first thing you should learn now.
Next, do you see the difference between Unicode and UTF-XXX? It must be clear to you that Unicode is just a map of numbers (code points) to glyphs (symbols, including Chinese characters, ASCII characters, Egyptian characters, etc.)
UTF-XXX on the other hand says, given a string of bytes, which Unicode numbers (code points) do they represent. Therefore, UTF-8 and UTF-16 are different efficient ways to represent Unicode.
As you may imagine, unlike ASCII, both UTF and GBK must allow more than one byte per character, since there are much more than 256 of them.
In GBK all characters are encoded as 1 or 2 bytes.
Since GBK is specialized for Chinese, it uses less bytes in average than UTF-XXX to represent a given Chinese text, and more for other languages.
In UTF-8 and 16, the number of bytes per glyph is variable, so you have to look at how many bytes are used for the Chinese code points.
In Unicode, Chinese glyphs are on the following ranges. Then you have to look at how efficiently UTF-8 and UTF-16 represent those ranges.
According to Wikipedia articles on UTF-8 and UTF-16, the first and most common range for Chinese glyphs 4E00-9FFF is represented in UTF-8 as either 2 or 3 bytes, while in UTF-16 it is represented as 2 bytes. Therefore, if you are going to use lots of Chinese, UTF-16 might be more efficient. You also have to look into the other ranges to see how many bytes per character are used.
For portability, the best choice is UTF, since UTF can represent almost any possible character set, so it is more likely that viewers will have been programmed to decode it correctly. The size gain of GBK is not that large.

Translating memory contents into a string via ASCII encoding

I have to translate some memory contents into a string, using ASCII encoding. For example:
0x6A636162
But I am not sure how to break that up, to be translated into ASCII. I think it has something to do with how many bits are in a char/digit, but I am not sure how to go about doing so (and of course, I would like to know more of the reasoning behind it, not just "how to do it").
ASCII uses 7 bits to encode a character (http://en.wikipedia.org/wiki/ASCII). However, it's common to encode characters using 8 bits instead (note that technically this isn't ASCII). Thus, you'd need to split your data into 8-bit chunks and match that to an ASCII table.
If you're using a specific programming language, it may have a way to translate an ASCII code to a character. For instance, Ruby has the .chr method, Python has the chr() built-in function, and in C you can printf("%c", number).
Note that each nibble (4 bits) can be represented as one hexadecimal digit, so for the sample string you show, each 8-bit "chunk" would be:
0x6A
0x63
0x61
0X62
the string reads "jcab" :)

Percent Encoded UTF-8 to Ascii(8-bit) conversion

Im reading in urls and they often have percent encoded characters.
Example: %C3%A9 is actually é
According to http://www.microsystools.com/products/sitemap-generator/faq/character-percentage-url-encoding/ , characters in the upper half of 8-Bit ASCII (128-255) are encoded as UTF-8, then their bytes are saved as hex. Now, when I get my URL, the %HEX's have been reencoded as 8-bit ascii, and I need to convert those back to their true 8bit ascii. Is there any function/library I can use, or else, how would I go about the conversion?
Im using C/C++.
First you need to URLDecode. Not a function available in cross-platform C++, but, luckily for you, not a hard problem. Copy bytes from source to target. Non-% bytes just get copied. When you hit %xx, convert XX from hex chars to binary, and you have your byte.
This gives you a buffer of text in UTF-8. You say you want 'ASCII' -- ISO-646. Then you can't have an accented e. I can think of several possibilities for what you really want:
ISO-8859-1. You can use ICU to convert UTF-8 to ISO-8859-1.
ISO-646. You can also use ICU, and I believe it will make accented chars into their ISO-646 equivalents.

Resources