I'm writing my own BASE64 encoder/decoder for some constrained environments.
And I found that Base64.Encoder#encodeString saying that it uses ISO-8859-1 for construct a String from those encoded bytes.
I perfectly presuming that ISO-8859-1 charset also covers all base64 alphabets.
Is there any possible reason not to use US-ASCII?
I suspect it's more efficient: converting from ISO-8859-1 back to text is just a matter of promoting each byte straight to a char, whereas for ASCII you'd need to check that the byte is valid ASCII. The result for base64 will always be the same, of course.
(That's only a guess, but an educated one. You could always run benchmarks if you want to validate it...)
Related
Given a TBytes array, can we identify if the array may convert to AnsiString, String or UTF8String without losing any characters?
What you appear to be asking to do is impossible. You seem to have a byte array of unknown provenance, that may be encoded as ANSI, UTF-8 or UTF-16. You are hoping to be able to determine which encoding is correct.
This is impossible because there exist byte arrays that are valid in all three of those encodings, and that represent different strings in each encoding. Raymond Chen shows a nice clean example here: The Notepad file encoding problem, redux.
You can use heuristic algorithms to attempt to guess the encoding, an example of which is IsTextUnicode. But any such approach is by necessity not robust.
this question will have a very simple answers which is yes or no I guess ?
If I encode from x64 bit unicode delphi app my stringlist like this
StringList.SaveToFile(FileName, TEncoding.ASCII);
is there any other limitation , difference in file layout while writing this file with the statement
StringList.SaveToFile(FileName);
or
StringList.SaveToFile(FileName, TEncoding.UTF8);
I'm afraid on line length and control char issues between both versions....Answer NO will make me happy.
UTF-8 and the Windows 'Ansi' codepages are all superset of ASCII. As such, if the string list only contains characters in the ASCII range, the three statements you listed will be equivalent if you prepend the last with this:
StringList.WriteBOM := False;
This is because by default, TStrings will write out a small marker (a BOM) to denote UTF-8 text.
The difference is simply in the encoding used. This in turn, of course, leads to differences in size. So ASCII files will be smaller than UTF-16 (what you get with TEncoding.Unicode. And UTF-8 files could be the same size as ASCII, or larger than UTF-16.
I guess you are asking if using ASCII or UTF-8 in any way damages the text that is written. Well, using ASCII will if the text contains non-ASCII characters. ASCII can only encode 127 characters.
On the other hand, UTF-8 is a full encoding of Unicode. Which means that
StringList.SaveToFile(FileName, TEncoding.UTF8);
StringList.LoadFromFile(FileName, TEncoding.UTF8);
results in the list having exactly the same content as it did before the save.
You ask if lines can be truncated by SaveToFile. They cannot.
Another point to make is that 32/64 bit is not relevant here. The code behaves in exactly the same way under 32 and 64 bit. The issues are always to do with encoding.
I would also note that the title of your question is somewhat mis-leading. When you encode with TEncoding.UTF8 you not do not have an ASCII file.
I have a file. I don't know how it was processed. It's probably a double encoding. I've found this link about double encoding that solved almost my problem:
http://www.spamusers.com/encoding.htm
It has all the double encodings substitutions to do like:
À àÁ
 Â
Unfortnately I still others weird characters like:
ú
ç
ö
Do you have an idea on how to clean these weird characters? For the ones I know I've just made a bash script and I've just replaced them. But I don't know how to recognize the others. I'm running on linux so if you have some magic commands I would like that.
The “double encodings substitutions” page that you link to seems to contain mappings meant to fix character data that has been doubly UTF-8 encoded. Thus, the proper fixing routine would be to reverse such mappings and see if the result makes sense.
For example, if you take A with grave accent, À, U+00C0, and UTF-8 encode it, you get the bytes C3 A0. If these are then mistakenly understood as single-byte encodings according to windows-1252 for example, you get the characters U+00C3 U+00A0 (letter à and no-break space). If these are then UTF-8 encoded, you get C3 83 for the former and C2 80 for the latter. If these bytes in turn are interpreted according to windows-1252, you get À as on the page.
But you don’t actually have “À”, do you? You have some digital data, bytes, which display that way if interpreted according to windows-1252. But that would be a wrong interpretation.
You should first read the data as UTF-8 encoded, decode it to characters, checking that all codes are less than 100 hexadecimal (if not, there’s yet another error involved somewhere), then UTF-9 decode again.
I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.
I use delphi 7.
I need to read a utf-8 file line by line, each line contain a word and its weight (a number)
So I need to read every next line, then divide a line by a separator (tab char) and save this in memory.
So,
1) is there a library to work with utf-8 files in Delphi (3-rd party maybe)
2) will functions operate ok with widestring? I use PosEx. So, if they won't, can you also give a link to 3-rd party library to work with widestrings?
If it is really UTF-8 that you are dealing with, then you should not need anything special as far as reading and processing them. You should be able to treat them as pchar or even as a normal Delphi 7 string. If you try to show the contents in some kind of message box, then you may need to do some conversions. For example, I don't believe the Delphi 7 message box method would display UTF-8 strings correctly if the string contained any byte values over 127 (0x7f). For something like that, you would need to convert to UTF-16 and call the Windows API MessageBoxW or something similar. Otherwise, though, UTF-8 strings can be treated in many situations the same as single byte ANSI strings.
I don't think UTF-8 is typically referred to as "widestring". I might be wrong, but I think that typically means UTF-16.
If your file is encoded as UTF-8, and the characters you're looking for are ASCII, then there's no need to use WideString at all. ASCII is a subset of UTF-8, and any ASCII character is guaranteed not to interfere with the special encoding used for other characters in UTF-8. The number characters 0 through 9 and the tab character are all ASCII.
The JCL comes with various functions and classes for dealing with Unicode, if you find you really need to use them.
If most of your input is UTF-8, it might be worthwhile to change your codepage on startup from the "default" to utf8 (codepage 65001). This will make all ansistring->widestring conversions effectively become a lossless utf-8->utf-16.
With D7, you will need a set of so called "unicode" components, components that base themselves on the winapi -W functions. Delphi's own components only do this with the watershed D2009 release that switches the default string type to UTF-16.
If you want to heavily invest in Unicode support, upgrading might be a smart thing to do
WideString is an UTF-16 implementation (a COM BSTR compatible one), it can't store UTF-8 strings, if you assign an 8 bit string it will be converted to UTF-16. But unless you use explicitly the proper conversion function, Delphi will interpret the 8 bit string using the current codepage.
An UTF-8 string can be stored in a Delphi AnsiString (the default string type in Delphi 7), but string manipulation functions are designed for ANSI codepages, not UTF-8. The difference is that UTF-8 is a multi byte character set. But the first 127 ANSI characters, more than one byte is needed to encode a given "character", while many ANSI codepages (especially those for European languages) only require one byte, encoding only 255 "characters" (while UTF-8 can encode the whole Unicode set).
If you're just looking for the tab character AFAIK you could use simply an AnsiString, but you have to ensure that any byte above $80 you may need to look for is not part of a multibyte sequence. If you have more complex processing needs, it may be easier to find libraries working on UTF-16 strings than UTF-8. As Rob Kennedy said, JCL is a good starting point as a free library implementing UTF string manipulation.
You could simply read the file as-is into a normal TStringList via its LoadFrom...() methods, then loop through the list as needed. If loading the entire file into memory at one time is not an option, then you can open the file using a TFileStream and then use the TStreamReader.ReadLine() method to read the stream line-by-line.
If you need to decode a given UTF-8 sequence to UTF-16 for processing, then I would suggest using the Win32 API MultiByteToWideChar() function directly, only because the RTL's UTF8Decode() function has a broken UTF-8 implementation in older Delphi versions (not sure about D7, but it definately does in D6).
The nice thing about either loading approach is that they are both encoding-aware in D2009 and later, which means that if you ever upgrade, you can make a couple of very small code changes to tell the RTL that the data is UTF-8, and it will decode it to UTF-16 for you automatically, and then the rest of your processing code can remain the same (assuming you are not doing anything that is Ansi-specific).