Squeak Monticello character-encoding - character-encoding

For a work project I am using headless Squeak on a (displayless, remote) Linuxserver and also using Squeak on a Windows developer-machine.
Code on the developer machine is managed using Monticello. I have to copy the mcz to the server using SFTP unfortunately (e.g. having a push-repository on the server is not possible for security reasons). The code is then merged by eg:
MczInstaller installFileNamed: 'name-b.18.mcz'.
Which generally works.
Unfortunately our code-base contains strings that contain Umlauts and other non-ascii characters. During the Monticello-reimport some of them get replaced with other characters and some get replaced with nothing.
I also tried e.g.
MczInstaller installStream: (FileStream readOnlyFileNamed: '...') binary
(note .mcz's are actually .zip's, so binary should be appropriate, i guess it is the default anyway)
Finding out how to make Monticello's transfer preserve the Squeak internal-encoding of non-ascii's is the main Goal of my question. Changing all the source code to only use ascii-strings is (at least in this codebase) much less desirable because manual labor is involved. If you are interested in why it is not a simple grep-replace in this case read this side note:
(Side note: (A simplified/special case) The codebase uses Seaside's #text: method to render strings that contain chars that have to be html-escaped. This works fine with our non-ascii's e.g. it converts ä into ä, if we were to grep-replace the literal ä's by ä explicitly, then we would have to use the #html: method instead (else double-escape), however that would then require that we replace all other characters that have to be html-escaped as well (e.g. &), but then again the source-code itself contains such characters. And there are other cases, like some #text:'s that take third-party strings, they may not be replaced by #html's...)

Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.
The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.
If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.
Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.
MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.
The snippet that you should use is rather:
MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'

Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.
It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.
Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code.
You don't need to convert non 7bit characters to HTML entities. You can use
Character value: 228
for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do
$ä asciiValue => 228
I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.

Related

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

My non-Unicode Delphi 7 application allows users to open .txt files.
Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.
I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.
In cases when converting is not possible, the function should return an error.
The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;
If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.
It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).
The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:
The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.
Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:
It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.
There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:
% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert
Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.
Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.
Notepad has trouble with this issue.
The solution as requested, would probably entail more effort than you put into the original program.
Possible solutions
Add a text editor into your program. If you write it, you will be able to read it.
The following solution pushes the translation to established tables provided by Windows.
Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):
Caution  Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.
Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.
Note  The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.
This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.
Simply require the text file to be saved in the local code page.
Other thoughts:
ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.
In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".
You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.
Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.
For example, if it is a .csv file, find a comma in the various formats...
Bottom Line
There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

How to decode unexpected strings from users?

I've published an app, and I find some of the comments to be like this: &ETH;&nbsp;&ETH;&micro;&ETH;&ordm;&ETH;&deg;&ETH;&frac14;&ETH;&micro;&ETH;&acute;&Ntilde;
I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?
These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &amp;
When decoding the data twice using this online tool (there are many others) the result is
РекамедÑ
That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that
was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).
The fix will likely need to be on server side.
If you are using PHP, see UTF-8 all the way through for a handy guide.

Handling UTF-8 Character with Latin1 db encoding

I keep getting an exception that ActiveRecord::StatementInvalid: PG::UntranslatableCharacter: ERROR: character with byte sequence 0xe2 0x80 0x99 in encoding "UTF8" has no equivalent in encoding "LATIN1". I did some checking and it looks like it is the backtick or apostrophe. What is the best way to handle this? Just strip out the character or convert the whole db to UTF-8? If it is converting to UTF-8 how can I do that permanently as it always seems to revert if you do it in the shell?
I don't understand what you mean by "revert, if done in the shell", but: You seem to have an application where some parts (at least the database) using encoding LATIN1, and one part (your Rails App) is using UTF-8. IMO, it is best if you have every in Unicode, but to what extend a conversion makes sense, can not be said in general. For example, if your database is also being processed by other tools, and those expect Latin1, a conversion is not sensible.
In any case, you need to define a clear borderline between where you use which encoding, and handle conversion at this border. This applies not only to the database, but also - for example - to the HTML pages you are generating (hopefully UTF-8), to files uploaded by the users and processes by your application, and so on.
If you convert to an encoding, where certain characters can not be represented - as this is in your case -, you have only three choices:
Reject the data (they must have been generated somewhere, perhaps as user input in a web form),
Simply remove the offending characters
Replace the offending characters by a placeholder (for instance, a question mark)
None of these options is very pleasant, but if converting your database to UTF-8 is no option, you should deal with this problem at the point where the problem string is generated, and not when it is written into the database.

CGI::unescape can't handle unescaping "wymiana+teflon%F3w"?

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

What strategies are there for escaping character entities?

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).
I appreciate that there is no absolute answer - any escaping can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.
A typical example is the "degrees" symbol:
HTML Entity (decimal) °
HTML Entity (hex) °
HTML Entity (named) °
How to type in Microsoft Windows Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex) 0xC2 0xB0 (c2b0)
UTF-8 (binary) 11000010:10110000
UTF-16 (hex) 0x00B0 (00b0)
UTF-16 (decimal) 176
UTF-32 (hex) 0x000000B0 (00b0)
UTF-32 (decimal) 176
C/C++/Java source code "\u00B0"
Python source code u"\u00B0"
We are also likely to encounter TeX
$10\,^{\circ}{\rm C}$
or
\degree
so backslashes, curlies and dollars are a poor idea.
We could for example use markup like:
__deg__
__#176__
and this will probably work but I'd appreciate advice from those who have similar problems.
update I accept #MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.
Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.
Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.
Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.
Example, something like
the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee
your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.
Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).
If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:
>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>
If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:
>>> uuid.uuid1(node=0x1234567890)
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>
You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).

Resources