Why can't Dart decode base64 from Crystal? - dart

Why does Dart produce the error "invalid character at position 61" with base64 from Crystal Lang?

The default Crystal lang Base64 encoding won’t work in Dart or Flutter.
This is because it doesn’t use strict encoding by default, inserting newlines every 60 characters.
To Dart, these newlines are unknown characters.
So, in short, you have to use Crystal’s Base64.strict_encode method. This will encode without special characters.
Dart has no method to ignore the special characters so this is 100% necessary to make it work.
https://crystal-lang.org/api/0.35.1/Base64.html#strict_encode(data,io:IO)-instance-method

Related

Japanese characters interpreted as control character

I have several files which include various strings in different written languages. The files I am working with are in the .inf format which is somewhat similar to .ini files.
I am inputting the text from these files into a parser which considers the [ symbol as the beginning of a 'category'. Therefore, it is important that this character does not accidentally appear in string sequences or parsing will fail because it interprets these as "control characters".
For example, this string contains some Japanese writings:
iANSProtocol_HELP="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X Protocol �̓`�[���������щ��z LAN �Ȃǂ̍��x�#�\�Ɏg�����܂��B"
DISKNAME ="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X CD-ROM �܂��̓t���b�s�[�f�B�X�N"
In my text-editors (Atom) default UTF-8 encoding this gives me garbage text which would not be an issue, however the 0x5B character is interpreted as [. Which causes the parser to fail because it assumes that this is signaling the beginning of a new category.
If I change the encoding to Japanese (CP 932), these characters are interpreted correctly as:
iANSProtocol_HELP="インテル(R) アドバンスト・ネットワーク・サービス Protocol はチーム化および仮想 LAN などの高度機能に使われます。"
DISKNAME ="インテル(R) アドバンスト・ネットワーク・サービス CD-ROM またはフロッピーディスク"
Of course I cannot encode every file to be Japanese because they may contain Chinese or other languages which will be written incorrectly.
What is the best course of action for this situation? Should I edit the code of the parser to escape characters inside string literals? Are there any special types of encoding that would allow me to see all special characters and languages?
Thanks
If the source file is in shift-jis, then you should use a parser that can support it, or convert the file to UTF-8 before you parse it.
I believe that this character set also uses ASCII as it's base type but it uses 2 bytes to for certain characters, so if 0x5B it probably doesn't appear as the 'first byte' of a character. (note: this is conjecture based on how I think shift-jis works).
So yea, you need to modify your parser to understand shift-jis, or you need to convert the file to UTF-8 before parsing. I imagine that converting is the easiest.

Converting special characters in TStringList

I am using Delphi 7 and have a routine which takes a csv file with a series of records and imports them. This is done by loading it into a TStringList with MyStringList.LoadFromFile(csvfile) and then getting each line with line = MyStringList[i].
This has always worked fine but I have now discovered that special characters are not picked up correctly. For example, Rue François Coppée comes out as Rue François Coppée - the accented French characters are the problem.
Is there a simple way to solve this?
Your file is encoded as UTF-8. For instance consider the ç. As you can see from the link, this is encoded in UTF-8 as 0xC3 0xA7. And in Windows-1252, 0xC3 encodes à and 0xA7 encodes §.
Whether or not you can handle this easily using your ANSI Delphi depends on the prevailing code page under which your program runs.
If you are using Windows 1252 then you will be fine. You just need to decode the UTF-8 encoded text with a call to UTF8Decode.
If you are using a different locale then life gets more difficult. Those characters may not be present in your locale's character set and in that case you cannot represent them in a Delphi string variable which is encoded using the prevailing ANSI charset. If this is the case then you need to use Unicode.
If you care about handling international text then you need to either:
Upgrade to a modern Delphi which has Unicode support, or
Stick to Delphi 7 and use WideString and the TNT Unicode components.
Probably it's not in UTF8 encoding. Try to convert it:
Text := UTF8Encode(Text);
Regards,

Trouble making a heart symbol in Lua?

I was wondering how to make the heart sign or "♥" in Lua, I have tried \003 because that is the ASCII code for it, but it does not print it out.
This has little to do with Lua.
You need to find out which character set and encoding is used in your environment and select a font that supports ♥ in that encoding.
Then you need to use an editor for your Lua script that saves in that encoding. If that part is not possible then you can determine the byte sequence required, code it as numeric escapes in a literal string and save in a compatible encoding such as CP437. For example, if you are outputting to a UTF-8 processor, "\xE2\x99\xA5".
Keep in mind that a Lua string is a counted sequence of bytes. It's up to you and your editor to put the right bytes in in the file, it's up to your environment (e.g., console) to interpret those bytes in a particular character encoding, and up to the font to display the glyph.
In a Windows console, you can select the Lucinda Console font, chcp 65001 to use UTF-8 and use Lua 5.1 like this: lua -e "print('\226\153\165')". As a comparison, chcp 437 to use IBM437 and use Lua 5.1 like this: lua -e "print('\003')".
For ASCII, only range 0x20 to 0x7E are printable. Others, including 0x03, isn't printable. Printing its value would be up to the implementation.
If the environment supports Unicode, you can simply call:
print("♥")
For instance, Lua Demo outputs ♥, same in ideone.

Squeak Monticello character-encoding

For a work project I am using headless Squeak on a (displayless, remote) Linuxserver and also using Squeak on a Windows developer-machine.
Code on the developer machine is managed using Monticello. I have to copy the mcz to the server using SFTP unfortunately (e.g. having a push-repository on the server is not possible for security reasons). The code is then merged by eg:
MczInstaller installFileNamed: 'name-b.18.mcz'.
Which generally works.
Unfortunately our code-base contains strings that contain Umlauts and other non-ascii characters. During the Monticello-reimport some of them get replaced with other characters and some get replaced with nothing.
I also tried e.g.
MczInstaller installStream: (FileStream readOnlyFileNamed: '...') binary
(note .mcz's are actually .zip's, so binary should be appropriate, i guess it is the default anyway)
Finding out how to make Monticello's transfer preserve the Squeak internal-encoding of non-ascii's is the main Goal of my question. Changing all the source code to only use ascii-strings is (at least in this codebase) much less desirable because manual labor is involved. If you are interested in why it is not a simple grep-replace in this case read this side note:
(Side note: (A simplified/special case) The codebase uses Seaside's #text: method to render strings that contain chars that have to be html-escaped. This works fine with our non-ascii's e.g. it converts ä into ä, if we were to grep-replace the literal ä's by ä explicitly, then we would have to use the #html: method instead (else double-escape), however that would then require that we replace all other characters that have to be html-escaped as well (e.g. &), but then again the source-code itself contains such characters. And there are other cases, like some #text:'s that take third-party strings, they may not be replaced by #html's...)
Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.
The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.
If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.
Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.
MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.
The snippet that you should use is rather:
MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'
Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.
It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.
Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code.
You don't need to convert non 7bit characters to HTML entities. You can use
Character value: 228
for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do
$ä asciiValue => 228
I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.

Eliminating non-convertable characters on encoding change from UTF-8 to Shift_JIS with ruby 1.9

I need to write a CVS export program which internally use UTF-8 encoding which originated from user input via web(so you can expect any characters). It's Japanese system so I need to encode to Shift_JIS.
Now, when I change UTF-8 into Shift_JIS, I get errors like:
Encoding::UndefinedConversionError (U+7E6B from UTF-8 to Shift_JIS):
I want to either a) eliminate the character, or b) map the character to some other character
(or simply, to string '(U+7E6B)')
It seems catch the exception and eliminate it as byte string but there must be easier way to do this.
What is the best way to do this conversion?
[Converting my follow-up comments to question to an answer]
I found encode has option and I can give encode with
:undef=>true, # for UndefinedConversionError :replace=>"?"
to have desired effect. can specify following also:
:invalid=>true, # for InvalidByteSequenceError

Resources