Tips on unicode text editors

Tips on unicode text editors - delphi

I am currently converting a legacy system to a new platform and need to extract strings from the old systems resource files.
The old system was written in Delphi and the strings are kept in files called .dfm. I have no trouble locating the strings and for English and other European languages there is no problem. The trouble comes when I try to extract strings in Japanese. I have used Notepad++ and it seems to me that the program don´t recognice the correct encoding. I get Japanese symbols but they don´t seem to match what is in the GUI. Notepad++ shows signs in something called GB2312(Simplified Chinese). But it looks weird.
My question is, does anyone have any tips on programs/text editors that are good at operations like this?
Also I'm grateful for any tips that might help me along the way.

Assuming that your issue is simply that Notepad++ is incorrectly guessing the encoding you can solve the problem by manually setting the encoding in Notepad++, like this:

Notepad++ itself already handles encoding issues. To make it to desired encoding, like Unicode;
first, copy all the contents of the file,
choose Unicode without BOM in the menu,
last, replace all contents with copied contents
save the file
Your contents will then be in your desired encoding.

Strings are kept not [just] in DFMs in Delphi. Only forms and associated text are. So you would to review all the code as well.
As for DFMs - before Delphi 2009 DFMs didn't use Unicode so you must know what charset was used. That was one of big problems with localization and internationalization of Delphi applications.

Related

Delphi Berlin Unicode Issue

I have a very strange Delphi compiler issue relating to Unicode characters.
I have a unit with this const definition:
const SLANG_SPANISH_ESP = 'Español';
When I compile this on my PC, the ñ gets converted to the ASCII equivalent. I've used a hex viewer to examine the relevant files:
Within the pas source file, the ñ is encoded in UTF-8 as C3 B1.
Within the generated DCU file, the ñ is encoded in ASCII (?) as F1.
All the other Delphi PCs within our group compile the DCU differently, generating the DCU file with the ñ encoded in UTF-8 as C3 B1.
This is just one example, but many of the non-ASCII characters suffer the same fate.
I have tried hard over the last couple of days to identify the cause, without success. I have eliminated the project files and source code as we use SVN. I double checked by manually copying the project folder from a colleague's PC.
I have looked through the Delphi settings for something that might affect this, without success either.
It's very frustrating and worrying to imagine that the same source code on different PCs compiles to different results. My only hope now is that someone from the community will be able to give me a clue.

I finally got to the bottom of this issue. It turns out that the pas file in question was not saved as UTF-8 despite what the IDE was telling me. In fact this is a known issue/quirk with Delphi where the unit is saved with UTF-8 characters but without the BOM.
You can refer to Marco Cantu's blog on this issue: The Delphi Compiler and UTF-8 Encoded Source Code Files With no BOM
The reason that the file did not have a BOM was because it was generated using a in-house tool. This tool has since been updated to output the BOM too.
Finally I discovered that, on a given machine, building the project with the IDE or externally via MsBuild.exe would yield different results. The IDE correctly interprets the unit a UTF-8, whereas MsBuild.exe interprets the unit as Ansi.

WiX localization garbled words under installer properties tab

I'm trying to do localization in WiX installer. How can I fix the garbled words shown below in the installer properties? The language that I defined is Japanese.

Windows Installer doesn't officially support codepage 65001 for UTF-8 -- mostly because of UI problems like this. Try using codepage 932 for ja-JP strings. Also, make sure you're setting the Package/#SummaryCodepage attribute (the .wxl file's code page sets Product/#Codepage).

TeeChart ActiveX CodeUTF8String and DecodeUTF8String

I am trying to update from TeeChart ActiveX 7 to TeeChart 2012/2013. My old TeeChart files were written with a version that saved Unicode strings coded with a "#U#" prefix. I wanted to use the DecodeUTF8String in the new "real-Unicode" version to convert these saved strings to ordinary Unicode strings. However, both the code and decode functions appear to have no effect. Am I missing something? How can I deal with this problem?
Would be great if I could get some help on this!

Being real unicode, these functions don't convert the texts anymore. However, I understand in cases like yours it could still be necessary to do that decodification.
We'll study if we can offer some built-in way to decode the strings coded in older versions. In the meanwhile, the only way I can think on is to manually replace in your TeeChart .tee files (save them as text, not as binary) the coded strings with decoded strings, so you'll have unicode tee files that should be correctly imported in newer TeeChart ActiveX versions.

What do I need to know to upgrade a complex application from C++Builder 2007 to 2010?

My company's main application is mostly written in C++ (with some Delphi code and components). We are upgrading from RAD Studio 2007 to 2010 for the next release, starting in about a week. What do I need to know to ensure this upgrade goes smoothly?
Points I have thought of so far are:
Unicode. This one looks really complicated. Our app contains a horrible mix of std::string-s and AnsiString-s with casts to and from them. I have lots of questions about this, such as "is wstring capable of holding everything a UnicodeString can, and should we just do a search/replace", or "should we avoid all C++ string types altogether and use UnicodeString", "can we change all event handlers to use String though the existing .HPPs event handler method prototypes were compiler-translated to AnsiString", right down to basics such as "should we prefix all strings with L, or is the compiler smart enough with Unicode enabled to use Unicode strings", etc. Any insight on this would be really appreciated.
We also need backwards compatibility. Our app uses its own binary tuple format that currently stores strings as an array of bytes. I need to upgrade this to read old files and, presumably, write new Unicode strings as well. How do I handle Unicode strings embedded in a binary format? Is there any generic way where I can point a UnicodeString at an array of bytes, that may be originally written as either ANSI bytes or Unicode, and it will figure out what they are?
Third-party components. We use SpTBX mainly, and it appears to be compatible.
Project upgrades. The standard advice in the Codegear forums seems to be to manually recreate all project files when upgrading. This is an awful lot of work (7 projects (mostly libs) in our main app, plus half a dozen DLLs, a lot of files.) Is there any way to automate this?
How's the linker look? We traditionally have a lot of trouble with the linker randomly crashing or running out of resources, though it got a lot better in 2007. This is one reason our main application is split into several libs - the linker cannot (hopefully, "could not, but now can"?) handle it otherwise.
I know there's a new type library editor and format (it stores the IDL, ie text, and generates the TLB dynamically?) How well does this handle upgrading existing COM projects with a TLB? We have Delphi code and TLB that are built into the C++ application.
Is there anything else I should be considering or be aware of?
I have found:
2007 and 2010 co-existing. I'm not sure I trust this answer since I have had issues with 2006 and 2007 on the same machine before.
several answers about Unicode: writing strings with 2009 and generic transition to Unicode text but none are answers for concerns, or the C++Builder-specific parts at all.
This question about guidelines upgrading to 2009 but though the answers are helpful, they don't answer all the Unicode-related issues above.
[Edit: added] Codegear documents for Unicode in RAD Studio and things to look for when converting to Unicode

Project upgrades. The standard advice in the Codegear forums seems to be to manually recreate all project files when upgrading. This is an awful lot of work (7 projects (mostly libs) in our main app, plus half a dozen DLLs, a lot of files.) Is there any way to automate this?
There is: just use the IDE's project importer :)
Seriously, I would just try importing the projects, and then go investigate if it doesn't seem to work.
How's the linker look? We traditionally have a lot of trouble with the linker randomly crashing or running out of resources, though it got a lot better in 2007. This is one reason our main application is split into several libs - the linker cannot (hopefully, "could not, but now can"?) handle it otherwise.
I've had almost no trouble with ILINK anymore since C++Builder 2009. I've occasionally read that others experienced out-of-memory errors, but someone in the newsgroups has discovered a workaround:
https://forums.embarcadero.com/thread.jspa?messageID=140012&tstart=0#140012
Also, as you can read here, the compiler got a new option (-Cx) to control the maximal amount of memory it allocates.
I know there's a new type library editor and format (it stores the IDL, ie text, and generates the TLB dynamically?) How well does this handle upgrading existing COM projects with a TLB?
Should work without a hitch.
I have lots of questions about this, such as "is wstring capable of holding everything a UnicodeString can, and should we just do a search/replace"
Yes, on Windows platforms wchar_t usually is 16 bit large, which means it suffices for holding UTF-16 which UnicodeString is.
or "should we avoid all C++ string types altogether and use UnicodeString"
Depends on how portable your code needs to be. In any case, whenever you just need a string type, use "String", not "UnicodeString".
"can we change all event handlers to use String though the existing .HPPs were compiler-translated to AnsiString"
First, you should NEVER re-use .hpp files generated by older versions of DCC!
For event handlers that use the String type in Delphi, you must use UnicodeString. As above, simply use "String", and your code will work for both the ANSI and Unicode versions of C++Builder.
right down to basics such as "should we prefix all strings with L, or is the compiler smart enough with Unicode enabled to use Unicode strings"
The compiler doesn't convert your strings (it would conflict with the language standards), but both AnsiString and UnicodeString do have copy constructor overloads for both char* and wchar_t* string literals. I.e., the following will work:
AnsiString as = L"foo";
UnicodeString us = "bar";
What will not work this way, though, is the whole bunch of printf()/scanf() functions; AnsiString::sprintf() takes const char*, UnicodeString::sprintf() takes const wchar_t*.
If you are using sprintf() a lot, you may find my CbdeFormat library useful; just read my article on the subject.

You do not say what the data strings in your binary tuple format are for: is it necessary for them to store Unicode? When I transitioned from D2007 to D2009 I was able to keep some parts of the system ANSI-string only.
If storing Unicode is required, then you need to check if your existing data is compatible with a format such as UTF-8. If the range of values stored in existing data files present a problem, then I would make your next upgrade do a one-time conversion of any old data files, reading in the old AnsiString data and writing it back as UTF-8 to a different file name or extension, or by modifying appropriate file header data. I have been versioning data files for a long time, just to allow this sort of processing change.
I am only just starting a BCB2010 project, so cannot comment on your other questions, but I certainly had difficulty upgrading a Delphi project from D2007 to D2009 - though I was able to fix this by editing the project file, which is just XML.
Good luck with the conversion ;-)

Unicode. This one looks really
complicated. Our app contains a
horrible mix of std::string-s and
AnsiString-s with casts to and from
them. I have lots of questions about
this, such as "is wstring capable of
holding everything a UnicodeString
can, and should we just do a
search/replace"
std::wstring contains wchar_t* strings, just like System::UnicodeString does.
should we avoid all C++ string
types altogether and use
UnicodeString
That is up to you to decide. char* strings are still supported. You are not forced to migrate everything to Unicode.
can we change all event handlers to
use String though the existing .HPPs
were compiler-translated to AnsiString
No, you cannot change auto-managed event handlers to use the System::String alias. All IDE versions will complain about that. You will have to manually update your event handler declarations and implementations to use UnicodeString parameters instead of AnsiString parameters when appropriate. That also means you cannot share DFMs and Unit .h files across multiple IDE versions, either (which you should not be doing anyway).
should we prefix all strings with L,
or is the compiler smart enough with
Unicode enabled to use Unicode strings
No. If you declare a string constant or character constant without an L prefix, the data will still be interpretted as Ansi. That has not changed. You can, however, pass Ansi data to System::UnicodeString (but not to std::wstring), and it will convert to Unicode automatically. But you have to be careful because it will use the OS's default Ansi codepage to interpret the data. As long as your Ansi data is only using ASCII characters only, then you will be OK. Otherwise, if you are using non-ASCII characters, then you are better off putting the data into a System::AnsiStringT or System::RawByteString (both were introduced in CB2009) that has been assigned the correct codepage, and then assign that to your System::UnicodeString variable. The associated codepage will be used instead of the OS default codepage for the conversion.
We also need backwards compatibility.
Our app uses its own binary tuple
format that currently stores strings
as an array of bytes. I need to
upgrade this to read old files and,
presumably, write new Unicode strings
as well. How do I handle Unicode
strings embedded in a binary format?
If your tuple is expecting 8-bit characters, then you will have to make sure that any struct declarations and such are using char and not wchar_t characters. If you need to store Unicode strings, but need to maintain the 8-bit compatibility, then you should encode your Unicode strings to UTF-8 first (you can use the System::UTF8String string type to help you - starting in CB2009, it is a true UTF-8 string now). As long as you do not use non-ASCII characters, then your old apps will not know the difference, as ASCII characters are encoded as-is in UTF-8. If you want to store raw Unicode data, however, then your tuple would need a flag somewhere (if it does not already have one) indicating whether the string data is stored as Ansi or Unicode, and your apps would have to look for that flag.
Is there any generic way where I can
point a UnicodeString at an array of
bytes, that may be originally written
as either ANSI bytes or Unicode, and
it will figure out what they are?
No. You have to know the actual encoding of the bytes beforehand. If you pass a memory address to System::AnsiString or std::string, it is going to assume Ansi characters. If you pass the same memory address to System::UnicodeString or std::wstring, it is going to assume Unicode characters instead.
Third-party components. We use SpTBX
mainly, and it appears to be
compatible.
Just like with all prior versions (except for the migration from 2006 to 2007), any third-party components you have will need to be re-compiled for 2010, either manually (if you have the source code for them) or by their respective vendors.
Project upgrades. The standard advice
in the Codegear forums seems to be to
manually recreate all project files
when upgrading.
Yes. That still applies.
I know there's a new type library
editor and format (it stores the IDL,
ie text, and generates the TLB
dynamically?)
.TLB files are not used at all anymore. The new system operates on .ridl (Reduced IDL) files now. During compiling, the .ridl produces the correct TypeLibrary information in the executable's binary resources directly. No .tlb files are generated.
How well does this handle upgrading
existing COM projects with a TLB? We
have Delphi code and TLB that are
built into the C++ application.
I do not remember whether CB2010 (or CB2009, for that matter) can consume pre-existing .tlb files directly. I don't think they can. You can, however, run the .tlb file through tlibimp.exe and it will export a .ridl file. Or you can copy the IDL text from the TLB editor in a past version and paste it into a new .ridl file manually. Either way, you can then add that .ridl ile to your CB2010 project.
2007 and 2010 co-existing. I'm not
sure I trust this answer since I have
had issues with 2006 and 2007 on the
same machine before.
That is why I use virtual machines when installing multiple IDE versions on the same physical machine.

Is the cost of upgrading in line with the benefits?
Why not start a gradual upgrade where new components would be developed on the new platform. Integrate the new components to the old version via different interop helpers.
This approach was suggested to vb6 developers who were thinking about upgrading to vb.net.

What technique would be the least effort to internationalise (at least multi-language) existing Delphi Applications?

I have developed about 300 Applications which I would like to provide with multi-language capabilities independent from the Operating System. I have written a just-in-time translator, but that is too slow in applications with many components. What would you suggest I do?

We are using TsiLang and are very happy with it.
One of the best points is that you can pretranslate the project with a dictionary (which you filled from existing translations).

I've heard that the TsiLang components are nice, but your looking at an inplace solution...
I've used GNU gettext for Delphi which does exactly the thing you want, it loads the translations from a text file and replaces the text in your components. It even has a pas/dfm scanner to automatically generate the English translation file.
It's also possible to automatically change your pascal source code to inject the gettext procedure inplace of your static strings. If I'm not mistaken it just adds a underscore as function to it, as below.
ShowMessage('Hello'); // before
ShowMessage(_('Hello')); // after
I must say it has been 2 years since I last used this method.
One thing will remain problematic, the Delphi components are not unicode enabled (D2009 fixes this), so when you don't change the components you'll still have limited support for other languages.

A good free solution would be GNU gettext for Delphi. It has some capabilities not present in TsiLang - for example, you can put the knowledge on how to count things (different endings for one, two, four, hundred and two, many ...) into the translation file so that you don't have to teach each program to know this stuff.
License for the Delphi part is very permissive but I'm not sure how much the included GNU stuff will affect your application.

Get Multilizer. It is made in Delphi and it can handle Delphi programs like no other with special support for VCL. You can even redo your screens easy for every language. With Multilizer you can use different techniques to translate and run your program.

Delphi 2009 has added an Integrated Translation Environment/External Translation Manager
ITE and ETM are now available for both Delphi and C++Builder.
In Codegear's article: What's New in Delphi and C++Builder 2009, they state:
The Integrated Translation Environment
(ITE) is a part of the IDE that
simplifies localizing your projects.
ITE can create a new localized project
from an existing project. ITE does not
automatically translate text, but
provides a dialog listing all text
that needs to be localized and fields
in which to enter the corresponding
translated text. Once you have entered
the translated text and built the
localized project, you can set another
language active and display a form in
the localized text; you don't have to
switch locales and reboot your system.
This allows you to perform
localization without requiring a
localized system.
The External Translation Manager (ETM)
is a standalone application that works
with DFM files and text strings in the
source code. Although ETM does not
allow you to create a new localized
project, it does provide a dialog
listing localized text and the
translated text, similarly to ITE.
This is what I plan to try first once I am at the point that I want to Internationalize my product.
However, to me the easy part is to translate the program. The hard part is to translate the help file.

I would say GNU gettext for Delphi in combination with TMS Unicode Component Pack (previously free under TntWare) to get unicode support in the components.
To work with, or have translators work with, the gettext files I recommend looking at the free cross-platform Poedit that may edit the .po files.

Just to mention cxLocalizer if you own DexExpress components.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart