Stop Firebird modifying strings based upon Windows charset - character-encoding

I have an application (written in Delphi) using the 1.5.5 Firebird embedded engine. I am using this engine since the application works with currently deployed Firebird databases and newer embedded engines won't open the database files correctly (ODS 10.1). All strings in the database are defined as VARCHAR(N) where N varies. The application used to be an ANSI application so the data contains ISO-latin-1 characters. Now the application is upgraded to be an unicode app. In order to store Unicode characters in existing databases (around 10k instances) I write an UTF8-BOM (if you can call it that) and then the remainder of the string is considered to be UTF8 and decoded by the database layer as such. This way we can use all the existing databases and still use All Unicode characters.
This works well for all machines in western Europe. But when the application is run in Romania (a Windows PC with Romanian language settings): the database engine alters the characters. For example: the UTF8 character string starts with character octet EF (ï). The database engine returns it as octet 69 (i).
How can this problem be solved for existing databases?
NB: I tried to specify a character set OCTETS when opening the database (using UIB library) but this fails as the charset is unknown.
Found out that the problem lies within UIB (the database layer used in this case). UIB handles csNONE in such a manner that if you give it a bytewise string (datatype AnsiString) it is converted to an UnicodeString by simply expanding the bytes to words and further on reduces it with the current threads codepage. Since Romania used no iso-latin-1 as it codepage... the data is corrupted there.
For now I changed the following routine in UIBLib (eg when ansistring is given and charset is none and an ansistring parameter is requested -> do no conversion at all):
procedure TSQLDA.EncodeStringA(Code: Smallint; Index: Word; const str: AnsiString);
begin
{$IFDEF UNICODE}
if FCharacterSet = csNONE then begin // new
EncodeStringB( Code, Index, str ); // new
end else begin // new
EncodeStringB(Code, Index, MBUEncode(UniCodeString(str), CharacterSetCP[FCharacterSet]));
end; // new
{$ELSE}
EncodeStringB(Code, Index, str);
{$ENDIF}
end;
Now I need to check if this behavior is correct for the library and give the maintainer a patch.

Related

Delphi 10: What can I do about String[25] changing some char values to "?"

I have ported a Delphi 7.1 application to Delphi 10.3.
I have some simple encrypting/decryption functions.
And if I encrypt string values and encrypt them, everything is fine:
var
test, encrypted, decrypted : string;
begin
test := 'XXXXXXXX'; // hidden message
encrypted := _common.encrypt(test);
decrypted := _common.decrypt(encrypted );
end;
in this scenario, everything works as expected, even with special characters, encrypted would be: 'y'#$0080'vn'
but if the value is of string[25], it handles special characters differently:
var
test,decrypted : string;
encrypted : string[25]
begin
test := 'XXXXXXXX'; // hidden message
encrypted := _common.encrypt(test);
decrypted := _common.decrypt(encrypted);
end;
in this scenario, everything work as expected unless the encrypted string contains special characters in this example res1 would be: 'y?vn'
I'm using string[] in records, when writing/reading data to/from disk
How can I fix this?
Can I use a different string type for the record type, or ?
/Flemming
Since Delphi 7, the string type has changed from one-byte ANSI characters to two-byte Unicode characters. However, the fixed-length string[n] still is a one-byte ANSI string. Therefore, you are mixing different string types. The easiest fix might be to switch those variables which you declare as string to a declaration as AnsiString instead.
The reason why you don't get same results in both code examples is that first code example fully relies on using default string type which in Delphi 10.3 is WideString (two bytes per character).
But in your second code example you declare your result as string[25] which is a short string type. Now unlike regular string type ShortString type can only contain single byte characters or in other word only supports AnsiString type which was default string type in Delphi 7.
So you don't get teh same results as you are mixing two different string types.
Any way general rule when dealing with encryption and decryption is not to work with strings at all but instead work with raw binary data. Why?
At the time of Delphi 7 strings have been affected by the currently used string encoding. So if you encrypted some string on a computer that used one string encoding and decrypted on a computer that used another string encoding you would get wrong result.
Now on modern Delphi versions that use WideString with Unicode encoding this no longer pose such problem but there is another potential problem since on Windows strings are 1 based (index of first character in string is 1) while on mobile platforms strings are 0 based (index of first character is 0).
So I strongly recommend you redesign your encryption/decryption routines to work on raw binary data instead.

Lazarus. Equivalent to Chr() for Unicode symbols

Is there any function in freepascal to show the Unicode symbol by its code (e.g. U+1D15E)? Unfortunately Chr() works only with ANSI symbols (with codes less than 127).
I want to use symbols from custom symbolic font and it is very inconvenient to put them into sourcecode directly (they are shown in Lazarus as ? or something else because they are absent in system fonts).
Take a look at this page. I assume that Freepascal either uses UTF-16, in which it becomes a surrogate pair of two WideChars (see table) or UTF-8, in which it becomes a sequence of byte values (see table again).
UTF-8:
const
HalfNoteString = UTF8String(#$F0#$9D#$85#$9E);
UTF-16:
const
HalfNoteString = UnicodeString(#$D834#$DD5E);
The names of the string types may differ, as I don't know FreePascal very well. Perhaps AnsiString and WideString.
I have never used Free Pascal, but if I were you, I'd try
var
s: char;
begin
s := char($222b); // Just cast a word
or, if the compiler is really stubborn,
var
s: char;
begin
PWord(#s)^ := $222b; // Forcibly write a word
Current unicode status of FPC to my best knowledge
The codepage of literals can be set with $codepage http://www.freepascal.org/docs-html/prog/progsu81.html
FPC 2.4.x+ does have unicodestring (since it is +/- Kylix widestring) but only basic routine support. (pos and copy, not routines like format), but the "record" misses the codepage field.
Lazarus widgets expect UTF8 in normal ansistrings (D7..D2007 ansistrings without codepage data), and programmers must manually insert conversions if necessary. So on Windows the widgets ARE mostly using unicode (-W) calls, but take ansistrings with UTF8 in it.
FPC doesn't follow the utf8 in ansistring scheme , so for some string accepting routines in sysutils, there are special routines in Lazarus that assume UTF8 that call -W variants)
FPC ansistring is the system default 1-byte encoding. ansi on Windows, utf8 on most other platforms.
Trunk, 2.7.1, provides support for the new D2009+ ansistring (with codepages).
There has been no discussion yet how to deal with the default stringtype (e.g. will "string" be utf8string on *nix and unicodestring on Windows, or unicodestring or utf8string everywhere?)
Other unicodestring related enhancement (like encoding parameters to tstringlist.savetofile) are not implemented. Likewise for the pseudo objects (like TCharacter which are afaik mostly static)
Update: 2.7.1 has a variable encoding ansistring type, and lazarus has been fixed to keep working. Nothing is really taking advantage from it yet though, e.g. most of the RTL still uses -A calls, and prototypes of sysutils and system procedures that takes strings haven't changed to rawbytestring yet.
I assume the problem is to convert from UCS4 encoding (which is actually a Unicode codepoint number) to UTF16.
In Delphi, you can use UCS4StringToUnicodeString function.
Warning: Be careful with UCS4String type. It is actually a zero-terminated dynamic array, not a string (that means it is zero-based).
var
S1: UCS4String;
S: string;
begin
SetLength(S1, 2);
S1[0]:= UCS4Char($1D15E);
S1[1]:= UCS4Char(0);
S:= UCS4StringToUnicodeString(S1);
ShowMessage(Format('%d, %x, %x', [Length(S), Ord(S[1]), Ord(S[2])]));
end;

FormatDateTime with chinese location - wrong characters... Delphi 2007

Output: Period: from 11-Ê®¶þÔÂ-10 to 13-Ê®¶þÔÂ-10
The above output is from a line like this:
FormatDateTime('dd-mmm-yy', dateValue)
The IDE is Delphi 2007 and we are trying to gear up our app to the Chinese market.
How can I display the correct characters?
With the setting turn to Hindi (India), instead of the funny characters I have the "?".
I'm trying to display the date on a report, using ReportBuilder 11.
Any help will be much appreciated.
The characters seem to be correct, only IMO they have been rendered wrong.
Here's what I've done:
copied the string as presented by the OP ("11-Ê®¶þÔÂ-10 to 13-Ê®¶þÔÂ-10");
pasted it into a blank plain-text editor window with CP 1252 (Windows Latin-1) and saved;
opened the text file in a browser;
the text showed up the same as the browser chose the same codepage, so I turned on the automatic detection of character encoding, hinting it that the contents was Chinese;
the text changed to "11-十二月-10 to 13-十二月-10" (hope your browser displays correct Chinese characters here, my does anyway) and the codepage changed to GB18030 (and I then tried GB2312, but the text wouldn't change);
well, I was curious and searched for "十二月", and it turned out to stand for "December", quite suitable for the context unless the month names had been mixed up.
So, this is why I think it's a text rendering (or whatever you call it, I'm not really sure about the term) problem.
EDIT: Of course, it must have had something to do with the data type chosen for storing the string. If the function result is AnsiString and the variable is WideString, then maybe the characters get converted as WideChars and so they are no longer one-byte compounds of multi-byte characters but are multi-byte characters on their own? At least that's what happened when the OP posted them here.
I don't know actually, but if it is so then I doubt if they can be rendered correctly unless converted back and rendered as part of an AnsiString.
Another solution is to use TntControls. They're a set of standard Delphi controls enhanced to support Unicode. You'll have to go through all your form files and replace
Button1: TButton
Label1: TLabel
with TTntButton, TTntLabel et cetera.
Please note, that as things stand, it's not only Chinese which will not work. Try any language using symbols other than standard European set (latin + stress marks etc), for instance Russian.
But
By replacing the controls, you'll solve one part of the problem. Another part is that everywhere where you use "string" or "AnsiString" and "char/pchar" or "AnsiChar/PAnsiChar", you can store only strings in default system encoding.
For instance, if your system encoding ("Language for non-unicode programs") is EN/US, Russian characters will be replaced with question marks when you assign them to "string" variable:
a: WideString;
b: string;
...
a := 'ЯУЭФЫЦ'; //WideString can store international characters
b := a; //string cannot, so the data is lost - you cannot restore it from just "b"
To store string data which is independent of system encoding, use WideString/WideChar/PWideChar and appropriate functions. If you have
a, b: WideString;
...
a := UpperCase(b);
then unicode information will still be lost because UpperCase() accepts "string":
function UpperCase(const S: string): string;
Your WideString will be converted to "string" (losing all international characters), given to UpperCase, then the result will be converted back to WideString but it's already too late.
Therefore you have to replace all string functions with Wide versions:
a := WideUpperCase(b);
(for some functions, their wide versions are unavailable or called differently, TntControls also contain a bunch of wide function versions)
The Chinese Market requires support for multi-byte character sets (either WideChar or Unicode).
The Delphi 2007 RTL/VCL only supports single-byte character sets (there is very limited support for WideChar in the RTL and VCL).
The easiest for you is to upgrade to a Delphi version that supports Unicode (Delphi 2009 was the first version that supports Unicode, the current Delphi vesion is Delphi XE).
Or you will need to update all your components to support WideChar, and rewrite the portions of RTL/VCL for which you need WideChar support.
--jeroen
Did you install Far East charset support in Windows? In Windows pre 7 (or Vista) those charset are not installed by default in Western versions, you have to add them in Control Panel -> Regional Settins, IIRC
Using a non-Unicode version of Delphi unluckily what character can be displayed depends on the current codepage. If it is not one of the Chinese ones, for example, it could not display the characters you need. What characters are actually displayed depends on how the codes you're using are mapped in the current codepage. You could use a multi-lingual version of Windows to switch fully to the locale you need, or you have to use a Unicode version of Delphi (from 2009 onwards).

Is there some advantage in use resourcestring instead of a const string?

Would you tell me if there is some advantage (less sotorage space, increase speed, etc) in using:
resourcestring
MsgErrInvalidInputRange = 'Invalid Message Here!';
instead of
const
MsgErrInvalidInputRange : String = 'Invalid Message Here!';
The const option will be faster than resourcestring, because the later will call the Windows API to get the resource text.
You can make it faster by using some caching mechanism. This is what we do in our Enhanced Delphi RTL.
And it's a good idea to first load the resourcestring into a string, if you'll have to access many times to a resourcestring content.
The main point of resourcestring is to allow i18n (internationalization) of your program.
You've got the Translation Manager with some editions of the Delphi IDE. But it relies on external DLL.
You can use the gettext system, coming from the Linux world, from http://dxgettext.po.dk which relies on external .po files.
We included our own i18n mechanism in our framework, which translates and caches the resourcestring text, and relies on external .txt files (you can use UTF-8 or Unicode text files, from Delphi 6 up to XE). The caching make it quite as fast as the const usage. See http://synopse.info/fossil/finfo?name=SQLite3/SQLite3i18n.pas
There are other open source or commercial solutions around.
About size storage, resourcestring are stored as UC2 buffers. So resourcestring will use more memory than string up to Delphi 2009. Since Delphi 2009, all string are unicodestring i.e. UCS2, so you won't have much more storage size. In all cases, storage size of text is not the bigger size parameter for an application (bitmaps and code size have a much bigger effect to the final exe).
Resource strings are stored as STRINGTABLE entries in your exe resource, consts are stored as part of the fixed data segment. Since they're part of the resource section you can extract them and the DFMs, translate them, and store them in a resource module (data-only DLL). When a Delphi app starts, it looks for that DLL and will use the strings from it instead of the ones included in your EXE to load translations.
The Embarcadero docwiki covers using the Translation Manager, but a lot of other Delphi translation tools use resource strings too.
As others have mentioned, resourcestring strings will be included in a separate resource within your exe, and as such have advantages when you need to cater for multiple languages in the UI of your app.
As some have mentioned as well, const strings are included in the data section of your app.
Up to D2007
In Delphi versions up to D2007, const strings were stored as Ansi strings, requiring a single byte per character, whereas resource strings would be stored in UTF-16: the windows default encoding (though perhaps not for Win9x). IIRC D2007 and prior versions didn't support UTF-8 encoded unit files. So any strings coded in your sources would have to be supported by the ANSI code pages, and as such probably didn't go beyond the Unicode Basic Multilingual Plane. Which means that only the UCS-2 part of UTF-16 would be used and all strings could be stored in two bytes per character.
In short: up to D2007 const strings take a single byte per character, resource strings take two bytes per character.
D2009 and up
Delphi was unicode enabled in version D2009. Since then things are a little different. Resourcestring strings are still stored as UTF-16. No other option here as they are "managed" by Windows.
Consts strings however are a completely different story. Since D2009 Delphi stores multiple versions of each const string in your exe. Each version in a different encoding. Const can be stored as Ansi strings, UTF-8 strings and UTF-16 strings.
Which of these encodings is stored depends on the use of the const. By default UTf-16 will be used, as that is the default internal Delphi encoding. Assign the same const to a "normal" (UTF-16) string, as well as to an AnsiString variable, and the const will be stored in the exe both UTF-16 and Ansi encoded...
De-duping
By the looks of it (experimenting with D5 and D2009), Delphi "de-dupes" const strings, whereas it doesn't do this for resourcestring strings.
With resourcestring, the compiler places those strings as a stringtable resource in the executable, allowing anyone (say your translation team) to edit them with a resource editor without needing to recompile the application, or have access to the source code.
There's also a third options that is:
const
MsgErrInvalidInputRange = 'Invalid Message Here!';
The latter shoud be the more performant one because tell the compiler to not allocate space in the data segment, it could put the string in the code segment. Also remember that what coould be done with typed constants depends on the $WRITEABLECONST directive, although I do not know what the compiler exactly when it is on or off.

read unicode output of console application

I've console app. written in Delphi 2010. It's output is Unicode supported. (I used UTF8Encode and SetConsoleOutputCP(CP_UTF8) for this). When I run the program from command prompt it works fine.
Now I want to read the output from another program which was created in Delphi 5. I use this method. But I've problems with unicode characters.
Does anyone have a recommendation to read the unicode output of console app. from Delphi 5?
Delphi 5 does have unicode support, but only through WideStrings which are UTF-16(-LE) encoded. Natively, D5 does not have UTF-8 support.
You can read the output of your D2010 console app in the way you already do, although I would take out the OemToAnsi conversion. OEMToAnsi was superseded (even in D5 days) by OEMToChar which can be used to convert OEM characters to Ansi (single byte characters using various code pages) or WideString (UTF-16-LE Unicode), but it won't do a thing to interpret the UTF-8 bytes coming in and might just mess things up.
What you need is a set of functions that can take all the "raw" utf-8 bytes you have read from the pipe and convert them to (UTF-16-LE encoded) WideStrings which you can then feed to a control that can take in and show WideStrings. Alternatively you could look for a control that does the "raw" byte interpretation and conversion all itself, but I must admit I haven't seen any let alone one that still supports D5.
A library that can convert many different encodings and still supports D5 is DIUnicode: http://www.wikitaxi.org/delphi/doku.php/products/unicode/index
You have two problems using Delphi 5 with unicode output.
The first is TMemo does not support Unicode characters you will need to find another control, such as the ones in TMS Unicode Component Pack. However, this Component pack does not support Delphi 5.
The second problem is with this part of the code:
repeat
BytesRead := 0;
ReadFile(ReadPipe,Buffer[0],
ReadBuffer,BytesRead,nil) ;
Buffer[BytesRead]:= #0;
OemToAnsi(Buffer,Buffer) ;
AMemo.Text := AMemo.text + String(Buffer) ;
until (BytesRead < ReadBuffer) ;
It is reading he characters and placing them into buffer which is a PCHAR (single character per byte in D5) Then type casting this to a String which is an AnsiString in D5.
Although I have not used D5 for years, the only type that I can remember that can handle unicode data in D5 is WideString.
I've changed somethings as follows and it works fine :
In console application, I didn't use SetConsoleOutputCP(CP_UTF8). Only use string output...
And at the other program (Delphi 5), I use this function without use OemToChar(Buffer,Buffer)

Resources