Delphi XE, Firebird and UTF8

Delphi XE, Firebird and UTF8 - delphi

I'm upgrading a D7 program to XE, and under Delphi 7 I had code like this...
ParamByName ('Somefield').AsString:=someutf8rawbytestring;
Under XE if someutf8rawbytestring contains unicode characters such as Cyrillic script, then they appear as ???? in the DB.
I see that someutf8rawbytestring is 8 characters long, for my 4 character string, which is correct. But in the DB there are just four characters.
I'm using Firebird 2 through TIBQuery with XE and updating a Varchar field with character type 'NONE'.
So what it looks like is that the utf8 is being detected and converted somehow back to unicode data points, and then that is failing a string conversion for the DB. I've tried setting the varchar field to UTF8 encoding but with the same result.
So how should this be handled?
EDIT: I can use a database tool and edit my DB field to have some non-ASCII data and when I read it back it comes as a utf8 encoded string that I can use UTF8decode on and it's correct. But writing data back to this field seems impossible without getting a bunch of ???? in the DB. I've tried ParamByName ('Somefield').AsString:=somewidestring; and ParamByName ('Somefield').AsWideString:=somewidestring; and I just get rubbish in the DB...
EDIT2: Here's the code (in one iteration) ...
procedure TFormnameEdit.savename(id : integer);
begin
With DataModule.UpdateNameQuery do begin
ParamByName ('Name').AsString:=UTF8Encode(NameEdit.Text);
ParamByName ('ID').AsInteger:=id;
ExecSQL;
Transaction.Commit;
end;
end;

As #Lightbulb recommended, adding lc_ctype=UTF8 to the TIBDatabase params solved the problem.

Related

How to write txt-file in blob

There is Firebird table with 2 blob fields- blob_binary field(subtype=0) and blob_Text field(subtype=1,utf-8).
DB has utf-encoding. Connection has utf encoding.
Version of Delphi is 10.2.3. I use FireDac components for data access. Server is Firebird 3.
App must write data from text-file(utf-8) to both blob-fields of "Content" table.
Text file, what I must write in blobs, contains text on English, Russian and Georgian languages(see image).
Project and DB files, with editing permission
Code below writes text in binary blob field but characters are strange(not ??? simbols. Maybe Ansi characters?).
Code for save text-file in Blob_Binary field:
ID:=Query1.FieldByName('Content_id').asInteger;
OpenDialog1.Execute;
Query1.Close;
Query1.SQL.Text := 'SELECT * FROM content where Content_id=:id';
Query1.Params[0].AsInteger:=ID;
Query1.open;
Query1.Edit;
(Query1.FieldByName('BLOB_BINARY') as TBlobField).LoadFromFile(OpenDialog1.FileName);
Query1.Post;
When I save text file in binary blob field then:
1) if I saved text file in encoding utf-BOM I get in binary blob normal text and
2) strange characters if I choose for text file encoding utf.
But when I use the same code for writing data in text blob field data appears strange like chinese characters (see image).
What do I wrong? How to correct this code to write in both fields utf characters?
I tried another solutions but result is the same. For example:
ID:=Query1.FieldByName('Content_id').asInteger;
OpenDialog1.Execute;
Query1.Close;
Query1.SQL.Text := 'Update content set Blob_Text=:Blob_Text where
Content_id=:id';
Query1.Params[0].DataType := ftBlob;
Query1.Params[0].AsStream := TFileStream.Create(OpenDialog1.FileName, fmOpenRead);
Query1.Params[1].AsInteger:=ID;
Query1.ExecSQL;
Update1: As I realised, if I save txt-file as "unicode" in noteped(or ucs-2 LE BOM in noteped++) it is saved fine in text blob, chines characters disappeared. Similarly, txt-file in binary blob is saved fine if it is in utf-BOM encoding. Although it's very uncomfortable not be able to save file in utf-8.

What you're seeing is known as mojibake, caused by interpreting text in a different encoding than the one it was originally written in. When you get random CJK (Chinese/Japanese/Korean) characters, it usually comes from incorrectly interpreting 8-bit (ASCII, ANSI, UTF-8, etc) encoded text as UTF-16. Have a look at your string types and the string types going into and coming out of the database, and check for compiler warnings about ANSI and Unicode string type mismatches, and you should be able to get to the bottom of this fairly quickly.

I have a same bug with ADOQuery and Firebird 2.5 Blob Field Sub_Type 1 (Text)
String fields are converted fine, blobs are not.
If I change connection to IBX, all works fine
Solved by:
SettingsTEXT.AsString := UTF8Decode(SettingsTEXT.AsString)

Anydac TADTable component collation issue

I'm having an issue with sorting strings that have special characters like ^ and ! in a Firebird database.
When using the TADTable component with the following settings and a table that uses collation unicode_ci_ai
CachedUpdates := false;
FetchOptions.Unidirectional := false;
FetchOptions.CursorKind := ckAutomatic;
FetchOptions.Mode := fmOnDemand;
FormatOptions.SortOptions := [soNoCase];
The server will put strings that start with ^ before strings that start with !, but TADTable does the opposite. This results in duplicates when bringing down the records.
I'm looking for best practice when sorting strings with special characters. I have to use TADTable (legacy system) and Live Data Window mode for speed.
Thank you.

This is most likely to do with the database connections having different default character encoding. See Firebird Character Sets and Collations

Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines.
Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F. This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read().
The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):
테스트
ed 85 8c ec 8a a4 ed 8a b8
Write a Delphi XE 2 Windows form application with TMemo control:
procedure TForm1.ReadFile(aFilename:string);
var
gFile : TextFile;
gLine : RawByteString;
gWideLine : string;
begin
AssignFile(gFile, aFilename);
try
Reset(gFile);
Memo1.Clear;
while not EOF(gFile) do
begin
ReadLn(gFile, gLine);
gWideLine := UTF8ToWideString(gLine);
Memo1.Lines.Add(gWideLine);
end;
finally
CloseFile(gFile);
end;
end;
I inspect the contents of gLine before performing a UTF8ToWideString conversation and under English / US locale Windows it is:
$ED $85 $8C $EC $8A $A4 $ED $8A $B8
As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!
Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.
Read same file (UTF8 w/o BOM) with above application and gLine now has hex value:
$3F $8C $EC $8A $A4 $3F $3F
Output in TMemo: ?�스??
Hypothesis that ReadLn() (and Read() for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).
Use TFileStream to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:
$ED $85 $8C $EC $8A $A4 $ED $8A $B8
Output in TMemo: 테스트 (perfect!)
Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.
Question(s):
Why is Read() not returning me the exact byte string as found in the file? Is it because I'm using a TextFile type and so Delphi is doing a degree of interpretation using the non-unicode codepage?
Is there a built in way to read a UTF8 encoded file line by line?
Update:
Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.

Is there a built in way to read a UTF8 encoded file line by line?
Use TStreamReader. It has a ReadLine() method.
procedure TForm1.ReadFile(aFilename:string);
var
gFile : TStreamReader;
gLine : string;
begin
Memo1.Clear;
gFile := TStreamReader.Create(aFilename, TEncoding.UTF8, True);
try
while not gFile.EndOfStream do
begin
gLine := gFile.ReadLine;
Memo1.Lines.Add(gLine);
end;
finally
gFile.Free;
end;
end;
With that said, this particular example can be greatly simplified:
procedure TForm1.ReadFile(aFilename:string);
begin
Memo1.Lines.LoadFromFile(aFilename, TEncoding.UTF8);
end;

Getting a unicode, hidden symbol, as data in Delphi

I'm writing a delimiter for some Excel spreadsheet data and I need to read the rightward arrow symbol and pilcrow symbol in a large string.
The pilcrow symbol, for row ends, was fairly simply, using the Chr function and the AnsiChar code 182.
The rightward arrow has been more tricky to figure out. There isn't an AnsiChar code for it. The Unicode value for it is '2192'. I can't, however, figure out how to make this into a string or char type for me to use in my function.
Any easy ways to do this?

You can't use the 2192 character directly. But since a STRING variable can't contain this value either (as thus your TStringList can't either), that doesn't matter.
What character(s) are the 2192 character represented as in your StringList AFTER you have read it in? Probably by these three separate characters: 0xE2 0x86 0x92 (in UTF-8 format). The simple solution, therefore, is to start by replacing these three characters with a single, unique character that you can then assign to the Delimiter field of the TStringList.
Like this:
.
.
.
<Read file into a STRING variable, say S>
S := ReplaceStr(S,#$E2#$86#$92,'|');
SL := TStringList.Create;
SL.Text := S;
SL.Delimiter := '|';
.
.
.
You'll have to select a single-character representation of your 3-byte UTF-8 Unicode character that doesn't occur in your data elsewhere.

You need to represent that character as a UTF-16 character. In Unicode Delphi you would do it like this:
Chr(2192)
which is of type WideChar.
However, you are using Delphi 7 which is a pre-Unicode Delphi. So you have to do it like this:
var
wc: WideChar;
....
wc := WideChar(2192);
Now, this might all be to no avail for you since it sounds a little like your code is working with 8 bit ANSI text. In which case that character cannot be encoded in any 8 bit ANSI character set. If you really must use that character, you'll need to use Unicode text.

Error because of quote char after converting file to string with Delphi XE?

I have incorrect result when converting file to string in Delphi XE. There are several ' characters that makes the result incorrect. I've used UnicodeFileToWideString and FileToString from http://www.delphidabbler.com/codesnip and my code :
function LoadFile(const FileName: TFileName): ansistring;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
// ReadBuffer(Result[1], Size);
except
Result := '';
Free;
end;
Free;
end;
end;
The result between Delphi XE and Delphi 6 is different. The result from D6 is correct. I've compared with result of a hex editor program.

Your output is being produced in the style of the Delphi debugger, which displays string variables using Delphi's own string-literal format. Whatever function you're using to produce that output from your own program has actually been fixed for Delphi XE. It's really your Delphi 6 output that's incorrect.
Delphi string literals consist of a series of printable characters between apostrophes and a series of non-printable characters designated by number signs and the numeric values of each character. To represent an apostrophe, write two of them next to each other. The printable and non-printable series of characters can be written right not to each other; there's no need to concatenate them with the + operator.
Here's an excerpt from the output you say is correct:
#$12'O)=ù'dlû'#6't
There are four lone apostrophes in that string, so each one either opens or closes a series of printable characters. We don't necessarily know which is which when we start reading the string at the left because the #, $, 1, and 2 characters are all printable on their own. But if they represent printable characters, then the 0, ), =, and ù characters are in the non-printable region, and that can't be. Therefore, the first apostrophe above opens a printable series, and the #$12 part represents the character at code 18 (12 in hexadecimal). After the ù is another apostrophe. Since the previous one opened a printable string, this one must close it. But the next character after that is d, which is not #, and therefore cannot be the start of a non-printable character code. Therefore, this string from your Delphi 6 code is mal-formed.
The correct version of that excerpt is this:
#$12'O)=ù''dlû'#6't
Now there are three lone apostrophes and one set of doubled apostrophes. The problematic apostrophe from the previous string has been doubled, indicating that it is a literal apostrophe instead of a printable-string-closing one. The printable series continues with dlû. Then it's closed to insert character No. 6, and then opened again for t. The apostrophe that opens the entire string, at the beginning of the file, is implicit.
You haven't indicated what code you're using to produce the output you've shown, but that's where the problem was. It's not there anymore, and the code that loads the file is correct, so the only place that needs your debugging attention is any code that depended on the old, incorrect format. You'd still do well to replace your code with that of Robmil since it does better at handling (or not handling) exceptions and empty files.

Actually, looking at the real data, your problem is that the file stores binary data, not string data, so interpreting this as a string is not valid at all. The only reason it works at all in Delphi 6 is that non-Unicode Delphi allows you to treat binary data and strings the same way. You cannot do this in Unicode Delphi, nor should you.
The solution to get the actual text from within the file is to read the file as binary data, and then copy any values from this binary data, one byte at a time, to a string if it is a "valid" Ansi character (printable).

I will suggest the code:
function LoadFile(const FileName: TFileName): AnsiString;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
try
SetLength(Result, Size);
if Size > 0 then
Read(Result[1], Size);
finally
Free;
end;
end;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart