There is Firebird table with 2 blob fields- blob_binary field(subtype=0) and blob_Text field(subtype=1,utf-8).
DB has utf-encoding. Connection has utf encoding.
Version of Delphi is 10.2.3. I use FireDac components for data access. Server is Firebird 3.
App must write data from text-file(utf-8) to both blob-fields of "Content" table.
Text file, what I must write in blobs, contains text on English, Russian and Georgian languages(see image).
Project and DB files, with editing permission
Code below writes text in binary blob field but characters are strange(not ??? simbols. Maybe Ansi characters?).
Code for save text-file in Blob_Binary field:
ID:=Query1.FieldByName('Content_id').asInteger;
OpenDialog1.Execute;
Query1.Close;
Query1.SQL.Text := 'SELECT * FROM content where Content_id=:id';
Query1.Params[0].AsInteger:=ID;
Query1.open;
Query1.Edit;
(Query1.FieldByName('BLOB_BINARY') as TBlobField).LoadFromFile(OpenDialog1.FileName);
Query1.Post;
When I save text file in binary blob field then:
1) if I saved text file in encoding utf-BOM I get in binary blob normal text and
2) strange characters if I choose for text file encoding utf.
But when I use the same code for writing data in text blob field data appears strange like chinese characters (see image).
What do I wrong? How to correct this code to write in both fields utf characters?
I tried another solutions but result is the same. For example:
ID:=Query1.FieldByName('Content_id').asInteger;
OpenDialog1.Execute;
Query1.Close;
Query1.SQL.Text := 'Update content set Blob_Text=:Blob_Text where
Content_id=:id';
Query1.Params[0].DataType := ftBlob;
Query1.Params[0].AsStream := TFileStream.Create(OpenDialog1.FileName, fmOpenRead);
Query1.Params[1].AsInteger:=ID;
Query1.ExecSQL;
Update1: As I realised, if I save txt-file as "unicode" in noteped(or ucs-2 LE BOM in noteped++) it is saved fine in text blob, chines characters disappeared. Similarly, txt-file in binary blob is saved fine if it is in utf-BOM encoding. Although it's very uncomfortable not be able to save file in utf-8.
What you're seeing is known as mojibake, caused by interpreting text in a different encoding than the one it was originally written in. When you get random CJK (Chinese/Japanese/Korean) characters, it usually comes from incorrectly interpreting 8-bit (ASCII, ANSI, UTF-8, etc) encoded text as UTF-16. Have a look at your string types and the string types going into and coming out of the database, and check for compiler warnings about ANSI and Unicode string type mismatches, and you should be able to get to the bottom of this fairly quickly.
I have a same bug with ADOQuery and Firebird 2.5 Blob Field Sub_Type 1 (Text)
String fields are converted fine, blobs are not.
If I change connection to IBX, all works fine
Solved by:
SettingsTEXT.AsString := UTF8Decode(SettingsTEXT.AsString)
Related
Delphi Tokyo - I have a text file... (Specifically CSV file). I am reading the file line by line using TextFile operations... The first three bytes of the file has some type of header data which I am not interested in. While I think this will be the case in all files, I want to verify that before I delete it. In short, I want to read the line, compare the first three bytes to three hex values, and if matching, delete the 3 bytes.
When I look at the file in a hex editor, I see
EF BB BF ...
For whatever reason, my comparison is NOT working.
Here is a code fragment.
var
LeadingBadBytes: String;
begin
// Open file, and read first line into variable TriggerHeader
...
LeadingBadBytes := '$EFBBBF';
if AnsiPos(LeadingBadBytes, TriggerHeader) = 1 then
delete(TriggerHeader, 1, 3);
The DELETE command by itself works fine, but I cannot get the AnsiPos to work. What should I be doing different?
The bytes EF BB BF are a UTF-8 BOM, which identifies the file as Unicode text encoded in UTF-8. They only appear at the beginning of the file, not on every line.
Your comparison does not work because you are comparing the read string to the literal string '$EFBBBF', not to the byte sequence EF BB BF.
Change this:
LeadingBadBytes := '$EFBBBF';
...
Delete(TriggerHeader, 1, 3);
To this:
LeadingBadBytes := #$FEFF; // EF BB BF is the UTF-8 encoded form of Unicode codepoint U+FEFF...
...
Delete(TriggerHeader, 1, 1); // or Delete(..., Length(LeadingBadBytes))
Also, consider using StrUtils.StartsText(...) instead of AnsiPos(...) = 1.
That being said, modern versions of Delphi should be handling the BOM for you, you shouldn't be receiving it in the read data at all. But, since you said you are using a TextFile, it is not BOM-aware, AFAIK. You should not be using outdated Pascal-style file I/O to begin with. Try using more modern Delphi RTL I/O classes instead, like TStringList or TStreamReader, which are BOM-aware.
I am trying to encode the 'subject' field, written in Hebrew, of an email into Base64 so that the subject can be read correctly in all browsers. At the moment, I am using the encoding Windows-1255 which works on some clients but not all, so I want to use utf-8, base64.
My reading on the subject (no pun intended) shows that the text has to be in the form
=?<charset>?<encoding>?<encoded text>?=
eg
=?windows-1255?Q?=E0=E1?=
I have taken encoded subject lines from letters which were sent to me in Hebrew with UTF-8B encoding and decoded them successfully on this website, www.webatic.com/run/convert/base64.php. I have also used this website to encode simple letters and have noticed that the return encoding is not the same as the result which I get from a Delphi algorithm.
So - I am looking for an algorithm which successfully encodes letters such as aleph (ord=224), bet (ord=225), etc. According to the website, the string composed of the two letters aleph and bet returns the code 15DXkq==, but the basic Delphi algorithm returns Ue4 and the TIdEncoderQuotedPrintable component returns =E0=E1 (which is the ISO-8859 encoding).
Edit (after several comments):
I asked a friend to send me an email from her Mac computer, which unsurprisingly uses UTF-8 encoding (as opposed to Windows-1255). The subject was one letter, aleph, ord 224. The encoded subject appeared in the email's header as follows
=?UTF-8?B?15A=?=
This can be separated into three parts: the 'prefix' (=?UTF-8?B?) which means that UTF-8 with base64 encoding is being used; the 'payload' (15A=), which the web site which I quoted translates this correctly as the letter aleph; and the suffix (?=).
I need an algorithm to translate an arbitrary string of letters, most of which will be in Hebrew (and thus with ord >= 224) into base64/utf-8; a correct solution is one that decodes correctly on the web site quoted.
I'm sorry to have wasted all your time. I spent several hours again on the subject today and discovered that the base64 code which I was using has a huge bug.
The steps necessary to send a base64 encoded UTF-8 subject line are:
Convert 'normal' text (ie local ANSI code page) to UTF-8 via the AnsiToUTF8 function
Encode this into base64
Create a string with the prefix '=?UTF-8?B?', the result from stage 2 and the suffix '=?='
Send!
Here is the complete code for creating and sending the email (obviously simplified)
with IdSMTP1 do
begin
host:= ....;
username:= ....;
password:= ....;
end;
with email do
begin
From.Address:= ....;
Recipients.EMailAddresses:= ....;
cclist.add.address:= ....;
email.subject:= '=?UTF-8?B?' + encode64 (AnsiToUTF8 (edit1.text)) + '=?=';
email.Body.text:= ....;
end;
try
IdSMTP1.Connect (1000);
IdSMTP1.Send (email);
finally
if IdSMTP1.Connected
then IdSMTP1.Disconnect;
end;
Using the code on this page which is the same as this page, the 'codes64' string begins with the digits, then capital letters, then lower case letters and then punctuation. But this page shows that the capital letters should come first, followed by the lower case letters, followed by the digits, followed by the punctuation.
Once I had made this correction, the strings began to be encoded 'correctly' - I could read them properly in my email client, which I am taking to be the definition of 'correct'.
It would be interesting to read whether anybody else has had problems with the base64 encoding code which I found.
You do not need to encode the Subject property manually at all. TIdMessage encodes it automatically for you. Simply assign the
Edit1.Text value as-is to the
Subject and let TIdMessage encode
it as needed.
If you want to customize how
TIdMessage encodes headers, use the TIdMessage.OnInitializeISO
event to provide the desired charset and encoding
values. In Delphi 2009+, it defaults to UTF-8 and Base64. In earlier versions, TIdMessage reads the RTL's current OS language and chooses some default values for known languages. However, Hebrew is not one of them, and so ISO-8859-1 and QuotedPrintable would end up being used. You can override those values, eg:
email.Subject := Edit1.Text;
.
procedure TForm1.emailInitializeISO(var VHeaderEncoding: Char; var VCharSet: string);
begin
VHeaderEncoding := 'B';
VCharSet := 'UTF-8';
end;
I would like to read a UTF-8 text file byte by byte and get the ascii value representation of each byte in the file. Can this be done? If so, what is the best method?
My goal is to then replace 2 byte combinations that i find with one byte (these are set conditions that I have prepared)
for example, If I find a 197 followed by a 158 (decimal representations), i will replace it with a single byte 17
I don't want to use the standard delphi IO operations
AssignFile
ReSet
ReWrite(OutFile);
ReadLn
WriteLn
CloseFile
Is there a better method? Can this be done using TStream (Reader & Writer)?
Here is an example test I am using. I know there is a character (350) (two bytes) starting in column 84. When viewed in a hex editor, the character consists of 197 + 158 - so i am trying to find the 198 using my delphi code and can't seem to find it
FS1:= TFileStream.Create(ParamStr1, fmOpenRead);
try
FS1.Seek(0, soBeginning);
FS1.Position:= FS1.Position + 84;
FS1.Read(B, SizeOf(B));
if ord(B) = 197 then showMessage('True') else ShowMessage('False');
finally
FS1.Free;
end;
You can use TFileStream to read all data from file to, for isntance, array of bytes, and later check for utf8 sequence.
Also please note that utf8 sequence can contain more than 2 bytes.
And, in Delphi there is a function Utf8ToUnicode, which will convert utf8 data to usable unicode string.
My understanding is that you want to convert a text file from UTF-8 to ASCII. That's quite simple:
StringList.LoadFromFile(UTF8FileName, TEncoding.UTF8);
StringList.SaveToFile(ASCIIFileName, TEncoding.ASCII);
The runtime library comes with all sorts of functionality to convert between different text encodings. Surely you don't want to attempt to replicate this functionality yourself?
I trust you realise that this conversion is liable to lose data. Characters with ordinal greater than 127 cannot be represented in ASCII. In fact every code point that requires more than 1 octet in UTF-8 cannot be represented in ASCII.
You asked the same question 5 hours later in another topic, the answer od which better addresses your specific question:
Replacing a unicode character in UTF-8 file using delphi 2010
I'm upgrading a D7 program to XE, and under Delphi 7 I had code like this...
ParamByName ('Somefield').AsString:=someutf8rawbytestring;
Under XE if someutf8rawbytestring contains unicode characters such as Cyrillic script, then they appear as ???? in the DB.
I see that someutf8rawbytestring is 8 characters long, for my 4 character string, which is correct. But in the DB there are just four characters.
I'm using Firebird 2 through TIBQuery with XE and updating a Varchar field with character type 'NONE'.
So what it looks like is that the utf8 is being detected and converted somehow back to unicode data points, and then that is failing a string conversion for the DB. I've tried setting the varchar field to UTF8 encoding but with the same result.
So how should this be handled?
EDIT: I can use a database tool and edit my DB field to have some non-ASCII data and when I read it back it comes as a utf8 encoded string that I can use UTF8decode on and it's correct. But writing data back to this field seems impossible without getting a bunch of ???? in the DB. I've tried ParamByName ('Somefield').AsString:=somewidestring; and ParamByName ('Somefield').AsWideString:=somewidestring; and I just get rubbish in the DB...
EDIT2: Here's the code (in one iteration) ...
procedure TFormnameEdit.savename(id : integer);
begin
With DataModule.UpdateNameQuery do begin
ParamByName ('Name').AsString:=UTF8Encode(NameEdit.Text);
ParamByName ('ID').AsInteger:=id;
ExecSQL;
Transaction.Commit;
end;
end;
As #Lightbulb recommended, adding lc_ctype=UTF8 to the TIBDatabase params solved the problem.
I have incorrect result when converting file to string in Delphi XE. There are several ' characters that makes the result incorrect. I've used UnicodeFileToWideString and FileToString from http://www.delphidabbler.com/codesnip and my code :
function LoadFile(const FileName: TFileName): ansistring;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
// ReadBuffer(Result[1], Size);
except
Result := '';
Free;
end;
Free;
end;
end;
The result between Delphi XE and Delphi 6 is different. The result from D6 is correct. I've compared with result of a hex editor program.
Your output is being produced in the style of the Delphi debugger, which displays string variables using Delphi's own string-literal format. Whatever function you're using to produce that output from your own program has actually been fixed for Delphi XE. It's really your Delphi 6 output that's incorrect.
Delphi string literals consist of a series of printable characters between apostrophes and a series of non-printable characters designated by number signs and the numeric values of each character. To represent an apostrophe, write two of them next to each other. The printable and non-printable series of characters can be written right not to each other; there's no need to concatenate them with the + operator.
Here's an excerpt from the output you say is correct:
#$12'O)=ù'dlû'#6't
There are four lone apostrophes in that string, so each one either opens or closes a series of printable characters. We don't necessarily know which is which when we start reading the string at the left because the #, $, 1, and 2 characters are all printable on their own. But if they represent printable characters, then the 0, ), =, and ù characters are in the non-printable region, and that can't be. Therefore, the first apostrophe above opens a printable series, and the #$12 part represents the character at code 18 (12 in hexadecimal). After the ù is another apostrophe. Since the previous one opened a printable string, this one must close it. But the next character after that is d, which is not #, and therefore cannot be the start of a non-printable character code. Therefore, this string from your Delphi 6 code is mal-formed.
The correct version of that excerpt is this:
#$12'O)=ù''dlû'#6't
Now there are three lone apostrophes and one set of doubled apostrophes. The problematic apostrophe from the previous string has been doubled, indicating that it is a literal apostrophe instead of a printable-string-closing one. The printable series continues with dlû. Then it's closed to insert character No. 6, and then opened again for t. The apostrophe that opens the entire string, at the beginning of the file, is implicit.
You haven't indicated what code you're using to produce the output you've shown, but that's where the problem was. It's not there anymore, and the code that loads the file is correct, so the only place that needs your debugging attention is any code that depended on the old, incorrect format. You'd still do well to replace your code with that of Robmil since it does better at handling (or not handling) exceptions and empty files.
Actually, looking at the real data, your problem is that the file stores binary data, not string data, so interpreting this as a string is not valid at all. The only reason it works at all in Delphi 6 is that non-Unicode Delphi allows you to treat binary data and strings the same way. You cannot do this in Unicode Delphi, nor should you.
The solution to get the actual text from within the file is to read the file as binary data, and then copy any values from this binary data, one byte at a time, to a string if it is a "valid" Ansi character (printable).
I will suggest the code:
function LoadFile(const FileName: TFileName): AnsiString;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
try
SetLength(Result, Size);
if Size > 0 then
Read(Result[1], Size);
finally
Free;
end;
end;