Is a PChar UTF-8 coded? - delphi

I'm writing a tool, which use a C-DLL. The functions of the C-DLL expect a char*, which is in UTF-8 Format.
My question: Can I pass a PChar or do I have to use UTF8Encode(string)?

Consider a string variable named s. On an ANSI Delphi PChar(s) is ANSI encoded. On a Unicode Delphi it is UTF-16 encoded.
Therefore, either way, you need to convert s to UTF-8 encoding. And then you can use PAnsiChar(...) to get a pointer to a null terminated C string.
So, the code you need looks like this:
PAnsiChar(UTF8Encode(s))

Please edit the question and add the tag with your target Delphi version.
Pass it as PAnsiChar; PChar is a joker and may mean different data types. When you work with DLL-like API, you ignore compiler safety net and that means you should make your own. And that means you should use real types, not jokers, the types that would not change no matter which compiler settings and version would be active.
But before getting passing the pointer you should ensure that the source data is encoded in UTF8 actually.
.
Var data: string; buffer: UTF8String; buffer_ptr: PAnsiChar;
Begin
buffer := data + #0;
// transcoding to UTF8 from whatever charset it was, transparently done by Delphi RTL
// last zero to ensure that even for empty string you would have valid pointer below
buffer_ptr := Pointer(#buffer[1]); // making sure there can be no codepage bound to the datatype
C_DLL_CALL(buffeR_ptr);
End;

Related

How to convert ANSI string filename to windows filename [duplicate]

This line:
TFileStream.Create(fileName, fmOpenRead or fmShareDenyNone);
drops an exception if the filename contain something like ñ
You are, ultimately calling CreateFileA, the ANSI API, and the characters you use have no ANSI encoding. The only way to get beyond this is to open the file with CreateFileW, the Unicode API.
You might not realise that you call CreateFileA, but that's how the Delphi 7 file stream is implemented.
One easy way to solve your problems is to upgrade to the latest Delphi which has good support for the native Windows Unicode API.
If you are stuck with ANSI Delphi then you still need to call CreateFileW. You can do this to create a file handle. You'll need to pass a UTF-16 string to that API. Use WideString to store it. You'll also need to get the filename from the user in UTF-16 form. Which means a call to GetOpenFileNameW or IFileDialog. Create a stream by passing the file handle to THandleStream.
To make all this possible you would use the TNT Unicode libraries. They work well but will impose a big port on you.
Frankly, the right way forward is to use modern tools that support Unicode.
You can use the TntUnicode units to have UTF8 support under Delphi 7.
Add TntClasses to your Uses and make the call like this:
TTntFileStream.Create(fileName, fmOpenRead or fmShareDenyNone);
Make sure that fileName is widestring.
Here you can get a copy of TntUnicode:
https://github.com/rofl0r/TntUnicode
UTF16 can be thought of as a codepage, just like all of the possible ANSI codepages.
As Remy mentions in his comment, assuming your ANSI codepage supports the required characters in your Unicode string you simply have to convert that Unicode version of that string to the equivalent ANSI codepage version.
The Delphi compiler can take care of a simple conversion for you automatically, which you use simply by casting a WIDEString (UTF16) to an (ANSI)String:
const
WIDE_FILENAME : WIDEString = 'fuññy.txt';
var
sFilename: String;
strm: TFileStream;
begin
sFilename := String(WIDE_FILENAME);
strm := TFileStream.Create(sFilename, fmOpenRead);
// etc
end;
This works perfectly well even on (e.g.) Delphi 7. The only caveat is that the codepage involved (the system default) must support the extended characters in the Unicode string.
NOTE: The above code uses the String type rather than ANSIString explicitly. On Delphi versions where String is ANSIString, this has the required effect but also is portable to versions where String is UnicodeString (should you upgrade your version later).
If you use ANSIString explicitly in this case, the result will be a double conversion if/when you upgrade:
// Unicode compiler using ANSIString type....
var
sFilename: ANSIString;
begin
sFilename := ANSIString(WIDE_FILENAME); // Codepage conversion from UTF16 to ANSI
strm := TFileStream.Create(sFilename, fmOpenRead); // Will implicitly convert *back* from ANSI to WIDE
versus
// Unicode compiler using String type....
var
sFilename: String;
begin
sFilename := String(WIDE_FILENAME); // String type conversion from WideString to UnicodeString
strm := TFileStream.Create(sFilename, fmOpenRead); // No further conversion necessary
Best solution is to go Unicode, but if that is not an option, you can still solve the problem.
In Windows you can set what codepage to use for non-Unicode programs. Just change it to support the correct language (Spanish?). Then the code should work.
Windows 7: Control Panel > Region and Language > Administrative > Language for non-Unicode programs
Windows XP: Control Panel > Regional and Language > Advanced > Language for non-Unicode programs

Problems with unicode text

I use delphi xe3 and i have small problem !! but i don't how to fix it..
problem is with this letter "è" this letter is inside a file path "C:\lène.mp4"
i save this path into a tstringlist , when i save this tstringlist to a file the path will be shown fine inside the txt file ..
but when trying to loading it using tstringlist it will be shown as "è" (showing it inside a memo or int a variable) in this case it gonna be an invalid path ..
but adding the path(string) directly to the tstring list and then passing it to the path variable it works fine
but loading from the file and passing to the path variable it doesnt work (getting "è" instead of "è")
normally i will work with a lot of uncite string but for i'm struggling with that letter
this will not work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.LoadFromFile('C:\Demo6-out.txt'); // this file contains only "C:\lène.mp4"
resp := (xfiles.Strings[0]);
// if i save xfiles to a file "path string" will be saved fine ... !
finally
xfiles.Free ;
end;
but like this it work ..
var
resp : widestring;
xfiles : tstringlist;
begin
xfiles := tstringlist.Create;
try
xfiles.Add('C:lène.mp4');
resp := (xfiles.Strings[0]);
finally
xfiles.Free ;
end;
i'm really confused
First, you should be using UnicodeString instead of WideString. UnicodeString was introduced in Delphi 2009, and is much more efficient than WideString. The RTL uses UnicodeString (almost) everywhere it previously used AnsiString prior to 2009.
Second, something else introduced in Delphi 2009 is SysUtils.TEncoding, which is used for Byte<->Character conversions. Several existing RTL classes, including TStrings/TStringList, were updated to support TEncoding when converting bytes to/from strings.
What happens when you load a file into TStringList is that an internal TEncoding object is assigned to help convert the file's raw bytes to UnicodeString values. Which implementation of TEncoding it uses depends on the character encoding that LoadFromFile() thinks the file is using, if not explicitly stated (LoadFromFile() has an optional AEncoding parameter). If the file has a UTF BOM, a matching TEncoding is used, whether that be TEncoding.UTF8 or TEncoding.(BigEndian)Unicode. If no BOM is present, and the AEncoding parameter is not used, then TEncoding.Default is used, which represents the OS's default charset locale (and thus provides backwards compatibility with existing pre-2009 code).
When saving a TStringList to file, if the list was previously loaded from a file then the same TEncoding used for loading is used for saving, otherwise TEncoding.Default is used (again, for backwards compatibility), unless overwritten by the optional AEncoding parameter of SaveToFile().
In your first example, the input file is most likely encoded in UTF-8 without a BOM. So LoadFromFile() would use TEncoding.Default to interpret the file's bytes. è is the result of the UTF-8 encoded form of è (byte octets 0xC3 0xA8) being misinterpreted as Windows-1252 instead of UTF-8. So, you would have to load the file like this instead:
xfiles.LoadFromFile('C:\Demo6-out.txt', TEncoding.UTF8);
In your second example, you are not loading a file or saving a file. You are simply assigning a string literal (which is unicode-aware in D2009+) to a UnicodeString variable (inside of the TStringList) and then assigning that to a WideString variable (WideString and UnicodeString use the same UTF-16 character encoding, they just different memory managements). So there are no data conversions being performed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Delphi 7, TFileStream cant open files with special characters

This line:
TFileStream.Create(fileName, fmOpenRead or fmShareDenyNone);
drops an exception if the filename contain something like ñ
You are, ultimately calling CreateFileA, the ANSI API, and the characters you use have no ANSI encoding. The only way to get beyond this is to open the file with CreateFileW, the Unicode API.
You might not realise that you call CreateFileA, but that's how the Delphi 7 file stream is implemented.
One easy way to solve your problems is to upgrade to the latest Delphi which has good support for the native Windows Unicode API.
If you are stuck with ANSI Delphi then you still need to call CreateFileW. You can do this to create a file handle. You'll need to pass a UTF-16 string to that API. Use WideString to store it. You'll also need to get the filename from the user in UTF-16 form. Which means a call to GetOpenFileNameW or IFileDialog. Create a stream by passing the file handle to THandleStream.
To make all this possible you would use the TNT Unicode libraries. They work well but will impose a big port on you.
Frankly, the right way forward is to use modern tools that support Unicode.
You can use the TntUnicode units to have UTF8 support under Delphi 7.
Add TntClasses to your Uses and make the call like this:
TTntFileStream.Create(fileName, fmOpenRead or fmShareDenyNone);
Make sure that fileName is widestring.
Here you can get a copy of TntUnicode:
https://github.com/rofl0r/TntUnicode
UTF16 can be thought of as a codepage, just like all of the possible ANSI codepages.
As Remy mentions in his comment, assuming your ANSI codepage supports the required characters in your Unicode string you simply have to convert that Unicode version of that string to the equivalent ANSI codepage version.
The Delphi compiler can take care of a simple conversion for you automatically, which you use simply by casting a WIDEString (UTF16) to an (ANSI)String:
const
WIDE_FILENAME : WIDEString = 'fuññy.txt';
var
sFilename: String;
strm: TFileStream;
begin
sFilename := String(WIDE_FILENAME);
strm := TFileStream.Create(sFilename, fmOpenRead);
// etc
end;
This works perfectly well even on (e.g.) Delphi 7. The only caveat is that the codepage involved (the system default) must support the extended characters in the Unicode string.
NOTE: The above code uses the String type rather than ANSIString explicitly. On Delphi versions where String is ANSIString, this has the required effect but also is portable to versions where String is UnicodeString (should you upgrade your version later).
If you use ANSIString explicitly in this case, the result will be a double conversion if/when you upgrade:
// Unicode compiler using ANSIString type....
var
sFilename: ANSIString;
begin
sFilename := ANSIString(WIDE_FILENAME); // Codepage conversion from UTF16 to ANSI
strm := TFileStream.Create(sFilename, fmOpenRead); // Will implicitly convert *back* from ANSI to WIDE
versus
// Unicode compiler using String type....
var
sFilename: String;
begin
sFilename := String(WIDE_FILENAME); // String type conversion from WideString to UnicodeString
strm := TFileStream.Create(sFilename, fmOpenRead); // No further conversion necessary
Best solution is to go Unicode, but if that is not an option, you can still solve the problem.
In Windows you can set what codepage to use for non-Unicode programs. Just change it to support the correct language (Spanish?). Then the code should work.
Windows 7: Control Panel > Region and Language > Administrative > Language for non-Unicode programs
Windows XP: Control Panel > Regional and Language > Advanced > Language for non-Unicode programs

How to use an arbitrary string encoding?

I'm trying to get some code working against an API published by a Chinese company. I have a spec and some sample code (in Java), enough to understand most of what's going on, but I ran across one thing I don't know how to do.
String ecodeform = "GBK";
String sm = new String(Hex.encodeHex("Insert message here".getBytes(ecodeform))); //test message
It's creating a string from the char array result of the hex representation of the original string, encoded in GBK format (the standard Chinese character encoding, equivalent to ASCII for English text). I can work out how to do most of that in Delphi, but I don't know how to encode a string to GBK, which is specifically required by this API.
In SysUtils, there's a TEncoding class that comes with a few built-in encodings, such as UTF8, UTF16, and "Default" (the system's default code page), but I don't know how to set up a TEncoding for an arbitrary encoding such as GBK.
Does anyone know how to set this up?
You can use the TEncoding.GetEncoding() method to get a TEncoding object for a specific codepage/charset, eg:
var
Enc: TEncoding;
Bytes: TBytes;
begin
Enc := TEncoding.GetEncoding(936); // or TEncoding.GetEncoding('gb2312')
try
Bytes := Enc.GetBytes('Insert message here');
finally
Enc.Free;
end;
// encode Bytes to hex string as needed...
end;
TEncoding has a GetEncoding method for that. Give it the encoding name or number, and it will return a TEncoding instance.
For GBK, the number I think you want is 936. See Microsoft's list of code pages for more.

Cast from RawByteString to string does automatically invoke UTF8Decode?

I want to store arbitary binary data as BLOB into a SQlite database.
The data will be added as value with this function:
procedure TSQLiteDatabase.AddParamText(name: string; value: string);
Now I want to convert a WideString into its UTF8 representation, so it can be stored to the database. After calling UTF8Encode and storing the result into the database, I noticed that the data inside the database is not UTF8 decoded. Instead, it is encoded as AnsiString in my computer's locale.
I ran following test to check what happened:
type
{$IFDEF Unicode}
TBinary = RawByteString;
{$ELSE}
TBinary = AnsiString;
{$ENDIF}
procedure TForm1.Button1Click(Sender: TObject);
var
original: WideString;
blob: TBinary;
begin
original := 'ä';
blob := UTF8Encode(original);
// Delphi 6: ä (as expected)
// Delphi XE4: ä (unexpected! How did it do an automatic UTF8Decode???)
ShowMessage(blob);
end;
After the character "ä" has been converted to UTF8, the data is correct in memory ("ä"), however, as soon as I pass the TBinary value to a function (as string or AnsiString), Delphi XE4 does a "magic typecast" invoking UTF8Decode for some reason I don't know.
I have already found a workaround to avoid this:
function RealUTF8Encode(AInput: WideString): TBinary;
var
tmp: TBinary;
begin
tmp := UTF8Encode(AInput);
SetLength(result, Length(tmp));
CopyMemory(#result[1], #tmp[1], Length(tmp));
end;
procedure TForm1.Button2Click(Sender: TObject);
var
original: WideString;
blob: TBinary;
begin
original := 'ä';
blob := RealUTF8Encode(original);
// Delphi 6: ä (as expected)
// Delphi XE4: ä (as expected)
ShowMessage(blob);
end;
However, this workaround with RealUTF8Encode looks dirty to me and I would like to understand why a simple call of UTF8Encode did not work and if there is a better solution.
In Ansi versions of Delphi (prior to D2009), UTF8Encode() returns a UTF-8 encoded AnsiString. In Unicode versions (D2009 and later), it returns a UTF-8 encoded RawByteString with a code page of CP_UTF8 (65001) assigned to it.
In Ansi versions, ShowMessage() takes an AnsiString as input, and the UTF-8 string is an AnsiString, so it gets displayed as-is. In Unicode versions, ShowMessage() takes a UTF-16 encoded UnicodeString as input, so the UTF-8 encoded RawByteString gets converted to UTF-16 using its assigned CP-UTF8 code page.
If you actually wrote the blob data directly to the database you would find that it may or may not be UTF-8 encoded, depending on how you are writing it. But your approach is wrong; the use of RawByteString is incorrect in this situation. RawByteString is meant to be used as a procedure parameter only. Do not use it as a local variable. That is the source of your problem. From the documentation:
The purpose of RawByteString is to reduce the need for multiple
overloads of procedures that read string data. This means that
parameters of routines that process strings without regard for the
string's code page should typically be of type RawByteString.
RawByteString should only be used as a parameter type, and only in
routines which otherwise would need multiple overloads for AnsiStrings
with different codepages. Such routines need to be written with care
for the actual codepage of the string at run time.
For Unicode versions of Delphi, instead of RawByteString, I would suggest that you use TBytes to hold your UTF-8 data, and encode it with TEncoding:
var
utf8: TBytes;
str: string;
...
str := ...;
utf8 := TEncoding.UTF8.GetBytes(str);
You are looking for a data type that does not perform implicit text encodings when passed around, and TBytes is that type.
For Ansi versions of Delphi, you can use AnsiString, WideString and UTF8Encode exactly as you do.
Personally however, I would recommend using TBytes consistently for your UTF-8 data. So if you need a single code base that supports Ansi and Unicode compilers (ugh!) then you should create some helpers:
{$IFDEF Unicode}
function GetUTF8Bytes(const Value: string): TBytes;
begin
Result := TEncoding.UTF8.GetBytes(Value);
end;
{$ELSE}
function GetUTF8Bytes(const Value: WideString): TBytes;
var
utf8str: UTF8String;
begin
utf8str := UTF8Encode(Value);
SetLength(Result, Length(utf8str));
Move(Pointer(utf8str)^, Pointer(Result)^, Length(utf8str));
end;
{$ENDIF}
The Ansi version incurs more heap allocations than are necessary. You might well choose to write a more efficient helper that calls WideCharToMultiByte() directly.
In Unicode versions of Delphi, if for some reason you don't want to use TBytes for UTF-8 data, you can use UTF8String instead. This is a special AnsiString that always uses the CP_UTF8 code page. You can then write:
var
utf8: UTF8String;
str: string;
....
utf8 := str;
and the compiler will convert from UTF-16 to UTF-8 behind the scenes for you. I would not recommend this though, because it is not supported on mobile platforms, or in Ansi versions of Delphi (UTF8String has existed since Delphi 6, but it was not a true UTF-8 string until Delphi 2009). That is, amongst other reasons, why I suggest that you use TBytes. My philosophy is, at least in the Unicode age, that there is the native string type, and any other encoding should be held in TBytes.

Resources