UTF8 version of WIDESTRING - delphi

I have a text that I need to store it in a widestring variable. But my text is UTF8 and widestring doesn't support UTF8 and converts it to some chinese characters.
so is there any UTF8 version of WIDESTRING?
I always use UTF8string but in this case I have to use WideString

When you assign a UTF8String variable to a WideString variable, the compiler automatically inserts instructions to decode the string (in Delphi 2009 and later). It coverts UTF-8 to UTF-16, which is what WideString holds. If your WideString variable holds Chinese characters, then that's because your UTF-8-encoded string holds UTF-8-encoded Chinese characters.
If you want your string ws to hold 16-bit versions of the bytes in your UTF8String s, then you can by-pass the automatic conversion with some type-casting:
var
ws: WideString;
i: Integer;
c: AnsiChar;
SetLength(ws, Length(s));
for i := 1 to Length(s) do begin
c := s[i];
ws[i] := WideChar(Ord(c));
end;
If you're using Delphi 2009 or later (which includes the XE series), then you should consider using UnicodeString instead of WideString. The former is a native Delphi type, whereas the latter is more of a wrapper for the Windows BSTR type. Both types exhibit the automatic conversion behavior when assigning to and from AnsiString derivatives like UTF8String, though, so they type you use doesn't affect this answer.
In earlier Delphi versions, the compiler would attempt to decode the string using the system code page (which is never UTF-8). To make it decode the string properly, call Utf8Decode:
ws := Utf8Decode(s);

Related

Can a unicode or UTF8 character be stripped from a ansistring?

In the case where a Unicode character or a UTF8 character exists in a ansistring is it possible to strip the characters from the string? In this particular case the ansistring contains EXIF parameters.
Edit
When the string is read it is visible as: Copyright © 2013 The States of Guernsey (Guernsey Museums & Galleries)
In one case, the copyright symbol © is encoded as UTF-8 sequence (that is 0xc2 and 0xa9).
Delphi 7 and Delphi 2010 shows it as ascii, displaying an "Â" (C2) and a "©" (A9), ignoring that is a UTF8 sequence. Exif tags and the Copyright tag (33432) should be simple ASCII, not UTF8 or unicode.
So if a ansistring contains one or more of these characters can they be stripped from the string or do they have to be manually edited?
Edit2
Attempting to recover the UTF8 I tried:
// remove the null terminator from a string (part of imageen unit}
function RemoveNull(sValue: string): string;
begin
result := trim(svalue);
if (result <> '') and
(result[length(result)] = #0) then
SetLength(result, length(result) - 1);
result := trim(result);
end;
EXIF_Copyright: is defined by ImageEn as AnsiString;
utf8: UTF8String;
// EXIF_Copyright
// Shows copyright information
SetLength(utf8, Length(EXIF_Copyright)); // [DCC Error] iexEXIFRoutines.pas(911): E2026 Constant expression expected
Move(Pointer(EXIF_Copyright)^, Pointer(utf8)^, Length(EXIF_Copyright)));
_EXIF_Copyright: result := RemoveNull(EXIF_Copyright);
Unfortunately I have little experience dealing with UTF8.
where EXIF_Copyright is an ansistring;
but this will not compile...
The simplest approach is to read your UTF-8 string into a variable of type UTF8String and then assign to another string variable.
You can assign to an AnsiString if you want, but I don't understand why you would do that. If you do convert to ANSI, any characters that cannot be represented will be converted to question marks. If you are desperate to strip non-ASCII characters, read into UTF8String, convert to string, and strip characters > 127.
As I understand it, the standard mandates ASCII but it's common now for EXIF text to be encoded with UTF-8.
I suggest you simply read the text into a UTF8String and leave it at that.
Your library gives you an AnsiString that actually contains UTF-8 text. So you can simply convert to UTF8String like this:
function ReinterpUTF8storedInAnsiString(const ansi: AnsiString): string;
var
utf8: UTF8String;
begin
SetLength(utf8, Length(ansi));
Move(Pointer(ansi)^, Pointer(utf8)^, Length(ansi));
Result := utf8;
end;
Now you will have the text that the file creator intended you to see.

Replace string that contain #0?

I use this function to read file to string
function LoadFile(const FileName: TFileName): string;
begin
with TFileStream.Create(FileName,
fmOpenRead or fmShareDenyWrite) do begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
except
Result := '';
Free;
raise;
end;
Free;
end;
end;
Here's the text of file :
version
Here's the return value of LoadFile :
'ÿþv'#0'e'#0'r'#0's'#0'i'#0'o'#0'n'#0
I want to make a new file contain "verabc". The problem is I still have a problem to replace "sion" with "abc". I am using D2007. If I remove all #0 then the result become Chinese character.
What you think is the text of the file isn't really the text of the file. What you've read into your string variable is accurate. You have a Unicode text file encoded as little-endian UTF-16. The first two bytes represent the byte-order mark, and each pair of bytes after that are another character of the string.
If you're reading a Unicode file, you should use a Unicode data type, such as WideString. You'll want to divide the file size by two when setting the length of the string, and you'll want to discard the first two bytes.
If you don't know what kind of file you're reading, then you need to read the first two or three bytes first. If the first two bytes are $ff $fe, as above, then you might have a little-endian UTF-16 file; read the rest of the file into a WideString, or UnicodeString if you have that type. If they're $fe $ff, then it might be big-endian; read the remainder of the file into a WideString and then swap the order of each pair of bytes. If the first two bytes are $ef $bb, then check the third byte. If it's $bf, then they are probably the UTF-8 byte-order mark. Discard all three and read the rest of the file into an AnsiString or an array of bytes, and then use a function like UTF8Decode to convert it into a WideString.
Once you have your data in a WideString, the debugger will show that it contains version, and you should have no trouble using a Unicode-enabled version of StringReplace to do your replacement.
It seems that you load a unicode encoded text file. 0 indicates Latin character.
If you don't want to deal with unicode text, choose ANSI encoding in your editor when you save the file.
If you need unicode encoding, use WideCharToString to convert it to an ANSI string, or just remove yourself the 0s, though the latter isn't the best solution. Also remove the 2 leading characters, ÿþ.
The editor put those bytes to mark the file as unicode.

Convert Delphi 7 code to work with Delphi 2009

I have a String that I needed access to the first character of, so I used stringname[1]. With the unicode support this no longer works. I get an error: [DCC Error] sndkey32.pas(420): E2010 Incompatible types: 'Char' and 'AnsiChar'
Example code:
//vkKeyScan from the windows unit
var
KeyString : String[20];
MKey : Word;
mkey:=vkKeyScan(KeyString[1])
How would I write this in modern versions of Delphi
The type String[20] is a ShortString of length 20, i.e. a ShortString that contains 20 characters. But ShortStrings behave like AnsiStrings, i.e. they are not Unicode - one character is one byte. Thus KeyString[1] is an AnsiChar, whereas the vkKeyScan function expects a WideChar (=Char) as argument. I really have no idea whatsoever why you want to use the type String[20] instead of String (=UnicodeString), but you could convert the AnsiChar KeyString[1] to a WideChar:
mkey := vkKeyScan(WideChar(KeyString[1]))
Off the top of my head: do you really need a string, which is equal to widestring in Delphi 2009?
One option is to have the definition
var KeyString: AnsiString;
then when you take KeyString[1] that would be an AnsiChar rather than a Char.

convert string character to ascii in delphi

How can i convert string character (123-jhk25) to ASCII in Delphi7
If you mean the ASCII code for the character you need to use the Ord() function which returns the Ordinal value of any "enumerable" type
In this case it works on character values, returning a byte:
var
Asc : Byte;
i : Integer;
begin
for i := 1 to Length(s) do
begin
Asc := Ord(s[i]);
// do something with Asc
end;
end;
It depends on your Delphi version. In Delphi 2007 and before, strings are automatically in ANSI string format, and anything below 128 is an ASCII character.
In D2009 and later, things become more complicated since the default string type is UnicodeString. You'll have to cast the character to AnsiChar. It'll perform a codepage conversion, and then whatever you end up with may or may not work depending on which language the character in question came from. But if it was originally an ASCII character, it should convert without trouble.

How to copy a RTF string to the clipboard in delphi 2009?

Here is my code that was working in Delphi pre 2009? It just either ends up throwing up a heap error on SetAsHandle.
If I change it to use AnsiString as per original, i.e.
procedure RTFtoClipboard(txt: string; rtf: AnsiString);
and
Data := GlobalAlloc(GHND or GMEM_SHARE, Length(rtf)*SizeOf(AnsiChar) + 1);
then there is no error but the clipboard is empty.
Full code:
unit uClipbrd;
interface
procedure RTFtoClipboard(txt: string; rtf: string);
implementation
uses
Clipbrd, Windows, SysUtils, uStdDialogs;
VAR
CF_RTF : Word = 0;
//------------------------------------------------------------------------------
procedure RTFtoClipboard(txt: string; rtf: string);
var
Data: Cardinal;
begin
with Clipboard do
begin
Data := GlobalAlloc(GHND or GMEM_SHARE, Length(rtf)*SizeOf(Char) + 1);
if Data <> 0 then
try
StrPCopy(GlobalLock(Data), rtf);
GlobalUnlock(Data);
Open;
try
AsText := txt;
SetAsHandle(CF_RTF, Data);
finally
Close;
end;
except
GlobalFree(Data);
ErrorDlg('Unable to copy the selected RTF text');
end
else
ErrorDlg('Global Alloc failed during Copy to Clipboard!');
end;
end;
initialization
CF_RTF := RegisterClipboardFormat('Rich Text Format');
if CF_RTF = 0 then
raise Exception.Create('Unable to register the Rich Text clipboard format!');
end.
To quote Wikipedia:
RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes. In a code page escape, two hexadecimal digits following an apostrophe are used for denoting a character taken from a Windows code page. For example, if control codes specifying Windows-1256 are present, the sequence \'c8 will encode the Arabic letter beh (ب).
If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.
So your idea of using AnsiString is good, but you would also need to replace all characters that are not ASCII and are not part of the current Ansi Windows codepage with the Unicode escapes. This should ideally be another function. Your code to write the data to the clipboard could remain the same, with the only change to use the Ansi string type.

Resources