How to copy a RTF string to the clipboard in delphi 2009? - delphi

Here is my code that was working in Delphi pre 2009? It just either ends up throwing up a heap error on SetAsHandle.
If I change it to use AnsiString as per original, i.e.
procedure RTFtoClipboard(txt: string; rtf: AnsiString);
and
Data := GlobalAlloc(GHND or GMEM_SHARE, Length(rtf)*SizeOf(AnsiChar) + 1);
then there is no error but the clipboard is empty.
Full code:
unit uClipbrd;
interface
procedure RTFtoClipboard(txt: string; rtf: string);
implementation
uses
Clipbrd, Windows, SysUtils, uStdDialogs;
VAR
CF_RTF : Word = 0;
//------------------------------------------------------------------------------
procedure RTFtoClipboard(txt: string; rtf: string);
var
Data: Cardinal;
begin
with Clipboard do
begin
Data := GlobalAlloc(GHND or GMEM_SHARE, Length(rtf)*SizeOf(Char) + 1);
if Data <> 0 then
try
StrPCopy(GlobalLock(Data), rtf);
GlobalUnlock(Data);
Open;
try
AsText := txt;
SetAsHandle(CF_RTF, Data);
finally
Close;
end;
except
GlobalFree(Data);
ErrorDlg('Unable to copy the selected RTF text');
end
else
ErrorDlg('Global Alloc failed during Copy to Clipboard!');
end;
end;
initialization
CF_RTF := RegisterClipboardFormat('Rich Text Format');
if CF_RTF = 0 then
raise Exception.Create('Unable to register the Rich Text clipboard format!');
end.

To quote Wikipedia:
RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes. In a code page escape, two hexadecimal digits following an apostrophe are used for denoting a character taken from a Windows code page. For example, if control codes specifying Windows-1256 are present, the sequence \'c8 will encode the Arabic letter beh (ب).
If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.
So your idea of using AnsiString is good, but you would also need to replace all characters that are not ASCII and are not part of the current Ansi Windows codepage with the Unicode escapes. This should ideally be another function. Your code to write the data to the clipboard could remain the same, with the only change to use the Ansi string type.

Related

Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines.
Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F. This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read().
The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):
테스트
ed 85 8c ec 8a a4 ed 8a b8
Write a Delphi XE 2 Windows form application with TMemo control:
procedure TForm1.ReadFile(aFilename:string);
var
gFile : TextFile;
gLine : RawByteString;
gWideLine : string;
begin
AssignFile(gFile, aFilename);
try
Reset(gFile);
Memo1.Clear;
while not EOF(gFile) do
begin
ReadLn(gFile, gLine);
gWideLine := UTF8ToWideString(gLine);
Memo1.Lines.Add(gWideLine);
end;
finally
CloseFile(gFile);
end;
end;
I inspect the contents of gLine before performing a UTF8ToWideString conversation and under English / US locale Windows it is:
$ED $85 $8C $EC $8A $A4 $ED $8A $B8
As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!
Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.
Read same file (UTF8 w/o BOM) with above application and gLine now has hex value:
$3F $8C $EC $8A $A4 $3F $3F
Output in TMemo: ?�스??
Hypothesis that ReadLn() (and Read() for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).
Use TFileStream to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:
$ED $85 $8C $EC $8A $A4 $ED $8A $B8
Output in TMemo: 테스트 (perfect!)
Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.
Question(s):
Why is Read() not returning me the exact byte string as found in the file? Is it because I'm using a TextFile type and so Delphi is doing a degree of interpretation using the non-unicode codepage?
Is there a built in way to read a UTF8 encoded file line by line?
Update:
Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.
Is there a built in way to read a UTF8 encoded file line by line?
Use TStreamReader. It has a ReadLine() method.
procedure TForm1.ReadFile(aFilename:string);
var
gFile : TStreamReader;
gLine : string;
begin
Memo1.Clear;
gFile := TStreamReader.Create(aFilename, TEncoding.UTF8, True);
try
while not gFile.EndOfStream do
begin
gLine := gFile.ReadLine;
Memo1.Lines.Add(gLine);
end;
finally
gFile.Free;
end;
end;
With that said, this particular example can be greatly simplified:
procedure TForm1.ReadFile(aFilename:string);
begin
Memo1.Lines.LoadFromFile(aFilename, TEncoding.UTF8);
end;

Can a unicode or UTF8 character be stripped from a ansistring?

In the case where a Unicode character or a UTF8 character exists in a ansistring is it possible to strip the characters from the string? In this particular case the ansistring contains EXIF parameters.
Edit
When the string is read it is visible as: Copyright © 2013 The States of Guernsey (Guernsey Museums & Galleries)
In one case, the copyright symbol © is encoded as UTF-8 sequence (that is 0xc2 and 0xa9).
Delphi 7 and Delphi 2010 shows it as ascii, displaying an "Â" (C2) and a "©" (A9), ignoring that is a UTF8 sequence. Exif tags and the Copyright tag (33432) should be simple ASCII, not UTF8 or unicode.
So if a ansistring contains one or more of these characters can they be stripped from the string or do they have to be manually edited?
Edit2
Attempting to recover the UTF8 I tried:
// remove the null terminator from a string (part of imageen unit}
function RemoveNull(sValue: string): string;
begin
result := trim(svalue);
if (result <> '') and
(result[length(result)] = #0) then
SetLength(result, length(result) - 1);
result := trim(result);
end;
EXIF_Copyright: is defined by ImageEn as AnsiString;
utf8: UTF8String;
// EXIF_Copyright
// Shows copyright information
SetLength(utf8, Length(EXIF_Copyright)); // [DCC Error] iexEXIFRoutines.pas(911): E2026 Constant expression expected
Move(Pointer(EXIF_Copyright)^, Pointer(utf8)^, Length(EXIF_Copyright)));
_EXIF_Copyright: result := RemoveNull(EXIF_Copyright);
Unfortunately I have little experience dealing with UTF8.
where EXIF_Copyright is an ansistring;
but this will not compile...
The simplest approach is to read your UTF-8 string into a variable of type UTF8String and then assign to another string variable.
You can assign to an AnsiString if you want, but I don't understand why you would do that. If you do convert to ANSI, any characters that cannot be represented will be converted to question marks. If you are desperate to strip non-ASCII characters, read into UTF8String, convert to string, and strip characters > 127.
As I understand it, the standard mandates ASCII but it's common now for EXIF text to be encoded with UTF-8.
I suggest you simply read the text into a UTF8String and leave it at that.
Your library gives you an AnsiString that actually contains UTF-8 text. So you can simply convert to UTF8String like this:
function ReinterpUTF8storedInAnsiString(const ansi: AnsiString): string;
var
utf8: UTF8String;
begin
SetLength(utf8, Length(ansi));
Move(Pointer(ansi)^, Pointer(utf8)^, Length(ansi));
Result := utf8;
end;
Now you will have the text that the file creator intended you to see.

How to convert UTF-8 string to PChar in Delphi 2009

I receive a string, which is displayed as '{'#0'S'#0'a'#0'm'#0'p'#0'l'#0'e'#0'-'#0'M'#0'e'#0's'#0's'#0'a'#0'g'#0'e'#0'}'#0 in the debugger.
I need to print it out in the debug output (OutputDebugString).
When I run OutputDebugString(PChar(mymsg)), only the first character of the received string is displayed (probably because of the #0 end-of-string marker).
How can I convert that string into something OutputDebugString can work with?
Update 1: Here's the code. I want to print the contents of the variable RxBufStr.
procedure ReceivingThread.OnExecute(AContext : TIdContext);
var
RxBufStr: String;
begin
with AContext.Connection.IOHandler do
begin
CheckForDataOnSource(10);
if not InputBufferIsEmpty then
begin
RxBufStr := InputBuffer.Extract();
end;
end;
end;
The data you have shown in the question looks like UTF-16 encoded data rather than UTF-8. However, since you are using a Unicode aware Delphi, and a string data type, clearly there has been an encoding mismatch. Your string variable appears to be double UTF-16 encoded if you can see what I mean!
It would appear therefore that InputBuffer.Extract is assuming that the data is transmitted using ANSI or UTF-8. In other words, an 8-bit encoding. But in fact the data is transmitted as UTF-16.
To solve the problem you need to align the reading of the buffer with the transmission of the buffer. You need to make sure that both sides use the same encoding. UTF-8 would be a good choice.
If the data in the buffer is UTF-16, then you can extract it with
RxBufStr := InputBuffer.Extract(-1, TIdTextEncoding.Unicode);
If you switch to UTF-8 then extract it with
RxBufStr := InputBuffer.Extract(-1, TIdTextEncoding.UTF8);
With
RxBufStr := InputBuffer.Extract();
the code does not specifiy a terminator or a data size, so it may happen that the client receives only a part of the sent data.
You can read the data with a given (known) length into a TIdBytes array and then convert it to a string using the correct encoding.
One way to do it is
TEncoding.Unicode.GetString( MyByteArray );
(found here)

Error because of quote char after converting file to string with Delphi XE?

I have incorrect result when converting file to string in Delphi XE. There are several ' characters that makes the result incorrect. I've used UnicodeFileToWideString and FileToString from http://www.delphidabbler.com/codesnip and my code :
function LoadFile(const FileName: TFileName): ansistring;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
// ReadBuffer(Result[1], Size);
except
Result := '';
Free;
end;
Free;
end;
end;
The result between Delphi XE and Delphi 6 is different. The result from D6 is correct. I've compared with result of a hex editor program.
Your output is being produced in the style of the Delphi debugger, which displays string variables using Delphi's own string-literal format. Whatever function you're using to produce that output from your own program has actually been fixed for Delphi XE. It's really your Delphi 6 output that's incorrect.
Delphi string literals consist of a series of printable characters between apostrophes and a series of non-printable characters designated by number signs and the numeric values of each character. To represent an apostrophe, write two of them next to each other. The printable and non-printable series of characters can be written right not to each other; there's no need to concatenate them with the + operator.
Here's an excerpt from the output you say is correct:
#$12'O)=ù'dlû'#6't
There are four lone apostrophes in that string, so each one either opens or closes a series of printable characters. We don't necessarily know which is which when we start reading the string at the left because the #, $, 1, and 2 characters are all printable on their own. But if they represent printable characters, then the 0, ), =, and ù characters are in the non-printable region, and that can't be. Therefore, the first apostrophe above opens a printable series, and the #$12 part represents the character at code 18 (12 in hexadecimal). After the ù is another apostrophe. Since the previous one opened a printable string, this one must close it. But the next character after that is d, which is not #, and therefore cannot be the start of a non-printable character code. Therefore, this string from your Delphi 6 code is mal-formed.
The correct version of that excerpt is this:
#$12'O)=ù''dlû'#6't
Now there are three lone apostrophes and one set of doubled apostrophes. The problematic apostrophe from the previous string has been doubled, indicating that it is a literal apostrophe instead of a printable-string-closing one. The printable series continues with dlû. Then it's closed to insert character No. 6, and then opened again for t. The apostrophe that opens the entire string, at the beginning of the file, is implicit.
You haven't indicated what code you're using to produce the output you've shown, but that's where the problem was. It's not there anymore, and the code that loads the file is correct, so the only place that needs your debugging attention is any code that depended on the old, incorrect format. You'd still do well to replace your code with that of Robmil since it does better at handling (or not handling) exceptions and empty files.
Actually, looking at the real data, your problem is that the file stores binary data, not string data, so interpreting this as a string is not valid at all. The only reason it works at all in Delphi 6 is that non-Unicode Delphi allows you to treat binary data and strings the same way. You cannot do this in Unicode Delphi, nor should you.
The solution to get the actual text from within the file is to read the file as binary data, and then copy any values from this binary data, one byte at a time, to a string if it is a "valid" Ansi character (printable).
I will suggest the code:
function LoadFile(const FileName: TFileName): AnsiString;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
try
SetLength(Result, Size);
if Size > 0 then
Read(Result[1], Size);
finally
Free;
end;
end;

Replace string that contain #0?

I use this function to read file to string
function LoadFile(const FileName: TFileName): string;
begin
with TFileStream.Create(FileName,
fmOpenRead or fmShareDenyWrite) do begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
except
Result := '';
Free;
raise;
end;
Free;
end;
end;
Here's the text of file :
version
Here's the return value of LoadFile :
'ÿþv'#0'e'#0'r'#0's'#0'i'#0'o'#0'n'#0
I want to make a new file contain "verabc". The problem is I still have a problem to replace "sion" with "abc". I am using D2007. If I remove all #0 then the result become Chinese character.
What you think is the text of the file isn't really the text of the file. What you've read into your string variable is accurate. You have a Unicode text file encoded as little-endian UTF-16. The first two bytes represent the byte-order mark, and each pair of bytes after that are another character of the string.
If you're reading a Unicode file, you should use a Unicode data type, such as WideString. You'll want to divide the file size by two when setting the length of the string, and you'll want to discard the first two bytes.
If you don't know what kind of file you're reading, then you need to read the first two or three bytes first. If the first two bytes are $ff $fe, as above, then you might have a little-endian UTF-16 file; read the rest of the file into a WideString, or UnicodeString if you have that type. If they're $fe $ff, then it might be big-endian; read the remainder of the file into a WideString and then swap the order of each pair of bytes. If the first two bytes are $ef $bb, then check the third byte. If it's $bf, then they are probably the UTF-8 byte-order mark. Discard all three and read the rest of the file into an AnsiString or an array of bytes, and then use a function like UTF8Decode to convert it into a WideString.
Once you have your data in a WideString, the debugger will show that it contains version, and you should have no trouble using a Unicode-enabled version of StringReplace to do your replacement.
It seems that you load a unicode encoded text file. 0 indicates Latin character.
If you don't want to deal with unicode text, choose ANSI encoding in your editor when you save the file.
If you need unicode encoding, use WideCharToString to convert it to an ANSI string, or just remove yourself the 0s, though the latter isn't the best solution. Also remove the 2 leading characters, ÿþ.
The editor put those bytes to mark the file as unicode.

Resources