How to convert UTF-8 string to PChar in Delphi 2009

How to convert UTF-8 string to PChar in Delphi 2009 - delphi

I receive a string, which is displayed as '{'#0'S'#0'a'#0'm'#0'p'#0'l'#0'e'#0'-'#0'M'#0'e'#0's'#0's'#0'a'#0'g'#0'e'#0'}'#0 in the debugger.
I need to print it out in the debug output (OutputDebugString).
When I run OutputDebugString(PChar(mymsg)), only the first character of the received string is displayed (probably because of the #0 end-of-string marker).
How can I convert that string into something OutputDebugString can work with?
Update 1: Here's the code. I want to print the contents of the variable RxBufStr.
procedure ReceivingThread.OnExecute(AContext : TIdContext);
var
RxBufStr: String;
begin
with AContext.Connection.IOHandler do
begin
CheckForDataOnSource(10);
if not InputBufferIsEmpty then
begin
RxBufStr := InputBuffer.Extract();
end;
end;
end;

The data you have shown in the question looks like UTF-16 encoded data rather than UTF-8. However, since you are using a Unicode aware Delphi, and a string data type, clearly there has been an encoding mismatch. Your string variable appears to be double UTF-16 encoded if you can see what I mean!
It would appear therefore that InputBuffer.Extract is assuming that the data is transmitted using ANSI or UTF-8. In other words, an 8-bit encoding. But in fact the data is transmitted as UTF-16.
To solve the problem you need to align the reading of the buffer with the transmission of the buffer. You need to make sure that both sides use the same encoding. UTF-8 would be a good choice.
If the data in the buffer is UTF-16, then you can extract it with
RxBufStr := InputBuffer.Extract(-1, TIdTextEncoding.Unicode);
If you switch to UTF-8 then extract it with
RxBufStr := InputBuffer.Extract(-1, TIdTextEncoding.UTF8);

With
RxBufStr := InputBuffer.Extract();
the code does not specifiy a terminator or a data size, so it may happen that the client receives only a part of the sent data.
You can read the data with a given (known) length into a TIdBytes array and then convert it to a string using the correct encoding.
One way to do it is
TEncoding.Unicode.GetString( MyByteArray );
(found here)

Related

Receiving a HEX number and turning into an INT or STRING

I'm sending data from an ATmega in the form of 16 bit (2 bytes). I have a serial component in Delphi which receives the data.
If I send a String (e.g. 'FF'), I get the data added to my Memo component. All fine.
However, if I send the raw hex $FF, I get a receive data blink saying "data received" but nothing is added to the Memo component's lines. I'm not sure how to convert this data into an Integer or String, something I can use.
A solution would be good but an explanation on how Delphi sees String, Char, etc. would be nice. Thanks.

When you receive data, you can cast them to bytes (if needed) and tranform into hex representation.
For example, if you get AnsiString:
AnsiS := Comport.ReadAnsiString; //your reading here
for i := 1 to Length(AnsiS) do
Memo1.Lines.Add(IntToHex(Ord(AnsiS[i]), 2));

When your ATMega sends the string "FF", it sends two characters ("F" and "F"), each encoded to their ASCII code decimal 70. When your Delphi program receives these two bytes (d70 and d70) it converts those ASCII codes to characters "F" and "F" and adds them to the memo.
When your ATMega sends the hex value FF ($FF as they are represented in Delphi code), it sends one byte with decimal value 255. When your Delphi program receives this one byte (d255) it attempts to convert it to a character but doesn't find a printable character representation for this code. Therefore nothing is added to the memo. Or, maybe your receiving code is filtering out this and possibly other values too.
It's not clear exactly what kind of solution you are looking for, but you can convert the byte value (d255) to hex or decimal representation with function IntToHex(Value: Integer; Digits: Integer): string; or System.SysUtils.Format(const Format: string; const Args: array of const): string; or use it as a byte value in your code.

extra spaces with string to buffer void type conversion implicit in Filestream.WriteBuffer method

Haven't needed to post here for a while, but I have a problem implementing filestreams.
When writing a string to filestream, the resultnig text file has extra spaces inserted between each character
So when running this method:
Function TDBImportStructures.SaveIVDataToFile(const AMeasurementType: integer;
IVDataRecordList: TIV; ExportFileName, LogFileName: String;
var ProgressInfo: TProgressInfo): Boolean; // AM
var
TempString: unicodestring;
ExportLogfile, OutputFile: TFileStream;
begin
ExportLogfile := TFileStream.Create(LogFileName, fmCreate);
TempString :=
'FileUploadTimestamp, Filename, MeasurementTimestamp, SerialNumber, DeviceID, PVInstallID,'
+ #13#10;
ExportLogfile.WriteBuffer(TempString[1], Length(TempString) * SizeOf(Char));
ExportLogfile.Free;
OutputFile := TFileStream.Create(ExportFileName, fmCreate);
TempString :=
'measurementdatetime,closestfiveseconddatetime,closesttenminutedatetime,deviceid,'
+ 'measuredmoduletemperature,moduletemperature,isc,voc,ff,impp,vmpp,iscslope,vocslope,'
+ 'pvinstallid,numivpoints,errorcode' + #13#10;
OutputFile.WriteBuffer(TempString[1], Length(TempString) * SizeOf(Char));
OutputFile.Free;
end;
(which is a stripped down test method, writing headers only). The resulting csv file for the 'OutPutFile' reads
'm e a s u r e d m o d u l e t e m p e r a t u r e, etcetera when viewed in wordpad, but not in excel, notepad, etc.
I'm guessing its the SizeOf(Char) statement which is wrong in a unicode context, but I'm not sure what would be the correct thing to insert here.
The 'ExportLogfile' seems to work ok but not the 'OutPutFile'
From what I've read elsewhere it is the writing in unicode which is the problem & not WordPad, see http://social.msdn.microsoft.com/Forums/en-US/7e040fd1-f399-4fb1-b700-9e7cc6117cc4/unicode-to-files-and-console-vs-notepad-wordpad-word-etc?forum=vcgeneral
Any suggestions folks?
many thanks, Brian

You are writing 16 bit UTF-16 encoded characters. And then viewing the text as if it were ANSI encoded text. This mismatch explains the behaviour. In fact you don't have extra spaces, those are zero bytes, interpreted as null characters.
You need to decide which encoding you wish to use. Which programs will read the file? Which text encoding are they expecting? Few programs that read csv files understand UTF-16.
A quick fix would be to switch to using AnsiString which would result in 8 bit text. But would not support international text. Do you need to support international text? Then perhaps you need UTF-8. Again you could perform a quick fix using Utf8String, but I think you should look deeper.
It's odd that you handle the text to binary conversion. It would be much simpler to use TStringList, calling Add to add lines, and then specify an encoding when saving the file.
List.Add(...);
List.Add(...);
// etc.
List.SaveToFile(FileName, TEncoding.UTF8);
A perhaps more elegant approach would be to use the TStreamWriter class. Supply an output stream (or filename) and an encoding when creating the object. And then call Write or WriteLine to add text.
Writer := TStreamWriter.Create(FileName, TEncoding.UTF8);
try
Writer.WriteLine(...);
// etc.
finally
Writer.Free;
end;
I've assumed UTF-8 here but you can easily specify a different encoding.

Error because of quote char after converting file to string with Delphi XE?

I have incorrect result when converting file to string in Delphi XE. There are several ' characters that makes the result incorrect. I've used UnicodeFileToWideString and FileToString from http://www.delphidabbler.com/codesnip and my code :
function LoadFile(const FileName: TFileName): ansistring;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
// ReadBuffer(Result[1], Size);
except
Result := '';
Free;
end;
Free;
end;
end;
The result between Delphi XE and Delphi 6 is different. The result from D6 is correct. I've compared with result of a hex editor program.

Your output is being produced in the style of the Delphi debugger, which displays string variables using Delphi's own string-literal format. Whatever function you're using to produce that output from your own program has actually been fixed for Delphi XE. It's really your Delphi 6 output that's incorrect.
Delphi string literals consist of a series of printable characters between apostrophes and a series of non-printable characters designated by number signs and the numeric values of each character. To represent an apostrophe, write two of them next to each other. The printable and non-printable series of characters can be written right not to each other; there's no need to concatenate them with the + operator.
Here's an excerpt from the output you say is correct:
#$12'O)=ù'dlû'#6't
There are four lone apostrophes in that string, so each one either opens or closes a series of printable characters. We don't necessarily know which is which when we start reading the string at the left because the #, $, 1, and 2 characters are all printable on their own. But if they represent printable characters, then the 0, ), =, and ù characters are in the non-printable region, and that can't be. Therefore, the first apostrophe above opens a printable series, and the #$12 part represents the character at code 18 (12 in hexadecimal). After the ù is another apostrophe. Since the previous one opened a printable string, this one must close it. But the next character after that is d, which is not #, and therefore cannot be the start of a non-printable character code. Therefore, this string from your Delphi 6 code is mal-formed.
The correct version of that excerpt is this:
#$12'O)=ù''dlû'#6't
Now there are three lone apostrophes and one set of doubled apostrophes. The problematic apostrophe from the previous string has been doubled, indicating that it is a literal apostrophe instead of a printable-string-closing one. The printable series continues with dlû. Then it's closed to insert character No. 6, and then opened again for t. The apostrophe that opens the entire string, at the beginning of the file, is implicit.
You haven't indicated what code you're using to produce the output you've shown, but that's where the problem was. It's not there anymore, and the code that loads the file is correct, so the only place that needs your debugging attention is any code that depended on the old, incorrect format. You'd still do well to replace your code with that of Robmil since it does better at handling (or not handling) exceptions and empty files.

Actually, looking at the real data, your problem is that the file stores binary data, not string data, so interpreting this as a string is not valid at all. The only reason it works at all in Delphi 6 is that non-Unicode Delphi allows you to treat binary data and strings the same way. You cannot do this in Unicode Delphi, nor should you.
The solution to get the actual text from within the file is to read the file as binary data, and then copy any values from this binary data, one byte at a time, to a string if it is a "valid" Ansi character (printable).

I will suggest the code:
function LoadFile(const FileName: TFileName): AnsiString;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
try
SetLength(Result, Size);
if Size > 0 then
Read(Result[1], Size);
finally
Free;
end;
end;

Replace string that contain #0?

I use this function to read file to string
function LoadFile(const FileName: TFileName): string;
begin
with TFileStream.Create(FileName,
fmOpenRead or fmShareDenyWrite) do begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
except
Result := '';
Free;
raise;
end;
Free;
end;
end;
Here's the text of file :
version
Here's the return value of LoadFile :
'ÿþv'#0'e'#0'r'#0's'#0'i'#0'o'#0'n'#0
I want to make a new file contain "verabc". The problem is I still have a problem to replace "sion" with "abc". I am using D2007. If I remove all #0 then the result become Chinese character.

What you think is the text of the file isn't really the text of the file. What you've read into your string variable is accurate. You have a Unicode text file encoded as little-endian UTF-16. The first two bytes represent the byte-order mark, and each pair of bytes after that are another character of the string.
If you're reading a Unicode file, you should use a Unicode data type, such as WideString. You'll want to divide the file size by two when setting the length of the string, and you'll want to discard the first two bytes.
If you don't know what kind of file you're reading, then you need to read the first two or three bytes first. If the first two bytes are $ff $fe, as above, then you might have a little-endian UTF-16 file; read the rest of the file into a WideString, or UnicodeString if you have that type. If they're $fe $ff, then it might be big-endian; read the remainder of the file into a WideString and then swap the order of each pair of bytes. If the first two bytes are $ef $bb, then check the third byte. If it's $bf, then they are probably the UTF-8 byte-order mark. Discard all three and read the rest of the file into an AnsiString or an array of bytes, and then use a function like UTF8Decode to convert it into a WideString.
Once you have your data in a WideString, the debugger will show that it contains version, and you should have no trouble using a Unicode-enabled version of StringReplace to do your replacement.

It seems that you load a unicode encoded text file. 0 indicates Latin character.
If you don't want to deal with unicode text, choose ANSI encoding in your editor when you save the file.
If you need unicode encoding, use WideCharToString to convert it to an ANSI string, or just remove yourself the 0s, though the latter isn't the best solution. Also remove the 2 leading characters, ÿþ.
The editor put those bytes to mark the file as unicode.

Delphi 2009 RawByteString vagaries

Suppose that for some perverse reason you want to display the raw byte contents of a UTF8String.
var
utf8Str : UTF8String;
begin
utf8Str := '€ąćęłńóśźż';
end;
(1) This doesn't do, it displays the readable form:
memo1.Lines.Add( RawByteString( utf8Str ));
// output: '€ąćęłńóśźż'
(2) This, however, does "work" - note the concatenation:
memo1.Lines.Add( 'x' + RawByteString( utf8Str ));
// output: 'xâ‚¬Ä…Ä‡Ä™Ĺ‚Ĺ„ĂłĹ›ĹşĹĽ'
I understand (1), though the compiler's forced coerction to UnicodeString seems to prevent ever displaying a RawByteString var as-is. However, why does the behavior change in (2)?
(3) Stranger still - let's reverse the concatenation:
memo1.Lines.Add( RawByteString( utf8Str ) + 'x' );
// output: '€ąćęłńóśźżx'
I've been reading up on the newfangled string types in Delphi and thought I understood how they work, but this is a puzzle.

RawByteString only exists to minimize the number of overloads required for functions that work with various flavours of AnsiStrings with different codepage affinities.
In general, don't declare variables of type RawByteString. Don't typecast values to that type. Don't do concatenations on variables of that type. About the only things you can do are:
Declaring a parameter of this type (the original intent)
Indexing on such a parameter
Searching in such a parameter
Intelligent operations that check the actual code page of the string, using the StringCodePage function.
For example, you'll note that the StringCodePage function itself uses RawByteString as its argument type. This way, it will work with any AnsiString, rather than doing a codepage translation before passing it as an argument.
For your case, things like concatenations are largely undefined. The behaviour changed between RTM and Update 2, but when the RTL string concatenation functions receive multiple strings with different code pages, there's no easy way for it to figure out what code page should be used for the final string. That's just one reason why you shouldn't concatenate them like you do here.

You cannot add a string to a TMemo "as is". You always need to so some kind of conversion to Unicode, because that's all TMemo knows about in Delphi 2009.
If you want to pretend that your UTF8String uses code page 1252, do this:
var
utf8Str : UTF8String;
Raw: RawByteString;
begin
utf8Str := '€ąćęłńóśźż';
Raw := utf8Str;
SetCodePage(Raw, 1252, False);
Memo.Lines.Add(Raw);
end;
For more details, see my article Using RawByteString Effectively

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart