Convert encoding of TMemoryStream to utf8 - delphi

I have a file opened in TMemoryStream. Its current encoding can be ANSI or UTF8 with BOM. I have to convert the encoding of TMemoryStream to UTF8. How do I do that?

If you are able to change the TMemoryStream to its descendant TBytesStream you can just use the Convert function from TEncoding.
var
stream: TBytesStream;
bytes: TBytesStream;
...
TEncoding.GetBufferEncoding(stream.Bytes, curEncoding);
if curEncoding <> TEncoding.UTF8 then begin
bytes := TEncoding.Convert(curEncoding, TEncoding.UTF8, stream.Bytes);
stream.Free;
stream := TBytesStream.Create(bytes);
end;
Not sure if it is the most efficient way, but at least it is one way and it only needs a couple of lines, which in turn is also some sort of efficiency.

Related

Copy TByteDynArray (array of byte) to string

How can I copy contents of a TByteDynArray variable to a string variable or even better, to a TMemoryStream?
Remy, thanks for your answer.
Well, I can't get it to work.
I'm doing this:
obtReferenciaPagamentoResponse.pdf is a TByteDynArray (array of byte) that comes throught a WebService call, that is referenced on the XSD like xsd:base64Binary.
procedure saveFile;
var
LInput, LOutput: TMemoryStream;
Id: Integer;
Buff: AnsiString;
//Buff: String;
begin
LInput := TMemoryStream.Create;
LOutput := TMemoryStream.Create;
// Tried like this also
//SetLength(Buff, Length(obtReferenciaPagamentoResponse.pdf));
//Move(obtReferenciaPagamentoResponse.pdf[0], Buff[1], Length(obtReferenciaPagamentoResponse.pdf));
// Tried other charsets
Buff := TEncoding.Ansi.GetString(obtReferenciaPagamentoResponse.pdf);
LInput.Write(Buff[1], Length(Buff) * SizeOf(Buff[1]));
LInput.Position := 0;
TNetEncoding.Base64.Decode(LInput, LOutput);
LOutput.Position := 0;
LOutput.SaveToFile(SaveDialog2.FileName);
LInput.Free;
LOutput.Free;
end;
But the PDF file is saved incompleted, I guess, because is always corrupted on open.
What am I doing wrong?
String is an alias for UnicodeString since 2009. As UnicodeString characters are now encoded in UTF-16, it does not make sense to copy raw bytes into a (Unicode)String unless the bytes are also encoded in UTF-16. In that case, you can simply use SetLength() to allocate the String's length to the appropriate number of Chars and then Move() the raw bytes into the String's allocated memory. Otherwise, use TEncoding.GetString() instead to decode the bytes into a UTF-16 String using the appropriate charset.
As for TMemoyStream, it has a Write() method for writing raw bytes into the stream. Simply set its Position property to the desired offset and then write the bytes.

Compress Base64 string with zlib

I need to send from Windows to mobile devices, iOS and Android, by TCP protocol, a big Base64 string.
I have no problem to send and receive, but the strings size are too big, about 24000 characters, and I'm looking at method to compress an decompress these strings.
Looking I see, that the best way is using the Zlib, and I found these link Delphi XE and ZLib Problems (II) in which explains how to do it.
The functions work with normal text string, but compressing base64 strings make they more big.
An example of a very small string that i would send, would be this:
cEJNYkpCSThLVEh6QjNFWC9wSGhXQ3lHWUlBcGNURS83TFdDNVUwUURxRnJvZlRVUWd4WEFWcFJBNUZSSE9JRXlsaWgzcEJvTGo5anQwTlEyd1pBTEtVQVlPbXdkKzJ6N3J5ZUd4SmU2bDNBWjFEd3lVZmZTR1FwNXRqWTVFOFd2SHRwakhDOU9JUEZRM00wMWhnU0p3MWxxNFRVdmdEU2pwekhwV2thS0JFNG9WYXRDUHhTdnp4blU5Vis2ZzJQYnRIdllubzhKSFhZeUlpckNtTGtUZHVHOTFncHVUWC9FSTdOK3JEUDBOVzlaTngrcEdxcXhpRWJ1ZXNUMmdxOXpJa0ZEak1ORHBFenFVSTlCdytHTy==
I don't know if is posible to compress this types of strings. I need help.
The functions that I use are this:
uses
SysUtils, Classes, ZLib, EncdDecd;
function CompressAndEncodeString(const Str: string): string;
var
Utf8Stream: TStringStream;
Compressed: TMemoryStream;
Base64Stream: TStringStream;
begin
Utf8Stream := TStringStream.Create(Str, TEncoding.UTF8);
try
Compressed := TMemoryStream.Create;
try
ZCompressStream(Utf8Stream, Compressed);
Compressed.Position := 0;
Base64Stream := TStringStream.Create('', TEncoding.ASCII);
try
EncodeStream(Compressed, Base64Stream);
Result := Base64Stream.DataString;
finally
Base64Stream.Free;
end;
finally
Compressed.Free;
end;
finally
Utf8Stream.Free;
end;
end;
function DecodeAndDecompressString(const Str: string): string;
var
Utf8Stream: TStringStream;
Compressed: TMemoryStream;
Base64Stream: TStringStream;
begin
Base64Stream := TStringStream.Create(Str, TEncoding.ASCII);
try
Compressed := TMemoryStream.Create;
try
DecodeStream(Base64Stream, Compressed);
Compressed.Position := 0;
Utf8Stream := TStringStream.Create('', TEncoding.UTF8);
try
ZDecompressStream(Compressed, Utf8Stream);
Result := Utf8Stream.DataString;
finally
Utf8Stream.Free;
end;
finally
Compressed.Free;
end;
finally
Base64Stream.Free;
end;
end;
As I understand the question you have done the following:
Encoding a string as UTF-8 bytes.
Compressed those bytes using zlib.
Base64 encoded the compressed bytes.
You then attempt to compress the output of step 3 and find that the result is no smaller. That is to be expected. You have already compressed the data, and further attempts to compress it cannot be expected to reduce the size significantly, especially not if you have base64 encoded in the meantime. If you could repeatedly compress data and have it get smaller each time, then eventually there would be nothing left. That is obviously not possible.
I think you are already doing a good job. You convert to UTF-8 which for most text is the most space effective of the Unicode encodings. If you worked with Chinese text then you'd be better off with UTF-16. You then compress the UTF-8 which is also reasonable. And finally for transmission you encode with base64, also reasonable.
The most obvious way for you to reduce the size of data to be transmitted is for you to omit the base64 step. If you can transmit the compressed bytes that are produced in step 2 then you will be transmitting less. Base64 uses 4 bytes to encode 3 bytes so the size of base64 encoded data is a third larger than the input data.
Another way could be to use a better compression algorithm than zlib, but again there are limits to what can be achieved. And usually better compression is achieved at the cost of increased computational time.

Assign [array of byte] to a Variant with no Unicode conversion

Consider the following code snippet (in Delphi XE2):
function PrepData(StrVal: string; Base64Val: AnsiString): OleVariant;
begin
Result := VarArrayCreate([0, 1], varVariant);
Result[0] := StrVal;
Result[1] := Base64Val;
end;
Base64Val is a binary value encoded as Base64 (so no null bytes). The (OleVariant) Result is automatically marshalled and sent between a client app and a DataSnap server.
When I capture the traffic with Wireshark, I see that both StrVal and Base64Val are transferred as Unicode strings. If I can, I would like to avoid the Unicode conversion for Base64Val. I've looked at all the Variant types and don't see anything other than varString that can transfer an array of characters.
I found this question that shows how to create a variant array of bytes. I'm thinking that I could use this technique instead of using an AnsiString. I'm curious though, is there another way to assign an array of non-Unicode character data to a Variant without a conversion to a Unicode string?
Delphi's implementation supports storing AnsiString and UnicodeString in a Variant, using custom variant type codes. These codes are varString and varUString.
But interop will typically use standard OLE variants and the OLE string, varOleStr, is 16 bit encoded. That would seem to be the reason for your observation.
You'll need to put the data in as an array of bytes if you do wish to avoid a conversion to 16 bit text. Doing so renders base64 encoding pointless. Stop base64 encoding the payload and send the binary in a byte array.
Keeping with the example in the question, this is how I made it work (using code and comments from David's answer to another question as referenced in my question):
function PrepData(StrVal: string; Data: TBytes): OleVariant;
var
SafeArray: PVarArray;
begin
Result := VarArrayCreate([0, 1], varVariant);
Result[0] := StrVal;
Result[1] := VarArrayCreate([1, Length(Data)], varByte);
SafeArray := VarArrayAsPSafeArray(Result[1]);
Move(Pointer(Data)^, SafeArray.Data^, Length(Data));
end;
Then on the DataSnap server, I can extract the binary data from the OleVariant like this, assuming Value is Result[1] from the Variant Array in the OleVariant:
procedure GetBinaryData(Value: Variant; Result: TMemoryStream);
var
SafeArray: PVarArray;
begin
SafeArray := VarArrayAsPSafeArray(Value);
Assert(SafeArray.ElementSize=1);
Result.Clear;
Result.WriteBuffer(SafeArray.Data^, SafeArray.Bounds[0].ElementCount);
end;

How to convert AnsiChar to UnicodeChar with specific CodePage?

I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.
Here is the outline of the code I would expect to have:
for I := 0 to 255 do
begin
anChar := AnsiChar(I); //obtain AnsiChar
//Apply codepage without converting the chars
//<<--- this part does not work, showing:
//"E2033 Types of actual and formal var parameters must be identical"
SetCodePage(anChar, aCodepages[K], False);
//Assign AnsiChar to UnicodeChar (automatic conversion)
uniChar := anChar;
//Here we get Unicode character index
uniCode := Ord(uniChar);
end;
The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.
What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?
I would do it like this:
function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
if MultiByteToWideChar(CodePage, 0, #ac, 1, #Result, 1) <> 1 then
RaiseLastOSError;
end;
I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.
Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.
However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.
If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.
You can also do it in one step for all characters. Here is an example for codepage 1250:
var
encoding: TEncoding;
bytes: TBytes;
unicode: TArray<Word>;
I: Integer;
S: string;
begin
SetLength(bytes, 256);
for I := 0 to 255 do
bytes[I] := I;
SetLength(unicode, 256);
encoding := TEncoding.GetEncoding(1250); // change codepage as needed
try
S := encoding.GetString(bytes);
for I := 0 to 255 do
unicode[I] := Word(S[I+1]); // as long as strings are 1-based
finally
encoding.Free;
end;
end;
Here is the code I have found to be working well:
var
I: Byte;
anChar: AnsiString;
Tmp: RawByteString;
uniChar: Char;
uniCode: Word;
begin
for I := 0 to 255 do
begin
anChar := AnsiChar(I);
Tmp := anChar;
SetCodePage(Tmp, aCodepages[K], False);
uniChar := UnicodeString(Tmp)[1];
uniCode := Word(uniChar);
<...snip...>
end;

unicode text file output differs between XE2 and Delphi 2009?

When I try the code below there seem to be different output in XE2 compared to D2009.
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Writeln(Outfile,utf8string('总结'));
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
Compiling with XE2 on a Windows 8 PC gives in WordPad
??
C
txt hex code: EF BB BF 3F 3F 0D 0A B0 43 0D 0A
Compiling with D2009 on a Windows XP PC gives in Wordpad
总结
°C
txt hex code: EF BB BF E6 80 BB E7 BB 93 0D 0A B0 43 0D 0A
My questions is why it differs and how can I save Chinese characters to a text file using the old text file I/O?
Thanks!
In XE2 onwards, AssignFile() has an optional CodePage parameter that sets the codepage of the output file:
function AssignFile(var F: File; FileName: String; [CodePage: Word]): Integer; overload;
Write() and Writeln() both have overloads that support UnicodeString and WideChar inputs.
So, you can create a file that has its codepage set to CP_UTF8, and then Write/ln() will automatically convert Unicode strings to UTF-8 when writing them to the file.
The downside is that you will not be able to write the UTF-8 BOM using AnsiChar values anymore, because the individual bytes will get converted to UTF-8 and thus not be written correctly. You can get around that by writing the BOM as a single Unicode character (which it what it really is - U+FEFF) instead of as individual bytes.
This works in XE2:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TextFile;
begin
AssignFile(Outfile, 'test_chinese.txt', CP_UTF8);
Rewrite(Outfile);
//This is the UTF-8 BOM
Write(Outfile, #$FEFF);
Writeln(Outfile, '总结');
Writeln(Outfile, '°C');
CloseFile(Outfile);
end;
With that said, if you want something that is more compatible and reliable between D2009 and XE2, use TStreamWriter instead:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TStreamWriter;
begin
Outfile := TStreamWriter.Create('test_chinese.txt', False, TEncoding.UTF8);
try
Outfile.WriteLine('总结');
Outfile.WriteLine('°C');
finally
Outfile.Free;
end;
end;
Or do the file I/O manually:
procedure TForm1.Button1Click(Sender: TObject);
var
Outfile: TFileStream;
BOM: TBytes;
procedure WriteBytes(const B: TBytes);
begin
if B <> '' then Outfile.WriteBuffer(B[0], Length(B));
end;
procedure WriteStr(const S: UTF8String);
begin
if S <> '' then Outfile.WriteBuffer(S[1], Length(S));
end;
procedure WriteLine(const S: UTF8String);
begin
WriteStr(S);
WriteStr(sLineBreak);
end;
begin
Outfile := TFileStream.Create('test_chinese.txt', fmCreate);
try
WriteBytes(TEncoding.UTF8.GetPreamble);
WriteLine('总结');
WriteLine('°C');
finally
Outfile.Free;
end;
end;
You really shouldn't use the old text I/O anymore.
Anyway, you can use TEncoding to get the UTF-8 TBytes like this:
procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
Bytes: TBytes;
myByte: Byte;
begin
assignfile(Outfile,'test_chinese.txt');
Rewrite(Outfile);
for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
//This is the UTF-8 BOM
Bytes := TEncoding.UTF8.GetBytes('总结');
for myByte in Bytes do begin
Write(Outfile, AnsiChar(myByte));
end;
Writeln(Outfile,'°C');
Closefile(Outfile);
end;
I'm not sure if there is an easier way to write TBytes to a Textfile, maybe somebody else has a better idea.
Edit:
For a pure binary file (File instead of TextFile type) use can use BlockWrite.
There are a couple of tell-tale signs that may tell you what whent wrong when dealing with Unicode. In your case you're seeing "?" in the resulting output file: You get question marks when you try to convert some thing from Unicode to a Code Page and the target Code Page can't represent the requested characters.
Looking at the hex dump it's obvious (counting line terminators) that the question marks are the result of saving the two Chinese characters to the file. The two chars got converted to exactly two question marks. This tells you the Writeln() decided to give you helping and converted the text from UTF8 (a unicode representation) to your local code page. The Delphi team probably decided to do this since the old I/O routines are not supposed to be UNICODE compatible; since you're writing an UTF8 string using the old I/O routines, they're helping you by converting this to your Code Page. You might not welcome that helping hand, but it doesn't mean it was wrong to do so: it's undocumented territory.
Since you now know why that's happening you know what to do to stop it. Let WriteLn() know you're sending something that doesn't need converting. You'll discover that's not particularly easy, since Delphi XE2 apparently "helps you out" whatever you. For example, stuff like this doesn't just change the string type, it converts to AnsiString, going through the code-page conversion routine that gets you question marks:
AnsiString(UTF8String('Whatever Unicode'));
Because of this, and if you need one-liner solutions, you could try a conversion routine, something like this:
function FakeConvert(const InStr: UTF8String): AnsiString;
var N: Integer;
begin
N := Length(InStr);
SetLength(Result, N);
Move(InStr[1], Result[1], N);
end;
You'll then be able to do:
Writeln(Outfile,FakeConvert('总结'));
And it'll do what you expect (I did actually try it before posting!)
Of course the only TRUE answer to this question is, since you upgraded all the way to Delphi XE2:
Stop using deprecated I/O routines, move to TStream based

Resources