Compress Base64 string with zlib - delphi

I need to send from Windows to mobile devices, iOS and Android, by TCP protocol, a big Base64 string.
I have no problem to send and receive, but the strings size are too big, about 24000 characters, and I'm looking at method to compress an decompress these strings.
Looking I see, that the best way is using the Zlib, and I found these link Delphi XE and ZLib Problems (II) in which explains how to do it.
The functions work with normal text string, but compressing base64 strings make they more big.
An example of a very small string that i would send, would be this:
cEJNYkpCSThLVEh6QjNFWC9wSGhXQ3lHWUlBcGNURS83TFdDNVUwUURxRnJvZlRVUWd4WEFWcFJBNUZSSE9JRXlsaWgzcEJvTGo5anQwTlEyd1pBTEtVQVlPbXdkKzJ6N3J5ZUd4SmU2bDNBWjFEd3lVZmZTR1FwNXRqWTVFOFd2SHRwakhDOU9JUEZRM00wMWhnU0p3MWxxNFRVdmdEU2pwekhwV2thS0JFNG9WYXRDUHhTdnp4blU5Vis2ZzJQYnRIdllubzhKSFhZeUlpckNtTGtUZHVHOTFncHVUWC9FSTdOK3JEUDBOVzlaTngrcEdxcXhpRWJ1ZXNUMmdxOXpJa0ZEak1ORHBFenFVSTlCdytHTy==
I don't know if is posible to compress this types of strings. I need help.
The functions that I use are this:
uses
SysUtils, Classes, ZLib, EncdDecd;
function CompressAndEncodeString(const Str: string): string;
var
Utf8Stream: TStringStream;
Compressed: TMemoryStream;
Base64Stream: TStringStream;
begin
Utf8Stream := TStringStream.Create(Str, TEncoding.UTF8);
try
Compressed := TMemoryStream.Create;
try
ZCompressStream(Utf8Stream, Compressed);
Compressed.Position := 0;
Base64Stream := TStringStream.Create('', TEncoding.ASCII);
try
EncodeStream(Compressed, Base64Stream);
Result := Base64Stream.DataString;
finally
Base64Stream.Free;
end;
finally
Compressed.Free;
end;
finally
Utf8Stream.Free;
end;
end;
function DecodeAndDecompressString(const Str: string): string;
var
Utf8Stream: TStringStream;
Compressed: TMemoryStream;
Base64Stream: TStringStream;
begin
Base64Stream := TStringStream.Create(Str, TEncoding.ASCII);
try
Compressed := TMemoryStream.Create;
try
DecodeStream(Base64Stream, Compressed);
Compressed.Position := 0;
Utf8Stream := TStringStream.Create('', TEncoding.UTF8);
try
ZDecompressStream(Compressed, Utf8Stream);
Result := Utf8Stream.DataString;
finally
Utf8Stream.Free;
end;
finally
Compressed.Free;
end;
finally
Base64Stream.Free;
end;
end;

As I understand the question you have done the following:
Encoding a string as UTF-8 bytes.
Compressed those bytes using zlib.
Base64 encoded the compressed bytes.
You then attempt to compress the output of step 3 and find that the result is no smaller. That is to be expected. You have already compressed the data, and further attempts to compress it cannot be expected to reduce the size significantly, especially not if you have base64 encoded in the meantime. If you could repeatedly compress data and have it get smaller each time, then eventually there would be nothing left. That is obviously not possible.
I think you are already doing a good job. You convert to UTF-8 which for most text is the most space effective of the Unicode encodings. If you worked with Chinese text then you'd be better off with UTF-16. You then compress the UTF-8 which is also reasonable. And finally for transmission you encode with base64, also reasonable.
The most obvious way for you to reduce the size of data to be transmitted is for you to omit the base64 step. If you can transmit the compressed bytes that are produced in step 2 then you will be transmitting less. Base64 uses 4 bytes to encode 3 bytes so the size of base64 encoded data is a third larger than the input data.
Another way could be to use a better compression algorithm than zlib, but again there are limits to what can be achieved. And usually better compression is achieved at the cost of increased computational time.

Related

Convert encoding of TMemoryStream to utf8

I have a file opened in TMemoryStream. Its current encoding can be ANSI or UTF8 with BOM. I have to convert the encoding of TMemoryStream to UTF8. How do I do that?
If you are able to change the TMemoryStream to its descendant TBytesStream you can just use the Convert function from TEncoding.
var
stream: TBytesStream;
bytes: TBytesStream;
...
TEncoding.GetBufferEncoding(stream.Bytes, curEncoding);
if curEncoding <> TEncoding.UTF8 then begin
bytes := TEncoding.Convert(curEncoding, TEncoding.UTF8, stream.Bytes);
stream.Free;
stream := TBytesStream.Create(bytes);
end;
Not sure if it is the most efficient way, but at least it is one way and it only needs a couple of lines, which in turn is also some sort of efficiency.

File upload fails, when posting with Indy and filename contains Greek characters

I am trying to implement a POST to a web service. I need to send a file whose type is variable (.docx, .pdf, .txt) along with a JSON formatted string.
I have manage to post files successfully with code similar to the following:
procedure DoRequest;
var
Http: TIdHTTP;
Params: TIdMultipartFormDataStream;
RequestStream, ResponseStream: TStringStream;
JRequest, JResponse: TJSONObject;
url: string;
begin
url := 'some_custom_service'
JRequest := TJSONObject.Create;
JResponse := TJSONObject.Create;
try
JRequest.AddPair('Pair1', 'Value1');
JRequest.AddPair('Pair2', 'Value2');
JRequest.AddPair('Pair3', 'Value3');
Http := TIdHTTP.Create(nil);
ResponseStream := TStringStream.Create;
RequestStream := TStringStream.Create(UTF8Encode(JRequest.ToString));
try
Params := TIdMultipartFormDataStream.Create;
Params.AddFile('File', ceFileName.Text, '').ContentTransfer := '';
Params.AddFormField('Json', 'application/json', '', RequestStream);
Http.Post(url, Params, ResponseStream);
JResponse := TJSONObject.ParseJSONValue(ResponseStream.DataString) as TJSONObject;
finally
RequestStream.Free;
ResponseStream.Free;
Params.Free;
Http.Free;
end;
finally
JRequest.Free;
JResponse.Free;
end;
end;
The problem appears when I try to send a file that contains Greek characters and spaces in the filename. Sometimes it fails and sometimes it succeeds.
After a lot of research, I notice that the POST header is encoded by Indy's TIdFormDataField class using the EncodeHeader() function. When the post fails, the encoded filename in the header is split, compared to the successful post where is not split.
For example :
Επιστολή εκπαιδευτικο.docx is encoded as =?UTF-8?B?zpXPgM65z4PPhM6/zrvOriDOtc66z4DOsc65zrTOtc+Fz4TOuc66zr8uZG9j?='#$D#$A' =?UTF-8?B?eA==?=, which fails.
Επιστολή εκπαιδευτικ.docx is encoded as
=?UTF-8?B?zpXPgM65z4PPhM6/zrvOriDOtc66z4DOsc65zrTOtc+Fz4TOuc66LmRvY3g=?=, which succeeds.
Επιστολή εκπαιδευτικ .docx is encoded as
=?UTF-8?B?zpXPgM65z4PPhM6/zrvOriDOtc66z4DOsc65zrTOtc+Fz4TOuc66?= .docx, which fails.
I have tried to change the encoding of the filename, the AContentType of the AddFile() procedure, and the ContentTransfer, but none of those change the behavior, and I still get errors when the encoded filename is split.
Is this some kind of bug, or am I missing something?
My code works for every case except those I described above.
I am using Delphi XE3 with Indy10.
EncodeHeader() does have some known issues with Unicode strings:
EncodeHeader() needs to take codeunits into account when splitting data between adjacent encoded-words
Basically, an MIME-encoded word cannot be more than 75 characters in length, so long text gets split up. But when encoding a long Unicode string, any given Unicode character may be charset-encoded using 1 or more bytes, and EncodeHeader() does not yet avoid erroneously splitting a multi-byte character between two individual bytes into separate encoded words (which is illegal and explicitly forbidden by RFC 2047 of the MIME spec).
However, that is not what is happening in your examples.
In your first example, 'Επιστολή εκπαιδευτικο.docx' is too long to be encoded as a single MIME word, so it gets split into 'Επιστολή εκπαιδευτικο.doc' 'x' substrings, which are then encoded separately. This is legal in MIME for long text (though you might have expected Indy to split the text into 'Επιστολή' ' εκπαιδευτικο.doc' instead, or even 'Επιστολή' ' εκπαιδευτικο' '.doc'. That might be a possibility in a future release). Adjacent MIME words that are separated by only whitespace are meant to be concatenated together without separating whitespace when decoded, thus producing 'Επιστολή εκπαιδευτικο.docx' again. If the server is not doing that, it has a flaw in its decoder (maybe it is decoding as 'Επιστολή εκπαιδευτικο.doc x' instead?).
In your second example, 'Επιστολή εκπαιδευτικ.docx' is short enough to be encoded as a single MIME word.
In your third example, 'Επιστολή εκπαιδευτικ .docx' gets split on the second whitespace (not the first) into 'Επιστολή εκπαιδευτικ' ' .docx' substrings, and only the first substring needs to be encoded. This is legal in MIME. When decoded, the decoded text is meant to be concatenated with the following unencoded text, preserving whitespace between them, thus producing 'Επιστολή εκπαιδευτικ .docx' again. If the server is not doing that, it has a flaw in its decoder (maybe it is decoding as 'Επιστολή εκπαιδευτικ.docx' instead?).
If you run these example filenames through Indy's MIME header encoder/decoder, they do decode properly:
var
s: String;
begin
s := EncodeHeader('Επιστολή εκπαιδευτικο.docx', '', 'B', 'UTF-8');
ShowMessage(s); // '=?UTF-8?B?zpXPgM65z4PPhM6/zrvOriDOtc66z4DOsc65zrTOtc+Fz4TOuc66zr8uZG9j?='#13#10' =?UTF-8?B?eA==?='
s := DecodeHeader(s);
ShowMessage(s); // 'Επιστολή εκπαιδευτικο.docx'
s := EncodeHeader('Επιστολή εκπαιδευτικ.docx', '', 'B', 'UTF-8');
ShowMessage(s); // '=?UTF-8?B?zpXPgM65z4PPhM6/zrvOriDOtc66z4DOsc65zrTOtc+Fz4TOuc66LmRvY3g=?='
s := DecodeHeader(s);
ShowMessage(s); // 'Επιστολή εκπαιδευτικ.docx'
s := EncodeHeader('Επιστολή εκπαιδευτικ .docx', '', 'B', 'UTF-8');
ShowMessage(s); // '=?UTF-8?B?zpXPgM65z4PPhM6/zrvOriDOtc66z4DOsc65zrTOtc+Fz4TOuc66?= .docx'
s := DecodeHeader(s);
ShowMessage(s); // 'Επιστολή εκπαιδευτικ .docx'
end;
So the problem seems to be on the server side decoding, not on Indy's client side encoding.
That being said, if you are using a fairly recent version of Indy 10 (Nov 2011 or later), TIdFormDataField has a HeaderEncoding property, which defaults to 'B' (base64) in Unicode environments. However, the splitting logic also affects 'Q' (quoted-printable) as well, so that may or may not work for you, either (but you can try it):
with Params.AddFile('File', ceFileName.Text, '') do
begin
ContentTransfer := '';
HeaderEncoding := 'Q'; // <--- here
HeaderCharSet := 'utf-8';
end;
Otherwise, a workaround might be to change the value to '8' (8-bit) instead, which effectively disables MIME encoding (but not charset encoding):
with Params.AddFile('File', ceFileName.Text, '') do
begin
ContentTransfer := '';
HeaderEncoding := '8'; // <--- here
HeaderCharSet := 'utf-8';
end;
Just note that if the server is not expecting raw UTF-8 bytes for the filename, you might still run into problems (ie, 'Επιστολή εκπαιδευτικο.docx' being interpreted as 'Επιστολή εκπαιδευτικο.docx', for instance).

Delphi Lockbox Hashing

I need to hash a string, preferably as SHA512, although it could be SHA256, SHA1, MD5 or CRC32.
I have downloaded Lockbox 3, put a TCryptographicLibrary and a THash component on a form, set the Hash property to SHA-512 and used the following code to produce a test result:
procedure TForm1.Button1Click(Sender: TObject);
begin
Hash1.HashString('myhashtest');
Edit1.Text := Stream_To_AnsiString(Hash1.HashOutputValue);
end;
To best illustrate the problem, I have gone on to an online hash calculator and the MD5 hash of 'myhashtest' is ff91e22313f0a41b46719e7ee6f99451 but setting the hash property in my test program to MD5 results in ÿ‘â#ð¤Fqž~æù”Q which is clearly wrong. I have tried the same test using other Hash properties, including the SHA512 which i want, and they all return rubbish.
Where am I going wrong?
THash.HashOutputValue is a stream of the raw hashed bytes. It appears that Stream_To_AnsiString() merely copies those raw bytes as-is into an AnsiString, it does not encode the bytes in any way. What you are looking for is the hex encoded version of the raw bytes instead. I do know that LockBox has a Stream_To_Base64() function (as shown in this example), but I do not know if it has a Stream_To_Hex() type of function. If it does not, you can easily create your own, eg:
function Stream_To_Hex(Stream: TStream): AnsiString;
var
NumBytes, I: Integer;
B: Byte;
begin
NumBytes := Stream.Size - Stream.Position;
SetLength(Result, NumBytes * 2);
for I := 0 to NumBytes-1 do
begin
Stream.ReadBuffer(B, 1);
BinToHex(#B, #Result[(I*2)+1], 1);
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
Hash1.HashString('myhashtest');
Edit1.Text := Stream_To_Hex(Hash1.HashOutputValue);
end;
Many cryptographic functions 'silently' (i.e. without stating so in the docs) output and require Base64- or hex-encoded strings (and also often AnsiStrings). This is because encrypted text can contain any data, and as soon as you start treating that as 'strings', string handling functions can easily choke on that (e.g. null-terminated strings containing a null). By Base-64/hex encoding the cryptotext you make sure it will be plain old ASCII characters that evene old code can read/write.
If you dig around a little in the cryptocode or its method parameters you usually can determine that, and convert your strings accordingly.
I figured out where stream_to_hex, it is inside uTPLB_StreamUtils (pas or hpp) depending if you are using c builder or delphi.

Assign [array of byte] to a Variant with no Unicode conversion

Consider the following code snippet (in Delphi XE2):
function PrepData(StrVal: string; Base64Val: AnsiString): OleVariant;
begin
Result := VarArrayCreate([0, 1], varVariant);
Result[0] := StrVal;
Result[1] := Base64Val;
end;
Base64Val is a binary value encoded as Base64 (so no null bytes). The (OleVariant) Result is automatically marshalled and sent between a client app and a DataSnap server.
When I capture the traffic with Wireshark, I see that both StrVal and Base64Val are transferred as Unicode strings. If I can, I would like to avoid the Unicode conversion for Base64Val. I've looked at all the Variant types and don't see anything other than varString that can transfer an array of characters.
I found this question that shows how to create a variant array of bytes. I'm thinking that I could use this technique instead of using an AnsiString. I'm curious though, is there another way to assign an array of non-Unicode character data to a Variant without a conversion to a Unicode string?
Delphi's implementation supports storing AnsiString and UnicodeString in a Variant, using custom variant type codes. These codes are varString and varUString.
But interop will typically use standard OLE variants and the OLE string, varOleStr, is 16 bit encoded. That would seem to be the reason for your observation.
You'll need to put the data in as an array of bytes if you do wish to avoid a conversion to 16 bit text. Doing so renders base64 encoding pointless. Stop base64 encoding the payload and send the binary in a byte array.
Keeping with the example in the question, this is how I made it work (using code and comments from David's answer to another question as referenced in my question):
function PrepData(StrVal: string; Data: TBytes): OleVariant;
var
SafeArray: PVarArray;
begin
Result := VarArrayCreate([0, 1], varVariant);
Result[0] := StrVal;
Result[1] := VarArrayCreate([1, Length(Data)], varByte);
SafeArray := VarArrayAsPSafeArray(Result[1]);
Move(Pointer(Data)^, SafeArray.Data^, Length(Data));
end;
Then on the DataSnap server, I can extract the binary data from the OleVariant like this, assuming Value is Result[1] from the Variant Array in the OleVariant:
procedure GetBinaryData(Value: Variant; Result: TMemoryStream);
var
SafeArray: PVarArray;
begin
SafeArray := VarArrayAsPSafeArray(Value);
Assert(SafeArray.ElementSize=1);
Result.Clear;
Result.WriteBuffer(SafeArray.Data^, SafeArray.Bounds[0].ElementCount);
end;

How can I use a large file in Delphi?

When I use a large file in memorystream or filestream I see an error which is "out of memory"
How can I solve this problem?
Example:
procedure button1.clıck(click);
var
mem:TMemoryStream;
str:string;
begin
mem:=Tmemorystream.create;
mem.loadfromfile('test.txt');----------> there test.txt size 1 gb..
compressstream(mem);
end;
Your implementation is very messy. I don't know exactly what CompressStream does, but if you want to deal with a large file as a stream, you can save memory by simply using a TFileStream instead of trying to read the whole thing into a TMemoryStream all at once.
Also, you're never freeing the TMemoryStream when you're done with it, which means that you're going to leak a whole lot of memory. (Unless CompressStream takes care of that, but that's not clear from the code and it's really not a good idea to write it that way.)
You can't fit the entire file into a single contiguous block of 32 bit address space. Hence the out of memory error.
Read the file in smaller pieces and process it piece by piece.
Answering the question in the title, you need to process the file piece by piece, byte by byte if that's needed: you definitively do not load the file all at once into memory! How you do that obviously depends on what you need to do with the file; But since we know you're trying to implement an Huffman encoder, I'll give you some specific tips.
An Huffman encoder is a stream encoder: Bytes go in and bits go out. Each unit of incoming data is replaced with it's corresponding bit pattern. The encoder doesn't need to see the whole file at once, because it is in fact only working on one byte each time.
Here's how you'd huffman-compress a file without loading it all into memory; Of course, the actual Huffman encoder is not shown, because the question is about working with big files, not about building the actual encoder. This piece of code includes buffered input and output and shows how you'd link an actual encoder procedure to it.
(beware, code written in browser; if it doesn't compile you're expected to fix it!)
type THuffmanBuffer = array[0..1023] of Byte; // Because I need to pass the array as parameter
procedure DoActualHuffmanEncoding(const EncodeByte:Byte; var BitBuffer: THuffmanBuffer; var AtBit: Integer);
begin
// This is where the actual Huffman encoding would happen. This procedure will
// copy the correct encoding for EncodeByte in BitBuffer starting at AtBit bit index
// The procedure is expected to advance the AtBit counter with the number of bits
// that were actually written (that's why AtBit is a var parameter).
end;
procedure HuffmanEncoder(const FileNameIn, FileNameOut: string);
var InFile, OutFile: TFileStream;
InBuffer, OutBuffer: THuffmanBuffer;
InBytesCount: Integer;
OutBitPos: Integer;
i: Integer;
begin
// First open the InFile
InFile := TFileStream.Create(FileNameIn, fmOpenRead or fmShareDenyWrite);
try
// Now prepare the OutFile
OutFile := TFileStream.Create(FileNameOut, fmCreate);
try
// Start the out bit counter
OutBitPos := 0;
// Read from the input file, one buffer at a time (for efficiency)
InBytesCount := InFile.Read(InBuffer, SizeOf(InBuffer));
while InBytesCount <> 0 do
begin
// Process the input buffer byte-by-byte
for i:=0 to InBytesCount-1 do
begin
DoActualHuffmanEncoding(InBuffer[i], OutBuffer, OutBitPos);
// The function writes bits to the outer buffer, not full bytes, and the
// encoding for a rare byte might be significantly longer then 1 byte.
// Whenever the output buffer approaches it's capacity we'll flush it
// out to the OutFile
if (OutBitPos > ((SizeOf(OutBuffer)-10)*8) then
begin
// Ok, we've got less then 10 bytes available in the OutBuffer, time to
// flush!
OutFile.Write(OutBuffer, OutBitPos div 8);
// We're now possibly left with one incomplete byte in the buffer.
// We'll copy that byte to the start of the buffer and continue.
OutBuffer[0] := OutBuffer[OutBitPos div 8];
OutBitPos := OutBitPos mod 8;
end;
end;
// Read next chunk
InBytesCount := InFile.Read(InBuffer, SizeOf(InBuffer));
end;
// Flush the remaining of the output buffer. This time we want to flush
// the final (potentially incomplete) byte as well, because we've got no
// more input, there'll be no more output.
OutFile.Write(OutBuffer, (OutBitPos + 7) div 8);
finally OutFile.Free;
end;
finally InFile.Free;
end;
end;
The Huffman encoder is not a difficult encoder to implement, but doing it both correctly and fast might be a challenge. I suggest you start with a correct encoder, once you've got both encoding and decoding working figure out how to do a fast encoder.
try something like http://www.explainth.at/en/delphi/mapstream.shtml

Resources