I'm trying to read a .MEM file using Delphi. It's a FoxPro Memory Variable Files. I've tried to read using TFileStream and load into TStringList. But, it only returns the first word.
F := TFileStream.Create(sFile, fmOpenRead);
L := TStringList.Create;
try
F.Position := 0;
L.LoadFromStream(F);
ShowMessage(L.Text);
finally
F.Free;
L.Free;
end;
The reason is because I want to migrate some useful .MEM values from an old program to my new program. Thanks for any help.
If it's a one-time affair and you have access to a VFP installation - i.e. the IDE, not only the runtime - then David Heffernan's suggestion is certainly the most sensible way. In this case you can load and inspect the .MEM via
release all extended && to clear away all existing memvars
restore from foo && assuming the file in question is named FOO.MEM
activate window Locals && inspect the variables...
list memory to foo && or list them to FOO.TXT
modify file foo.txt
However, LIST MEMORY (and DISPLAY MEMORY) also include all the system variables - the things that start with an underscore - which would need to be parsed off.
If it's an ongoing affair - repeated imports necessary - and you know which variables you need then there are two fairly clean and easy ways.
The first is only valid if a VFP IDE is installed on the computer on which the Delphi program is to be run. In this case you can instantiate VFP from Delphi (leave it invisible), have it read the .MEM and then query individual variables:
procedure fetch_variables_from_MEM (mem_filename: string; var_list: CFoos);
var
fox: Variant;
foo: CFoo;
begin
fox := CreateOleObject('VisualFoxpro.Application.9');
try
fox.DoCmd('release all extended');
fox.DoCmd('restore from ' + mem_filename);
for foo in var_list do
foo.Value := fox.Eval('m.' + foo.Name);
finally
fox.Quit; // AutoQuit not supported
end;
end;
I glossed over some details, like that CoInitialize() needs to be called on the thread somewhere before calling this, and I assumed suitable definitions for the variable list (a list/collection of hypothetical CFoo objects), but the sketched outline works - even in 64-bit Delphi.
The advantage is that things like datetime values arrive as TDateTime by virtue of the COM infrastructure and the use of variants.
The second easy way is applicable if an IDE is not available on the machine where the Delphi program is to be used but you have access to an IDE somewhere, so that you can build a small COM server:
define class FoxWrapper as custom olepublic
function Eval (cExpression as string) as variant
return evaluate(m.cExpression)
procedure DoCmd (cCommand as string)
&cCommand
enddefine
This can then be used instead of "VisualFoxPro.Application.9" in the example above. Note: for 64-bit Delphi you need to build this as an out-of-process server (i.e. an EXE). Also, this may run afoul of the VFP licence conditions.
For accessing the data directly, here's some quick & dirty Delphi code that I modelled after some FoxPro stuff that I coded eons ago and updated for VFP9. This is proof-of-principle code with simplified array handling and other compromises for the sake of exposition; it lacks all the production-quality noise necessitated by Delphi's half-assed language definition and its quarter-assed runtime.
type
TMEMVarHeader = packed record
var_name: array [0..10] of AnsiChar;
mem_type: AnsiChar; // 0ACDHLNOQYacdhlnoqy
big_size: UInt32; // only if mem_type == 'H'
width : Byte; // special meaning if mem_type == 'H'
decimals: Byte;
padding : array [0..13] of Byte; // 0 0 0 0 0 0 0 3 0 0 0 0 0 0
end;
SizeOf_TMEMVarHeader_eq_32 = true .. SizeOf(TMEMVarHeader) = 32;
TMEMVarInfo = record
header: TMEMVarHeader;
null_t: AnsiChar;
name : AnsiString;
value : Variant;
function ReadFromStream (stream: TStream): Boolean; // false if EOF
end;
function TMEMVarInfo.ReadFromStream (stream: TStream): Boolean;
const
DELPHI_EPOCH = 2415019.0;
var
header_bytes_read: Integer;
name_length: UInt16;
text_length: UInt32;
array_dim_1: UInt16;
array_dim_2: UInt16;
d: TDate; // 64-bit double
l: Boolean;
n: Double; // 64-bit double
q: array of Byte;
c: AnsiString;
t: TDateTime; // 64-bit double
y: Int64;
binary: Boolean;
i: Cardinal;
a: array of Variant;
v: TMEMVarInfo;
begin
name := ''; value := Unassigned;
header_bytes_read := stream.Read(header, SizeOf(header));
if header_bytes_read <> Sizeof(header) then begin
if not ((header_bytes_read = 1) and (header.var_name[0] = #26)) then
raise Exception.Create('unexpected MEM file format (problem reading header)');
result := false; // EOF
EXIT;
end;
result := true;
// variable name
if header.var_name[0] = #0 then begin // long variable name
assert(header.mem_type = LoCase(header.mem_type));
stream.ReadBuffer(name_length, Sizeof(name_length));
SetLength(name, name_length);
stream.ReadBuffer(name[1], name_length);
end else begin
assert(header.mem_type = UpCase(header.mem_type));
name := header.var_name;
end;
// variable value
case UpCase(header.mem_type) of
'A':
begin
stream.ReadBuffer(array_dim_1, SizeOf(array_dim_1));
stream.ReadBuffer(array_dim_2, SizeOf(array_dim_2));
if array_dim_2 = 0 then // it's a vector, not an array
array_dim_2 := 1;
SetLength(a, array_dim_1 * array_dim_2);
for i := 0 to array_dim_1 * array_dim_2 - 1 do begin
if not v.ReadFromStream(stream) then
raise Exception.Create('error reading array element');
a[i] := v.value;
end;
value := a;
end;
'0': begin stream.ReadBuffer(null_t, 1); value := Null; end;
'C', 'H', 'Q':
begin
if UpCase(header.mem_type) = 'H' then begin // length > 254
binary := header.width <> 0;
text_length := header.big_size;
end else begin
binary := UpCase(header.mem_type) = 'Q';
text_length := header.width;
end;
if binary then begin
SetLength(q, text_length); stream.ReadBuffer(q[0], text_length); value := q;
end else begin
SetLength(c, text_length); stream.ReadBuffer(c[1], text_length); value := c;
end;
end;
'D': begin stream.ReadBuffer(d, Sizeof(d)); if d > 0 then d := d - DELPHI_EPOCH; VarCast(value, d, varDate); end;
'L': begin stream.ReadBuffer(l, Sizeof(l)); value := l; end;
'N': begin stream.ReadBuffer(n, Sizeof(n)); value := n; end;
'T': begin stream.ReadBuffer(t, Sizeof(t)); if t > 0 then t := t - DELPHI_EPOCH; value := t; end;
'Y': begin stream.ReadBuffer(y, Sizeof(y)); VarCast(value, y / 10000.0, varCurrency); end;
else
raise Exception.Create('unexpected type ''' + header.mem_type + ''' in MEM file');
end;
end;
For reading a .MEM, create a TFileStream and a TMEMVarInfo variable, then read variables one by one until var_info.ReadFromStream(stream) returns false.
Note: the byte at offset 19h (shown as 3 in the structure comment) is a code page identifier. The values are the same as those found in .DBF headers, i.e. 1 for DOS 437, 3 for Windows 1252 and so on. However, even though VFP stores these identifiers when writing a .MEM, all the newer versions of VFP that I tested completely ignore these code page marks when loading a .MEM. A self-written importer could put the code page marks to good use, though.
Reading the binary .mem files is not the correct way to proceed. The correct solution is to get VFP to export the data. It knows how to read it. Get VFP to export to a known format, and read that. This is the standard approach to data migration.
How can I insert a string at a specified index in TMemoryStream? If you added a string
"a" to an existing string "b" at Index 0 it would move it forward "ab" etc. For example this is what TStringBuilder.Insert does.
Expand the stream so that there is room for the text to be inserted. So, if the text to be inserted has length N, then you need to make the stream N bytes larger.
Copy all the existing content, starting from the insertion point, to the right to make room for the insertion. A call to Move will get this done. You'll be moving this text N bytes to the right.
Write the inserted string at the insertion point.
I'm assuming an 8 bit encoding. If you use a 16 bit encoding, then the stream needs to be grown by 2N bytes, and so on.
You will find that this is a potentially expensive operation. If you care about performance you will do whatever you can to avoid ever having to do this.
P.S. I'm sorry if I have offended any right-to-left language readers with my Anglo-centric assumption that strings run from left to right!
You asked for some code. Here it is:
procedure TMyStringBuilder.Insert(Index: Integer; const Str: string);
var
N: Integer;
P: Char;
begin
N := Length(Str);
if N=0 then
exit;
FStream.Size := FStream.Size + N*SizeOf(Char);
P := PChar(FStream.Memory);
Move((P + Index)^, (P + Index + N)^, (FStream.Size - N - Index)*SizeOf(Char));
Move(Pointer(Str)^, (P + Index)^, N*SizeOf(Char));
end;
Note that I wrote this code, and then looked at the code in TStringBuilder. It's pretty much identical to that!
The fact that the code you end up writing for this operation is identical to that in TStringBuilder should cause you to stop and contemplate. It's very likely that this new string builder replacement class that you are building will end up with the same implementation as the original. It's highly likely that your replacement will perform no better than the original, and quite plausible that the performance of the replacement will be worse.
It looks a little to me as though you are optimising prematurely. According to your comments below you have not yet timed your code to prove that time spent in TStringBuilder methods is your bottleneck. That's really the first thing that you need to do.
Assuming that you do this timing, and prove that TStringBuilder methods are your bottleneck, you then need to identify why that code is performing below par. And then you need to work out how the code could be improved. Simply repeating the implementation of the original class is not going to yield any benefits.
To move the existing data to make some room for the new string data, you can use pointer operation and Move procedure for faster operation. But it only has to be done if the insertion index is lower then the size of the original stream. If the index is larger than stream size then you could: (1) expand the stream size to accomodate the index number and fill the extra room with zero values or spaces, or (2) reduce the index value to the stream size, so the string will be inserted or appended in the end of the stream.
Depend on your needs, you could: (1) make a class derived from TMemoryStream, or (2) make a function to process an instance of TMemoryStream. Here's the first case:
type
TExtMemoryStream = class(TMemoryStream)
public
procedure InsertString(Index: Integer; const S: string);
end;
procedure TExtMemoryStream.InsertString(Index: Integer; const S: string);
var
SLength, OldSize: Integer;
Src, Dst, PointerToS: ^Char;
begin
if Index > Size then Index := Size;
SLength := Length(S);
OldSize := Size;
SetSize(Size + SLength);
Src := Memory; Inc(Src, Index);
Dst := Src; Inc(Dst, SLength);
Move(Src^, Dst^, OldSize - Index);
PointerToS := #S[1];
Move(PointerToS^, Src^, SLength);
end;
or the second case:
procedure InsertStringToMemoryStream(MS: TMemoryStream;
Index: Integer; const S: string);
var
SLength, OldSize: Integer;
Src, Dst, PointerToS: ^Char;
begin
if Index > MS.Size then Index := MS.Size;
SLength := Length(S);
OldSize := MS.Size;
MS.SetSize(MS.Size + SLength);
Src := MS.Memory; Inc(Src, Index);
Dst := Src; Inc(Dst, SLength);
Move(Src^, Dst^, OldSize - Index);
PointerToS := #S[1];
Move(PointerToS^, Src^, SLength);
end;
There, hope it helps :)
I am testing some enhanced string related functions with which I am trying to use move as a way to copy strings around for faster, more efficient use without delving into pointers.
While testing a function for making a delimited string from a TStringList, I encountered a strange issue. The compiler referenced the bytes contained through the index when it was empty and when a string was added to it through move, index referenced the characters contained.
Here is a small downsized barebone code sample:-
unit UI;
interface
uses
System.SysUtils, System.Types, System.UITypes, System.Rtti, System.Classes,
System.Variants, FMX.Types, FMX.Controls, FMX.Forms, FMX.Dialogs, FMX.Layouts,
FMX.Memo;
type
TForm1 = class(TForm)
Results: TMemo;
procedure FormCreate(Sender: TObject);
end;
var
Form1: TForm1;
implementation
{$R *.fmx}
function StringListToDelimitedString
( const AStringList: TStringList; const ADelimiter: String ): String;
var
Str : String;
Temp1 : NativeInt;
Temp2 : NativeInt;
DelimiterSize : Byte;
begin
Result := ' ';
Temp1 := 0;
DelimiterSize := Length ( ADelimiter ) * 2;
for Str in AStringList do
Temp1 := Temp1 + Length ( Str );
SetLength ( Result, Temp1 );
Temp1 := 1;
for Str in AStringList do
begin
Temp2 := Length ( Str ) * 2;
// Here Index references bytes in Result
Move ( Str [1], Result [Temp1], Temp2 );
// From here the index seems to address characters instead of bytes in Result
Temp1 := Temp1 + Temp2;
Move ( ADelimiter [1], Result [Temp1], DelimiterSize );
Temp1 := Temp1 + DelimiterSize;
end;
end;
procedure TForm1.FormCreate(Sender: TObject);
var
StrList : TStringList;
Str : String;
begin
// Test 1 : StringListToDelimitedString
StrList := TStringList.Create;
Str := '';
StrList.Add ( 'Hello1' );
StrList.Add ( 'Hello2' );
StrList.Add ( 'Hello3' );
StrList.Add ( 'Hello4' );
Str := StringListToDelimitedString ( StrList, ';' );
Results.Lines.Add ( Str );
StrList.Free;
end;
end.
Please devise a solution and if possible, some explanation. Alternatives are welcome too.
Let's look at the crucial bit of code:
// Here Index references bytes in Result
Move ( Str [1], Result [Temp1], Temp2 );
// From here the index seems to address characters instead of bytes in Result
Temp1 := Temp1 + Temp2;
Move ( ADelimiter [1], Result [Temp1], DelimiterSize );
Now, some explanations. When you index a string, you are always indexing characters. You are never indexing bytes. It looks to me as though you wish to index bytes. In which case using the string index operator makes life hard. So I suggest that you index bytes as follows.
First of all initialise Temp1 to 0 rather than 1 since we will be using zero-based indexing.
When you need to index Result using a zero-based byte index, do so like this:
PByte(Result)[Temp1]
So your code becomes:
Temp1 := 0;
for Str in AStringList do
begin
Temp2 := Length(Str)*2;
Move(Str[1], PByte(Result)[Temp1], Temp2);
Temp1 := Temp1 + Temp2;
Move(ADelimiter[1], PByte(Result)[Temp1], DelimiterSize);
Temp1 := Temp1 + DelimiterSize;
end;
In fact I think I'd write it like this, avoiding all string indexing:
Temp1 := 0;
for Str in AStringList do
begin
Temp2 := Length(Str)*2;
Move(Pointer(Str)^, PByte(Result)[Temp1], Temp2);
Temp1 := Temp1 + Temp2;
Move(Pointer(ADelimiter)^, PByte(Result)[Temp1], DelimiterSize);
Temp1 := Temp1 + DelimiterSize;
end;
I'd suggest better names than Temp1 and Temp2. I also question the use of NativeInt here. I'd normally expect to see Integer. Not least because a Delphi string is indexed by signed 32 bit values. You cannot have a string with length greater than 2GB.
Note also that you are not allocating enough memory. You forgot to account for the length of the delimiter. Fix that and your function looks like this:
function StringListToDelimitedString(const AStringList: TStringList;
const ADelimiter: String): String;
var
Str: String;
Temp1: Integer;
Temp2: Integer;
DelimiterSize: Integer;
begin
Temp1 := 0;
DelimiterSize := Length(ADelimiter) * SizeOf(Char);
for Str in AStringList do
inc(Temp1, Length(Str) + DelimiterSize);
SetLength(Result, Temp1);
Temp1 := 0;
for Str in AStringList do
begin
Temp2 := Length(Str) * SizeOf(Char);
Move(Pointer(Str)^, PByte(Result)[Temp1], Temp2);
inc(Temp1, Temp2);
Move(Pointer(ADelimiter)^, PByte(Result)[Temp1], DelimiterSize);
inc(Temp1, DelimiterSize);
end;
end;
If you want to avoid pointers, then write it like this:
function StringListToDelimitedString(const AStringList: TStringList;
const ADelimiter: String): String;
var
Str: String;
StrLen: Integer;
ResultLen: Integer;
DelimiterLen: Integer;
ResultIndex: Integer;
begin
DelimiterLen := Length(ADelimiter);
ResultLen := 0;
for Str in AStringList do
inc(ResultLen, Length(Str) + DelimiterLen);
SetLength(Result, ResultLen);
ResultIndex := 1;
for Str in AStringList do
begin
StrLen := Length(Str);
Move(Pointer(Str)^, Result[ResultIndex], StrLen*SizeOf(Char));
inc(ResultIndex, StrLen);
Move(Pointer(ADelimiter)^, Result[ResultIndex], DelimiterLen*SizeOf(Char));
inc(ResultIndex, DelimiterLen);
end;
end;
System.Move works with untyped pointers and counter of bytes. System.Copy and SysUtils.StrLCopy work with strings (Pascal strings and C strings respectively) and counter of chars. But char and byte are different types, so when you move from string/char context to pointers/bytes context - you should re-calculate length in chars to length in bytes. By the way, same is about indices, Result [Temp1] calculates in characters, not in bytes. And always did.
Correct solution is not mixing citizens of different planets. If you want pointers - use pointers. If you want characters and strings - use characters and strings. But do not mix them! divide and conquer and always separate and make clear when you're using raw piinters and where you use typed strings! Otherwise you're misleading yourself;
function StringListToDelimitedString
( const AStringList: TStrings; const ADelimiter: String ): String;
var
Str : array of String;
Lengths : array of Integer;
Temp1 : NativeInt;
Count, TotalChars : Integer;
PtrDestination: PByte;
PCurStr: ^String;
CurLen: Integer;
Procedure Add1(const Source: string);
var count: integer; // all context is in bytes, not chars here!
Ptr1, Ptr2: PByte;
begin
if Source = '' then exit;
Ptr1 := #Source[ 1 ];
Ptr2 := #Source[ Length(Source)+1 ];
count := ptr2 - ptr1;
Move( Source[1], PtrDestination^, count);
Inc(PtrDestination, count);
end;
begin // here all context is in chars and typed strings, not bytes
Count := AStringList.Count;
if Count <= 0 then exit('');
SetLength(Str, Count); SetLength(Lengths, Count);
TotalChars := 0;
for Temp1 := 0 to Count - 1 do begin
PCurStr := #Str[ Temp1 ];
PCurStr^ := AStringList[ Temp1 ]; // caching content, avoiding extra .Get(I) calls
CurLen := Length ( PCurStr^ ); // caching length, avoind extra function calls
Lengths[ Temp1 ] := CurLen;
Inc(TotalChars, CurLen);
end;
SetLength ( Result, TotalChars + ( Count-1 )*Length( ADelimiter ) );
PtrDestination := Pointer(Result[1]);
// Calls UniqueString to get a safe pointer - but only once
for Temp1 := Low(Str) to High(Str) do
begin
Add1( Str[ Temp1 ] );
Dec( Count );
if Count > 0 // not last string yet
then Add1( Delimeter );
end;
end;
Now, the correct solution i believe would be stopping inventing bicycles and using ready-made and tested libraryes, for example.
Str := JclStringList().Add(['Hello1','Hello2','Hello3','Hello4']).Join(';');
Or, if you really need to add a delimiter PAST THE LAST string (which is usually carefully avoided) then
Str := JclStringList().Add(['Hello1','Hello2','Hello3','Hello4', '']).Join(';');
The original claim of squeezing single percents of CPU power just does not hold with the original code. The illusion of fast pointer operations is just shadowed by a suboptimal code caring not about performance at all.
function StringListToDelimitedString
( const AStringList: TStringList; const ADelimiter: String ): String;
TStringList is a class. Class instance creation and deletion are expensive (slow) operations. Delphi made a flexible framework of those classes - but the speed suffers. So if you want to get few extra percents of speed and pay with sacrificing reliability and flexibility - the don't use classes.
DelimiterSize : Byte;
It should be NativeInt instead just as the rest of muneric variables there. You think you just saved few bytes - but you forced CPU to use non-native datatype and insert typecasts every now and then. It is nothing but an explicitly introduced delay. Ironically, you even did not saved those bytes, cause Delphi would just pad three bytes more to allocate next variable on 32-bits boundary. That is a typical "memory alignment" optimization.
Result := ' ';
This value would never be used. So it is just a loss of time.
for Str in AStringList do
This construction, requiring instantiating TInterfacedObject and calling its virtual methods and then reference-counting it with global locking - is expensive (slow) operation. And twice slow in multithreading taskloads. If you need to squeeze few percetns of speed - you should avoid loosing tens of percents on for-in loops. Those hi-level loops are handy and reliable and flexible - but they pay with speed for that.
for Str in AStringList do
Then You do it twice. But you DON'T KNOW how it that stringlist implemented. How efficiently does it gets the string ? It may even pass messages to another process like TMemo.Lines do! So you should minimize all accesses to that class and its multitude of internal virtual members. Cache all the strings ONCE in some local variable, do not fetch TWICE every of those!
Move ( Str [1], Result [Temp1], Temp2 );
Now we come to a really interesting question - is there even hypothetical place to gain any speed advantage by usiong pointers and bytes? Open CPU window and look how that line is actually implemented!
Strings are reference-counted! When you do Str2 := Str1; no data is copied but only pointers do. But as you start accessing the real memory buffer inside the string - that Str[1] expression - the compiler can not more count references, so Delphi is forced here to decrease reference coutner to ONE SINGLE. That is, Delphi is forced here to call UniqueString over Str and over Result; the System.UniqueString checks the refcounter and if it is >1 makes a special local copy of string (copying all the data into a newly allocated special buffer). Then you do a Move - just like Delphi RTL does itself. I cannto get where any advantage of speed may come from?
Move ( ADelimiter [1], Result [Temp1], DelimiterSize )
And here the same operations are done yet again. And they are costly operations! At least an extra procedure is called, at worst the new buffer is allocated and all the content being copied.
Resume:
The boundary between reference-counted strings and raw pointers is a costly one and every time you cross it - you force Delphi to pay a price.
Mixing those boundaries in the same code make the price paid again and again and again. It also confuses yourself where your counters and indices refer to bytes and where they refer to chars.
Delphi optimized casual string operations for years. And did a pretty good job there. Outperforming Delphi is possible - but you would need to understand in very very fine details - up to each CPU assembler instruction - what goes under the curtains of Pascal sources in your program. That is a dirty and tedious work. There will be no mopre that luxury of using those reliable and flexible things as for-in loop and TStrings classes.
In the end you would most probably get a few percents of speed gain, that no one would ever notice. But you will pay for that with a code much harder to understand, write, read and test. Will those few percents of speed worth unmaintainable code ? I doubt it.
So unless you are forced to do so, my advice is to skip wasting your time and just do a usual Str := JclStringList().Add(['Hello1','Hello2','Hello3','Hello4']).Join(';');
Reliability and flexibility is almost always more preferable than sheer speed.
And sorry to tell that, while i do not know a lot about speed optimizations, i easily saw a speed-damaging issues in your code, that you intended to be faster than Delphi itself.
My experience is miles and miles away of even trying to outperform Delphi in strings field. And i do not think you have any chances other but wasting a lot of time to finally get performance worse than stock one.
I am having trouble with some Delphi code that uses TFileStream to read chunks of data from a file to a dynamic array. The original objective in writing the code is to compare the contents of two files that have the same size but potentially different date and time stamps to see if the contents are the same. This is done by reading the data from each file of the pair into separate dynamic arrays and comparing each byte of one array with the corresponding byte of the other.
The code makes multiple calls to TFileStream.Read. After about 75 calls, the program crashes with an 'Out of Memory' Error message.
It does not seem to matter how large the blocks of data that are read, it seems to be the number of calls that results in the error message.
The code is a function that I have written that is called elsewhere whenever the program encounters two files that it needs to compare (which, for reasons that I won't go into, could be forty or fifty different file pairs). The 'Out of Memory' error occurs whether it is a single file that is being read in small blocks, or multiple files that are being read in their entirety. It seems to be the number of calls that is the determinant of the error.
While I realize that there might be more elegant ways of achieving the comparison of the files than what I have shown below, what I would really like to know is what is wrong with the use of the TFileStream and/or SetLength calls that are causing the memory problems. I have tried freeing the memory after every call (as shown in the code) and it seems to make no difference.
I would be grateful if someone could explain what is going wrong.
function Compare_file_contents(SPN,TPN : String; SourceFileSize : int64) : boolean;
var
SF : TFileStream; //First file of pair for comparison
TF : TFileStream; //Second file of pair
SourceArray : TBytes; // Buffer array to receive first file data
TargetArray : TBytes; //Buffer array to receive second file data
ArrayLength : int64; //Length of dynamic array
Position : int64; //Position within files to start each block of data read
TestPosition : int64; //Position within dynamic arrays to compare each byte
MaxArrayLength : integer; //Maximum size for the buffer arrays
LastRun : Boolean; //End first repeat loop
begin
{ The comparison has an arbitrary upper boundary of 100 MB to avoid slowing the
the overall program. The main files bigger than this will be *.pst files that
will most likely have new dates every time the program is run, so it will take
about the same time to copy the files as it does to read and compare them, and
it will have to be done every time.
The function terminates when it is confirmed that the files are not the same.
If the source file is bigger than 100 MB, it is simply assumed that they are
not identical, thus Result = False. Also, LongInt integers (=integers) have
a range of -2147483648..2147483647, so files bigger than 2 GB will have
overflowed to a negative number. Hence the check to see if the file size is
less than zero.
The outer repeat ... until loop terminates on LastRun, but LastRun should only
be set if SecondLastRun is True, because it will skip the final comparisons in
the inner repeat ... until loop otherwise. }
Result := True;
LastRun := False;
MaxArrayLength := 1024*1024;
if (SourceFileSize > 100*1024*1024) or (SourceFileSize < 0) then Result := False
else
begin
{ The comparison is done by using TFileStream to open and read the data from
the source and target files as bytes to dynamic arrays (TBytes). Then a repeat
loop is used to compare individual bytes until a difference is found or all
of the information has been compared. If a difference is found, Result is
set to False. }
if SourceFileSize > MaxArrayLength then ArrayLength := MaxArrayLength
else ArrayLength := SourceFileSize;
SF := TFileStream.Create(SPN,fmOpenRead);
TF := TFileStream.Create(TPN,fmOpenRead);
Position := 0;
SetLength(SourceArray,ArrayLength);
SetLength(TargetArray,ArrayLength);
try
SF.Read(SourceArray,ArrayLength);
TF.Read(TargetArray,ArrayLength);
Position := SF.Position;
finally
SF.Free;
TF.Free;
end;
repeat
TestPosition := 0;
repeat
if SourceArray[TestPosition] <> TargetArray[TestPosition] then
Result := False;
Inc(TestPosition);
until (Result = False) or (TestPosition = ArrayLength);
if SourceFileSize > Position then
begin
if SourceFileSize - Position - MaxArrayLength > 0 then
ArrayLength := MaxArrayLength
else ArrayLength := SourceFileSize - Position;
SF := TFileStream.Create(SPN,fmOpenRead);
TF := TFileStream.Create(TPN,fmOpenRead);
SF.Position := Position;
TF.Position := Position;
try
SF.Read(SourceArray,ArrayLength);
TF.Read(TargetArray,ArrayLength);
Position := SF.Position;
finally
SF.Free;
TF.Free;
end;
end else LastRun := True;
until (Result = False) or LastRun;
Finalize(SourceArray);
Finalize(TargetArray);
end;
end; { Compare_file_contents }
This routine seems to be far more complicated than it needs to be. Rather than trying to debug it, I offer you my routine that compares streams.
function StreamsEqual(Stream1, Stream2: TStream): Boolean;
const
OneKB = 1024;
var
Buffer1, Buffer2: array [0..4*OneKB-1] of Byte;
SavePos1, SavePos2: Int64;
Count: Int64;
N: Integer;
begin
if Stream1.Size<>Stream2.Size then begin
Result := False;
exit;
end;
SavePos1 := Stream1.Position;
SavePos2 := Stream2.Position;
Try
Stream1.Position := 0;
Stream2.Position := 0;
Count := Stream1.Size;
while Count <> 0 do begin
N := Min(SizeOf(Buffer1), Count);
Stream1.ReadBuffer(Buffer1, N);
Stream2.ReadBuffer(Buffer2, N);
if not CompareMem(#Buffer1, #Buffer2, N) then begin
Result := False;
exit;
end;
dec(Count, N);
end;
Result := True;
Finally
Stream1.Position := SavePos1;
Stream2.Position := SavePos2;
End;
end;
If you wish to add your 100MB size check to this function, it's obvious where and how to do it.
The routine above uses a stack allocated buffer. In contrast your version allocates on the heap. Perhaps your version leads to heap fragmentation.
I realise that this does not answer the direct question that you asked. However, it does solve your problem. I hope this proves useful.
I'm a Delphi programmer.
I have made a program who uses dictionaries with words and expressions (loaded in program as "array of string").
It uses a search algorithm based on their "checksum" (I hope this is the correct word).
A string is transformed in integer based on this:
var
FHashSize: Integer; //stores the value of GetHashSize
HashTable, HashTableNoCase: array[Byte] of Longword;
HashTableInit: Boolean = False;
const
AnsiLowCaseLookup: array[AnsiChar] of AnsiChar = (
#$00, #$01, #$02, #$03, #$04, #$05, #$06, #$07,
#$08, #$09, #$0A, #$0B, #$0C, #$0D, #$0E, #$0F,
#$10, #$11, #$12, #$13, #$14, #$15, #$16, #$17,
#$18, #$19, #$1A, #$1B, #$1C, #$1D, #$1E, #$1F,
#$20, #$21, #$22, #$23, #$24, #$25, #$26, #$27,
#$28, #$29, #$2A, #$2B, #$2C, #$2D, #$2E, #$2F,
#$30, #$31, #$32, #$33, #$34, #$35, #$36, #$37,
#$38, #$39, #$3A, #$3B, #$3C, #$3D, #$3E, #$3F,
#$40, #$61, #$62, #$63, #$64, #$65, #$66, #$67,
#$68, #$69, #$6A, #$6B, #$6C, #$6D, #$6E, #$6F,
#$70, #$71, #$72, #$73, #$74, #$75, #$76, #$77,
#$78, #$79, #$7A, #$5B, #$5C, #$5D, #$5E, #$5F,
#$60, #$61, #$62, #$63, #$64, #$65, #$66, #$67,
#$68, #$69, #$6A, #$6B, #$6C, #$6D, #$6E, #$6F,
#$70, #$71, #$72, #$73, #$74, #$75, #$76, #$77,
#$78, #$79, #$7A, #$7B, #$7C, #$7D, #$7E, #$7F,
#$80, #$81, #$82, #$83, #$84, #$85, #$86, #$87,
#$88, #$89, #$8A, #$8B, #$8C, #$8D, #$8E, #$8F,
#$90, #$91, #$92, #$93, #$94, #$95, #$96, #$97,
#$98, #$99, #$9A, #$9B, #$9C, #$9D, #$9E, #$9F,
#$A0, #$A1, #$A2, #$A3, #$A4, #$A5, #$A6, #$A7,
#$A8, #$A9, #$AA, #$AB, #$AC, #$AD, #$AE, #$AF,
#$B0, #$B1, #$B2, #$B3, #$B4, #$B5, #$B6, #$B7,
#$B8, #$B9, #$BA, #$BB, #$BC, #$BD, #$BE, #$BF,
#$C0, #$C1, #$C2, #$C3, #$C4, #$C5, #$C6, #$C7,
#$C8, #$C9, #$CA, #$CB, #$CC, #$CD, #$CE, #$CF,
#$D0, #$D1, #$D2, #$D3, #$D4, #$D5, #$D6, #$D7,
#$D8, #$D9, #$DA, #$DB, #$DC, #$DD, #$DE, #$DF,
#$E0, #$E1, #$E2, #$E3, #$E4, #$E5, #$E6, #$E7,
#$E8, #$E9, #$EA, #$EB, #$EC, #$ED, #$EE, #$EF,
#$F0, #$F1, #$F2, #$F3, #$F4, #$F5, #$F6, #$F7,
#$F8, #$F9, #$FA, #$FB, #$FC, #$FD, #$FE, #$FF);
implementation
function GetHashSize(const Count: Integer): Integer;
begin
if Count < 65 then
Result := 256
else
Result := Round(IntPower(16, Ceil(Log10(Count div 4) / Log10(16))));
end;
function Hash(const Hash: LongWord; const Buf; const BufSize: Integer): LongWord;
var P: PByte;
I: Integer;
begin
P := #Buf;
Result := Hash;
for I := 1 to BufSize do
begin
Result := HashTable[Byte(Result) xor P^] xor (Result shr 8);
Inc(P);
end;
end;
function HashStrBuf(const StrBuf: Pointer; const StrLength: Integer; const Slots: LongWord): LongWord;
var P: PChar;
I, J: Integer;
begin
if not HashTableInit then
InitHashTable;
P := StrBuf;
if StrLength <= 48 then // Hash all characters for short strings
Result := Hash($FFFFFFFF, P^, StrLength)
else
begin
// Hash first 16 bytes
Result := Hash($FFFFFFFF, P^, 16);
// Hash last 16 bytes
Inc(P, StrLength - 16);
Result := Hash(Result, P^, 16);
// Hash 16 bytes sampled from rest of string
I := (StrLength - 48) div 16;
P := StrBuf;
Inc(P, 16);
for J := 1 to 16 do
begin
Result := HashTable[Byte(Result) xor Byte(P^)] xor (Result shr 8);
Inc(P, I + 1);
end;
end;
// Mod into slots
if Slots <> 0 then
Result := Result mod Slots;
end;
procedure InitHashTable;
var I, J: Byte;
R: LongWord;
begin
for I := $00 to $FF do
begin
R := I;
for J := 8 downto 1 do
if R and 1 <> 0 then
R := (R shr 1) xor $EDB88320
else
R := R shr 1;
HashTable[I] := R;
end;
Move(HashTable, HashTableNoCase, Sizeof(HashTable));
for I := Ord('A') to Ord('Z') do
HashTableNoCase[I] := HashTableNoCase[I or 32];
HashTableInit := True;
end;
The result of the HashStrBuf is "and (FHashSize - 1)" and is used as index in an "array of array of Integer" (of FHashSize size) to store the index of the string from that "array of string".
This way, when searches for a string, it's transformed in "checksum" and then the code searches in the "branch" with this index comparing this string with the strings from dictionary who have the same "checksum".
Ideally each string from dictionary should have unique checksum. But in the "real world" about 2/3 share the same "checksum" with other words. Because of that the search is not that fast.
In these dictionaries strings are composed of this characters: ['a'..'z',#224..#246,#248..#254,#154,#156..#159,#179,#186,#191,#190,#185,'0'..'9', '''']
Is there any way to improve the "hashing" so the strings would have more unique "checksums"?
Oh, one way is to increase the size of that "array of array of Integer" (FHashSize) but it cannot be increased too much because it takes a lot of Ram.
Another thing: these dictionaries are stored on HDD only as words/expressions (not the "checksums"). Their "checksum" is generated at program startup. But it takes a lot of seconds to do that...
Is there any way to speed up the startup of the program? Maybe by improving the "hashing" function, maybe by storing the "checksums" on HDD and loading them from there...
Any input would be appreciated...
PS: here is the code to search:
function TDictionary.LocateKey(const Key: AnsiString): Integer;
var i, j, l, H: Integer;
P, Q: PChar;
begin
Result := -1;
l := Length(Key);
H := HashStrBuf(#Key[1], l, 0) and (FHashSize - 1);
P := #Key[1];
for i := 0 to High(FHash[H]) do //FHash is that "array of array of integer"
begin
if l <> FKeys.ItemSize[FHash[H][i]] then //FKeys.ItemSize is an byte array with the lengths of strings from dictionary
Continue;
Q := FKeys.Pointer(FHash[H][i]); //pointer to string in dictionary
for j := 0 to l - 1 do
if (P + j)^ <> (Q + j)^ then
Break;
if j = l then
begin
Result := FHash[H][i];
Exit;
end;
end;
end;
Don't reinvent the wheel!
IMHO your hashing is far from efficient, and your collision algorithm can be improved.
Take a look for instance at the IniFiles unit, and the THashedStringList.
It's a bit old, but a good start for a string list using hashes.
There are a lot of good Delphi implementation of such, like in SuperObject and a lot of other code...
Take a look at our SynBigTable unit, which can handle arrays of data in memory or in file very fast, with full indexed searches. Or our latest TDynArray wrapper around any dynamic array of data, to implement TList-like methods to it, including fast binary search. I'm quite sure it could be faster than your hand-tuned code using hashing, if you use an ordered index then fast binary search.
Post-Scriptum:
About pure hashing speed of a string content, take a look at this function - rename RawByteString into AnsiString, PPtrInt into PPointer, and PtrInt into Integer for Delphi 7:
function Hash32(const Text: RawByteString): cardinal;
function SubHash(P: PCardinalArray): cardinal;
{$ifdef HASINLINE}inline;{$endif}
var s1,s2: cardinal;
i, L: PtrInt;
const Mask: array[0..3] of cardinal = (0,$ff,$ffff,$ffffff);
begin
if P<>nil then begin
L := PPtrInt(PtrInt(P)-4)^; // fast lenght(Text)
s1 := 0;
s2 := 0;
for i := 1 to L shr 4 do begin // 16 bytes (4 DWORD) by loop - aligned read
inc(s1,P^[0]);
inc(s2,s1);
inc(s1,P^[1]);
inc(s2,s1);
inc(s1,P^[2]);
inc(s2,s1);
inc(s1,P^[3]);
inc(s2,s1);
inc(PtrUInt(P),16);
end;
for i := 1 to (L shr 2)and 3 do begin // 4 bytes (DWORD) by loop
inc(s1,P^[0]);
inc(s2,s1);
inc(PtrUInt(P),4);
end;
inc(s1,P^[0] and Mask[L and 3]); // remaining 0..3 bytes
inc(s2,s1);
result := s1 xor (s2 shl 16);
end else
result := 0;
end;
begin // use a sub function for better code generation under Delphi
result := SubHash(pointer(Text));
end;
There is even a pure asm version, even faster, in our SynCommons.pas unit. I don't know any faster hashing function around (it's faster than crc32/adler32/IniFiles.hash...). It's based on adler32, but use DWORD aligned reading and summing for even better speed. This could be improved with SSE asm, of course, but here is a fast pure Delphi hash function.
Then don't forget to use "multiplication"/"binary and operation" for hash resolution, just like in IniFiles. It will reduce the number of iteration to your list of hashs.
But since you didn't provide the search source code, we are not able to know what could be improved here.
If you are using Delphi 7, consider using Julian Bucknall's lovely Delphi data types code, EzDsl (Easy Data Structures Library).
Now you don't have to reinvent the wheel as another wise person has also said.
You can download ezdsl, a version that I have made work with both Delphi 7, and recent unicode delphi versions, here.
In particular the unit name EHash contains a hash table implementation, which has various hashing algorithms plug-inable, or you can write your own plugin function that just does the hashing function of your choice.
As a word to the wise, if you are using a Unicode Delphi version; I would be careful about hashing your unicode strings with a code library like this, without checking how its hashing algorithms perform on your system. The OP here is using Delphi 7, so Unicode is not a factor for the original question.
I think you'll find a database (without checksums) a lot quicker. Maybe try sqlite which will give you a single file database. There are many Delphi Libraries available.