Unicode string and TStringStream - delphi

Delphi 2009 and above uses unicode strings for their default string type. To my understanding unicode char is actually 16 bit value or 2 bytes (note: I understand there is possibility of 3 or 4 bytes char, but let's consider the most usual case). However I found that TStringStream is not very reliable to manipulating this strings. For example, TStringStream.Size property returns the length of the string, while I think it should return the byte count of the contained string. Okay, you can adjust it on your own, but the thing that really confused me the most is: TStringStream does not read from or write to a buffer reliably.
Please check the following code (it's a DUnit test and always fail). Please let me know where the problem is (I was using D2010 when testing the code).
procedure TestTCPackage.TestStringStream;
const
cCount = 10;
cOrdMaxChar = Ord(High(Char));
var
B: Pointer;
SW, SR: TStringStream;
T: string;
i, j, k : Integer;
vStrings: array [0..cCount-1] of string;
begin
RandSeed := GetTickCount;
for i := 0 to cCount - 1 do
begin
j := Random(100) + 1;
SetLength(vStrings[i], j);
for k := 1 to j do
// fill string with random char (but no #0)
vStrings[i][k] := Char(Random(cOrdMaxChar-1) + 1);
end;
for i := 0 to cCount - 1 do
begin
SW := TStringStream.Create(vStrings[i]);
try
GetMem(B, SW.Size * SizeOf(Char));
try
SW.Read(B^, SW.Size * SizeOf(Char));
SR := TStringStream.Create;
try
SR.Write(B^, SW.Size * SizeOf(Char));
SR.Position := 0;
// check the string in the TStringStream with original value
Check(SR.DataString = vStrings[i]);
finally
SR.Free;
end;
finally
FreeMem(B);
end;
finally
SW.Free;
end;
end;
end;
Note: I already tried to use an instance of TMemoryStream as intermediary from reading/writing the buffer and use CopyFrom of the TStringStream to read the content of that TMemoryStream with same failing effect.

Unicode strings aren't for data storage; use TBytes for that. TStringStream uses its associated encoding (the Encoding property) for encoding strings passed in with WriteString, and decoding strings read out with ReadString or the DataString property.

After reading this post (and thanks to Serg who provided the answer to that question) and Barry Kelly's answer, I have found the problem. TStringStream is actually using ASCII/ansistring encoding by default. So even if your default string type is unicode, unless you spesifically tell it to, it won't use unicode encoding. Personally I think it's strange. Maybe for making it easier to convert old codes.
So you have to specifically set the encoding of the TStringStream to TEncoding.Unicode to manipulate unicode string properly.
Here is my modified code which passes DUnit test is:
procedure TestTCPackage.TestStringStream;
const
cCount = 10;
cOrdMaxChar = Ord(High(Char));
var
B: Pointer;
SW, SR: TStringStream;
i, j, k : Integer;
vStrings: array [0..cCount-1] of string;
begin
RandSeed := GetTickCount;
for i := 0 to cCount - 1 do
begin
j := Random(100) + 1;
SetLength(vStrings[i], j);
for k := 1 to j do
// fill string with random char (but no #0)
vStrings[i][k] := Char(Random(cOrdMaxChar-1) + 1);
end;
for i := 0 to cCount - 1 do
begin
SW := TStringStream.Create(vStrings[i], ***TEncoding.Unicode***);
try
GetMem(B, SW.Size);
try
SW.ReadBuffer(B^, SW.Size);
SR := TStringStream.Create('', ***TEncoding.Unicode***);
try
SR.WriteBuffer(B^, SW.Size);
SR.Position := 0;
// check the string in the TStringStream with original value
Check(SR.DataString = vStrings[i]);
finally
SR.Free;
end;
finally
FreeMem(B);
end;
finally
SW.Free;
end;
end;
end;
Last note: Unicode does bite! :D

Related

How to read last line in a text file using Delphi

I need to read the last line in some very large textfiles (to get the timestamp from the data). TStringlist would be a simple approach but it returns an out of memory error. I'm trying to use seek and blockread, but the characters in the buffer are all nonsense. Is this something to do with unicode?
Function TForm1.ReadLastLine2(FileName: String): String;
var
FileHandle: File;
s,line: string;
ok: 0..1;
Buf: array[1..8] of Char;
k: longword;
i,ReadCount: integer;
begin
AssignFile (FileHandle,FileName);
Reset (FileHandle); // or for binary files: Reset (FileHandle,1);
ok := 0;
k := FileSize (FileHandle);
Seek (FileHandle, k-1);
s := '';
while ok<>1 do begin
BlockRead (FileHandle, buf, SizeOf(Buf)-1, ReadCount); //BlockRead ( var FileHandle : File; var Buffer; RecordCount : Integer {; var RecordsRead : Integer} ) ;
if ord (buf[1]) <>13 then //Arg to integer
s := s + buf[1]
else
ok := ok + 1;
k := k-1;
seek (FileHandle,k);
end;
CloseFile (FileHandle);
// Reverse the order in the line read
setlength (line,length(s));
for i:=1 to length(s) do
line[length(s) - i+1 ] := s[i];
Result := Line;
end;
Based on www.delphipages.com/forum/showthread.php?t=102965
The testfile is a simple CSV I created in excel ( this is not the 100MB I ultimately need to read).
a,b,c,d,e,f,g,h,i,j,blank
A,B,C,D,E,F,G,H,I,J,blank
1,2,3,4,5,6,7,8,9,0,blank
Mary,had,a,little,lamb,His,fleece,was,white,as,snow
And,everywhere,that,Mary,went,The,lamb,was,sure,to,go
You really have to read the file in LARGE chunks from the tail to the head.
Since it is so large it does not fit the memory - then reading it line by line from start to end would be very slow. With ReadLn - twice slow.
You also has to be ready that the last line might end with EOL or may not.
Personally I would also account for three possible EOL sequences:
CR/LF aka #13#10=^M^J - DOS/Windows style
CR without LF - just #13=^M - Classic MacOS file
LF without CR - just #10=^J - UNIX style, including MacOS version 10
If you are sure your CSV files would only ever be generated by native Windows programs it would be safe to assume full CR/LF be used. But if there can be other Java programs, non-Windows platforms, mobile programs - I would be less sure. Of course pure CR without LF would be the least probable case of them all.
uses System.IOUtils, System.Math, System.Classes;
type FileChar = AnsiChar; FileString = AnsiString; // for non-Unicode files
// type FileChar = WideChar; FileString = UnicodeString;// for UTF16 and UCS-2 files
const FileCharSize = SizeOf(FileChar);
// somewhere later in the code add: Assert(FileCharSize = SizeOf(FileString[1]);
function ReadLastLine(const FileName: String): FileString; overload; forward;
const PageSize = 4*1024;
// the minimal read atom of most modern HDD and the memory allocation atom of Win32
// since the chances your file would have lines longer than 4Kb are very small - I would not increase it to several atoms.
function ReadLastLine(const Lines: TStringDynArray): FileString; overload;
var i: integer;
begin
Result := '';
i := High(Lines);
if i < Low(Lines) then exit; // empty array - empty file
Result := Lines[i];
if Result > '' then exit; // we got the line
Dec(i); // skip the empty ghost line, in case last line was CRLF-terminated
if i < Low(Lines) then exit; // that ghost was the only line in the empty file
Result := Lines[i];
end;
// scan for EOLs in not-yet-scanned part
function FindLastLine(buffer: TArray<FileChar>; const OldRead : Integer;
const LastChunk: Boolean; out Line: FileString): boolean;
var i, tailCRLF: integer; c: FileChar;
begin
Result := False;
if Length(Buffer) = 0 then exit;
i := High(Buffer);
tailCRLF := 0; // test for trailing CR/LF
if Buffer[i] = ^J then begin // LF - single, or after CR
Dec(i);
Inc(tailCRLF);
end;
if (i >= Low(Buffer)) and (Buffer[i] = ^M) then begin // CR, alone or before LF
Inc(tailCRLF);
end;
i := High(Buffer) - Max(OldRead, tailCRLF);
if i - Low(Buffer) < 0 then exit; // no new data to read - results would be like before
if OldRead > 0 then Inc(i); // the CR/LF pair could be sliced between new and previous buffer - so need to start a bit earlier
for i := i downto Low(Buffer) do begin
c := Buffer[i];
if (c=^J) or (c=^M) then begin // found EOL
SetString( Line, #Buffer[i+1], High(Buffer) - tailCRLF - i);
exit(True);
end;
end;
// we did not find non-terminating EOL in the buffer (except maybe trailing),
// now we should ask for more file content, if there is still left any
// or take the entire file (without trailing EOL if any)
if LastChunk then begin
SetString( Line, #Buffer[ Low(Buffer) ], Length(Buffer) - tailCRLF);
Result := true;
end;
end;
function ReadLastLine(const FileName: String): FileString; overload;
var Buffer, tmp: TArray<FileChar>;
// dynamic arrays - eases memory management and protect from stack corruption
FS: TFileStream; FSize, NewPos: Int64;
OldRead, NewLen : Integer; EndOfFile: boolean;
begin
Result := '';
FS := TFile.OpenRead(FileName);
try
FSize := FS.Size;
if FSize <= PageSize then begin // small file, we can be lazy!
FreeAndNil(FS); // free the handle and avoid double-free in finally
Result := ReadLastLine( TFile.ReadAllLines( FileName, TEncoding.ANSI ));
// or TEncoding.UTF16
// warning - TFIle is not share-aware, if the file is being written to by another app
exit;
end;
SetLength( Buffer, PageSize div FileCharSize);
OldRead := 0;
repeat
NewPos := FSize - Length(Buffer)*FileCharSize;
EndOfFile := NewPos <= 0;
if NewPos < 0 then NewPos := 0;
FS.Position := NewPos;
FS.ReadBuffer( Buffer[Low(Buffer)], (Length(Buffer) - OldRead)*FileCharSize);
if FindLastLine(Buffer, OldRead, EndOfFile, Result) then
exit; // done !
tmp := Buffer; Buffer := nil; // flip-flop: preparing to broaden our mouth
OldRead := Length(tmp); // need not to re-scan the tail again and again when expanding our scanning range
NewLen := Min( 2*Length(tmp), FSize div FileCharSize );
SetLength(Buffer, NewLen); // this may trigger EOutOfMemory...
Move( tmp[Low(tmp)], Buffer[High(Buffer)-OldRead+1], OldRead*FileCharSize);
tmp := nil; // free old buffer
until EndOfFile;
finally
FS.Free;
end;
end;
PS. Note one extra special case - if you would use Unicode chars (two-bytes ones) and would give odd-length file (3 bytes, 5 bytes, etc) - you would never be ble to scan the starting single byte (half-widechar). Maybe you should add the extra guard there, like Assert( 0 = FS.Size mod FileCharSize)
PPS. As a rule of thumb you better keep those functions out of the form class, - because WHY mixing them? In general you should separate concerns into small blocks. Reading file has nothing with user interaction - so should better be offloaded to an extra UNIT. Then you would be able to use functions from that unit in one form or 10 forms, in main thread or in multi-threaded application. Like LEGO parts - they give you flexibility by being small and separate.
PPPS. Another approach here would be using memory-mapped files. Google for MMF implementations for Delphi and articles about benefits and problems with MMF approach. Personally I think rewriting the code above to use MMF would greatly simplify it, removing several "special cases" and the troublesome and memory copying flip-flop. OTOH it would demand you to be very strict with pointers arithmetic.
https://en.wikipedia.org/wiki/Memory-mapped_file
https://msdn.microsoft.com/en-us/library/ms810613.aspx
http://torry.net/quicksearchd.php?String=memory+map&Title=No
Your char type is two byte, so that buffer is 16 byte. Then with blockread you read sizeof(buffer)-1 byte into it, and check the first 2 byte char if it is equal to #13.
The sizeof(buffer)-1 is dodgy (where does that -1 come from?), and the rest is valid, but only if your input file is utf16.
Also your read 8 (or 16) characters each time, but compare only one and then do a seek again. That is not very logical either.
If your encoding is not utf16, I suggest you change the type of a buffer element to ansichar and remove the -1
In response to kopiks suggestion, I figured out how to do it with TFilestream, it works ok with the simple test file, though there may be some further tweeks when I use it on a variety of csv files. Also, I don't make any claims that this is the most efficient method.
procedure TForm1.Button6Click(Sender: TObject);
Var
StreamSize, ApproxNumRows : Integer;
TempStr : String;
begin
if OpenDialog1.Execute then begin
TempStr := ReadLastLineOfTextFile(OpenDialog1.FileName,StreamSize, ApproxNumRows);
// TempStr := ReadFileStream('c:\temp\CSVTestFile.csv');
ShowMessage ('approximately '+ IntToStr(ApproxNumRows)+' Rows');
ListBox1.Items.Add(TempStr);
end;
end;
Function TForm1.ReadLastLineOfTextFile(const FileName: String; var StreamSize, ApproxNumRows : Integer): String;
const
MAXLINELENGTH = 256;
var
Stream: TFileStream;
BlockSize,CharCount : integer;
Hash13Found : Boolean;
Buffer : array [0..MAXLINELENGTH] of AnsiChar;
begin
Hash13Found := False;
Result :='';
Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
StreamSize := Stream.size;
if StreamSize < MAXLINELENGTH then
BlockSize := StreamSize
Else
BlockSize := MAXLINELENGTH;
// for CharCount := 0 to Length(Buffer)-1 do begin
// Buffer[CharCount] := #0; // zeroing the buffer can aid diagnostics
// end;
CharCount := 0;
Repeat
Stream.Seek(-(CharCount+3), 2); //+3 misses out the #0,#10,#13 at the end of the file
Stream.Read( Buffer[CharCount], 1);
Result := String(Buffer[CharCount]) + result;
if Buffer[CharCount] =#13 then
Hash13Found := True;
Inc(CharCount);
Until Hash13Found OR (CharCount = BlockSize);
ShowMessage(Result);
ApproxNumRows := Round(StreamSize / CharCount);
end;
Just thought of a new solution.
Again, there could be better ones, but this one is the best i thought of.
function GetLastLine(textFilePath: string): string;
var
list: tstringlist;
begin
list := tstringlist.Create;
try
list.LoadFromFile(textFilePath);
result := list[list.Count-1];
finally
list.free;
end;
end;

Convert TMemoryStream to WideString in Delphi 7

When I use this code and function in Delphi 7 an error message will be displayed :
This code convert MemoryStream content to WideString
function ReadWideString(stream: TStream): WideString;
var
nChars: LongInt;
begin
stream.Position := 0;
stream.ReadBuffer(nChars, SizeOf(nChars));
SetLength(Result, nChars);
if nChars > 0 then
stream.ReadBuffer(Result[1], nChars * SizeOf(Result[1]));
end;
procedure TForm1.Button2Click(Sender: TObject);
var
mem: TMemoryStream;
begin
mem := TMemoryStream.Create;
mem.LoadFromFile('C:\Users\User1\Desktop\wide.txt');
Memo1.Lines.Add(ReadWideString(mem));
end;
Any help would be greatly appreciated.
Your code works fine as is. The problem is that the input that you pass to your function is not in the expected format.
The function that you are using expects a 4 byte integer containing the length, followed by the UTF-16 payload.
It looks like you actually have straight UTF-16 text, without the length prepended. Read that like this:
stream.Position := 0;
nChars := stream.Size div SizeOf(Result[1]);
SetLength(Result, nChars);
if nChars > 0 then
stream.ReadBuffer(Result[1], nChars * SizeOf(Result[1]));
Now, your input may contain a UTF-16 BOM. If so you'll need to decide how to handle that.
The bottom line here is that you need your code to match the input you provide.

is possible write/read a file using a string data type structure?

for write something in a file i use for example this code:
procedure MyProc (... );
const
BufSize = 65535;
var
FileSrc, FileDst: TFileStream;
StreamRead: Cardinal;
InBuf, OutBuf: Array [0..bufsize] of byte;
begin
.....
FileSrc := TFileStream.Create (uFileSrc, fmOpenRead Or fmShareDenyWrite);
try
FileDst := TFileStream.Create (uFileTmp, fmCreate);
try
StreamRead := 0;
while ((iCounter < iFileSize) or (StreamRead = Cardinal(BufSize)))
begin
StreamRead := FileSrc.Read (InBuf, BufSize);
Inc (iCounter, StreamRead);
end;
finally
FileDst.Free;
end;
finally
FileSrc.Free;
end;
end;
And for I/O file i use a array of byte, and so is all ok, but when i use a string, for example declaring:
InBuf, OutBuf: string // in delphi xe2 = unicode string
then not work. In sense that file not write nothing. I have understood why, or just think to have understood it.
I think that problem maybe is why string contain just a pointer to memory and not static structure; correct?
In this case, there is some solution for solve it? In sense, is possible to do something for i can to write a file using string and not vector? Or i need necessary use a vector?
If possible, can i can to do ?
Thanks very much.
There are two issues with using strings. First of all you want to use RawByteString so that you ensure the use of byte sized character elements – a Unicode string has elements that are two bytes wide. And secondly you need to dereference the string which is really just a pointer.
But I wonder why you would prefer strings to the stack allocated byte array.
procedure MyProc (... );
const
BufSize = 65536;
var
FileSrc, FileDst: TFileStream;
StreamRead: Cardinal;
InBuf: RawByteString;
begin
.....
FileSrc := TFileStream.Create (uFileSrc, fmOpenRead Or fmShareDenyWrite);
try
FileDst := TFileStream.Create (uFileTmp, fmCreate);
try
SetLength(InBuf, BufSize);
StreamRead := 0;
while ((iCounter < iFileSize) or (StreamRead = Cardinal(BufSize)))
begin
StreamRead := FileSrc.Read (InBuf[1], BufSize);
Inc (iCounter, StreamRead);
end;
finally
FileDst.Free;
end;
finally
FileSrc.Free;
end;
end;
Note: Your previous code declared a buffer of 65536 bytes, but you only ever used 65535 of them. Probably not what you intended.
To use a string as a buffer (which I would not recommend), you'll have to use SetLength to allocate the internal buffer, and you'll have to pass InBuf[1] and OutBuf[1] as the data to read or write.
var
InBuf, OutBuf: AnsiString; // or TBytes
begin
SetLength(InBuf, BufSize);
SetLength(OutBuf, BufSize);
...
StreamRead := FileSrc.Read(InBuf[1], BufSize); // if TBytes, use InBuf[0]
// etc...
You can also use a TBytes, instead of an AnsiString. The usage remains the same.
But I actually see no advantage in dynamically allocating TBytes, AnsiStrings or RawByteStrings here. I'd rather do what you already do: use a stack based buffer. I would perhaps make it a little smaller in a multi-threaded environment.
Yes, you can save / load strings to / from stream, see the following example
var Len: Integer;
buf: string;
FData: TStream;
// save string to stream
// save the length of the string
Len := Length(buf);
FData.Write(Len, SizeOf(Len));
// save string itself
if(Len > 0)then FData.Write(buf[1], Len * sizeof(buf[1]));
// read string from stream
// read the length of the string
FData.Read(Len, SizeOf(Len));
if(Len > 0)then begin
// get memory for the string
SetLength(buf, Len);
// read string content
FData.Read(buf[1], Len * sizeof(buf[1]));
end else buf := '';
On a related note, to copy the contents from one TStream to another TStream, you could just use the TStream.CopyFrom() method instead:
procedure MyProc (... );
var
FileSrc, FileDst: TFileStream;
begin
...
FileSrc := TFileStream.Create (uFileSrc, fmOpenRead Or fmShareDenyWrite);
try
FileDst := TFileStream.Create (uFileTmp, fmCreate);
try
FileDst.CopyFrom(FileSrc, 0); // or FileDst.CopyFrom(FileSrc, iFileSize)
finally
FileDst.Free;
end;
finally
FileSrc.Free;
end;
...
end;
Which can be simplified by calling CopyFile() instead:
procedure MyProc (... );
begin
...
CopyFile(PChar(uFileSrc), PChar(uFileTmp), False);
...
end;
Either way, you don't have to worry about read/writing the file data manually at all!

how to improve the code (Delphi) for loading and searching in a dictionary?

I'm a Delphi programmer.
I have made a program who uses dictionaries with words and expressions (loaded in program as "array of string").
It uses a search algorithm based on their "checksum" (I hope this is the correct word).
A string is transformed in integer based on this:
var
FHashSize: Integer; //stores the value of GetHashSize
HashTable, HashTableNoCase: array[Byte] of Longword;
HashTableInit: Boolean = False;
const
AnsiLowCaseLookup: array[AnsiChar] of AnsiChar = (
#$00, #$01, #$02, #$03, #$04, #$05, #$06, #$07,
#$08, #$09, #$0A, #$0B, #$0C, #$0D, #$0E, #$0F,
#$10, #$11, #$12, #$13, #$14, #$15, #$16, #$17,
#$18, #$19, #$1A, #$1B, #$1C, #$1D, #$1E, #$1F,
#$20, #$21, #$22, #$23, #$24, #$25, #$26, #$27,
#$28, #$29, #$2A, #$2B, #$2C, #$2D, #$2E, #$2F,
#$30, #$31, #$32, #$33, #$34, #$35, #$36, #$37,
#$38, #$39, #$3A, #$3B, #$3C, #$3D, #$3E, #$3F,
#$40, #$61, #$62, #$63, #$64, #$65, #$66, #$67,
#$68, #$69, #$6A, #$6B, #$6C, #$6D, #$6E, #$6F,
#$70, #$71, #$72, #$73, #$74, #$75, #$76, #$77,
#$78, #$79, #$7A, #$5B, #$5C, #$5D, #$5E, #$5F,
#$60, #$61, #$62, #$63, #$64, #$65, #$66, #$67,
#$68, #$69, #$6A, #$6B, #$6C, #$6D, #$6E, #$6F,
#$70, #$71, #$72, #$73, #$74, #$75, #$76, #$77,
#$78, #$79, #$7A, #$7B, #$7C, #$7D, #$7E, #$7F,
#$80, #$81, #$82, #$83, #$84, #$85, #$86, #$87,
#$88, #$89, #$8A, #$8B, #$8C, #$8D, #$8E, #$8F,
#$90, #$91, #$92, #$93, #$94, #$95, #$96, #$97,
#$98, #$99, #$9A, #$9B, #$9C, #$9D, #$9E, #$9F,
#$A0, #$A1, #$A2, #$A3, #$A4, #$A5, #$A6, #$A7,
#$A8, #$A9, #$AA, #$AB, #$AC, #$AD, #$AE, #$AF,
#$B0, #$B1, #$B2, #$B3, #$B4, #$B5, #$B6, #$B7,
#$B8, #$B9, #$BA, #$BB, #$BC, #$BD, #$BE, #$BF,
#$C0, #$C1, #$C2, #$C3, #$C4, #$C5, #$C6, #$C7,
#$C8, #$C9, #$CA, #$CB, #$CC, #$CD, #$CE, #$CF,
#$D0, #$D1, #$D2, #$D3, #$D4, #$D5, #$D6, #$D7,
#$D8, #$D9, #$DA, #$DB, #$DC, #$DD, #$DE, #$DF,
#$E0, #$E1, #$E2, #$E3, #$E4, #$E5, #$E6, #$E7,
#$E8, #$E9, #$EA, #$EB, #$EC, #$ED, #$EE, #$EF,
#$F0, #$F1, #$F2, #$F3, #$F4, #$F5, #$F6, #$F7,
#$F8, #$F9, #$FA, #$FB, #$FC, #$FD, #$FE, #$FF);
implementation
function GetHashSize(const Count: Integer): Integer;
begin
if Count < 65 then
Result := 256
else
Result := Round(IntPower(16, Ceil(Log10(Count div 4) / Log10(16))));
end;
function Hash(const Hash: LongWord; const Buf; const BufSize: Integer): LongWord;
var P: PByte;
I: Integer;
begin
P := #Buf;
Result := Hash;
for I := 1 to BufSize do
begin
Result := HashTable[Byte(Result) xor P^] xor (Result shr 8);
Inc(P);
end;
end;
function HashStrBuf(const StrBuf: Pointer; const StrLength: Integer; const Slots: LongWord): LongWord;
var P: PChar;
I, J: Integer;
begin
if not HashTableInit then
InitHashTable;
P := StrBuf;
if StrLength <= 48 then // Hash all characters for short strings
Result := Hash($FFFFFFFF, P^, StrLength)
else
begin
// Hash first 16 bytes
Result := Hash($FFFFFFFF, P^, 16);
// Hash last 16 bytes
Inc(P, StrLength - 16);
Result := Hash(Result, P^, 16);
// Hash 16 bytes sampled from rest of string
I := (StrLength - 48) div 16;
P := StrBuf;
Inc(P, 16);
for J := 1 to 16 do
begin
Result := HashTable[Byte(Result) xor Byte(P^)] xor (Result shr 8);
Inc(P, I + 1);
end;
end;
// Mod into slots
if Slots <> 0 then
Result := Result mod Slots;
end;
procedure InitHashTable;
var I, J: Byte;
R: LongWord;
begin
for I := $00 to $FF do
begin
R := I;
for J := 8 downto 1 do
if R and 1 <> 0 then
R := (R shr 1) xor $EDB88320
else
R := R shr 1;
HashTable[I] := R;
end;
Move(HashTable, HashTableNoCase, Sizeof(HashTable));
for I := Ord('A') to Ord('Z') do
HashTableNoCase[I] := HashTableNoCase[I or 32];
HashTableInit := True;
end;
The result of the HashStrBuf is "and (FHashSize - 1)" and is used as index in an "array of array of Integer" (of FHashSize size) to store the index of the string from that "array of string".
This way, when searches for a string, it's transformed in "checksum" and then the code searches in the "branch" with this index comparing this string with the strings from dictionary who have the same "checksum".
Ideally each string from dictionary should have unique checksum. But in the "real world" about 2/3 share the same "checksum" with other words. Because of that the search is not that fast.
In these dictionaries strings are composed of this characters: ['a'..'z',#224..#246,#248..#254,#154,#156..#159,#179,#186,#191,#190,#185,'0'..'9', '''']
Is there any way to improve the "hashing" so the strings would have more unique "checksums"?
Oh, one way is to increase the size of that "array of array of Integer" (FHashSize) but it cannot be increased too much because it takes a lot of Ram.
Another thing: these dictionaries are stored on HDD only as words/expressions (not the "checksums"). Their "checksum" is generated at program startup. But it takes a lot of seconds to do that...
Is there any way to speed up the startup of the program? Maybe by improving the "hashing" function, maybe by storing the "checksums" on HDD and loading them from there...
Any input would be appreciated...
PS: here is the code to search:
function TDictionary.LocateKey(const Key: AnsiString): Integer;
var i, j, l, H: Integer;
P, Q: PChar;
begin
Result := -1;
l := Length(Key);
H := HashStrBuf(#Key[1], l, 0) and (FHashSize - 1);
P := #Key[1];
for i := 0 to High(FHash[H]) do //FHash is that "array of array of integer"
begin
if l <> FKeys.ItemSize[FHash[H][i]] then //FKeys.ItemSize is an byte array with the lengths of strings from dictionary
Continue;
Q := FKeys.Pointer(FHash[H][i]); //pointer to string in dictionary
for j := 0 to l - 1 do
if (P + j)^ <> (Q + j)^ then
Break;
if j = l then
begin
Result := FHash[H][i];
Exit;
end;
end;
end;
Don't reinvent the wheel!
IMHO your hashing is far from efficient, and your collision algorithm can be improved.
Take a look for instance at the IniFiles unit, and the THashedStringList.
It's a bit old, but a good start for a string list using hashes.
There are a lot of good Delphi implementation of such, like in SuperObject and a lot of other code...
Take a look at our SynBigTable unit, which can handle arrays of data in memory or in file very fast, with full indexed searches. Or our latest TDynArray wrapper around any dynamic array of data, to implement TList-like methods to it, including fast binary search. I'm quite sure it could be faster than your hand-tuned code using hashing, if you use an ordered index then fast binary search.
Post-Scriptum:
About pure hashing speed of a string content, take a look at this function - rename RawByteString into AnsiString, PPtrInt into PPointer, and PtrInt into Integer for Delphi 7:
function Hash32(const Text: RawByteString): cardinal;
function SubHash(P: PCardinalArray): cardinal;
{$ifdef HASINLINE}inline;{$endif}
var s1,s2: cardinal;
i, L: PtrInt;
const Mask: array[0..3] of cardinal = (0,$ff,$ffff,$ffffff);
begin
if P<>nil then begin
L := PPtrInt(PtrInt(P)-4)^; // fast lenght(Text)
s1 := 0;
s2 := 0;
for i := 1 to L shr 4 do begin // 16 bytes (4 DWORD) by loop - aligned read
inc(s1,P^[0]);
inc(s2,s1);
inc(s1,P^[1]);
inc(s2,s1);
inc(s1,P^[2]);
inc(s2,s1);
inc(s1,P^[3]);
inc(s2,s1);
inc(PtrUInt(P),16);
end;
for i := 1 to (L shr 2)and 3 do begin // 4 bytes (DWORD) by loop
inc(s1,P^[0]);
inc(s2,s1);
inc(PtrUInt(P),4);
end;
inc(s1,P^[0] and Mask[L and 3]); // remaining 0..3 bytes
inc(s2,s1);
result := s1 xor (s2 shl 16);
end else
result := 0;
end;
begin // use a sub function for better code generation under Delphi
result := SubHash(pointer(Text));
end;
There is even a pure asm version, even faster, in our SynCommons.pas unit. I don't know any faster hashing function around (it's faster than crc32/adler32/IniFiles.hash...). It's based on adler32, but use DWORD aligned reading and summing for even better speed. This could be improved with SSE asm, of course, but here is a fast pure Delphi hash function.
Then don't forget to use "multiplication"/"binary and operation" for hash resolution, just like in IniFiles. It will reduce the number of iteration to your list of hashs.
But since you didn't provide the search source code, we are not able to know what could be improved here.
If you are using Delphi 7, consider using Julian Bucknall's lovely Delphi data types code, EzDsl (Easy Data Structures Library).
Now you don't have to reinvent the wheel as another wise person has also said.
You can download ezdsl, a version that I have made work with both Delphi 7, and recent unicode delphi versions, here.
In particular the unit name EHash contains a hash table implementation, which has various hashing algorithms plug-inable, or you can write your own plugin function that just does the hashing function of your choice.
As a word to the wise, if you are using a Unicode Delphi version; I would be careful about hashing your unicode strings with a code library like this, without checking how its hashing algorithms perform on your system. The OP here is using Delphi 7, so Unicode is not a factor for the original question.
I think you'll find a database (without checksums) a lot quicker. Maybe try sqlite which will give you a single file database. There are many Delphi Libraries available.

Delphi: Encoding Strings as Python do

I want to encode strings as Python do.
Python code is this:
def EncodeToUTF(inputstr):
uns = inputstr.decode('iso-8859-2')
utfs = uns.encode('utf-8')
return utfs
This is very simple.
But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).
I tried this test code to see the convertion:
procedure TForm1.Button1Click(Sender: TObject);
var
w : WideString;
buf : array[0..2048] of WideChar;
i : integer;
lc : Cardinal;
begin
lc := GetThreadLocale;
Caption := IntToStr(lc);
StringToWideChar(Edit1.Text, buf, SizeOF(buf));
w := buf;
lc := MakeLCID(
MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
0);
Win32Check(SetThreadLocale(lc));
Edit2.Text := WideCharToString(PWideChar(w));
Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;
The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase.
The local lc is 1038 (hun), the new lc is 1033.
But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.
What I do wrong? How to I do same thing as Python do?
Thanks for every help, link, etc:
dd
Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:
1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():
function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
ret: Integer;
uns: WideString;
begin
Result := '';
if inputstr = '' then Exit;
ret := MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), nil, 0);
if ret < 1 then Exit;
SetLength(uns, ret);
MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns));
ret := WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), nil, 0, nil, nil);
if ret < 1 then Exit;
SetLength(Result, ret);
WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), PAnsiChar(Result), Length(Result), nil, nil);
end;
2a) on D2009+, use SysUtils.TEncoding.Convert():
function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
enc: TEncoding;
buf: TBytes;
begin
Result := '';
if inputstr = '' then Exit;
enc := TEncoding.GetEncoding(28592);
try
buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
if Length(buf) > 0 then
SetString(Result, PAnsiChar(#buf[0]), Length(buf));
finally
enc.Free;
end;
end;
2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:
type
Latin2String = type AnsiString(28592);
var
inputstr: Latin2String;
outputstr: UTF8String;
begin
// put the ISO-8859-2 encoded bytes into inputstr, then...
outputstr := inputstr;
end;
If you're using Delphi 2009 or newer every input from the default VCL controls will be UTF-16, so no need to do any conversions on your input.
If you're using Delphi 2007 or older (as it seems) you are at mercy of Windows, because the VCL is ANSI and Windows has a fixed Codepage that determines which characters can be used in i.e. a TEdit.
You can change the system-wide default ANSI CP in the control panel though, but that requires a reboot each time you do.
In Delphi 2007 you have some chance to use TNTUnicode controls or some similar solution to get the Text from the UI to your code.
In Delphi 2009 and newer there are also plenty of Unicode and character set handling routines in the RTL.
The conversion between character sets can be done with SysUtils.TEncoding:
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html
The Python code in your question returns a string in UTF-8 encoding. To do this with pre-2009 Delphi versions you can use code similar to:
procedure TForm1.Button1Click(Sender: TObject);
var
Src, Dest: string;
Len: integer;
buf : array[0..2048] of WideChar;
begin
Src := Edit1.Text;
Len := MultiByteToWideChar(CP_ACP, 0, PChar(Src), Length(Src), #buf[0], 2048);
buf[Len] := #0;
SetLength(Dest, 2048);
SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, #buf[0], Len, PChar(Dest),
2048, nil, nil));
Edit2.Text := Dest;
end;
Note that this doesn't change the current thread locale, it simply passes the correct code page parameters to the API.
There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16().
My code that converts between ISO Latin2 and UTF-8 looks like:
s2 := EncodingToUTF16('ISO-8859-2', s);
s2utf8 := UTF16ToEncoding('UTF-8', s2);

Resources