I have come up with this function to return the number of occurrences of a string in a Delphi Stream. However, I suspect there is a more efficient way to achieve this, since I am using "higher level" constructs (char), and not working at the lower byte/pointer level (which I am not that familiar with)
function ReadStream(const S: AnsiString; Stream: TMemoryStream): Integer;
var
Arr: Array of AnsiChar;
Buf: AnsiChar;
ReadCount: Integer;
procedure AddChar(const C: AnsiChar);
var
I: Integer;
begin
for I := 1 to Length(S) - 1 do
Arr[I] := Arr[I+1];
Arr[Length(S)] := C;
end;
function IsEqual: Boolean;
var
I: Integer;
begin
Result := True;
for I := 1 to Length(S) do
if S[I] <> Arr[I] then
begin
Result := False;
Break;;
end;
end;
begin
Stream.Position := 0;
SetLength(Arr, Length(S));
Result := 0;
repeat
ReadCount := Stream.Read(Buf, 1);
AddChar(Buf);
if IsEqual then
Inc(Result);
until ReadCount = 0;
end;
Can someone supply a procedure that is more efficient?
Stream has a method that will let you get into the internal buffer.
You can get a pointer to the internal buffer using the Memory property.
If you are working in 32 bit and you are willing to let go of the deprecated TMemoryStream and use TBytesStream instead you can use abuse the fact that a dynamic array and an AnsiString share the same structure in 32 bit.
Unfortunately Emba broke that compatibility in X64, Which means that for no good reason whatsoever you cannot have strings > 2GB in X64.
Note that this trick will break in 64 bit! (See fix below)
You can use Boyer-Moore string searching.
This allows you to write code like this:
function CountOccurrances(const Needle: AnsiString; const Haystack: TBytesStream): integer;
var
Start: cardinal;
Count: integer;
begin
Start:= 1;
Count:= 0;
repeat
{$ifdef CPUx86}
Start:= _FindStringBoyerAnsiString(string(HayStack.Memory), Needle, false, Start);
{$else}
Start:= __FindStringBoyerAnsiStringIn64BitTArrayByte(TArray<Byte>(HaySAtack.Memory), Needle, false, Start);
{$endif}
if Start >= 1 then begin
Inc(Start, Length(Needle));
Inc(Count);
end;
until Start <= 0;
Result:= Count;
end;
For 32 bit you'll have to rewrite the BoyerMoore code to use AnsiString; a trivial rewrite.
For 64 bit you'll have to rewrite the BoyerMoore code to use a TArray<byte> as a first parameter; a relatively simple task.
If you are looking for efficiency, please try and avoid WinAPI calls that use pchars. c-style strings are a horrible idea, because they do not have a length prefix.
Johan has given you a good answer about Boyer-Moore searching. BM is fine if
your are content to use it as a "black box", but if you want to understand what's going on,
there is a bit of a gulf between the complexity of your own code and a BM implementation.
You might find it helpful to explore searching that's more efficient than your own code
but not so complex as BM. There is one ultra-simple way to do what you want without
getting invoved with pointers, PChars, etc.
Let's leave aside for a moment the fact that you want to work with a TMemoryStream, and
consider finding the number of occurrences of a string SubStr in another string Target.
For efficiency, things you want to avoid are a) repeatedly scanning the same characters
over and over and b) copying one or both strings.
Since D7, Delphi has included a PosEx function:
function PosEx(const SubStr, S: string; Offset: Cardinal = 1): Integer;
Description
PosEx returns the index of SubStr in S, beginning the search at Offset. If Offset is 1 (default), PosEx is equivalent to Pos.
PosEx returns 0 if SubStr is not found, if Offset is greater than the length of S, or if Offset is less than 1.
So what you can do is repeatedly call PosEx, starting with Offset = 1, and each time it
finds SubStr in Target you increment Offset to skip over it, like this (in a console application):
function ContainsCount(const SubStr, Target : String) : Integer;
var
i : Integer;
begin
Result := 0;
i := 1;
repeat
i := PosEx(SubStr, Target, i);
if i > 0 then begin
Inc(Result);
i := i + Length(SubStr);
end;
until i <= 0;
end;
var
Count : Integer;
Target : String;
begin
Target := 'aa b ca';
Count := ContainsCount('a', Target);
writeln(Count);
readln;
end.
The fact that PosEx and ContainsCount both pass SubStr and Target as
consts meants that no string copying is involved, and it should be obvious
that ContainsCount never scans the same characters more that once.
Once you've satisfied yourself that this works, you might care to trace
into PosEx to see how it does its stuff.
You can do something which works in a similar way on PChars using the RTL functions StrPos/AnsiStrPos
To convert your memory stream to a string, you could use this code from
Rob Kennedy's answer to this q Converting TMemoryStream to 'String' in Delphi 2009
function MemoryStreamToString(M: TMemoryStream): string;
begin
SetString(Result, PChar(M.Memory), M.Size div SizeOf(Char));
end;
(Note what he says about the alternative version later in his answer)
Btw, if you look through the VCL + RTL code, you'll see that quite a lot of the string-parsing and processing code (e.g. in TParser, TStringList, TExpressionParser) all does its work with PChars. There's a reason for that of course, because it minimizes character copying and means that most scanning operations can be done by changing pointer (PChar) values.
I am taking VB6 code and translating it to Delphi.
VB6 code that opens a file using sequential order:
Dim bytNumDataPoints As Byte
Dim bytcount as Byte
Dim lintData(0 To 23) As Long
Input #intFileNumber, bytNumDataPoints
For bytcount = 0 To (bytNumDataPoints - 1)
Input #intFileNumber, lintData(bytcount) '
lintData(bytcount) = (lintData(bytcount) + 65536) Mod 65536
Next bytcount
File data:
24 <<<<<<<<<<<< Number Data Points
200 300 400 450 500 600 750 1000 1250 1500 1750 2000 2500 3000 3500 3750 4000 4500 5000 5250 5500 5750 6000 6250 <<<< data
This is some neat stuff. You keep calling Input and fill out the array.
As far I know there is no equivalent for this phenomena in Delphi. You cannot use ReadLn like that, right? To me in Delphi you would have to
ReadLn(F, S); //S is a string
z.Delimiter := ' '; //z is a stringlist
z.DelimitedText := S; //and then breakdown the array
Any thought? Thanks.
Use Read instead of Readln;
Something like this:
var
ArrLng, Index: Integer;
Arr: array of Integer;
F: Text;
begin
Assign(F, 'your-fie-name');
OpenFile(F);
try
Readln(F, ArrLng);
SetLength(Arr, ArrLng);
Index := 0;
while (not Eof(F)) and (Index < ArrLng) do
begin
Read(F, Arr[Index]);
Inc(Index);
end;
finally
CloseFile(F);
end;
end;
Splitting string would be generally better approach, with StringList or without.
How to split a string of only ten characters e.g."12345*45688" into an array
Split a string into an array of strings based on a delimiter
How to split the string in delphi
But i think you can also use old Pascal approach there. Just forget about end-of-line, you probably don't need it.
var F: TextFile; I, J, K: integer;
begin
...
ReadLN(F, J);
for i := 1 to J do
Read(F, K);
...
end
But i think it would only be nice for niche approaches (like out of memory) and overall SplitString approach would be faster.
And if you have multi-line files, then two chained stringlist would be IMHO most easy approach, providing those files are not GB-sized.
https://stackoverflow.com/a/14454614/976391
or utilizing more SL features - https://stackoverflow.com/a/14649862/976391
The TStringList class and it's LoadFromFile method. It's got delimiter properties as well.
When you are doing Delphi, treat it like .net. If there's something that should be there, it is, you just haven't found it yet.
You can chuck the number of values thing away as well.
How would I read data from a text file into two arrays? One being string and the other integer?
The text file has a layout like this:
Hello
1
Test
2
Bye
3
Each number corresponds to the text above it. Can anyone perhaps help me? Would greatly appreciate it
var
Items: TStringList;
Strings: array of string;
Integers: array of Integer;
i, Count: Integer;
begin
Items := TStringList.Create;
try
Items.LoadFromFile('c:\YourFileName.txt');
// Production code should check that Items.Count is even at this point.
// Actual arrays here. Set their size once, because we know already.
// growing your arrays inside the iteration will cause many reallocations
// and memory fragmentation.
Count := Items.Count div 2;
SetLength(Strings, Count);
SetLength(Integers, Count);
for i := 0 to Count - 1 do
begin
Strings[i] := Items[i*2];
Integers[i] := StrToInt(Items[i*2+1]);
end;
finally
Items.Free;
end;
end
I would read the file into a string list and then process it item by item. The even ones are put into the list of strings, and the odd ones go into the numbers.
var
file, strings, numbers: TStringList;
...
//create the lists
file.LoadFromFile(filename);
Assert(file.Count mod 2=0);
for i := 0 to file.Count-1 do
if i mod 2=0 then
strings.Add(file[i])
else
numbers.Add(file[i]);
I'd probably use some helper functions called odd and even in my own code.
If you wanted the numbers in a list of integers, rather than a string list, then you would use TList<Integer> and add StrToInt(file[i]) on the odd iterations.
I've used lists rather than dynamic arrays for the ease of writing this code, but GolezTrol shows you how to do it with dynamic arrays if that's what you prefer.
That said, since your state that the number is associated with the string, you may actually be better off with something like this:
type
TNameAndID = record
Name: string;
ID: Integer;
end;
var
List: TList<TNameAndID>;
Item: TNameAndID;
...
List := TList<TNameAndID>.Create;
file.LoadFromFile(filename);
Assert(file.Count mod 2=0);
for i := 0 to file.Count-1 do begin
if i mod 2=0 then begin
Item.Name := file[i];
end else begin
Item.ID := StrToInt(file[i]);
List.Add(Item);
end;
end;
end;
The advantage of this approach is that you now have assurance that the association between name and ID will be maintained. Should you ever wish to sort, insert or remove items then you will find the above structure much more convenient than two parallel arrays.
I want to terminate file that user selected from my program. I wrote this sample code:
var
aFile: TFileStream;
Const
FileAddr: String = 'H:\Akon.mp3';
Buf: Byte = 0;
begin
if FileExists(FileAddr) then
begin
// Open given file in file stream & rewrite it
aFile:= TFileStream.Create(FileAddr, fmOpenReadWrite);
try
aFile.Seek(0, soFromBeginning);
while aFile.Position <> aFile.Size do
aFile.Write(Buf, 1);
finally
aFile.Free;
ShowMessage('Finish');
end;
end;
end;
As you can see, I overwrite given file with 0 (null) value. This code works correctly, but the speed is very low in large files. I would like do this process in multithreaded code, but I tried some test some code and can't do it. For example, I create 4 threads that do this work to speed up this process.
Is there any way to speed up this process?
I don't know if it could help you, but I think you could do better (than multithreading) writing to file a larger buffer.
For example you could initialize a buffer 16k wide and write directly to FileStream; you have only to check the last part of file, for which you write only a part of the full buffer.
Believe me, it will be really faster...
OK, I'll bite:
const
FileAddr: String = 'H:\Akon.mp3';
var
aFile: TFileStream;
Buf: array[0..1023] of Byte;
Remaining, NumBytes: Integer;
begin
if FileExists(FileAddr) then
begin
// Open given file in file stream & rewrite it
aFile:= TFileStream.Create(FileAddr, fmOpenReadWrite);
try
FillChar(Buf, SizeOf(Buf), 0);
Remaining := aFile.Size;
while Remaining > 0 do begin
NumBytes := SizeOf(Buf);
if NumBytes < Remaining then
NumBytes := Remaining;
aFile.WriteBuffer(Buf, NumBytes);
Dec(Remaining, NumBytes);
end;
finally
aFile.Free;
ShowMessage('Finish');
end;
end;
end;
Multiple threads won't help you here. Your constraint is disk access, primarily because you're writing only 1 byte at a time.
Declare Buf as an array of bytes, and initialise it with FillChar or ZeroMemory. Then change your while loop as follows:
while ((aFile.Position + SizeOf(Buf)) < aFile.Size) do
begin
aFile.Write(Buf, SizeOf(Buf));
end;
if (aFile.Position < aFile.Size) then
begin
aFile.Write(Buf, aFile.Size - aFile.Position);
end;
You should learn from the slowness of the above code that:
Writing one byte at a time is the slowest and worst way to do it, you incur a huge overhead, and reduce your overall performance.
Even writing 1k bytes (1024 bytes) at a time, would be a vast improvement, but writing a larger amount will of course be even faster, until you reach a point of diminishing returns, which I would guess is going to be somewhere between 200k and 500k write buffer size. The only way to find out when it stops mattering for your application is to test, test, and test.
Checking position against size so often is completely superfluous. If you read the size once, and write the correct number of bytes, using a local variable you will save yourself more overhead, and improve performance. ie, Inc(LPosition,BufSize) to increment LPosition:Integer logical variable, by the buffer size amount BufSize.
Not sure if this meets your requirments but it works and it's fast.
var
aFile: TFileStream;
const
FileAddr: String = 'H:\Akon.mp3';
begin
if FileExists(FileAddr) then
begin
aFile:= TFileStream.Create(FileAddr, fmOpenReadWrite);
try
afile.Size := 0;
finally
aFile.Free;
ShowMessage('Finish');
end;
end;
end;
So will something along these lines (declaring b outside the function will improve your performance in the loop, especially when dealing with a large file ). I assume that in the app filename would be a var:
const
b: byte=0;
procedure MyProc;
var
aFile: TFileStream;
Buf: array of byte;
len: integer;
FileAddr: String;
begin
FileAddr := 'C:\testURL.txt';
if FileExists(FileAddr) then
begin
aFile := TFileStream.Create(FileAddr, fmcreate);
try
len := afile.Size;
setlength(buf, len);
fillchar(buf[0], len, b);
aFile.Position := 0;
aFile.Write(buf, len);
finally
aFile.Free;
ShowMessage('Finish');
end;
end;
end;
In my program, I process millions of strings that have a special character, e.g. "|" to separate tokens within each string. I have a function to return the n'th token, and this is it:
function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
var
I, P, P2: integer;
begin
P2 := Pos(Delim, Line);
if TokenNum = 1 then begin
if P2 = 0 then
Result := Line
else
Result := copy(Line, 1, P2-1);
end
else begin
P := 0; { To prevent warnings }
for I := 2 to TokenNum do begin
P := P2;
if P = 0 then break;
P2 := PosEx(Delim, Line, P+1);
end;
if P = 0 then
Result := ''
else if P2 = 0 then
Result := copy(Line, P+1, MaxInt)
else
Result := copy(Line, P+1, P2-P-1);
end;
end; { GetTok }
I developed this function back when I was using Delphi 4. It calls the very efficient PosEx routine that was originally developed by Fastcode and is now included in the StrUtils library of Delphi.
I recently upgraded to Delphi 2009 and my strings are all Unicode. This GetTok function still works and still works well.
I have gone through the new libraries in Delphi 2009 and there are many new functions and additions to it.
But I have not seen a GetToken function like I need in any of the new Delphi libraries, in the various fastcode projects, and I can't find anything with a Google search other than Zarko Gajic's: Delphi Split / Tokenizer Functions, which is not as optimized as what I already have.
Any improvement, even 10% would be noticeable in my program. I know an alternative is StringLists and to always keep the tokens separate, but this has a big overhead memory-wise and I'm not sure if I did all that work to convert whether it would be any faster.
Whew. So after all this long winded talk, my question really is:
Do you know of any very fast implementations of a GetToken routine? An assembler optimized version would be ideal?
If not, are there any optimizations that you can see to my code above that might make an improvement?
Followup: Barry Kelly mentioned a question I asked a year ago about optimizing the parsing of the lines in a file. At that time I hadn't even thought of my GetTok routine which was not used for the that read or parsing. It is only now that I saw the overhead of my GetTok routine which led me to ask this question. Until Carl Smotricz and Barry's answers, I had never thought of connecting the two. So obvious, but it just didn't register. Thanks for pointing that out.
Yes, my Delim is a single character, so obviously I have some major optimization I can do. My use of Pos and PosEx in the GetTok routine (above) blinded me to the idea that I can do it faster with a character by character search instead, with bits of code like:
while (cp^ > #0) and (cp^ <= Delim) do
Inc(cp);
I'm going to go through everyone's answers and try the various suggestions and compare them. Then I'll post the results.
Confusion: Okay, now I'm really perplexed.
I took Carl and Barry's recommendation to go with PChars, and here is my implementation:
function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
{ LK Nov 7, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; https://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
I: integer;
PLine, PStart: PChar;
begin
PLine := PChar(Line);
PStart := PLine;
inc(PLine);
for I := 1 to TokenNum do begin
while (PLine^ <> #0) and (PLine^ <> Delim) do
inc(PLine);
if I = TokenNum then begin
SetString(Result, PStart, PLine - PStart);
break;
end;
if PLine^ = #0 then begin
Result := '';
break;
end;
inc(PLine);
PStart := PLine;
end;
end; { GetTok }
On paper, I don't think you can do much better than this.
So I put both routines to the task and used AQTime to see what's happening. The run I had included 1,108,514 calls to GetTok.
AQTime timed the original routine at 0.40 seconds. The million calls to Pos took 0.10 seconds. A half a million of the TokenNum = 1 copies took 0.10 seconds. The 600,000 PosEx calls only took 0.03 seconds.
Then I timed my new routine with AQTime for the same run and exactly the same calls. AQTime reports that my new "fast" routine took 3.65 seconds, which is 9 times as long. The culprit according to AQTime was the first loop:
while (PLine^ <> #0) and (PLine^ <> Delim) do
inc(PLine);
The while line, which was executed 18 million times, was reported at 2.66 seconds. The inc line, executed 16 million times, was said to take 0.47 seconds.
Now I thought I knew what was happening here. I had a similar problem with AQTime in a question I posed last year: Why is CharInSet faster than Case statement?
Again it was Barry Kelly who clued me in. Basically, an instrumenting profiler like AQTime does not necessarily do the job for microoptimization. It adds an overhead to each line which may swamp the results which is shown clearly in these numbers. The 34 million lines executed in my new "optimized code" overwhelm the several million lines of my original code, with apparently little or no overhead from the Pos and PosEx routines.
Barry gave me a sample of code using QueryPerformanceCounter to check that he was correct, and in that case he was.
Okay, so let's do the same now with QueryPerformanceCounter to prove that my new routine is faster and not 9 times slower as AQTime says it is. So here I go:
function TimeIt(const Title: string): double;
var i: Integer;
start, finish, freq: Int64;
Seconds: double;
begin
QueryPerformanceCounter(start);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 1);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 2);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 3);
for i := 1 to 250000 do
GetTokOld('This is a string|that needs|parsing', '|', 4);
QueryPerformanceCounter(finish);
QueryPerformanceFrequency(freq);
Seconds := (finish - start) / freq;
Result := Seconds;
end;
So this will test 1,000,000 calls to GetTok.
My old procedure with the Pos and PosEx calls took 0.29 seconds.
The new one with PChars took 2.07 seconds.
Now I am completely befuddled! Can anyone tell me why the PChar procedure is not only slower, but is 8 to 9 times slower!?
Mystery solved! Andreas said in his answer to change the Delim parameter from a string to a Char. I'll always be using just a Char, so at least for my implementation this is very possible. I was amazed at what happened.
The time for the 1 million calls went down from 1.88 seconds to .22 seconds.
And surprisingly, the time for my original Pos/PosEx routine went UP from .29 to .44 seconds when I changed it's Delim parameter to a Char.
Frankly, I'm disappointed by Delphi's optimizer. That Delim is a constant parameter. The optimizer should have noticed that the same conversion is happening within the loop and should have moved it out so that it would only be done once.
Double checking my Code generation parameters, yes I do have Optimization True and String format checking Off.
Bottom line is that the new PChar routine with Andrea's fix is about 25% faster than my original (.22 versus .29).
I still want to follow up on the other comments here and test them out.
Turning off optimization and turning on String format checking only increases the time from .22 to .30. It adds about the same to the original.
The advantage to using assembler code, or calling routines written in assembler like Pos or PosEx is that they are NOT subject to what code generation options you have set. They will always run the same way, a pre-optimized and non-bloated way.
I have reaffirmed in the last couple of days, that the best way to compare code for microoptimization is to look at and compare the Assembler code in the CPU window. It would be nice if Embarcadero could make that window a bit more convenient, and allow us to copy portions to the clipboard or to print sections of it.
Also, I unfairly slammed AQTime earlier in this post, thinking that the extra time added for my new routine was solely because of the instrumentation it added. Now that I go back and check with the Char parameter instead of String, the while loop is down to .30 seconds (from 2.66) and the inc line is down to .14 seconds (from .47). Strange that the inc line would go down as well. But I'm getting worn out from all this testing already.
I took Carl's idea of looping by characters, and rewrote that code with that idea. It makes another improvement, down to .19 seconds from .22. So here is now the best so far:
function GetTok(const Line: string; const Delim: Char; const TokenNum: Byte): string;
{ LK Nov 8, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; https://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
I, CurToken: Integer;
PLine, PStart: PChar;
begin
CurToken := 1;
PLine := PChar(Line);
PStart := PLine;
for I := 1 to length(Line) do begin
if PLine^ = Delim then begin
if CurToken = TokenNum then
break
else begin
CurToken := CurToken + 1;
inc(PLine);
PStart := PLine;
end;
end
else
inc(PLine);
end;
if CurToken = TokenNum then
SetString(Result, PStart, PLine - PStart)
else
Result := '';
end;
There still may be some minor optimizations to this, such as the CurToken = Tokennum comparison, which should be the same type, Integer or Byte, whichever is faster.
But let's say, I'm happy now.
Thanks again to the StackOverflow Delphi community.
It makes a big difference what "Delim" is expected to be. If it's expected to be a single character, you're far better off stepping through the string character by character, ideally through a PChar, and testing specifically.
If it's a long string, Boyer-Moore and similar searches have a set-up phase for skip tables, and the best way would be to build the tables once, and reuse them for each subsequent find. That means you need state between calls, and this function would be better off as a method on an object instead.
You might be interested in this answer I gave to a question some time before, about the fastest way to parse a line in Delphi. (But I see that it is you that asked the question! Nevertheless, in solving your problem, I would hew to how I described parsing, not using PosEx like you are using, depending on what Delim normally looks like.)
UPDATE: OK, I spent about 40 minutes looking at this. If you know the delimiter is going to be a character, you're pretty much always better off with the second version (i.e. PChar scanning), but you have to pass Delim as a character. At the time of writing, you're converting the PLine^ expression - of type Char - to a string for comparison with Delim. That will be very slow; even indexing into the string, with Delim[1] will also be somewhat slow.
However, depending on how large your lines are, and how many delimited pieces you want to pull out, you may be better off with a resumable approach, rather than skipping unwanted delimited pieces inside the tokenizing routine. If you call GetTok with successively increasing indexes, like you are currently doing in your mini benchmark, you'll end up with O(n*n) performance, where n is the number of delimited sections. That can be turned into O(n) if you save the state of the scan and restore it for the next iteration, or pack all extracted items into an array.
Here's a version that does all tokenization once, and returns an array. It needs to tokenize twice though, in order to know how large to make the array. On the other hand, only the second tokenization needs to extract the strings:
// Do all tokenization up front.
function GetTok4(const Line: string; const Delim: Char): TArray<string>;
var
cp, start: PChar;
count: Integer;
begin
// Count sections
count := 1;
cp := PChar(Line);
start := cp;
while True do
begin
if cp^ <> #0 then
begin
if cp^ <> Delim then
Inc(cp)
else
begin
Inc(cp);
Inc(count);
end;
end
else
begin
Inc(count);
Break;
end;
end;
SetLength(Result, count);
cp := start;
count := 0;
while True do
begin
if cp^ <> #0 then
begin
if cp^ <> Delim then
Inc(cp)
else
begin
SetString(Result[count], start, cp - start);
Inc(cp);
Inc(count);
end;
end
else
begin
SetString(Result[count], start, cp - start);
Break;
end;
end;
end;
Here's the resumable approach. The loads and stores of the current position and delimiter character do have a cost, though:
type
TTokenizer = record
private
FSource: string;
FCurrPos: PChar;
FDelim: Char;
public
procedure Reset(const ASource: string; ADelim: Char); inline;
function GetToken(out AResult: string): Boolean; inline;
end;
procedure TTokenizer.Reset(const ASource: string; ADelim: Char);
begin
FSource := ASource; // keep reference alive
FCurrPos := PChar(FSource);
FDelim := ADelim;
end;
function TTokenizer.GetToken(out AResult: string): Boolean;
var
cp, start: PChar;
delim: Char;
begin
// copy members to locals for better optimization
cp := FCurrPos;
delim := FDelim;
if cp^ = #0 then
begin
AResult := '';
Exit(False);
end;
start := cp;
while (cp^ <> #0) and (cp^ <> Delim) do
Inc(cp);
SetString(AResult, start, cp - start);
if cp^ = Delim then
Inc(cp);
FCurrPos := cp;
Result := True;
end;
Here's the full program I used for benchmarking.
Here are the results:
*** count=3, Length(src)=200
GetTok1: 595 ms
GetTok2: 547 ms
GetTok3: 2366 ms
GetTok4: 407 ms
GetTokBK: 226 ms
*** count=6, Length(src)=350
GetTok1: 1587 ms
GetTok2: 1502 ms
GetTok3: 6890 ms
GetTok4: 679 ms
GetTokBK: 334 ms
*** count=9, Length(src)=500
GetTok1: 3055 ms
GetTok2: 2912 ms
GetTok3: 13766 ms
GetTok4: 947 ms
GetTokBK: 446 ms
*** count=12, Length(src)=650
GetTok1: 4997 ms
GetTok2: 4803 ms
GetTok3: 23021 ms
GetTok4: 1213 ms
GetTokBK: 543 ms
*** count=15, Length(src)=800
GetTok1: 7417 ms
GetTok2: 7173 ms
GetTok3: 34644 ms
GetTok4: 1480 ms
GetTokBK: 653 ms
Depending on the characteristics of your data, whether the delimiter is likely to be a character or not, and how you work with it, different approaches may be faster.
(I made a mistake in my earlier program, I wasn't measuring the same operations for each style of routine. I updated the pastebin link and benchmark results.)
Your new function (the one with PChar) should declare "Delim" as Char and not as String. In your current implementation the compiler has to convert the PLine^ char into a string to compare it with "Delim". And that happens in a tight loop resulting is an enormous performance hit.
function GetTok(const Line: string; const Delim: Char{<<==}; const TokenNum: Byte): string;
{ LK Feb 12, 2007 - This function has been optimized as best as possible }
{ LK Nov 7, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx }
{ See; http://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi }
var
I: integer;
PLine, PStart: PChar;
begin
PLine := PChar(Line);
PStart := PLine;
inc(PLine);
for I := 1 to TokenNum do begin
while (PLine^ <> #0) and (PLine^ <> Delim) do
inc(PLine);
if I = TokenNum then begin
SetString(Result, PStart, PLine - PStart);
break;
end;
if PLine^ = #0 then begin
Result := '';
break;
end;
inc(PLine);
PStart := PLine;
end;
end; { GetTok }
Delphi compiles to VERY efficient code; in my experience, it was very difficult to do better in assembler.
I think you should just point a PChar (they still exist, don't they? I parted ways with Delphi around 4.0) at the beginning of the string and increment it while counting "|"s, until you've found n-1 of them. I suspect that will be faster than calling PosEx repeatedly.
Take note of that position, then increment the pointer some more until you hit the next pipe. Pull out your substring. Done.
I'm only guessing, but I wouldn't be surprised if this was close to the quickest this problem can be solved.
EDIT: Here's what I had in mind. This code is, alas, uncompiled and untested, but it should demonstrate what I meant.
In particular, Delim is treated as a single char, which I believe makes a world of difference if that will fulfill the requirements, and the character at PLine is tested only once. Finally, there's no more comparison against TokenNum; I believe it's faster to decrement a counter to 0 for counting delimiters.
function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string;
var
Del: Char;
PLine, PStart: PChar;
Nth, I, P0, P9: Integer;
begin
Del := Delim[1];
Nth := TokenNum + 1;
P0 := 1;
P9 := Line.length + 1;
PLine := PChar(line);
for I := 1 to P9 do begin
if PLine^ = Del then begin
if Nth = 0 then begin
P9 := I;
break;
end;
Dec(Nth);
if Nth = 0 then P0 := I + 1
end;
Inc(PLine);
end;
if (Nth <= 1) or (TokenNum = 1) then
Result := Copy(Line, P0, P9 - P0);
else
Result := ''
end;
Using assembler would be a micro-optimization. There are much greater gains to be had by optimizing the algorithm. Not doing work beats doing work in the fastest possible way, every time.
One example would be if you have places in your program where you need several tokens of the same line. Another procedure that returns an array of tokens which you can then index into should be faster than calling your function more than once, especially if you let the procedure not return all tokens, but only as many as you need.
But in general I agree with Carl's answer (+1), using a PChar for scanning would probably be faster than your current code.
This is a function that I've had in my personal library for quite some time that I use extensively. I believe this is the most current version of it. I've had multiple versions in the past being optimized for a variety of different reasons. This one tries to take into account Quoted strings, but if that code is removed it makes the function a slight bit faster.
I actually have a number of other routines, CountSections and ParseSectionPOS being a couple of examples.
Unfortuately this routine is ansi/pchar based only. Although I don't think it would be difficult to move it to unicode. Maybe I've already done that...I'll have to check on that.
Note: This routine is 1 based in the ParseNum indexing.
function ParseSection(ParseLine: string; ParseNum: Integer; ParseSep: Char; QuotedStrChar:char = #0) : string;
var
wStart, wEnd : integer;
wIndex : integer;
wLen : integer;
wQuotedString : boolean;
begin
result := '';
wQuotedString := false;
if not (ParseLine = '') then
begin
wIndex := 1;
wStart := 1;
wEnd := 1;
wLen := Length(ParseLine);
while wEnd <= wLen do
begin
if (QuotedStrChar <> #0) and (ParseLine[wEnd] = QuotedStrChar) then
wQuotedString := not wQuotedString;
if not wQuotedString and (ParseLine[wEnd] = ParseSep) then
begin
if wIndex=ParseNum then
break
else
begin
inc(wIndex);
wStart := wEnd+1;
end;
end;
inc(wEnd);
end;
result := copy(ParseLine, wStart, wEnd-wStart);
if (length(result) > 0) and (QuotedStrChar <> #0) and (result[1] = QuotedStrChar) then
result := AnsiDequotedStr(result, QuotedStrChar);
end;
end; { ParseSection }
In your code, I think this is the only line that can be optimized:
Result := copy(Line, P+1, MaxInt)
If you calculate the new Length there, it might get a bit faster, but not the 10% you are looking for.
Your tokenizing algorithm seems pretty OK.
For optimizing it, I would run it through a profiler (like AQTime from AutomatedQA) with a representative subset of your production data. That will point you to the weakest spot.
The only RTL function that comes close is this one in the Classes unit:
procedure TStrings.SetDelimitedText(const Value: string);
It tokenizes, but uses both QuoteChar and Delimiter, but you only use a Delimiter.
It uses the SetString function in the System unit which is a pretty fast way to set the content of a string based on a PChar/PAnsiChar/PUnicodeChar and a length.
That might get you some improvement as well; on the other hand, Copy is really fast too.
I'm not the person always blaming the algorithm, but if I look at the first piece of source,
the problem is that for string N, you do the POS/posexes for string 1..n-1 again too.
This means for N items, you do sum (n, n-1,n-2...1) POSes (=+/- 0.5*N^2) , while only N are needed.
If you simply cache the position of the last found result, e.g. in a record that is passed by VAR parameter, you can gain a lot.
type
TLastPosition = record
elementnr : integer; // last tokennumber
elementpos: integer; // character index of last match
end;
and then something
if tokennum=(lastposition.elementnr+1) then
begin
newpos:=posex(delim,line,lastposition.elementpos);
end;
Unfortunately, I don't have the time now to write it out, but I hope you get the idea