Faster CompareText implementation for D2009 - delphi

I'm extensively using hash map data structures in my program. I'm using a hash map implementation by Barry Kelly posted on the Codegear forums. That implementation internally uses RTL's CompareText function. Profiling made me realize that A LOT of time is spent in SysUtils CompareText function.
I had a look at the
Fastcode site
and found some faster implementations of CompareText. Unfortunately they seem not to work for D2009 and its unicode strings.
Now for the question: Is there a similar faster version that supports D2009 strings? The CompareText functions seems to be called a lot when using hash maps (at least in the implemenation I'm currently using), so little performance improvements could really make a difference. Or should the implementations presented there also work for unicode strings?

Many of the FastCode functions will probably compile and appear to work just fine in Delphi 2009, but they won't be right for all input. The ones that are implemented in assembler will fail because they assume characters are just one byte each. The ones implemented in Delphi will fare a little better, but they'll still return incorrect results sometimes because the old CompareText's notion of "case-insensitive" is based on ASCII whereas the new one should be based on Unicode. The rules for which characters are considered the same save for case are much different for Unicode from how they are for ASCII.
Andreas says in a comment below that Unicode CompareText still uses the ASCII case-comparison rules, so a number of the FastCode functions should work fine. Just look them over before using them to make sure they're not making any character-size assumptions. I seem to recall that some FastCode functions were incorporated into the Delphi RTL already. I have no idea whether CompareText was one of them.
If you're calling CompareText a lot in a hash table, then that suggests your hash table isn't doing a very good job. CompareText should only get called when the hash of the thing you're searching for designated a non-empty bucket in the hash table. From there, a hash table will often use a linear search to find the right item in the bucket, and it will call CompareText for every item during that search. I don't know whether that's how the one you're using works.
You might solve this by using a different hash function that distributes its results more evenly over the available buckets. If your buckets are already evenly filled, then you may need more buckets (and then make sure the hash function still distributes evenly over that number as well).
If the hash-map class you're using is based on TBucketList, then there is room for improvement in the bucket storage. That class doesn't calculate a hash on the entire input. It uses the input only to determine the bucket to use. If the class would also keep track of the full hash computed for a string, then comparisons during the linear search could go much faster. Just compare the hashes, and only compare the strings when the hashes match completely. (For a 256-bucket bucket-list, the largest supported size, only one byte of the input determines the bucket, and the rest of the bytes are ignored.) I've written about TBucketList here before.

Related

There seem to be a lot of ruby methods that are very similar, how do I pick which one to use?

I'm relatively new to Ruby, so this is a pretty general question. I have found through the Ruby Docs page a lot of methods that seem to do the exact same thing or very similar. For example chars vs split(' ') and each vs map vs collect. Sometimes there are small differences and other times I see no difference at all.
My question here is how do I know which is best practice, or is it just personal preference? I'm sure this varies from instance to instance, so if I can learn some of the more important ones to be cognizant of I would really appreciate that because I would like to develop good habits early.
I am a bit confused by your specific examples:
map and collect are aliases. They don't "do the exact same thing", they are the exact same thing. They are just two names for the same method. You can use whatever name you wish, or what reads best in context, or what your team has decided as a Coding Standard. The Community seems to have settled on map.
each and map/collect are completely different, there is no similarity there, apart from the general fact that they both operate on collections. map transform a collection by mapping every element to a new element using a transformation operation. It returns a new collection (an Array, actually) with the transformed elements. each performs a side-effect for every element of the collection. Since it is only used for its side-effect, the return value is irrelevant (it might just as well return nil like Kernel#puts does, in languages like C, C++, Java, C♯, it would return void), but it is specified to always return its receiver.
split splits a String into an Array of Strings based on a delimiter that can be either a Regexp (in which case you can also influence whether or not the delimiter itself gets captured in the output or ignored) or a String, or nil (in which case the global default separator gets used). chars returns an Array with the individual characters (represented as Strings of length 1, since Ruby doesn't have an specific Character type). chars belongs together in a family with bytes and codepoints which do the same thing for bytes and codepoints, respectively. split can only be used as a replacement for one of the methods in this family (chars) and split is much more general than that.
So, in the examples you gave, there really isn't much similarity at all, and I cannot imagine any situation where it would be unclear which one to choose.
In general, you have a problem and you look for the method (or combination of methods) that solve it. You don't look at a bunch of methods and look for the problem they solve.
There'll typically be only one method that fits a specific problem. Larger problems can be broken down into different subproblems in different ways, so it is indeed possible that you may end up with different combinations of methods to solve the same larger problem, but for each individual subproblem, there will generally be only one applicable method.
When documentation states that 2 methods do the same, it's just matter of preference. To learn the details, you should always start with Ruby API documentation

What is the difference between Delphi string comparsion functions?

There's a bunch of ways you can compare strings in modern Delphi (say 2010-XE3):
'<=' operator which resolves to UStrCmp / LStrCmp
CompareStr
AnsiCompareStr
Can someone give (or point to) a description of what those methods do, in principle?
So far I've figured that AnsiCompareStr calls CompareString on Windows, which is a "textual" comparison (i.e. takes into account unicode combined characters etc). Simple CompareStr does not do that and seems to do a binary comparison instead.
But what is the difference between CompareStr and UStrCmp? Between UStrCmp and LStrCmp? Do they all produce identical results? Do those results change between versions of Delphi?
I'm asking because I need a comparison which will always produce the same results, so that indexes in app built with one version of Delphi remain consistent with code built with another.
AnsiCompareStr is specified as taking locale into account, and should return identical results regardless of Delphi version, but may return different results based on Windows version and/or settings.. CompareStr is a pure binary comparison: "The comparison operation is based on the 16-bit ordinal value of each character and is not affected by the current locale" (for the CompareStr(const S1, S2: string) overload). UStrCmp also uses a pure binary comparison: "Strings are compared according to the ordinal values that make up the characters that make up the string." So there should not be a difference between the latter two. The way they return the result is different, so two implementations are needed (although it would be possible to make one rely on the other).
As for the differences between LStrCmp and UStrCmp, LStrCmp takes AnsiStrings, UStrCmp takes UnicodeStrings. It's entirely possible that two characters (let's say A and B) are ordered in the misnamed "ANSI" code page as A < B, but are ordered in Unicode as A > B. You should almost always just use the comparison appropriate for the data you have.

String indexing vs. dynamic array indexing in Delphi

In Delphi, why are AnsiStrings indexed from one and dynamic arrays indexed from zero? Is this a historical accident, to make AnsiStrings work more like ShortStrings, or is there some deeper logic at work?
One of the contributing factors that led to "Pascal" strings being 1 indexed instead of 0 indexed was that the length of the string was stored in the zeroth byte. Yes, that could have been hidden from the programmer's view by having the compiler internally add a constant offset to the string index expression (as was done in Delphi's long strings later) but in the beginning things were much simpler. Allocate a block of memory, store the length in byte zero, index the char data from byte 1. End of story.
As I recall UCSD Pascal was using this length-in-zero-byte convention long before Turbo Pascal came along.
As for why dynamic arrays are zero based, I don't recall any specific reason but I would guess it reflects the dynamic array's kinship to dynamically allocating a buffer and indexing off the buffer pointer. The array types that you would use to create array pointer types were zero based arrays. The first byte is found at buffer pointer + 0 offset. This is the C rationalization for zero based everything. There was no compelling reason to carry string's 1 based indexing pattern over to compiler managed arrays when string's 1 based indexing was already (and had always been) the exception rather than the norm.
It may well be that because the string type was the first array-like data type that everyone first encountered and possibly the most used data type across the board, there may be a perception of a bias towards 1 based indexing in the language. However, if you look closely I think you'll find arrays in Pascal (distinct from string) have never been inherently 1 based, especially when dynamically allocated.
The reason for the Delphi string tradition of 1-based strings is quite simple. The tradition comes from the implementation of old style Turbo Pascal strings. That data type stored the length of the string in the first byte of the variable, index 0. The string data began in the next byte, index 1.
You can still use that data type today. It's now called ShortString. As is immediately obvious from it's implementation, there is a 255 character limit. This limit led to the introduction of huge strings, if I recall correctly, in Delphi 2. When huge strings were introduced the language designers chose to retain 1-based indexing to make it easier for developers to switch from short strings to huge strings.
I guess Turbo Pascal didn't invent the idea of using element 0 for length. It's just that I'm too young to remember what came before then!
Dynamic arrays weren't bound by the past in the same way and had a free choice. I don't know why zero based was chosen. Perhaps because it fits more easily with the prevailing fashion on platform on which Delphi existed at that time, namely Windows. That's just a guess though. Danny Thorpe worked on the Delphi compiler at that time, and even he can't remember the rationale!
The Delphi language designers are currently moving towards zero based string indexing for huge strings. The initial steps in this direction can be seen in XE3 in the TStringHelper class which uses 0-based indexing. And also in the ZEROBASEDSTRINGS conditional which allows you to opt in to 0-based indexing. Expect the next generation Delphi compiler to use 0-based indexing only. The times they are changin'.
Historical accident.
Pascal strings and arrays traditionally start at 1.
C - and perhaps consequently AnsiStrings - start at 0.
I don't know the rationale for "breaking with Pascal tradition" for Dynamic arrays, which also start at zero. But it makes sense, and I agree with it ...
IMHO...

How do I compare unicode strings containing non-english characters to sort alpabetically?

I am trying to sort array/lists/whatever of data based upon the unicode string values in them which contain non-english characters, I want them sorted correctly alphabetically.
I have written a lot of code (D2010, win XP), which I thought was pretty solid for future internationalisation, but it is not. Its all using unicodestring (string) data type, which up until now I have just been putting english characters into the unicode strings.
It seems I have to own up to making a very serious unicode mistake. I talked to my German friend, and tried out some German ß's, (ß is 'ss' and should come after S and before T in alphabet) and and ö's etc (note the umlaut) and none of my sorting algorithms work anymore. Results are very mixed up. Garbage.
Since then I have been reading up extensively and learnt a lot of unpleasant things with regards to unicode collation. Things are looking grim, much grimmer than I ever expected, I have seriously messed this up. I hope I am missing something and things are not actually quite as grim as they appear at present. I have been tinkering around looking at windows api calls (RtlCompareUnicodeString) with no success (protection faults), I could not get it to work. Problem with API calls I learnt is that they change on various newer windows platforms, and also with delphi going cross plat soon, with linux later, my app is client server so I need to be concerned about this, but tbh with the situation being what is it (bad) I would be grateful for any forward progress, ie win api specific.
Is using win api function RtlCompareUnicodeString to obvious solution? If so I should really try again with that but tbh I have been taken aback by all of the issues involved with unicode collation and I not clear at all what I should be doing to compare these strings this way anyway.
I learnt of the IBM ICU c++ opensource project, there is a delphi wrapper for it albeit for an older version of ICU. It seems a very comprehensive solution which is platform independant. Surely I cannot be looking at creating a delphi wrapper for this (or updating the existing one) to get a good solution for unicode collation?
I would be extremely glad to hear advice at two levels :-
A) A windows specific non portable solution, I would be glad off that at the moment, forget the client server ramifications!
B) A more portable solution which is immune from the various XP/vista/win7 variations of unicode api functions, therefore putting me in good stead for XE2 mac support and future linux support, not to mention the client server complications.
Btw I dont really want to be doing 'make-do' solutions, scanning strings prior to comparison and replacing certain tricky characters etc, which I have read about. I gave the German examplle above, thats just an example, I want to get it working for all (or at least most, far east, russian) languages, I don't want to do workarounds for a specific language or two. I also do not need any advice on the sorting algorithms, they are fine, its just the string comparison bit that's wrong.
I hope I am missing/doing something stupid, this all looks to be a headache.
Thank you.
EDIT, Rudy, here is how I was trying to call RtlCompareUnicodeString. Sorry for the delay I have been having a horrible time with this.
program Project26
{$APPTYPE CONSOLE}
uses
SysUtils;
var
a,b:ansistring;
k,l:string;
x,y:widestring;
r:integer;
procedure RtlInitUnicodeString(
DestinationString:pstring;
SourceString:pwidechar) stdcall; external 'NTDLL';
function RtlCompareUnicodeString(
String1:pstring;
String2:pstring;
CaseInSensitive:boolean
):integer stdcall; external 'NTDLL';
begin
x:='wef';
y:='fsd';
RtlInitUnicodeString(#k, pwidechar(x));
RtlInitUnicodeString(#l, pwidechar(y));
r:=RtlCompareUnicodeString(#k,#l,false);
writeln(r);
readln;
end.
I realise this is most likely wrong, I am not used to calling api unctions directly, this is my best guess.
About your StringCompareEx api function. That looked really good, but is avail on Vista + only, I'm using XP. StringCompare is on XP, but that's not Unicode!
To recap, the basic task afoot, is to compare two strings, and to do so based on the character sort order specified in the current windows locale.
Can anyone say for sure if ansicomparetext should do this or not? It don't work for me, but others have said it should, and other things i have read suggest it should.
This is what I get with 31 test strings when using AnsiCompareText when in German Locale (space delimited - no strings contain spaces) :-
arß Asß asß aßs no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa
Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz
EDIT 2.
I am still keen to hear if I should expect AnsiCompareText to work using the locale info, as lkessler has said so, and lkessler has also posted about these subjects before and seems have been through this before.
However, following on from Rudy's advice I have also been checking out CompareStringW - which shares the same documentation with CompareString, so it is NOT non-unicode as I have stated earlier.
Even if AnsiCompareText is not going to work, although I think it should, the win32api function CompareStringW should indeed work. Now I have defined my API function, and I can call it, and I get a result, and no error... but i get the same result everytime regardless of the input strings! It returns 1 everytime - which means less than. Here's my code
var
k,l:string;
function CompareStringW(
Locale:integer;
dwCmpFlags:longword;
lpString1:pstring;
cchCount1:integer;
lpString2:pstring;
cchCount2:integer
):integer stdcall; external 'Kernel32.dll';
begin;
k:='zzz';
l:='xxx';
writeln(length(k));
r:=comparestringw(LOCALE_USER_DEFAULT,0,#k,3,#l,3);
writeln(r); // result is 1=less than, 2=equal, 3=greater than
readln;
end;
I feel I am getting somewhere now after much pain. Would be glad to know about AnsiCompareText, and what I am doing wrong with the above CompareStringW api call. Thank you.
EDIT 3
Firstly, I fixed the api call to CompareStringW myself, I was passing in #mystring when I should do PString(mystring). Now it all works correctly.
r:=comparestringw(LOCALE_USER_DEFAULT,0,pstring(k),-1,pstring(l),-1);
Now, you can imagine my dismay when I still got the same sort result as I did right at the beginning...
arß asß aßs Asß no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa
Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz
You may also imagine my EXTREME dismay not to mention simultaneous joy when I realised the sort order IS CORRECT, and IT WAS CORRECT RIGHT BACK IN THE BEGGINING! It make sme sick to say it, but there was never any problem in the first place - this is all down to my lack of German knowledge. I beleived the sort was wrong, since you can see above string start with S, then later they start with ß, then s again and back to ß and so on. Well I can't speak German however I could still clearly see that they was not sorted correctly - my German friend told me ß comes after S and before T... I WAS WRONG! What is happening is that string functions (both AnsiCompareText and winapi CompareTextW) are SUBSTITUTING every 'ß' with 'ss', and every 'ö' with a normal 'o'... so if i take those result above and to a search and replace as described I get...
arss asss asss Asss no no o on oo oo ooo ooo op po ss SS ssass ssbss
sss Sssa Sssb ssss ssss SSSS ssssss ssssss SSssss ssz sszss z zzz
Looks pretty correct to me! And it always was.
I am extremely grateful for all the advice given, and extremely sorry to have wasted your time like this. Those german ß's got me all confused, there was never nothing wrong with the built in delphi function or anything else. It just looked like there was. I made the mistake of combining them with normal 's' in my test data, any other letter would have not have created this illusion of un-sortedness! The squiggly ß's have made me look a fool! ßs!
Rudy and lkessler we're both especially helpful, ty both, I have to accept lkessler's answer as most correct, sorry Rudy.
You said you had problems calling Windows API calls yourself. Could you post the code, so people here can see why it failed? It is not as hard as it may seem, but it does require some care. ISTM that RtlCompareUnicodeStrings() is too low level.
I found a few solutions:
Non-portable
You could use the Windows API function CompareStringEx. This will compare using Unicode specific collation types. You can specify how you want this done (see link). It does require wide strings, i.e. PWideChar pointers to them. If you have problems calling it, give a holler and I'll try to add some demo code.
More or less portable
To make this more or less portable, you could write a function that compares two strings and use conditional defines to choose the different comparison APIs for the platform.
Try using CompareStr for case sensitive, or CompareText for case insensitive if you want your sorts exactly the same in any locale.
And use AnsiCompareStr for case sensitive, or AnsiCompareText for case insensitive if you want your sorts to be specific to the locale of the user.
See: How can I get TStringList to sort differently in Delphi for a lot more information on this.
In Unicode the numeric order of the characters is certainly not the sorting sequence. AnsiCompareText as mentioned by HeartWare does take locale specifics into consideration when comparing characters, but, as you found out, does nothing wrt the sorting order. What you are looking for is called the collation sequence of a language, which specifies the alphabetic sorting order for a language taking diacritics etc into consideration. They were sort of implied in the old Ansi Code pages, though those didn't account for sorting difference between languages using the same character set either.
I checked the D2010 docs. Apart from some TIB* components I didn't find any links. C++ builder does seem to have a compare function that takes collation into account, but that's not much use in Delphi. There you will probably have to use some Windows' API functions directly.
Docs:
Sorting collate all out: http://www.siao2.com/2008/12/06/9181413.aspx
Collation terminology: http://msdn.microsoft.com/en-us/library/ms143726(SQL.90).aspx (though that pertains to MS SQL 2005, it may be helpful)
The 'Sorting "Collate" all out' article is by Michael Kaplan, someone who has great in-depth knowledge of all things Unicode and all intricacies of various languages. His blog has been invaluable to me when porting from D2006 to D2009.
Have you tried AnsiCompareText ? Even though it is called "Ansi", I believe it calls on to an OS-specific Unicode-able comparison routine...
It should also make you safe from cross-platform dependencies (provided that Embarcadero supplies a compatible version in the various OS's they target).
I do not know how good the comparison works with the various strange Unicode ways to encode strings, but try it out and let us know the result...

Delphi TStringList wrapper to implement on-the-fly compression

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross
I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).
You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.
I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.
I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

Resources