What is the difference between Delphi string comparsion functions? - delphi

There's a bunch of ways you can compare strings in modern Delphi (say 2010-XE3):
'<=' operator which resolves to UStrCmp / LStrCmp
CompareStr
AnsiCompareStr
Can someone give (or point to) a description of what those methods do, in principle?
So far I've figured that AnsiCompareStr calls CompareString on Windows, which is a "textual" comparison (i.e. takes into account unicode combined characters etc). Simple CompareStr does not do that and seems to do a binary comparison instead.
But what is the difference between CompareStr and UStrCmp? Between UStrCmp and LStrCmp? Do they all produce identical results? Do those results change between versions of Delphi?
I'm asking because I need a comparison which will always produce the same results, so that indexes in app built with one version of Delphi remain consistent with code built with another.

AnsiCompareStr is specified as taking locale into account, and should return identical results regardless of Delphi version, but may return different results based on Windows version and/or settings.. CompareStr is a pure binary comparison: "The comparison operation is based on the 16-bit ordinal value of each character and is not affected by the current locale" (for the CompareStr(const S1, S2: string) overload). UStrCmp also uses a pure binary comparison: "Strings are compared according to the ordinal values that make up the characters that make up the string." So there should not be a difference between the latter two. The way they return the result is different, so two implementations are needed (although it would be possible to make one rely on the other).
As for the differences between LStrCmp and UStrCmp, LStrCmp takes AnsiStrings, UStrCmp takes UnicodeStrings. It's entirely possible that two characters (let's say A and B) are ordered in the misnamed "ANSI" code page as A < B, but are ordered in Unicode as A > B. You should almost always just use the comparison appropriate for the data you have.

Related

String indexing vs. dynamic array indexing in Delphi

In Delphi, why are AnsiStrings indexed from one and dynamic arrays indexed from zero? Is this a historical accident, to make AnsiStrings work more like ShortStrings, or is there some deeper logic at work?
One of the contributing factors that led to "Pascal" strings being 1 indexed instead of 0 indexed was that the length of the string was stored in the zeroth byte. Yes, that could have been hidden from the programmer's view by having the compiler internally add a constant offset to the string index expression (as was done in Delphi's long strings later) but in the beginning things were much simpler. Allocate a block of memory, store the length in byte zero, index the char data from byte 1. End of story.
As I recall UCSD Pascal was using this length-in-zero-byte convention long before Turbo Pascal came along.
As for why dynamic arrays are zero based, I don't recall any specific reason but I would guess it reflects the dynamic array's kinship to dynamically allocating a buffer and indexing off the buffer pointer. The array types that you would use to create array pointer types were zero based arrays. The first byte is found at buffer pointer + 0 offset. This is the C rationalization for zero based everything. There was no compelling reason to carry string's 1 based indexing pattern over to compiler managed arrays when string's 1 based indexing was already (and had always been) the exception rather than the norm.
It may well be that because the string type was the first array-like data type that everyone first encountered and possibly the most used data type across the board, there may be a perception of a bias towards 1 based indexing in the language. However, if you look closely I think you'll find arrays in Pascal (distinct from string) have never been inherently 1 based, especially when dynamically allocated.
The reason for the Delphi string tradition of 1-based strings is quite simple. The tradition comes from the implementation of old style Turbo Pascal strings. That data type stored the length of the string in the first byte of the variable, index 0. The string data began in the next byte, index 1.
You can still use that data type today. It's now called ShortString. As is immediately obvious from it's implementation, there is a 255 character limit. This limit led to the introduction of huge strings, if I recall correctly, in Delphi 2. When huge strings were introduced the language designers chose to retain 1-based indexing to make it easier for developers to switch from short strings to huge strings.
I guess Turbo Pascal didn't invent the idea of using element 0 for length. It's just that I'm too young to remember what came before then!
Dynamic arrays weren't bound by the past in the same way and had a free choice. I don't know why zero based was chosen. Perhaps because it fits more easily with the prevailing fashion on platform on which Delphi existed at that time, namely Windows. That's just a guess though. Danny Thorpe worked on the Delphi compiler at that time, and even he can't remember the rationale!
The Delphi language designers are currently moving towards zero based string indexing for huge strings. The initial steps in this direction can be seen in XE3 in the TStringHelper class which uses 0-based indexing. And also in the ZEROBASEDSTRINGS conditional which allows you to opt in to 0-based indexing. Expect the next generation Delphi compiler to use 0-based indexing only. The times they are changin'.
Historical accident.
Pascal strings and arrays traditionally start at 1.
C - and perhaps consequently AnsiStrings - start at 0.
I don't know the rationale for "breaking with Pascal tradition" for Dynamic arrays, which also start at zero. But it makes sense, and I agree with it ...
IMHO...

StrLComp vs AnsiStrLComp when called with Unicode strings

I'm having a bit of confusion regarding the "Ansi" vs "regular" rtl string functions when called with Unicode strings. I understand that under older versions of Delphi (when Ansistring was the default) that the "Ansi" versions handled multibyte characters. Does this mean anything when dealing with Unicode strings? Assuming that I need to handle Korean characters and also that my code does not have to be compatible with older Delphi versions, which rtl functions should be used?
The 'Ansi' prefix of the string compare functions really never signified anything other than that the locale was taken into account when comparing strings instead of doing "just" a simple binary comparison. In the Unicode world this is still the case. The Ansi* family of functions also take (Unicode) strings as their parameters and take the locale into account when doing the comparison.
From the AnsiCompareStr doc (D2009):
Most locales consider lowercase characters to be less than the
corresponding uppercase characters. This is in contrast to ASCII
order, in which lowercase characters are greater than uppercase
characters. Thus, setting S1 to 'a' and S2 to 'A' causees
AnsiCompareStr to return a value less than zero, while CompareStr,
with the same arguments, returns a value greater than zero.
What the effect of "taking the locale into account" may be differs per locale. It may have to do with accented characters or not. In Unicode versions it may actually take into account how the characters are composed. For example an accented e (é) may be encoded exactly like that but may also be encoded as two separate items: the accent and the e.
Both the Ansi* and the "normal" string compare functions are included in the SysUtils unit. They all take strings as their parameters and in Unicode Delphi that does indeed mean UnicodeStrings.
If you need to work with AnsiStrings then you need to use the AnsiStrings unit. It has the same set of string compare functions, but in this unit they all take AnsiStrings as their parameters.
Now, if you don't need compatability with older versions: use the standard functions from SysUtils. Use the normale ones if byte comparison is enough. Use the Ansi ones if you need to take locale considerations into account.
Not sure what exactly you want to do, but...
if you want to compare two strings by your current user locale rules, use the AnsiStrLComp for case sensitive comparision or AnsiStrLIComp for case insensitive comparision. Internally these functions uses the CompareString function with the LOCALE_USER_DEFAULT locale set
if you want to compare two strings by using the Delphi internal comparing mechanism, use the StrLComp function for case sensitive comparision or StrLIComp for case insensitive compare
So if you'll compare the two same strings with AnsiStrLComp or AnsiStrLIComp on machines with different user locale settings, you may get different results, but on the other hand you can get natural sorting for the user's language settings to your application.
The StrLComp and StrLIComp will work on all machines the same way, locale independently.
The simple answer is that when it comes to Delphi string routines you should use the ANSI...() functions for Unicode strings.
However, if you are comparing strings (among other things) then you may also need to consider normalising those strings first, depending on the nature and needs (and the source of the strings) in your application, to deal with Unicode Equivalence.

AnsiStrIComp fails comparing strings in Delphi 2010

I'm slightly confused and hoping for enlightenment.
I'm using Delphi 2010 for this project and I'm trying to compare 2 strings.
Using the code below fails
if AnsiStrIComp(PAnsiChar(sCatName), PAnsiChar(CatNode.CatName)) = 0 then...
because according to the debugger only the first character of each string is being compared (i.e. if sCatName is "Automobiles", PAnsiChar(sCatName) is "A").
I want to be able to compare strings that may be in different languages, for example English vs Japanese.
In this case I am looking for a match, but I have other functions used for sorting, etc. where I need to know how the strings compare (less than, equal, greater than).
I assume that sCatName and CatNode.CatName are defined as strings (= UnicodeStrings)?. They should be.
There is no need to convert the strings to null-terminated strings! This you (mostly) only need to do when working with the Windows API.
If you want to test equality of two strings, use SameStr(S1, S2) (case sensitive matching) or SameText(S1, S2) (case insensitive matching), or simply S1 = S2 in the first case. All three options return true or false, depending on the strings equality.
If you want to get a numerical value based on the ordinal values of the characters (as in sorting), then use CompareStr(S1, S2) or CompareText(S1, S2). These return a negative integer, zero, or a positive integer.
(You might want to use the Ansi- functions: AnsiSameStr, AnsiSameText, AnsiCompareStr, and AnsiCompareText; these functions will use the current locale. The non Ansi- functions will accept a third, optional parameter, explicitly specifying the locale to use.)
Update
Please read Remy Lebeau's comments regarding the cause of the problem.
What about simple sCatName=CatNode.CatName? If they are strings it should work.

Faster CompareText implementation for D2009

I'm extensively using hash map data structures in my program. I'm using a hash map implementation by Barry Kelly posted on the Codegear forums. That implementation internally uses RTL's CompareText function. Profiling made me realize that A LOT of time is spent in SysUtils CompareText function.
I had a look at the
Fastcode site
and found some faster implementations of CompareText. Unfortunately they seem not to work for D2009 and its unicode strings.
Now for the question: Is there a similar faster version that supports D2009 strings? The CompareText functions seems to be called a lot when using hash maps (at least in the implemenation I'm currently using), so little performance improvements could really make a difference. Or should the implementations presented there also work for unicode strings?
Many of the FastCode functions will probably compile and appear to work just fine in Delphi 2009, but they won't be right for all input. The ones that are implemented in assembler will fail because they assume characters are just one byte each. The ones implemented in Delphi will fare a little better, but they'll still return incorrect results sometimes because the old CompareText's notion of "case-insensitive" is based on ASCII whereas the new one should be based on Unicode. The rules for which characters are considered the same save for case are much different for Unicode from how they are for ASCII.
Andreas says in a comment below that Unicode CompareText still uses the ASCII case-comparison rules, so a number of the FastCode functions should work fine. Just look them over before using them to make sure they're not making any character-size assumptions. I seem to recall that some FastCode functions were incorporated into the Delphi RTL already. I have no idea whether CompareText was one of them.
If you're calling CompareText a lot in a hash table, then that suggests your hash table isn't doing a very good job. CompareText should only get called when the hash of the thing you're searching for designated a non-empty bucket in the hash table. From there, a hash table will often use a linear search to find the right item in the bucket, and it will call CompareText for every item during that search. I don't know whether that's how the one you're using works.
You might solve this by using a different hash function that distributes its results more evenly over the available buckets. If your buckets are already evenly filled, then you may need more buckets (and then make sure the hash function still distributes evenly over that number as well).
If the hash-map class you're using is based on TBucketList, then there is room for improvement in the bucket storage. That class doesn't calculate a hash on the entire input. It uses the input only to determine the bucket to use. If the class would also keep track of the full hash computed for a string, then comparisons during the linear search could go much faster. Just compare the hashes, and only compare the strings when the hashes match completely. (For a 256-bucket bucket-list, the largest supported size, only one byte of the input determines the bucket, and the rest of the bytes are ignored.) I've written about TBucketList here before.

How to also prepare for 64-bits when migrating to Delphi 2010 and Unicode

As 64 bits support is not expected in the next version it is no longer an option to wait for the possibility to migrate our existing code base to unicode and 64-bit in one go.
However it would be nice if we could already prepare our code for 64-bit when doing our unicode translation. This will minimize impact in the event it will finally appear in version 2020.
Any suggestions how to approach this without introducing to much clutter if it doesn't arrive until 2020?
There's another similar question, but I'll repeat my reply here too, to make sure as many people see this info:
First up, a disclaimer: although I work for Embarcadero. I can't speak for my employer. What I'm about to write is based on my own opinion of how a hypothetical 64-bit Delphi should work, but there may or may not be competing opinions and other foreseen or unforeseen incompatibilities and events that cause alternative design decisions to be made.
That said:
There are two integer types, NativeInt and NativeUInt, whose size will
float between 32-bit and 64-bit depending on platform. They've been
around for quite a few releases. No other integer types will change size
depending on bitness of the target.
Make sure that any place that relies on casting a pointer value to an
integer or vice versa is using NativeInt or NativeUInt for the integer
type. TComponent.Tag should be NativeInt in later versions of Delphi.
I'd suggest don't use NativeInt or NativeUInt for non-pointer-based values. Try to keep your code semantically the same between 32-bit and 64-bit. If you need 32 bits of range, use Integer; if you need 64 bits, use Int64. That way your code should run the same on both bitnesses. Only if you're casting to and from a Pointer value of some kind, such as a reference or a THandle, should you use NativeInt.
Pointer-like things should follow similar rules to pointers: object
references (obviously), but also things like HWND, THandle, etc.
Don't rely on internal details of strings and dynamic arrays, like
their header data.
Our general policy on API changes for 64-bit should be to keep the
same API between 32-bit and 64-bit where possible, even if it means that
the 64-bit API does not necessarily take advantage of the machine. For
example, TList will probably only handle MaxInt div SizeOf(Pointer)
elements, in order to keep Count, indexes etc. as Integer. Because the
Integer type won't float (i.e. change size depending on bitness), we
don't want to have ripple effects on customer code: any indexes that
round-tripped through an Integer-typed variable, or for-loop index,
would be truncated and potentially cause subtle bugs.
Where APIs are extended for 64-bit, they will most likely be done with
an extra function / method / property to access the extra data, and this
API will also be supported in 32-bit. For example, the Length() standard
routine will probably return values of type Integer for arguments of
type string or dynamic array; if one wants to deal with very large
dynamic arrays, there may be a LongLength() routine as well, whose
implementation in 32-bit is the same as Length(). Length() would throw
an exception in 64-bit if applied to a dynamic array with more than 232
elements.
Related to this, there will probably be improved error checking for
narrowing operations in the language, especially narrowing 64-bit values
to 32-bit locations. This would hit the usability of assigning the
return value of Length to locations of type Integer if Length(),
returned Int64. On the other hand, specifically for compiler-magic
functions like Length(), there may be some advantage of the magic taken,
to e.g. switch the return type based on context. But advantage can't be
similarly taken in non-magic APIs.
Dynamic arrays will probably support 64-bit indexing. Note that Java
arrays are limited to 32-bit indexing, even on 64-bit platforms.
Strings probably will be limited to 32-bit indexing. We have a hard
time coming up with realistic reasons for people wanting 4GB+ strings
that really are strings, and not just managed blobs of data, for which
dynamic arrays may serve just as well.
Perhaps a built-in assembler, but with restrictions, like not being able to freely mix with Delphi code; there are also rules around exceptions and stack frame layout that need to be followed on x64.
First, look at the places where you interact with non-delphi libraries and api-calls,
they might differ. On Win32, libraries with the stdcall calling convenstion are named like _SomeFunction#4 (#4 indicating the size of the parameters, etc). On Win64, there is only one calling convention, and the functions in a dll are no more decorated. If you import functions from dll files, you might need to adjust them.
Keep in mind, in a 64 bit exe you cannot load a 32-bit dll, so, if you depend on 3rd party dll files, you should check for a 64-bit version of those files as well.
Also, look at Integers, if you depend on their max value, for example when you let them overflow and wait for the moment that happens, it will cause trouble if the size of an integer is changed.
Also, when working with streams, and you want to serialize different data, with includes an integer, it will cause trouble, since the size of the integer changed, and your stream will be out of sync.
So, on places where you depend on the size of an integer or pointer, you will need to make adjustments. When serializing sush data, you need to keep in mind this size issue as well, as it might cause data incompatibilities between 32 and 64 bit versions.
Also, the FreePascal compiler with the Lazarus IDE already supports 64-bit. This alternative Object Pascal compiler is not 100% compatible with the Borland/Codegear/Embarcadero dialect of Pascal, so just recompiling with it for 64-bit might not be that simple, but it might help point out problems with 64-bit.
The conversion to 64bit should not be very painful. Start with being intentional about the size of an integer where it matters. Don't use "integer" instead use Int32 for integers sized at 32bits, and Int64 for integers sized at 64bits. In the last bit conversion the definition of Integer went from Int16 to Int32, so your playing it safe by specifying the exact bit depth.
If you have any inline assembly, create a pascal equivalent and create some unit tests to insure that they operate the same way. Perform some timing tests of both and see if the assembly still runs faster enough to keep. If it does, then you will want to make changes to both as they are needed.
Use NativeInt for integers that can contain casted pointers.

Resources