String indexing vs. dynamic array indexing in Delphi

String indexing vs. dynamic array indexing in Delphi - delphi

In Delphi, why are AnsiStrings indexed from one and dynamic arrays indexed from zero? Is this a historical accident, to make AnsiStrings work more like ShortStrings, or is there some deeper logic at work?

One of the contributing factors that led to "Pascal" strings being 1 indexed instead of 0 indexed was that the length of the string was stored in the zeroth byte. Yes, that could have been hidden from the programmer's view by having the compiler internally add a constant offset to the string index expression (as was done in Delphi's long strings later) but in the beginning things were much simpler. Allocate a block of memory, store the length in byte zero, index the char data from byte 1. End of story.
As I recall UCSD Pascal was using this length-in-zero-byte convention long before Turbo Pascal came along.
As for why dynamic arrays are zero based, I don't recall any specific reason but I would guess it reflects the dynamic array's kinship to dynamically allocating a buffer and indexing off the buffer pointer. The array types that you would use to create array pointer types were zero based arrays. The first byte is found at buffer pointer + 0 offset. This is the C rationalization for zero based everything. There was no compelling reason to carry string's 1 based indexing pattern over to compiler managed arrays when string's 1 based indexing was already (and had always been) the exception rather than the norm.
It may well be that because the string type was the first array-like data type that everyone first encountered and possibly the most used data type across the board, there may be a perception of a bias towards 1 based indexing in the language. However, if you look closely I think you'll find arrays in Pascal (distinct from string) have never been inherently 1 based, especially when dynamically allocated.

The reason for the Delphi string tradition of 1-based strings is quite simple. The tradition comes from the implementation of old style Turbo Pascal strings. That data type stored the length of the string in the first byte of the variable, index 0. The string data began in the next byte, index 1.
You can still use that data type today. It's now called ShortString. As is immediately obvious from it's implementation, there is a 255 character limit. This limit led to the introduction of huge strings, if I recall correctly, in Delphi 2. When huge strings were introduced the language designers chose to retain 1-based indexing to make it easier for developers to switch from short strings to huge strings.
I guess Turbo Pascal didn't invent the idea of using element 0 for length. It's just that I'm too young to remember what came before then!
Dynamic arrays weren't bound by the past in the same way and had a free choice. I don't know why zero based was chosen. Perhaps because it fits more easily with the prevailing fashion on platform on which Delphi existed at that time, namely Windows. That's just a guess though. Danny Thorpe worked on the Delphi compiler at that time, and even he can't remember the rationale!
The Delphi language designers are currently moving towards zero based string indexing for huge strings. The initial steps in this direction can be seen in XE3 in the TStringHelper class which uses 0-based indexing. And also in the ZEROBASEDSTRINGS conditional which allows you to opt in to 0-based indexing. Expect the next generation Delphi compiler to use 0-based indexing only. The times they are changin'.

Historical accident.
Pascal strings and arrays traditionally start at 1.
C - and perhaps consequently AnsiStrings - start at 0.
I don't know the rationale for "breaking with Pascal tradition" for Dynamic arrays, which also start at zero. But it makes sense, and I agree with it ...
IMHO...

Related

How is data written to memory

When we store data in memory.
How does it get stored, so it can recognize what type of data it is when loaded.
What I want to ask is how the data types like Natural numbers, integers, characters, etc are stored in memory. So they can be recognized easily later when extracted from memory.
When we see at memory, what we see are hex numbers.
How can we relate these hex numbers for ASCII value or Integer Value or any other etc.

Since all of your data is written in binary, there isn't much difference between how the char a is written and how the int 97 is written, since they represent the same binary string (at least the last 8 bits of those strings). That being said, when you read from memory, you read a data type, by that type, you know how you should interpret the data

Memory does not operate in terms of "character" or "integer", these are high-level concepts that assume an abstract machine.
Typically, but not necessarily, a character is just an integer with a smaller size, often 8 bits (but a character could as well be 32 bits!) which represents one symbol or letter, rather than a discrete number. In some cases, a character may even be encoded using a variable length.
Memory operates in terms of bits that are organized in bytes (smallest directly addressable unit) or words. These are -- unbeknownst to you -- organized in banks. The hardware typically allows access in units called "cache lines", but this is something that happens secretly behind your back.
In assembler language, you can typically access bytes and power-of-two multiples of these, sometimes with special alignment requirements (there's usually also bit operations, but while they only change one bit, they still work on whole bytes/words).
All of that is, however, not very interesting, and also widely irrelevant for you. It is first and foremost the compiler's (or interpreter's) job to make sure that when you speak of an integer or a character, that whatever you want comes out at the other end. It is also the tool's responsibility to convert one into another if possible, and produce an error if not possible.
You do not even know for certain whether the value of an integer or a character has a memory location at all (it may very well be stored in a register) unless you explicitly enforce that.
You cannot distinguish a byte at some memory location that came from a "character" from a byte that belongs to an "integer". They look just the same.
And while it is possible to read the raw bytes of one type as another type in most languages, this is not something you normally need to do (or should do).

Delphi 64 bit internal data format / dynamic arrays

I am trying to figure out if there could be situations where using "packed array of [type]" could be advantageous if trying to save memory.
You can read that on Win32 bit, it is not the case:
http://docwiki.embarcadero.com/RADStudio/XE4/en/Internal_Data_Formats#Dynamic_Array_Types
Generally speaking, it would seem to me runtime code for dynamic arrays would be fasterif never using padding between items. But since I believe default alignment in Delphi/64bit is 8byte, it would seem to be there would be a a chance that you some day could not count on dynamic arrays containg e.g. booleans being "packed" by default

First of all I think you need to be careful how you read the documentation. It has not been updated consistently and the fact that some parts appear specific to Win32 can in no way be used to infer that Win64 differs. In fact, for the issues raised in your question, there are no substantive differences between 32 and 64 bit.
For Delphi, using packed on an array does not change the padding between successive elements. The distance between two adjacent elements of an array is equal to SizeOf(T) where T is the element type. This holds true irrespective of the use of the packed modifier. And using packed on an array does not even affect the alignment of the array.
So, to be completely clear and explicit, packed has no impact at all when used on arrays. The compiler treats a packed array in exactly the same way as it treats a non-packed array.
This statement seems to be at odds with the documentation. The documentation states:
Alignment of Structured Types
By default, the values in a structured type are aligned on word- or
double-word boundaries for faster access.
You can, however, specify byte alignment by including the reserved
word packed when you declare a structured type. The packed word
specifies compressed data storage. Here is an example declaration:
type TNumbers = packed array [1..100] of Real;
But consider this program:
{$APPTYPE CONSOLE}
type
TNumbers = packed array [1..100] of Real;
type
TRec = record
a: Byte;
b: TNumbers;
end;
begin
Writeln(Integer(#TRec(nil^).a));
Writeln(Integer(#TRec(nil^).b));
Readln;
end.
The output of which, using Delphi XE2, is:
0
8
So, using packed with arrays, in contradiction of the documentation, does not modify alignment of the type.
Note that the above program, when compiled by Delphi 6, has output
0
1
So it would appear that the compiler has changed, but the documentation has not caught up.
A related question can be found here: Are there any difference between array and packed array in Delphi? But note that the answer of Barry Kelly (an Embarcadero compiler engineer at the time he wrote the answer) does not mention the change in compiler behaviour.
Generally speaking, it would seem to me runtime code would be faster if never using padding between items.
As discussed above, for arrays there never is padding between array elements.
I believe default alignment in Delphi/64bit is 8 byte.
The alignment is determined by the data type and is not related to the target architecture. So a Boolean has alignment of 1. A Word has alignment of 2. An Integer has alignment of 4. A Double has alignment of 8. These alignment values are the same on 32 and 64 bit.
Alignment has an enormous impact on performance. If you care about performance you should, as a general rule, never pack data structures. If you wish to minimise the size of a structure (class or record), put the larger elements first to minimise padding. This could improve performance, but it could also worsen it.
But there is no single hard and fast rule. I can imagine that packing could improve performance if the thing that was packed was hardly ever accessed. And I can, more readily in fact, imagine that changing the order of elements in a structure to group related elements could improve performance.

Why do Delphi and Free Pascal usually prefer a signed-integer data type to unsigned one?

I'm not a Pascal newbie, but I still don't know until now why Delphi and Free Pascal usually declares parameters and returned values as signed integers whereas I see them should always be positive. For example:
Pos() returns type of Integer. Is it possible to be a negative?
SetLength() declares the NewLength parameter as a type of Integer. Is there a negative length for string?
System.THandle declared as Longint. Is there a negative number for handles?
There are many decisions like those in Delphi and Free Pascal. What considerations were behind this?

In Pascal, Integer (signed) is the base type. All other integer number types are a subranges of integer. (this is not entirely true in Borland dialects, given longint in TP and int64 in Delphi, but close enough).
An important reason for that if the intermediate result of calculations gets negative, and you calculate with unsigned integers, range check errors will trigger, and since most older programming languages DON'T assume 2-complement integers, the result (with range checks off) might even be corrupt.
The THandle case is much simpler. Delphi didn't have a proper 32-bit unsigned till D4, but only a 31-bit cardinal. (since 32-bit unsigned integer is not a subrange of integer, the later unsigned ints are a subset of int64, which moved the problem to uint64 which was only added in D2010 or so)
So in many places in the headers signed types are used where the winapi uses unsigned types, probably to avoid the 32th bit getting accidentally corrupt in those versions, and the custom stuck.
But the winapi case is different from the general case.
Added later Some Pascal (and Modula2/3) implementations circumvent this trap by setting the integer at a size larger than the wordsize, and require all numeric types to declare a proper subrange, like in the below program.
The first holds the primary assumption that everything is a subset of integer, and the second allows the compiler to scale nearly everything down again to fit in registers, specially if the CPU has some operations for larger than word operations. (like x86 where 32-bit * 32-bit mul gives a 64-bit result, or can detect wordsize overflows using status bits (e.g. to generate range exceptions for adds without doing a full 2*wordsize add)
var x : 0..20;
y : -10..10;
begin
// any expression of x and y has a range -10..20
Turbo Pascal and Delphi emulate an integer type twice the wordsize for their 16-bit and 32-bit offerings. The handling of the highest unsigned type is hacky at best.

Well, for a start THandle is declared incorrectly. It's unsigned in the Windows headers and should be so in Delphi. In fact I think this was corrected in a recent release of Delphi.
I'd imagine that the preference for signed over unsigned is largely historical and not particularly significant. However, I can think of one example where it is important. Consider the for loop:
for i := 0 to Count-1 do
If i is unsigned and Count is 0 then this loop runs from 0 to $FFFFFFFF which is not what you want. Using a signed integer loop variable avoids that problem.
Pascal is a victim of its syntax here. The equivalent C or C++ loop has no such trouble
for (unsigned int i=0; i<Count; i++)
due to the syntactic difference and use of a comparison operator as stopping condition.
This could also be the reason why Length() on a string or dynamic array returns a signed value. And so for consistency, SetLength() should accept signed values. And given that the return value of Pos() is used to index strings, it should be signed also.
Here's another Stack Overflow discussion of the topic: Should I use unsigned integers for counting members?
Of course, I'm speculating wildly here. Perhaps there was no design and just out of habit the precedent of using signed values was set and became enshrined.

Some string related search functions return -1 when nothing is found.
I believe the reasoning behind this is that MaxInt is 2GB which is the maximum size for strings in 32 bit Delphi. This because a single process can have up to 2GB memory

There are many reasons for using signed integers, even some that might apply when you do not intend to return a negative value.
Imagine I write code that calls Pos, and I want to do math with the results. Would you rather have a negative result (Pos('x',s)-5) raise a range-check exception, underflow and become a very large unsigned number around 4 billion, or go negative, if Pos('x',s) returns 1? Either one is a source of problems for new users who seldom think about these cases, but the long-established tradition is that by using Integer results, it's your job to check for negative and zero results and not use them as string offsets. There is an advantage for beginning and for advanced programmers, in using Integer, and not having "negative" values roll under and become large unsigned values or raise range exceptions.
Secondly, remember that in beginning programming, one usually introduces Integer (signed) types long before one introduces unsigned types like Cardinal. Beginners often work with functions like Pos, and it makes sense to use the type that will create the least-unfriendly set of side effects. There are no negative side effects to having a range larger than the one you absolutely need (the range you probably need for Pos is 1 to maximum-string-length-in-delphi). There is zero benefit in 32-bit Delphi to using the Cardinal type for Pos, and there definitely ARE downsides to choosing it.
Once you get to 64-bit delphi, however, you could theoretically have strings LARGER than an Integer can hold, and moving to Cardinal wouldn't fix all your potential problems. However, the chance of anyone having a 2+ GB string is probably nil, and Delphi 64-bit compiler doesn't allow a >2 GB string, anyway. In my testing, I can achieve an almost 1 GB String in 64 bit Delphi. So the practical length limit for a Win64 string is about a billion (1073741814) characters, which is using nearly 2 GB of actual RAM. At that limit, I either get EIntOverflow or EAccessViolation, and it seems I am hitting Delphi run time library (RTL) bugs, not properly defined limits, so your mileage may vary.

Why does Delphi warn when assigning ShortString to string?

I'm converting some legacy code to Delphi 2010.
There are a fair number of old ShortStrings, like string[25]
Why does the assignment below:
type
S: String;
ShortS: String[25];
...
S := ShortS;
cause the compiler to generate this warning:
W1057 Implicit string cast from 'ShortString' to 'string'.
There's no data loss that is occurring here. In what circumstances would this warning be helpful information to me?
Thanks!
Tomw

It's because your code is implicitly converting a single-byte character string to a UnicodeString. It's warning you in case you might have overlooked it, since that can cause problems if you do it by mistake.
To make it go away, use an explicit conversion:
S := string(ShortS);

The ShortString type has not changed. It continues to be, in effect, an array of AnsiChar.
By assigning it to a string type, you are taking what is a group of AnsiChars (one byte) and putting it into a group of WideChars (two bytes). The compiler can do that just fine, and is smart enough not to lose data, but the warning is there to let you know that such a conversion has taken place.

The warning is very important because you may lose data. The conversion is done using the current Windows 8-bit character set, and some character sets do not define all values between 0 and 255, or are multi-byte character sets, and thus cannot convert all byte values.
The data loss can occur on a standard computer in a country with specific standard character sets, or on a computer in USA that has been set up for a different locale, because the user communicates a lot with people in other languages.
For instance, if the local code page is 932, the byte values 129 and 130 will both convert to the same value in the Unicode string.
In addition to this, the conversion involves a Windows API call, which is an expensive operation. If you do a lot of these, it can slow down your application.

It's safe ( as long as you're using the ShortString for its intended purpose: to hold a string of characters and not a collection of bytes, some of which may be 0 ), but may have performance implications if you do it a lot. As far as I know, Delphi has to allocate memory for the new unicode string, extract the characters from the ShortString into a null-terminated string (that's why it's important that it's a properly-formed string) and then call something like the Windows API MultiByteToWideChar() function. Not rocket science, but not a trivial operation either.

ShortStrings don't have a code page associated with them, AnsiStrings do (since D2009).
The conversion from ShortString to UnicodeString can only be done on the assumption that ShortStrings are encoded in the default ANSI encoding which is not a safe assumption.

I don't really know Delphi, but if I remember correctly, the Shortstrings are essentially a sequence of characters on the stack, whereas a regular string (AnsiString) is actually a reference to a location on the heap. This may have different implications.
Here's a good article on the different string types:
http://www.codexterity.com/delphistrings.htm
I think there might also be a difference in terms of encoding but I'm not 100% sure.

Should I use unsigned integers for counting members?

Should I use unsigned integers for my count class members?
Answer
For example, assume a class
TList <T> = class
private
FCount : Cardinal;
public
property Count : Cardinal read FCount;
end;
That does make sense, doesn't it? The number of items stored in a list can't be negative, so why not use an unsigned integer type for it? I think it's in general a good principle to always use the least general (ergo the most special) type possible.
Now, iterating over a list looks like this:
for I := 0 to List.Count - 1 do
Writeln (List [I]);
When the number of items stored in the list is zero, the compiler tries to evaluate
List.Count - 1
which results in a nice Integer overflow (underflow to be exact). Combined with the fact that the debugger does not show the appropriate location where the exception occured, this was very hard to find for me.
Let me add that if you have overflow checking turned off, the resulting errors will be even harder to track, because then you will often access memory that doesn't belong to you - and that results in undefined behaviour.
I will be using plain Integers for all my count members from now on to avoid situations like this.
If that's complete nonsense, please point it out to me :)
(I just spent an hour tracking an integer overflow in my code, so I decided to share that - most people on here will know that of course, but perhaps I can save someone some time.)

No, definitely not. Delphi idiom is to use integers here. Don't fight the language.
In a 32 bit environment you'll not have more elements in the list, except if you try to build a bitmap.
Let's be clear: every programmer who is going to have to use your code is going to hate you for using a Cardinal instead of an integer.

Unsigned integers are almost always more trouble than they're worth because you usually end up mixing signed and unsigned integers in expressions at some point. That means that the type will need to be widened (and probably have a performance hit) to get correct semantics (ideally the compiler does this as per language definition), or else you'll need to be very careful in your range checking.
Take C/C++ for example: size_t is the type of the integer for memory sizes and allocation, and is unsigned, but ptrdiff_t is the type for the offset you get when you subtract one pointer from another, and necessarily is signed. Want to know how many elements you've allocated in an array? Perhaps you subtract the first element address from the last+1 element address and divide by sizeof(element-type)? Well, now you've just mixed signed and unsigned integers.

Regarding your statement that "I think it's in general a good principle to always use the least general (ergo the most special) type possible." - actually I think it's a good principle to use the data type that will cause you least angst and trouble.
Generally for me that's a signed int since:
I don't usually have lists with 231 or more elements in them.
You shouldn't have lists that big either :-)
I don't like the hassle of having special edge cases in my code.
But it's really a style issue. If 'purity' of code is more important to you than brevity of code, your method is best (with modifications to catch the edge cases). Myself, I prefer brevity since edge cases tend to clutter the code and reduce understanding.

Don't.
It's not just going against a programming idiom, it's an explicit request to the compiler to use unsigned arithmetic, that is either prone to anomalous behaviors (if you don't guard against overflows) or to irrelevant runtime exceptions (if you guard against overflows, a temporary overflow will be fatal, f.i when you subtract before adding, even if the final result is positive, and I'm referring to the CPU opcode-level ordering of operations, which may not bear a trivial relationship to what you have in your code).
Keep in mind "unsigned" does not translate to "positive", it translates to "doesn't have a sign", which is different. The term "unsigned" was picked for good reason (and naming it "Cardinal" in Delphi was a poor choice IMO).
Unsigned types are for raw storage specifications, bitwise operations, ASM code, embedded controllers and other specialty uses. When you're doing high-level programming, you should forget you ever heard about unsigned types.

Moral: use iterators and foreach when you can, because it avoids this question altogether.

Boundary conditions frequently present problems. Allowing for a type that can go negative may just shift the issue. Perhaps it shifts it in a way that's easier to debug, perhaps not. I started off using integers for counting loops like that, but later on switched to cardinals to help me catch errors.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart