Delphi 64 bit internal data format / dynamic arrays

Delphi 64 bit internal data format / dynamic arrays - delphi

I am trying to figure out if there could be situations where using "packed array of [type]" could be advantageous if trying to save memory.
You can read that on Win32 bit, it is not the case:
http://docwiki.embarcadero.com/RADStudio/XE4/en/Internal_Data_Formats#Dynamic_Array_Types
Generally speaking, it would seem to me runtime code for dynamic arrays would be fasterif never using padding between items. But since I believe default alignment in Delphi/64bit is 8byte, it would seem to be there would be a a chance that you some day could not count on dynamic arrays containg e.g. booleans being "packed" by default

First of all I think you need to be careful how you read the documentation. It has not been updated consistently and the fact that some parts appear specific to Win32 can in no way be used to infer that Win64 differs. In fact, for the issues raised in your question, there are no substantive differences between 32 and 64 bit.
For Delphi, using packed on an array does not change the padding between successive elements. The distance between two adjacent elements of an array is equal to SizeOf(T) where T is the element type. This holds true irrespective of the use of the packed modifier. And using packed on an array does not even affect the alignment of the array.
So, to be completely clear and explicit, packed has no impact at all when used on arrays. The compiler treats a packed array in exactly the same way as it treats a non-packed array.
This statement seems to be at odds with the documentation. The documentation states:
Alignment of Structured Types
By default, the values in a structured type are aligned on word- or
double-word boundaries for faster access.
You can, however, specify byte alignment by including the reserved
word packed when you declare a structured type. The packed word
specifies compressed data storage. Here is an example declaration:
type TNumbers = packed array [1..100] of Real;
But consider this program:
{$APPTYPE CONSOLE}
type
TNumbers = packed array [1..100] of Real;
type
TRec = record
a: Byte;
b: TNumbers;
end;
begin
Writeln(Integer(#TRec(nil^).a));
Writeln(Integer(#TRec(nil^).b));
Readln;
end.
The output of which, using Delphi XE2, is:
0
8
So, using packed with arrays, in contradiction of the documentation, does not modify alignment of the type.
Note that the above program, when compiled by Delphi 6, has output
0
1
So it would appear that the compiler has changed, but the documentation has not caught up.
A related question can be found here: Are there any difference between array and packed array in Delphi? But note that the answer of Barry Kelly (an Embarcadero compiler engineer at the time he wrote the answer) does not mention the change in compiler behaviour.
Generally speaking, it would seem to me runtime code would be faster if never using padding between items.
As discussed above, for arrays there never is padding between array elements.
I believe default alignment in Delphi/64bit is 8 byte.
The alignment is determined by the data type and is not related to the target architecture. So a Boolean has alignment of 1. A Word has alignment of 2. An Integer has alignment of 4. A Double has alignment of 8. These alignment values are the same on 32 and 64 bit.
Alignment has an enormous impact on performance. If you care about performance you should, as a general rule, never pack data structures. If you wish to minimise the size of a structure (class or record), put the larger elements first to minimise padding. This could improve performance, but it could also worsen it.
But there is no single hard and fast rule. I can imagine that packing could improve performance if the thing that was packed was hardly ever accessed. And I can, more readily in fact, imagine that changing the order of elements in a structure to group related elements could improve performance.

Related

How is data written to memory

When we store data in memory.
How does it get stored, so it can recognize what type of data it is when loaded.
What I want to ask is how the data types like Natural numbers, integers, characters, etc are stored in memory. So they can be recognized easily later when extracted from memory.
When we see at memory, what we see are hex numbers.
How can we relate these hex numbers for ASCII value or Integer Value or any other etc.

Since all of your data is written in binary, there isn't much difference between how the char a is written and how the int 97 is written, since they represent the same binary string (at least the last 8 bits of those strings). That being said, when you read from memory, you read a data type, by that type, you know how you should interpret the data

Memory does not operate in terms of "character" or "integer", these are high-level concepts that assume an abstract machine.
Typically, but not necessarily, a character is just an integer with a smaller size, often 8 bits (but a character could as well be 32 bits!) which represents one symbol or letter, rather than a discrete number. In some cases, a character may even be encoded using a variable length.
Memory operates in terms of bits that are organized in bytes (smallest directly addressable unit) or words. These are -- unbeknownst to you -- organized in banks. The hardware typically allows access in units called "cache lines", but this is something that happens secretly behind your back.
In assembler language, you can typically access bytes and power-of-two multiples of these, sometimes with special alignment requirements (there's usually also bit operations, but while they only change one bit, they still work on whole bytes/words).
All of that is, however, not very interesting, and also widely irrelevant for you. It is first and foremost the compiler's (or interpreter's) job to make sure that when you speak of an integer or a character, that whatever you want comes out at the other end. It is also the tool's responsibility to convert one into another if possible, and produce an error if not possible.
You do not even know for certain whether the value of an integer or a character has a memory location at all (it may very well be stored in a register) unless you explicitly enforce that.
You cannot distinguish a byte at some memory location that came from a "character" from a byte that belongs to an "integer". They look just the same.
And while it is possible to read the raw bytes of one type as another type in most languages, this is not something you normally need to do (or should do).

String indexing vs. dynamic array indexing in Delphi

In Delphi, why are AnsiStrings indexed from one and dynamic arrays indexed from zero? Is this a historical accident, to make AnsiStrings work more like ShortStrings, or is there some deeper logic at work?

One of the contributing factors that led to "Pascal" strings being 1 indexed instead of 0 indexed was that the length of the string was stored in the zeroth byte. Yes, that could have been hidden from the programmer's view by having the compiler internally add a constant offset to the string index expression (as was done in Delphi's long strings later) but in the beginning things were much simpler. Allocate a block of memory, store the length in byte zero, index the char data from byte 1. End of story.
As I recall UCSD Pascal was using this length-in-zero-byte convention long before Turbo Pascal came along.
As for why dynamic arrays are zero based, I don't recall any specific reason but I would guess it reflects the dynamic array's kinship to dynamically allocating a buffer and indexing off the buffer pointer. The array types that you would use to create array pointer types were zero based arrays. The first byte is found at buffer pointer + 0 offset. This is the C rationalization for zero based everything. There was no compelling reason to carry string's 1 based indexing pattern over to compiler managed arrays when string's 1 based indexing was already (and had always been) the exception rather than the norm.
It may well be that because the string type was the first array-like data type that everyone first encountered and possibly the most used data type across the board, there may be a perception of a bias towards 1 based indexing in the language. However, if you look closely I think you'll find arrays in Pascal (distinct from string) have never been inherently 1 based, especially when dynamically allocated.

The reason for the Delphi string tradition of 1-based strings is quite simple. The tradition comes from the implementation of old style Turbo Pascal strings. That data type stored the length of the string in the first byte of the variable, index 0. The string data began in the next byte, index 1.
You can still use that data type today. It's now called ShortString. As is immediately obvious from it's implementation, there is a 255 character limit. This limit led to the introduction of huge strings, if I recall correctly, in Delphi 2. When huge strings were introduced the language designers chose to retain 1-based indexing to make it easier for developers to switch from short strings to huge strings.
I guess Turbo Pascal didn't invent the idea of using element 0 for length. It's just that I'm too young to remember what came before then!
Dynamic arrays weren't bound by the past in the same way and had a free choice. I don't know why zero based was chosen. Perhaps because it fits more easily with the prevailing fashion on platform on which Delphi existed at that time, namely Windows. That's just a guess though. Danny Thorpe worked on the Delphi compiler at that time, and even he can't remember the rationale!
The Delphi language designers are currently moving towards zero based string indexing for huge strings. The initial steps in this direction can be seen in XE3 in the TStringHelper class which uses 0-based indexing. And also in the ZEROBASEDSTRINGS conditional which allows you to opt in to 0-based indexing. Expect the next generation Delphi compiler to use 0-based indexing only. The times they are changin'.

Historical accident.
Pascal strings and arrays traditionally start at 1.
C - and perhaps consequently AnsiStrings - start at 0.
I don't know the rationale for "breaking with Pascal tradition" for Dynamic arrays, which also start at zero. But it makes sense, and I agree with it ...
IMHO...

Why do Delphi and Free Pascal usually prefer a signed-integer data type to unsigned one?

I'm not a Pascal newbie, but I still don't know until now why Delphi and Free Pascal usually declares parameters and returned values as signed integers whereas I see them should always be positive. For example:
Pos() returns type of Integer. Is it possible to be a negative?
SetLength() declares the NewLength parameter as a type of Integer. Is there a negative length for string?
System.THandle declared as Longint. Is there a negative number for handles?
There are many decisions like those in Delphi and Free Pascal. What considerations were behind this?

In Pascal, Integer (signed) is the base type. All other integer number types are a subranges of integer. (this is not entirely true in Borland dialects, given longint in TP and int64 in Delphi, but close enough).
An important reason for that if the intermediate result of calculations gets negative, and you calculate with unsigned integers, range check errors will trigger, and since most older programming languages DON'T assume 2-complement integers, the result (with range checks off) might even be corrupt.
The THandle case is much simpler. Delphi didn't have a proper 32-bit unsigned till D4, but only a 31-bit cardinal. (since 32-bit unsigned integer is not a subrange of integer, the later unsigned ints are a subset of int64, which moved the problem to uint64 which was only added in D2010 or so)
So in many places in the headers signed types are used where the winapi uses unsigned types, probably to avoid the 32th bit getting accidentally corrupt in those versions, and the custom stuck.
But the winapi case is different from the general case.
Added later Some Pascal (and Modula2/3) implementations circumvent this trap by setting the integer at a size larger than the wordsize, and require all numeric types to declare a proper subrange, like in the below program.
The first holds the primary assumption that everything is a subset of integer, and the second allows the compiler to scale nearly everything down again to fit in registers, specially if the CPU has some operations for larger than word operations. (like x86 where 32-bit * 32-bit mul gives a 64-bit result, or can detect wordsize overflows using status bits (e.g. to generate range exceptions for adds without doing a full 2*wordsize add)
var x : 0..20;
y : -10..10;
begin
// any expression of x and y has a range -10..20
Turbo Pascal and Delphi emulate an integer type twice the wordsize for their 16-bit and 32-bit offerings. The handling of the highest unsigned type is hacky at best.

Well, for a start THandle is declared incorrectly. It's unsigned in the Windows headers and should be so in Delphi. In fact I think this was corrected in a recent release of Delphi.
I'd imagine that the preference for signed over unsigned is largely historical and not particularly significant. However, I can think of one example where it is important. Consider the for loop:
for i := 0 to Count-1 do
If i is unsigned and Count is 0 then this loop runs from 0 to $FFFFFFFF which is not what you want. Using a signed integer loop variable avoids that problem.
Pascal is a victim of its syntax here. The equivalent C or C++ loop has no such trouble
for (unsigned int i=0; i<Count; i++)
due to the syntactic difference and use of a comparison operator as stopping condition.
This could also be the reason why Length() on a string or dynamic array returns a signed value. And so for consistency, SetLength() should accept signed values. And given that the return value of Pos() is used to index strings, it should be signed also.
Here's another Stack Overflow discussion of the topic: Should I use unsigned integers for counting members?
Of course, I'm speculating wildly here. Perhaps there was no design and just out of habit the precedent of using signed values was set and became enshrined.

Some string related search functions return -1 when nothing is found.
I believe the reasoning behind this is that MaxInt is 2GB which is the maximum size for strings in 32 bit Delphi. This because a single process can have up to 2GB memory

There are many reasons for using signed integers, even some that might apply when you do not intend to return a negative value.
Imagine I write code that calls Pos, and I want to do math with the results. Would you rather have a negative result (Pos('x',s)-5) raise a range-check exception, underflow and become a very large unsigned number around 4 billion, or go negative, if Pos('x',s) returns 1? Either one is a source of problems for new users who seldom think about these cases, but the long-established tradition is that by using Integer results, it's your job to check for negative and zero results and not use them as string offsets. There is an advantage for beginning and for advanced programmers, in using Integer, and not having "negative" values roll under and become large unsigned values or raise range exceptions.
Secondly, remember that in beginning programming, one usually introduces Integer (signed) types long before one introduces unsigned types like Cardinal. Beginners often work with functions like Pos, and it makes sense to use the type that will create the least-unfriendly set of side effects. There are no negative side effects to having a range larger than the one you absolutely need (the range you probably need for Pos is 1 to maximum-string-length-in-delphi). There is zero benefit in 32-bit Delphi to using the Cardinal type for Pos, and there definitely ARE downsides to choosing it.
Once you get to 64-bit delphi, however, you could theoretically have strings LARGER than an Integer can hold, and moving to Cardinal wouldn't fix all your potential problems. However, the chance of anyone having a 2+ GB string is probably nil, and Delphi 64-bit compiler doesn't allow a >2 GB string, anyway. In my testing, I can achieve an almost 1 GB String in 64 bit Delphi. So the practical length limit for a Win64 string is about a billion (1073741814) characters, which is using nearly 2 GB of actual RAM. At that limit, I either get EIntOverflow or EAccessViolation, and it seems I am hitting Delphi run time library (RTL) bugs, not properly defined limits, so your mileage may vary.

Memory layout of a set

How is a set organized in memory in Delphi?
What I try to do is casting a simple type to a set type like
var
MyNumber : Word;
ShiftState : TShiftState;
begin
MyNumber:=42;
ShiftState:=TShiftState(MyNumber);
end;
Delphi (2009) won't allow this and I don't understand why. It would make my life a lot easier in cases where I get a number where single bits encode different enum values and I just could cast it like this. Can this be done?
One approach I was going to go for is:
var
ShiftState : TShiftState;
MyNumber : Word absolute ShiftState;
begin
MyNumber:=42;
end;
But before doing this I thought I'd ask for the memory layout. It's more a feeling than knowlege what I am having right now about this.

A Delphi set is a bit field who's bits correspond to the associated values of the elements in your set. For a set of normal enumerated types the bit layout is straight-forward:
bit 0 corresponds to set element with ordinal value 0
bit 1 corresponds to set element with ordinal value 1
and so on.
Things get a bit interesting when you're dealing with a non-contiguous set or with a set that doesn't start at 0. You can do that using Delphi's subrange type (example: set of 3..7) or using enumerated types that specify the actual ordinal value for the elements:
type enum=(seven=7, eight=8, eleven=11);
EnumSet = set of enum;
In such cases Delphi will allocate the minimum required amount of bytes that will include all required bits, but will not "shift" bit values to use less space. In the EnumSet example Delphi will use two bytes:
The first byte will have it's 7th bit associated with seven
The second byte will have bit 0 associated with eight
The second byte will have bit 3 associated with eleven
You can see some tests I did over here: Delphi 2009 - Bug? Adding supposedly invalid values to a set
Tests were done using Delphi 2010, didn't repeat them for Delphi XE.

Unfortunately I just now stumbled upon the following question: Delphi 2009 - Bug? Adding supposedly invalid values to a set
The accepted answer of Cosmin contains a very detailed description of what is going on with sets in Delphi. And why I better not used my approach with absolute. Apparently a set variable can take from 1 to 32 byte of memory, depending on the enum values.

You have to choose a correctly sized ordinal type. For me (D2007) your code compiles with MyNumber: Byte:
procedure Test;
var
MyNumber: Byte;
ShiftState: TShiftState;
begin
MyNumber := 42;
ShiftState := TShiftState(MyNumber);
end;
I have used this technique in some situations and didn't encounter problems.
UPDATE
The TShiftState type has been extended since Delphi 2010 to include two new states, ssTouch and ssPen, based on the corresponding doc page (current doc page). The Delphi 2009 doc still has TShiftState defined as a set of 7 states.
So, your attempt to convert Word to TShiftState would work in Delphi 2010+, but Byte is the right size for Delphi 2009.

I use this:
For <= 8 elements, PByte(#MyNumber)^, for <= 16 elements, PWord(#MyNumber)^, etc.
If the enum takes up more space (through the min enum size compiler option) this will still work.

How to also prepare for 64-bits when migrating to Delphi 2010 and Unicode

As 64 bits support is not expected in the next version it is no longer an option to wait for the possibility to migrate our existing code base to unicode and 64-bit in one go.
However it would be nice if we could already prepare our code for 64-bit when doing our unicode translation. This will minimize impact in the event it will finally appear in version 2020.
Any suggestions how to approach this without introducing to much clutter if it doesn't arrive until 2020?

There's another similar question, but I'll repeat my reply here too, to make sure as many people see this info:
First up, a disclaimer: although I work for Embarcadero. I can't speak for my employer. What I'm about to write is based on my own opinion of how a hypothetical 64-bit Delphi should work, but there may or may not be competing opinions and other foreseen or unforeseen incompatibilities and events that cause alternative design decisions to be made.
That said:
There are two integer types, NativeInt and NativeUInt, whose size will
float between 32-bit and 64-bit depending on platform. They've been
around for quite a few releases. No other integer types will change size
depending on bitness of the target.
Make sure that any place that relies on casting a pointer value to an
integer or vice versa is using NativeInt or NativeUInt for the integer
type. TComponent.Tag should be NativeInt in later versions of Delphi.
I'd suggest don't use NativeInt or NativeUInt for non-pointer-based values. Try to keep your code semantically the same between 32-bit and 64-bit. If you need 32 bits of range, use Integer; if you need 64 bits, use Int64. That way your code should run the same on both bitnesses. Only if you're casting to and from a Pointer value of some kind, such as a reference or a THandle, should you use NativeInt.
Pointer-like things should follow similar rules to pointers: object
references (obviously), but also things like HWND, THandle, etc.
Don't rely on internal details of strings and dynamic arrays, like
their header data.
Our general policy on API changes for 64-bit should be to keep the
same API between 32-bit and 64-bit where possible, even if it means that
the 64-bit API does not necessarily take advantage of the machine. For
example, TList will probably only handle MaxInt div SizeOf(Pointer)
elements, in order to keep Count, indexes etc. as Integer. Because the
Integer type won't float (i.e. change size depending on bitness), we
don't want to have ripple effects on customer code: any indexes that
round-tripped through an Integer-typed variable, or for-loop index,
would be truncated and potentially cause subtle bugs.
Where APIs are extended for 64-bit, they will most likely be done with
an extra function / method / property to access the extra data, and this
API will also be supported in 32-bit. For example, the Length() standard
routine will probably return values of type Integer for arguments of
type string or dynamic array; if one wants to deal with very large
dynamic arrays, there may be a LongLength() routine as well, whose
implementation in 32-bit is the same as Length(). Length() would throw
an exception in 64-bit if applied to a dynamic array with more than 232
elements.
Related to this, there will probably be improved error checking for
narrowing operations in the language, especially narrowing 64-bit values
to 32-bit locations. This would hit the usability of assigning the
return value of Length to locations of type Integer if Length(),
returned Int64. On the other hand, specifically for compiler-magic
functions like Length(), there may be some advantage of the magic taken,
to e.g. switch the return type based on context. But advantage can't be
similarly taken in non-magic APIs.
Dynamic arrays will probably support 64-bit indexing. Note that Java
arrays are limited to 32-bit indexing, even on 64-bit platforms.
Strings probably will be limited to 32-bit indexing. We have a hard
time coming up with realistic reasons for people wanting 4GB+ strings
that really are strings, and not just managed blobs of data, for which
dynamic arrays may serve just as well.
Perhaps a built-in assembler, but with restrictions, like not being able to freely mix with Delphi code; there are also rules around exceptions and stack frame layout that need to be followed on x64.

First, look at the places where you interact with non-delphi libraries and api-calls,
they might differ. On Win32, libraries with the stdcall calling convenstion are named like _SomeFunction#4 (#4 indicating the size of the parameters, etc). On Win64, there is only one calling convention, and the functions in a dll are no more decorated. If you import functions from dll files, you might need to adjust them.
Keep in mind, in a 64 bit exe you cannot load a 32-bit dll, so, if you depend on 3rd party dll files, you should check for a 64-bit version of those files as well.
Also, look at Integers, if you depend on their max value, for example when you let them overflow and wait for the moment that happens, it will cause trouble if the size of an integer is changed.
Also, when working with streams, and you want to serialize different data, with includes an integer, it will cause trouble, since the size of the integer changed, and your stream will be out of sync.
So, on places where you depend on the size of an integer or pointer, you will need to make adjustments. When serializing sush data, you need to keep in mind this size issue as well, as it might cause data incompatibilities between 32 and 64 bit versions.
Also, the FreePascal compiler with the Lazarus IDE already supports 64-bit. This alternative Object Pascal compiler is not 100% compatible with the Borland/Codegear/Embarcadero dialect of Pascal, so just recompiling with it for 64-bit might not be that simple, but it might help point out problems with 64-bit.

The conversion to 64bit should not be very painful. Start with being intentional about the size of an integer where it matters. Don't use "integer" instead use Int32 for integers sized at 32bits, and Int64 for integers sized at 64bits. In the last bit conversion the definition of Integer went from Int16 to Int32, so your playing it safe by specifying the exact bit depth.
If you have any inline assembly, create a pascal equivalent and create some unit tests to insure that they operate the same way. Perform some timing tests of both and see if the assembly still runs faster enough to keep. If it does, then you will want to make changes to both as they are needed.

Use NativeInt for integers that can contain casted pointers.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart