Does the order or syntax of allocate statement affect performance? (Fortran) - memory

Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).

From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.

There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.

Related

Better to allocate too much memory on the stack, or the correct amount on the heap?

I have just started learning rust, and it is my first proper look into a low level language (I do python, usually). Part of the tutorial explains that a string literal is stored on the stack, because it has a fixed (and known) size; it also explains that a non-initialised string is stored on the heap, so that its size can grow as necessary.
My understanding is that the stack is much faster than the heap. In the case of a string whose size is unknown, but I know it will not ever require more than n bytes, does it make sense to allocate space on the stack for the maximum size, instead of sticking it on the heap?
The purpose of this question is not to solve a problem, but to help me understand, so I would appreciate verbose and detailed answers!
The difference in performance between the stack and the heap comes due to the fact that objects in the heap may change size at run time, and they must then be reallocated somewhere else in the heap.
Now for the verbose part. Imagine you have an integer i32. This number will always be the same size, so any modifications made to it will occur in place. When it goes out of scope (it stops being needed in the program) it will either be deleted, or, a more efficient solution, it will be deleted along with the whole stack it belongs to.
Now you want to create a String. So you create it in the heap and give it a value. And then you modify it and add some characters to it. Now two things can happen.
There is free memory after the string, so the allocator uses this memory to write the new part.
There already is an object allocated in the memory right after the string, and of course, you don't want to overwrite it. So the allocator looks for the next free memory space with enough size to hold the new string and copies into it that string. Then deletes the old one, freeing that memory.
As you can see in the heap the number of operations to be made is incredibly higher than in the stack, so its performance will be lower.
Now, in your case, there are some methods specifically for memory reservation. String::reserve() and String::reserve_exact(). I would recommend you to look at the documentation for Rust always. Usually there already is a std method for what you want.

Delphi dynamic array efficiency

I am not a Delphi expert and I was reading online about dynamic arrays and static arrays. In this article I have found a chapter called "Dynamic v. Static Arrays" with a code snippet and below the author says:
[...] access to a dynamic array can be faster than a static array!
I have understood that dynamic arrays are located on the heap (they are implemented with references/pointers).
So far I know that the access time is better on dynamic arrays. But is that the same thing with the allocation? Like if I called SetLength(MyDynArray, 5) is that slower than creating a MyArray = array[0..4] of XXX?
So far I know that the access time is better on dynamic arrays.
That is not correct. The statement in that article is simply false.
But is that the same thing with the allocation? Like if I called SetLength(MyDynArray, 5) is that slower than creating a MyArray = array[0..4] of XXX?
A common fallacy is that static arrays are allocated on the heap. They could be global variables, and so allocated automatically when the module is loaded. They could be local variables and allocated on the stack. They could be dynamically allocated with calls to New or GetMem. Or they could be contained in a compound type (e.g. a record or a class) and so allocated in whatever way the owning object is allocated.
Having got that clear, let's consider a couple of common cases.
Local variable, static array type
As mentioned, static arrays declared as local variables are allocated on the stack. Allocation is automatic and essentially free. Think of the allocation as being performed by the compiler (when it generates code to reserve a stack frame). As such there is no runtime cost to the allocation. There may be a runtime cost to access because this might generate a page fault. That's all perfectly normal though, and if you want to use a small fixed size array as a local variable then there is no faster way to do it.
Member variable of a class, static array type
Again, as described above, the allocation is performed by the containing object. The static array is part of the space reserved for the object and when the object is instantiated sufficient memory is allocated on the heap. The cost for heap allocation does not typically depend significantly on the size of the block to be allocated. An exception to that statement might be really huge blocks but I'm assuming your array is relatively small in size, tens or hundreds of bytes. Armed with that knowledge we can see again that the cost for allocation is essentially zero, given that we are already allocating the memory for the containing object.
Local variable, dynamic array type
A dynamic array is represented by a pointer. So your local variable is a pointer allocated on the stack. The same argument applies as for any other local variable, for instance the local variable of static array type discussed above. The allocation is essentially free. Before you can do anything with this variable though, you need to allocate it with a call to SetLength. That incurs a heap allocation which is expensive. Likewise when you are done you have to deallocate.
Member variable of a class, dynamic array type
Again, allocation of the dynamic array pointer is free, but you must call SetLength to allocate. That's a heap allocation. There needs to be a deallocation too when the object is destroyed.
Conclusion
For small arrays, whose lengths are known at compile time, use of static arrays results in more efficient allocation and deallocation.
Note that I am only considering allocation here. If allocation is a relatively insignificant portion of the time spent working with the object then this performance characteristic may not matter. For instance, suppose the array is allocated at program startup, and then used repeatedly for the duration of the program. In such a scenario the access times dominate the allocation times and the difference between allocation times becomes insignificant.
On the flip side, imagine a short function called repeatedly during the programs lifetime, let's suppose this function is the performance bottleneck. If it operates on a small array, then it is possible that the allocation cost of using a dynamic array could be significant.
Very seldom can you draw hard and fast rules with performance. You need to understand how the tools work, and understand how your program uses these tools. You can then form opinions on which coding strategies might perform best, opinions that you should then test by profiling. You will be surprised more often than you might expect that your intuition is not a good predictor of performance.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Linked List in OpenCL

I have 1000 float datas in an array. I want to separate into different classes, lets say 4 classes. Their sizes are unpredictable. I could easily hold them in a linked list in a CPU implementation, but in OpenCL kernel, is there an opportunity like that? In my mind there are 3 solution to this problem.
First, arrays with length 1000 constructed in number of classes, which is memory costly.
Second, I allocate an array with length 1000 and separate them into parts. However, I may transport the values from and index into different index, becuase I don't know the size of each classes and they may exceed the size which I provided for each.
Third, and better in my opinion, I get two different array with same length. One of them stores data, the other one stores pointers. For example, in i-th index of data array, the value is stored which belongs to 2nd class. Additionally in i-th index of pointer to the next data which belongs to 2nd class. But this is good for just atomic type (like int, float, char etc) linked lists.
I am new in OpenCL. I haven't known lots of features of it yet. If there is a better way, please don't share with me and others.
Using pointers on GPU is usually very bad idea. Major amount of data resides in global memory, and to fetch it quickly the access should be coalesced. Using pointers breaks the access pattern totally, making it essentially random. It's not very good on CPUs too since it cause a lot of cache misses, but CPUs have larger caches and "smarter" internal logic, so it's usually not so important, but sometimes cache-aware memory access pattern can increase CPU application's speed by nearly order of magnitude. On GPUs coalesced global memory access is one of most important optimizations, and pointers can't provide it.
If you are not extremely short on memory, I'd suggest to use first way and preallocate arrays large enough to hold all data. If you are really short on memory, you could use textures to store your data and pointer arrays, but it depends on the algorithm whether it would provide any benefits or not.

Working with large arrays - OutOfRam

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.
The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.
If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.
Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

Resources