Fortran program segfaults when compiled with armflang at -Ofast level

Fortran program segfaults when compiled with armflang at -Ofast level - armflang

I have a Fortran program, which works with armflang at -O3 level but segfaults at -Ofast optimization level. What could be going wrong?
Update: The problem is not specific to a workload. This happens for certain NCAR workloads and WRF 3.9.1.

This could be because of the -fstack-arrays option which is enabled by default at -Ofast level. As per Arm Fortran Compiler armflang documentation, -fstack-arrays is not enabled at -O3 level.
The -fstack-arrays option places automatic arrays of all sizes on the local stack. This usually leads to better performance as it avoids indirect calls to malloc() and free() for local and for temporary arrays. On typical Linux systems, stack-size per process is 8192 kB by default (which usually can be increased because the hard limit is "unlimited"). This creates a problem for programs with huge arrays, leading to non-obvious segfaults.
There are two ways to work around this problem
Use -Ofast -fno-stack-arrays instead, which will disable placing automatic arrays on local stack, but keeps all other -Ofast optimizations.
Call ulimit -s unlimited before running the program if the system permits. This sets the stack to a bigger size than default.

There is no MWE and you have the answer before asking the question !
I am surprised by your choice of -fstack-arrays with -Ofast. I have been reviewing the stack vs heap performance of local and automatic arrays, for both single thread and multi-thread
My experience is that for the situation of stack overflow, ie for larger arrays, there is no significant advantage putting large arrays on the stack. Small arrays do show some performance benefit, especially if they can reside in the cache.
There needs to be a recognition of the array sizes being located in the stack, as a more robust solution is for larger arrays on the heap.
For multi-threading, where each thread has it's own stack, I am showing some benefit of increasing the thread stack size and placing all arrays on the separate stack. In my testing, this appears to reduce the memory coherence problems, although a conclusive proof can be elusive.
I would be interested in your reasoning for your choice of -Ofast and -fstack-arrays. Have you tested a size limit for local and automatic arrays being placed on the stack ?

Related

Do desktop processors support weakly ordered memory?

Are there any Intel/AMD desktop processors that support weakly ordered memory or this is a feature of a server setup with multiple processors?

I'm not sure if this is what you're after, but IIRC there have been some architectures that had weakly-ordered memory accesses in that they could be ordered arbitrarily, and you had to insert memory barriers to ensure a particular ordering.
Modern processors use what's called a "load-store queue" that hides memory reordering, making it look almost as though it's happening in program order. Reads are often reordered (but with some care), writes can be made out of order but are committed in-order (although multiple writes to the same location are consolidated), and reads and writes are reordered wrt each other only carefully and speculatively. The latter is called "hoisting", where a read is performed speculatively ahead of a write (that appears earlier in the instruction sequence) and may be canceled (like a mispredicted branch) if it turns out that preceding write would have affected it.
Also, if memory is marked as uncached, then CPUs generally infer that to mean it's I/O space and perform no access reordering. x86 and SPARC are like this. However, PowerPC will still reorder reads to I/O memory space, and we have to use the EIEIO (ensure in order execution of I/O) instruction to force a particular ordering. IIRC, we also had to use memory barriers on PA-RISC and Alpha. Moreover, there are memory barriers on x86, but I'm not familiar with their use (possibly to ensure ordering of accesses to cached memory space).
You mention multi-core systems. In general, elaborate cache coherency protocols are employed to make all memory access appear to conform to certain interleaving rules, such that accesses hit the last-level caches and main memory in an order that would be possible if there were no caching.

Many modern processors now use out-of-order execution to improve performance by hiding memory latencies. This is not related to multiple processors/cores, it can be done with a single-cored processor acting alone. For this reason, you should not rely on memory ordering.

Why is there a stack and a heap?

Why do assembly languages use both a stack and a heap? They seem redundant.

They're not redundant. Each of them has strengths and weaknesses: A stack is faster if used right, because memory allocation is trivial (push / pop). The downside is that you can only add and remove items at the top (hence the name, stack). Also, total stack space is limited, and when you run out, you have a... well, stack overflow. The heap, by contrast, allows random allocation and deallocation, and you can store large amounts of data there, but the downside is that allocation carries more overhead - for each allocated block of memory, a suitable free portion must be found, and in the long run, fragmentation of the free space needs to be avoided, and the system must track where the free blocks are.
You use the stack to pass around small short-lived values, e.g. local counter variables, function arguments, return values, etc.; these lend themselves to push/pop allocation style. For larger or long-lived data structures, you use the heap.

You could certainly construct a computing system that utilised either one of them as its only memory model. However, they both have rather different properties each with its own good and bad points. Most systems utilise both so as to get the benefits from each of them.
Stacks
A stack can be thought of as a pile of plates, you write a value on a plate and put it on the top of the stack this is called a push operation and stores a value on the stack. You can obviously also remove the top plate from the stack this is called a pop operation. But new allocations must always be at the top of the stack.
The stack tend to be used for local variables and passing values between functions. Generally stacks have the following awesome properties:
Requires only a handful of pointers to manage
Very easy to implement in hardware, most processors have built in hardware support for a stack making it even faster.
Very quick to allocate memory
The problem with the stack comes from the fact items can only be added/removed from the top of the stack. Now this makes great sense when traversing up and down through function calls: pop functions inputs from the stack, allocate space for local variables on the stack, run function, clear local variables from the top of the stack and push the return value onto the stack. If on the other hand I want to allocate some memory and say pass it to another thread or in general free it far away from where it was allocated all of a sudden I have a problem, the stack is not in the correct position when I want to free the memory.
You could say the stack facilitates fast sequential memory allocation.
Heap
Now the heap is different each allocation is generally tracked separately. This causes a lot of overhead for allocations and deallocations, but each one can be handled independently of other memory allocations, well until you run out of memory.
There are numerous algorithms for accomplishing this and it is probably a bit unwise to twitter on about them here but here is a link that talks about a few good simple heap allocation algorithms: Alternatives to malloc and new
So the heap facilitates random memory allocation but this comes with a runtime penalty, however that penalty is often small that what would be incurred if you had to handle the situation using just the stack.

It is about the memory handling and managing.
There are different type of registers of x86 architectures.
There are possibilities of hardware supported memory management on x86 architecture and so on.
Stack is used by instruction pointer, Heap is for data segment in some applications.
To read more I advice you read the following links:
http://en.wikipedia.org/wiki/Data_segment
http://en.wikipedia.org/wiki/X86_memory_segmentation
"A memory model allows a compiler to perform many important
optimizations" - Wikipedia

Multithreaded Heap Management

In C/C++ I can allocate memory in one thread and delete it in another thread. Yet whenever one requests memory from the heap, the heap allocator needs to walk the heap to find a suitably sized free area. How can two threads access the same heap efficiently without corrupting the heap? (Is this done by locking the heap?)

In general, you do not need to worry about the thread-safety of your memory allocator. All standard memory allocators -- that is, those shipped with MacOS, Windows, Linux, etc. -- are thread-safe. Locks are a standard way of providing thread-safety, though it is possible to write a memory allocator that only uses atomic operations rather than locks.
Now it is an entirely different question whether those memory allocators scale; that is, is their performance independent of the number of threads performing memory operations? In most cases, the answer is no; they either slow down or can consume a lot more memory. The first scalable allocator in both dimensions (speed and space) is Hoard (which I wrote); the Mac OS X allocator is inspired by it -- and cites it in the documentation -- but Hoard is faster. There are others, including Google's tcmalloc.

Yes an "ordinary" heap implementation supporting multithreaded code will necessarily include some sort of locking to ensure correct operation. Under fairly extreme conditions (a lot of heap activity) this can become a bottleneck; more specialized heaps (generally providing some sort of thread-local heap) are available which can help in this situation. I've used Intel TBB's "scalable allocator" to good effect. tcmalloc and jemalloc are other examples of mallocs implemented with multithreaded scaling in mind.
Some timing comparisons comparisons between single threaded and multithread-aware mallocs here.

This is an Operating Systems question, so the answer is going to depend on the OS.
On Windows, each process gets its own heap. That means multiple threads in the same process are (by default) sharing a heap. Thus the OS has to thread-synchronize its allocation and deallocation calls to prevent heap corruption. If you don't like the idea of the possible contention that may ensue, you can get around it by using the Heap* routines. You can even overload malloc (in C) and new (in C++) to call them.

I found this link.
Basically, the heap can be divided into arenas. When requesting memory, each arena is checked in turn to see whether it is locked. This means that different threads can access different parts of the heap at the same time safely. Frees are a bit more complicated because each free must be freed from the arena that it was allocated from. I imagine a good implementation will get different threads to default to different arenas to try to minimize contention.

Yes, normally access to the heap has to be locked. Any time you have a shared resource, that resource needs to be protected; memory is a resource.

This will depend heavily on your platform/OS, but I believe this is generally OK on major sytems. C/C++ do not define threads, so by default I believe the answer is "heap is not protected", that you must have some sort of multithreaded protection for your heap access.
However, at least with linux and gcc, I believe that enabling -pthread will give you this protection automatically...
Additionally, here is another related question:
C++ new operator thread safety in linux and gcc 4

When does the stack really overflow?

Is infinite recursion the only case or can it happen for other reasons?
Doesn't the stack size grow as needed same as heap?
Sorry if this question has been asked before, would appreciate links to them if that is the case.

I can't speak for all platforms, but as it happens, I've just spent some time working with Windows .exe files (I mean, actually studying the binary format of them - I know in a sense all of us here work with executable files ;) ). I'm betting that most other platforms have similar capabilities, but I'm not immediate familiar with them.
Part of the file format itself includes two values relevant to the current discussion:
typedef struct _IMAGE_OPTIONAL_HEADER {
...
DWORD SizeOfStackReserve;
DWORD SizeOfStackCommit;
...
} IMAGE_OPTIONAL_HEADER32, *PIMAGE_OPTIONAL_HEADER32;
From MSDN:
SizeOfStackReserve
The number of bytes to reserve for the
stack. Only the memory specified by
the SizeOfStackCommit member is
committed at load time; the rest is
made available one page at a time
until this reserve size is reached.
SizeOfStackCommit
The number of bytes to commit for the
stack.
In other words, the linker specifies a maximum size for the program's stack. If you hit the maximum size, you overflow - no matter how you hit the maximum size. You could write a simple program to do it in one line of code just by allocating a single stack variable (say, an array) that's bigger than the maximum stack size. Or you could do it via infinite (or finite, but very deep) recursion, or just by allocating too many stack variables.
The Microsoft linker sets this value to 1MB by default on X86 platforms (4MB on Itanium systems). This seems small on the face of it, for a modern system. However, more modern versions of Windows interpret these values slightly differently. Instead of completely limiting the stack, it limits the physical memory the stack will use. If your stack grows beyond this, virtual memory will get involved, so you should still be good... assuming you have enough virtual memory.
Remember, it is possible to run out of memory, even on modern systems with huge amounts of RAM and plenty of virtual memory on disk. You just need to allocate really big amounts of data.
So, long story short: is it possible to overflow the stack without infinite recursion? Definitely. Is it likely? Not really, unless you're allocating really huge objects.

The stack overflows when the stack pointer is pushed out of the memory block the operating system has allocated for the stack. Some operating systems will resize the stack as it grows (IIRC Linux does this) while in others the stack size is fixed at the start of the process or thread (IIRC Windows does this).
Possible reasons for overflowing the stack:
An unbounded number of stack frames (e.g. from unbounded recursion)
Attempting to allocate large blocks from the stack
Buffer overflows for buffers allocated on the stack
There are probably other reasons as well that I can't think of off the top of my head.

This question doesn't specify which stack is "the" stack. So, here are a few answers:
Call Stack
The call stack gets overflowed whenever the number of calls on the stack overruns the amount of memory it has. The most common way is infinite recursion, but it's quite possible to have recursion that's excessive but not infinite. For example, computing the Ackermann function naively will tax any computer.
Languages
Stack-based languages
Some languages, like Postscript and Forth, and some virtual machines, like the Java virtual machine, are stack-based. In these languages, it may be possible to make expressions so complex that they overflow the stack.
Context-free languages
Context-free languages are often implemented using a stack. If the strings for the code of these languages gets too complex, it's possible to overflow the stack.

On a laptop or desktop machine it may be unusual to overflow the stack without infinite (or very deeply nested) recursion when running from the main thread... however, stack overflows are not uncommon for:
Threaded code in which the thread has been allotted a small, fixed-sized stack.
Signal handling code in which the signal handling context has a small, fixed-sized stack.
Code executing on embedded devices, where memory is generally scarce.
As an example, if you register a signal handler using sigaction, if the signal handler does any complex (i.e. deeply nested operations) it is very easy to get a stack overflow on a number of operating systems, since signal handlers are usually allotted a small, fixed-sized stack. Similarly, if you spawn a thread with pthread_create, but you specify a small stacksize with pthread_attr_setstacksize, then it is very easy to attain a stack overflow. On very memory-limited devices such wireless sensors, it is an art to avoid stack overflows.

My day job involves a lot of work with LotusScript in Lotus Notes, which has fixed stack limits for various scopes. E.g. most variables in a procedure/function must fit in a 32kB stack, except that the content of class variables is stored on the heap.
If fixed-size variables exceed the stack size, code won't compile.
Run-time stack overflows can occur with recursion. This is easy to achieve in LotusScript as it limits recursion of any single function to a 32kB stack. I gave up on using a recursive QuickSort years ago because of this.

If your program exceeds its alloted stack space without any infinite recursion going on, then you're doing something wrong.
Though it can happen if you leave off some asterisks and try to pass some huge buffers by value.
The memory allocated for the stack does generally grow as needed within reasonable boundaries - I'm not sure what the upper limit is on various systems.

Coping with, and minimizing, memory usage in Common Lisp (SBCL)

I have a VPS with not very much memory (256Mb) which I am trying to use for Common Lisp development with SBCL+Hunchentoot to write some simple web-apps. A large amount of memory appears to be getting used without doing anything particularly complex, and after a while of serving pages it runs out of memory and either goes crazy using all swap or (if there is no swap) just dies.
So I need help to:
Find out what is using all the memory (if it's libraries or me, especially)
Limit the amount of memory which SBCL is allowed to use, to avoid massive quantities of swapping
Handle things cleanly when memory runs out, rather than crashing (since it's a web-app I want it to carry on and try to clean up).
I assume the first two are reasonably straightforward, but is the third even possible?
How do people handle out-of-memory or constrained memory conditions in Lisp?
(Also, I note that a 64-bit SBCL appears to use literally twice as much memory as 32-bit. Is this expected? I can run a 32-bit version if it will save a lot of memory)

To limit the memory usage of SBCL, use --dynamic-space-size option (e.g.,sbcl --dynamic-space-size 128 will limit memory usage to 128M).
To find out who is using memory, you may call (room) (the function that tells how much memory is being used) at different times: at startup, after all libraries are loaded and then during work (of cource, call (sb-ext:gc :full t) before room not to measure the garbage that has not yet been collected).
Also, it is possible to use SBCL Profiler to measure memory allocation.

Find out what is using all the memory
(if it's libraries or me, especially)
Attila Lendvai has some SBCL-specific code to find out where an allocated objects comes from. Refer to http://article.gmane.org/gmane.lisp.steel-bank.devel/12903 and write him a private mail if needed.
Be sure to try another implementation, preferably with a precise GC (like Clozure CL) to ensure it's not an implementation-specific leak.
Limit the amount of memory which SBCL
is allowed to use, to avoid massive
quantities of swapping
Already answered by others.
Handle things cleanly when memory runs
out, rather than crashing (since it's
a web-app I want it to carry on and
try to clean up).
256MB is tight, but anyway: schedule a recurring (maybe 1s) timed thread that checks the remaining free space. If the free space is less than X then use exec() to replace the current SBCL process image with a new one.

If you don't have any type declarations, I would expect 64-bit Lisp to take twice the space of a 32-bit one. Even a plain (small) int will use a 64-bit chunk of memory. I don't think it'll use less than a machine word, unless you declare it.
I can't help with #2 and #3, but if you figure out #1, I suspect it won't be a problem. I've seen SBCL/Hunchentoot instances running for ages. If I'm using an outrageous amount of memory, it's usually my own fault. :-)

I would not be surprised by a 64-bit SBCL using twice the meory, as it will probably use a 64-bit cell rather than a 32-bit one, but couldn't say for sure without actually checking.
Typical things that keep memory hanging around for longer than expected are no-longer-useful references that still have a path to the root allocation set (hash tables are, I find, a good way of letting these things linger). You could try interspersing explicit calls to GC in your code and make sure to (as far as possible) not store things in global variables.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart