I'm trying to understand what is stored in the stack in optix.
As I understand it, we set the stack size per context, and one stack is attached to each thread in the ray generation program.
When a ray is launched, the thread carries with it the stack, which stores the ray's payload.
I thought that, when we do a recursive ray-tracer for example, the stack overflow would occur because there would be too many payloads to keep in the memory. But right now, I have a program with a radiance ray that has a payload of float + 3 uint, and a shadow ray with only a float, and there is only one bounce. However, my stack needs to be bigger that 1024 to avoid a stack overflow. Surely, this is way more that just my two payloads.
So I wonder, what else is in the stack?
(I mean in general, not in my particular case. What is stored in the stack except the ray(s) payload(s) (if they are)? For example, do we also store information about the hits? about the scene tree? Do we keep track of which program called the current ray?)
Thanks for your help!
Answered on the NVIDIA board here
Detlef Roettger wrote
"The stack is also used to save and restore live variables around
function calls (e.g. rtTrace or callable programs). That's the
background for one of the performance advice in the OptiX Programming
Guide which starts with Try to minimize live state across calls to
rtTrace in programs."
More info on this at §3.1.3 - Global State in the OptiX Programming guide.
Remember that OptiX programs are full blown CUDA kernels combined together. Stack memory is therefore also used for ordinary execution needs (the amount is likely to vary even between CUDA versions).
Related
I am currently working on a project to develop an application in STM32 microcontroller using RTOS (micrium).
Are there any tools to calculate the stack usage of a particular thread in RTOS application?
No tools I know of. However, two simple methods to estimate stack usage have always worked for me.
Fill all RAM with a value like 0x55 or 0xAA. Let the program run long enough while using all of the device's options to have the most code execution coverage. Stop (under some debugger), and examine RAM for the above values being overwritten. That should give you a good approximation. This works with or without an OS.
Modify the OS just a bit so that on task switches you record to some global variable (array) and for each task the lowest stack pointer found by comparing to the previous value for the same task. After running the app long enough as in [1], examine the counters. Although there is no guarantee the moment a task switch happens you will have the maximum stack used for that task, statistically, after long enough time and assuming preemptive switching, you will have managed to record an accurate enough value.
If you are using GCC or clang -fstack-usage compiler switch generates a stack frame size for each function. You need to combine that information with call-graph information generated by the linker to find the deepest stack usage starting from a specific function. Starting at main(), a task entry-point and and ISR will then give you the worst-case usage for that thread.
Helpfully the work to create such a tool has been done for you as discussed here, using a Perl script from here.
ARM's armcc compiler v5 and earlier (v6 is clang/llvm) has this functionality built-in and can include detailed stack analysis in the link map, including the worst-case call path and warnings of non-deterministic stack usage (due to recursion or call-backs through function pointers for example). You may be using armcc if you are using Keil ARM MDK for example. Again for multi-threaded systems (tasks/ISRs) you need to look at the stack usage for the thread entry point.
Note also that on ARM Cortex-M, the "system stack" is shared by the main() thread and all ISRs, and if you use the ISR preemption priorities multiple interrupts may be active simultaneously. So in theory worst case stack usage is the sum of the stack usage for each of main() and all ISRs that may occur concurrently. Whilst it is good practice to keep ISRs short and simple, beware of third-party code. ST's USB library for example runs the entire USB device stack in the ISR context for example!
I stumbled upon PF_RING while reading about PACKET_MMAP kernel documentation (https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt)
Can someone explain the difference between the actual technology (implementation details and differences) between PF_RING and PACKET_RX_RING/PACKET_TX_RING in PACKET_MMAP
PF_RING has two very different modes of operation.
The one called "vanilla" operates "above" driver level, so it should be mostly similar to PACKET_MMAP. They both simply share a buffer between the user application and the network stack. I think PF_RING also discards the packets, so it could be say it's exclusive. PACKET_MMAP, on the contrary, lets the kernel stack process the packets after the copy to userspace.
The "DNA" or "zero-copy" mode implements kernel bypassing. Instead of copying data to a shared ring buffer, the driver's buffers themselves are shared. This, obviously, requires custom drivers and means no other processes will be able to receive traffic from the affected interfaces. Many commonplace cards are supported. Due to this reduced copying and context switches and interrupts (you can do polling if you want to) you can squeeze quite a lot more of performance. The upstream technology that comes the closest is AF_XDP.
I may have gotten some things wrong (I just Googled for a bit out of curiosity and am by no means an expert in PF_RING), so watch out for other answers. I do think most of what I wrote is accurate.
I try to understand the basics of concurrent programming in Go. Almost all articles use the term "address space", for example: "All goroutines share the same address space". What does it mean?
I've tried to understand the following topics from wiki, but it wasn't successful:
http://en.wikipedia.org/wiki/Virtual_memory
http://en.wikipedia.org/wiki/Memory_segmentation
http://en.wikipedia.org/wiki/Page_(computer_memory)
...
However at the moment it's difficult to understand for me, because my knowledges in areas like memory management and concurrent programming are really poor. There are many unknown words like segments, pages, relative/absolute addresses, VAS etc.
Could anybody explain to me the basics of the problem? May be there are some useful articles, that I can't find.
Golang spec:
A "go" statement starts the execution of a function call as an independent concurrent thread of control, or goroutine, within the same address space.
Could anybody explain to me the basics of the problem?
"Address space" is a generic term which can apply to many contexts:
Address spaces are created by combining enough uniquely identified qualifiers to make an address unambiguous (within a particular address space)
Dave Cheney's presentation "Five things that make Go fast" illustrates the main issue addressed by having goroutine within the same process address space: stack management.
Dave's qualifies the "address space", speaking first of thread:
Because a process switch can occur at any point in a process’ execution, the operating system needs to store the contents of all of these registers because it does not know which are currently in use.
This lead to the development of threads, which are conceptually the same as processes, but share the same memory space.
(so this is about memory)
Then Dave illustrates the stack within a process address space (the addresses managed by a process):
Traditionally inside the address space of a process,
the heap is at the bottom of memory, just above the program (text) and grows upwards.
The stack is located at the top of the virtual address space, and grows downwards.
See also "What and where are the stack and heap?".
The issue:
Because the heap and stack overwriting each other would be catastrophic, the operating system usually arranges to place an area of unwritable memory between the stack and the heap to ensure that if they did collide, the program will abort.
With threads, that can lead to restrict the heap size of a process:
as the number of threads in your program increases, the amount of available address space is reduced.
goroutine uses a different approach, while still sharing the same process address space:
what about the stack requirements of those goroutines ?
Instead of using guard pages, the Go compiler inserts a check as part of every function call to check if there is sufficient stack for the function to run. If there is not, the runtime can allocate more stack space.
Because of this check, a goroutines initial stack can be made much smaller, which in turn permits Go programmers to treat goroutines as cheap resources.
Go 1.3 introduces a new way of managing those stacks:
Instead of adding and removing additional stack segments, if the stack of a goroutine is too small, a new, larger, stack will be allocated.
The old stack’s contents are copied to the new stack, then the goroutine continues with its new larger stack.
After the first call to H the stack will be large enough that the check for available stack space will always succeed.
When you application runs on the RAM, addresses in RAM are allocated to your application by the memory manager. This is refered to as address space.
Concept:
the processor (CPU) executes instructions in a Fetch-Decode-Execute
cycle. It executes instructions in an applicaiton by fetching it to
the RAM (Random Acces Memory). This is done because it is very
in-efficient to get it all the way from disk. Some-one needs to keep
track of memory usage, so the operating system implements a memory
manager. Your appication, consists of some program, in your case this
is written in Go programming language. When you execute your script,
the OS executes the instructions in the above mentioned fashion.
Reading your post i can empathize. The terms you mentioned will become familiar to you as program more and more.
I first encountered these terms from the operating systems book, a.k.a the dinosaur book.
Hope this helps you.
So I understand what a stack overflow is, when memory collides (and the title of this website) but what I do not understand is why new entries to the stack are in a decremental memory address. Why are they not in a random memory address, would it not make more sense so that memory collision is not an issue? I am guessing there is some sort of optimizing reason behind that?
** EDIT **
What I did not realize is a stack is given x amount of address space. Makes sense now but brings me to a follow-up question. Can I explicitly state how much memory I want to allocate to a stack?
"Memory collides" would better suit the term of "buffer overflow", where you write outside of the predestined space, but where it is likely to be within a different allocated memory block.
A stack overflow is not about writing outside of one's memory allocation into another memory allocation. It's just about writing outside of one's stack memory allocation. Most likely outside of the stack there's a guard memory page, that is not allocated for anything and which causes a fault on a read or write attempt.
And assigning a random address for each value pushed on the stack makes it hard to find data on the stack (and it's not a stack anymore). When the compiler or programmer knows that subsequent elements occupy subsequent addresses, then it's easy to compute those addresses just from the base pointer of the stack frame.
The answer to this question is probably complex, but basically stack operations are considered to be very primitive functions that the processor does as part of normal execution of code. (Saving return addresses and other stuff.)
So where do you put the memory management code? Where do you track the allocated addresses or add code to allocate new addresses? There really isn't anywhere to do this as these are basic operations performed by the processor itself.
Similar to the memory that holds the code itself, the stack is assumed to be setup before the code runs (and pointed to by the stack register). There really isn't any place to add complex memory management to stack memory. And so, yes, if not enough memory was provided, the stack will overflow.
Stack overflow is when you have used up all available stack space. The space available for the stack is, in most cases just an arbitrary limit chosen by the system designers. It is possible to alter this, but on modern systems, it's not really an issue - code that needs several megabytes of stack, unless the system is REALLY huge, is probably not correctly designed.
The stack grows towards zero from "custom" - it has to go in a defined direction or it would be very hard to follow what is going on, and lower adddress is just as good as higher address. It used to be that stack and heap grew towards each other, which would allow code that uses a lot of stack and not so much heap to work in the same amount of memory as something that uses a smaller amount of stack and a larger amount of heap. But these days, there is typically enough memory (space) that the heap can be defined to be somewhere completely separate from the stack. Instead the stack overflow is detected by having a region of "reserved" memory just at the top of the stack that is not usable - so the OS gets a "trap" for using memory that isn't available, and the application can be killed.
Is infinite recursion the only case or can it happen for other reasons?
Doesn't the stack size grow as needed same as heap?
Sorry if this question has been asked before, would appreciate links to them if that is the case.
I can't speak for all platforms, but as it happens, I've just spent some time working with Windows .exe files (I mean, actually studying the binary format of them - I know in a sense all of us here work with executable files ;) ). I'm betting that most other platforms have similar capabilities, but I'm not immediate familiar with them.
Part of the file format itself includes two values relevant to the current discussion:
typedef struct _IMAGE_OPTIONAL_HEADER {
...
DWORD SizeOfStackReserve;
DWORD SizeOfStackCommit;
...
} IMAGE_OPTIONAL_HEADER32, *PIMAGE_OPTIONAL_HEADER32;
From MSDN:
SizeOfStackReserve
The number of bytes to reserve for the
stack. Only the memory specified by
the SizeOfStackCommit member is
committed at load time; the rest is
made available one page at a time
until this reserve size is reached.
SizeOfStackCommit
The number of bytes to commit for the
stack.
In other words, the linker specifies a maximum size for the program's stack. If you hit the maximum size, you overflow - no matter how you hit the maximum size. You could write a simple program to do it in one line of code just by allocating a single stack variable (say, an array) that's bigger than the maximum stack size. Or you could do it via infinite (or finite, but very deep) recursion, or just by allocating too many stack variables.
The Microsoft linker sets this value to 1MB by default on X86 platforms (4MB on Itanium systems). This seems small on the face of it, for a modern system. However, more modern versions of Windows interpret these values slightly differently. Instead of completely limiting the stack, it limits the physical memory the stack will use. If your stack grows beyond this, virtual memory will get involved, so you should still be good... assuming you have enough virtual memory.
Remember, it is possible to run out of memory, even on modern systems with huge amounts of RAM and plenty of virtual memory on disk. You just need to allocate really big amounts of data.
So, long story short: is it possible to overflow the stack without infinite recursion? Definitely. Is it likely? Not really, unless you're allocating really huge objects.
The stack overflows when the stack pointer is pushed out of the memory block the operating system has allocated for the stack. Some operating systems will resize the stack as it grows (IIRC Linux does this) while in others the stack size is fixed at the start of the process or thread (IIRC Windows does this).
Possible reasons for overflowing the stack:
An unbounded number of stack frames (e.g. from unbounded recursion)
Attempting to allocate large blocks from the stack
Buffer overflows for buffers allocated on the stack
There are probably other reasons as well that I can't think of off the top of my head.
This question doesn't specify which stack is "the" stack. So, here are a few answers:
Call Stack
The call stack gets overflowed whenever the number of calls on the stack overruns the amount of memory it has. The most common way is infinite recursion, but it's quite possible to have recursion that's excessive but not infinite. For example, computing the Ackermann function naively will tax any computer.
Languages
Stack-based languages
Some languages, like Postscript and Forth, and some virtual machines, like the Java virtual machine, are stack-based. In these languages, it may be possible to make expressions so complex that they overflow the stack.
Context-free languages
Context-free languages are often implemented using a stack. If the strings for the code of these languages gets too complex, it's possible to overflow the stack.
On a laptop or desktop machine it may be unusual to overflow the stack without infinite (or very deeply nested) recursion when running from the main thread... however, stack overflows are not uncommon for:
Threaded code in which the thread has been allotted a small, fixed-sized stack.
Signal handling code in which the signal handling context has a small, fixed-sized stack.
Code executing on embedded devices, where memory is generally scarce.
As an example, if you register a signal handler using sigaction, if the signal handler does any complex (i.e. deeply nested operations) it is very easy to get a stack overflow on a number of operating systems, since signal handlers are usually allotted a small, fixed-sized stack. Similarly, if you spawn a thread with pthread_create, but you specify a small stacksize with pthread_attr_setstacksize, then it is very easy to attain a stack overflow. On very memory-limited devices such wireless sensors, it is an art to avoid stack overflows.
My day job involves a lot of work with LotusScript in Lotus Notes, which has fixed stack limits for various scopes. E.g. most variables in a procedure/function must fit in a 32kB stack, except that the content of class variables is stored on the heap.
If fixed-size variables exceed the stack size, code won't compile.
Run-time stack overflows can occur with recursion. This is easy to achieve in LotusScript as it limits recursion of any single function to a 32kB stack. I gave up on using a recursive QuickSort years ago because of this.
If your program exceeds its alloted stack space without any infinite recursion going on, then you're doing something wrong.
Though it can happen if you leave off some asterisks and try to pass some huge buffers by value.
The memory allocated for the stack does generally grow as needed within reasonable boundaries - I'm not sure what the upper limit is on various systems.