What does it mean when WebAssembly has an "implicit" stack? - stack

The WebAssembly spec states here that it has an implicit operand and call stack.
What exactly does that mean in terms of WebAssembly, and how would and explicit stack differ from an implicit one?

The implicit stack is managed by the VM and is not directly accessible. It is implicitly pushed to, and popped from, by various instruction.
An example of an explicit stack might be a region of linear memory that you have direct access to via load and store instruction. Indeed this is exactly what llvm does for address taken stack variables. i.e. it allocates a specific region and linear memory for them.
The control flow stack (e.g. the return addresses of each of the function on the stack) is also part of the implicit stack and cannot be explicitly read from or written to.

Related

Checkpointing with LD_PRELOAD -- how to manipulate the instruction pointer and call stack?

The LD_PRELOAD technique allows us to supply our own custom standard library functions to an existing binary, overriding the standard ones or manipulating their behaviour, giving a fun way to experiment with a binary and understand its behaviour.
I've read that LD_PRELOAD can be used to "checkpoint" a program --- that is, to produce a record of the full memory state, call stack and instruction pointer at any given time --- allowing us to "reset" the program back to that previous state at will.
It's clear to me how we can record the state of the heap. Since we can provide our own version of malloc and related functions, our preloaded library can obviously gain perfect knowledge of the memory state.
What I can't work out is how our preloaded functions can determine the call stack and instruction pointer; and then reset them at a later time to the previously recorded value. Clearly this is necessary for checkpointing. Are there standard library functions that can do this? Or is a different technique required?
I've read that LD_PRELOAD can be used to "checkpoint" a program ... allowing us to "reset" the program back to that previous state at will.
That is a gross simplification. This "checkpoint" mechanism can not possibly restore any open file descriptors, or any mutexes, since the state of these is partially inside the kernel.
It's clear to me how we can record the memory state. ...
What I can't work out is how our preloaded functions can determine the call stack and instruction pointer;
The instruction pointer is inside the preloaded function, and is trivially available as e.g. register void *rip __asm__("rip") on x86_64. But you (likely) don't care about that address -- you probably care about the caller of your function. That is also trivially available as __builtin_return_address() (at least when using GCC).
And the rest of the call stack is saved in memory (in the stack region to be more precise), so if you know the contents of memory, you know the call stack.
Indeed, when you use e.g. GDB where command with a core dump, that's exactly what GDB does -- it reads contents of memory from the core and recovers the call stack from it.
Update:
I wrote in my original post that I know how to inspect the memory, but in fact I only know how to inspect the heap. How can I view the full contents of all stack frames?
Inspecting memory works the same regardless of whether that memory "belongs" to heap, stack, or code. You simply dereference a pointer and voilà -- you get the contents of memory at that location.
What you probably mean is:
how to find location of stack and
how to decode it
The answer to the first question is OS-specific, and you didn't tag your question with any OS.
Assuming you are on Linux, one way to locate the stack is to parse entries in /proc/self/maps looking for an entry (continuous address range) which "covers" current stack (i.e. "covers" an address of any local variable).
For the second question, the answer is:
it's complicated1 and
you don't actually need to decode it in order to save/restore its state.
1To figure out how to decode stack, you could look at sources for debuggers (such as GDB and LLDB).
This is also very OS and processor specific.
You would need to know calling conventions. On x86_64 you would need to know about unwind descriptors. To find local variables, you would need to know about DWARF debugging format.
Did I mention it's complicated?

Wasm Hot Reloading Experiment: Debunking Assumptions, and How to specify where the data section is?

First, to avoid making this seem like an XYZ problem, I'd like to give some context (Note I am not using Emscripten):
I am trying to see if I can implement a form of hot reloading for Wasm programs written in C++, hosted on the web. To do this, I want to have a section of memory that I call my "world state" (to anyone who has watched Handmade Hero ( https://handmadehero.org/ ), this will be familiar):
struct State {
// put everything here
} state;
Typically for a full C++ program with a platform layer, you'd allocate this struct on the platform side and feed a pointer to that memory through a function pointer in the reloadable/dll/dylib part of the code. The reloadable code puts EVERYTHING into this persistent memory so if the code needs to be recompiled and reloaded, all the state will continue to exist since the memory was allocated in the part of the program that wasn't reloaded. As far as I can tell, this is impossible in Wasm though.
Firstly, is my assumption correct that I have to use WebAssembly.Memory? --or can I allocate a uint8array in js and use that for my persistent state, separate from the program memory? If so, is that slower?
So this will work as long as I don't use a dynamic allocator like WASI, and instead use a push allocator I can control. (I think this because, suppose I use malloc to get memory addresses and reload--malloc's internal state will reload and think all the heap memory is available when it's not, so future allocations might clobber previous ones.)
Upon reload, I can first copy the struct into a temporary buffer on the js side, reload, get the memory location of the struct from Wasm (I will require that it exists), and copy the saved memory from js back into position.
However this falls apart if I use pointers because if I change the program (which is the point) __data_end might change, which would offset all of the addresses! I checked the linker flags here https://lld.llvm.org/WebAssembly.html to see what I could control. I can specify that the stack comes before the data segment, but the heap would still come after that, which results in the same problem. I can also specify where the global data are located, but that's not the data segment I believe, so the variable-size data segment could still offset all of my addresses.
Here's a nice page that can help us visualize the Wasm memory: https://dassur.ma/things/c-to-webassembly/
Would anyone have any thoughts on how to achieve what I'd like? The only options I can think of involve somehow using memory outside the Wasm memory (possibly slower or impossible), using only stack memory and no pointers (unrealistic unless I can auto-recalculate all pointer offsets after a recompile, which would be painful and bug-prone), or finding a way to make the data segment come after the stack and heap at a fixed address, which would then guarantee that the stack and heap segments wouldn't get offset if the data segment needs to grow. Another option, if possible, would be to fix the max size of the data segment. The Wasm spec/documentation aren't really great when it comes to memory manipulation like this, so I'd appreciate some clarification about what's possible too. Lastly, maybe I could use two Wasm modules (but wouldn't that sort of indirection be slow)? I might be missing something crucial related to the memory layout.
Please let me know if you need more details. I've done something like this before in C, as I mentioned, and it's a common rapid iteration game-dev technique. Basically I'm trying to recreate it in Wasm.
EDIT: Apparently you can call Wasm functions from another module directly. Firstly, how do you do it, and secondly, what would be performance characteristics be for accessing the memory of another module?
EDIT2: Maybe some form of dynamic linking if that's supported? https://webassembly.org/docs/dynamic-linking/
WebAssembly modules hold variable state in three distinct places:
Linear memory
Local variables associated with the execution stack
Global variables
Of these, only global variables and linear memory are accessible to the host environment, and potentially serialisable in order to cache them as you hot-reload your module. There is of course no way to directly access and store the current call-stack.
If I were looking to achieve this, I'd create my own state machine within WebAssembly, storing this within a known location within linear memory.
Wasm is organised into modules, and modules define four relevant kinds of entities: functions, memories, tables, globals. The code is in the functions, while the other three represent a module's state.
Now, the interesting thing is that all four of these entity kinds can be imported and exported. Moreover, all of them can be created externally to the module, e.g., by the JS API.
Consequently, a way to emulate code swapping is to set up your module such that all three pieces of state are created externally and imported into the module. That way, you can keep them alive externally and pass them to the upgraded module once available. (You also need to make sure that the upgraded module doesn't use data/element segments or start functions in a way that paves over preexisting state.)
Of course, this only works if the shape of the module's state does not change between upgrades. E.g., no new globals, no new data layout in memory, otherwise the new code won't understand the old state. That is actually the hard part of the problem, but it's independent from Wasm specifics.

What is in the stack?

I'm trying to understand what is stored in the stack in optix.
As I understand it, we set the stack size per context, and one stack is attached to each thread in the ray generation program.
When a ray is launched, the thread carries with it the stack, which stores the ray's payload.
I thought that, when we do a recursive ray-tracer for example, the stack overflow would occur because there would be too many payloads to keep in the memory. But right now, I have a program with a radiance ray that has a payload of float + 3 uint, and a shadow ray with only a float, and there is only one bounce. However, my stack needs to be bigger that 1024 to avoid a stack overflow. Surely, this is way more that just my two payloads.
So I wonder, what else is in the stack?
(I mean in general, not in my particular case. What is stored in the stack except the ray(s) payload(s) (if they are)? For example, do we also store information about the hits? about the scene tree? Do we keep track of which program called the current ray?)
Thanks for your help!
Answered on the NVIDIA board here
Detlef Roettger wrote
"The stack is also used to save and restore live variables around
function calls (e.g. rtTrace or callable programs). That's the
background for one of the performance advice in the OptiX Programming
Guide which starts with Try to minimize live state across calls to
rtTrace in programs."
More info on this at §3.1.3 - Global State in the OptiX Programming guide.
Remember that OptiX programs are full blown CUDA kernels combined together. Stack memory is therefore also used for ordinary execution needs (the amount is likely to vary even between CUDA versions).

Why is there a stack and a heap?

Why do assembly languages use both a stack and a heap? They seem redundant.
They're not redundant. Each of them has strengths and weaknesses: A stack is faster if used right, because memory allocation is trivial (push / pop). The downside is that you can only add and remove items at the top (hence the name, stack). Also, total stack space is limited, and when you run out, you have a... well, stack overflow. The heap, by contrast, allows random allocation and deallocation, and you can store large amounts of data there, but the downside is that allocation carries more overhead - for each allocated block of memory, a suitable free portion must be found, and in the long run, fragmentation of the free space needs to be avoided, and the system must track where the free blocks are.
You use the stack to pass around small short-lived values, e.g. local counter variables, function arguments, return values, etc.; these lend themselves to push/pop allocation style. For larger or long-lived data structures, you use the heap.
You could certainly construct a computing system that utilised either one of them as its only memory model. However, they both have rather different properties each with its own good and bad points. Most systems utilise both so as to get the benefits from each of them.
Stacks
A stack can be thought of as a pile of plates, you write a value on a plate and put it on the top of the stack this is called a push operation and stores a value on the stack. You can obviously also remove the top plate from the stack this is called a pop operation. But new allocations must always be at the top of the stack.
The stack tend to be used for local variables and passing values between functions. Generally stacks have the following awesome properties:
Requires only a handful of pointers to manage
Very easy to implement in hardware, most processors have built in hardware support for a stack making it even faster.
Very quick to allocate memory
The problem with the stack comes from the fact items can only be added/removed from the top of the stack. Now this makes great sense when traversing up and down through function calls: pop functions inputs from the stack, allocate space for local variables on the stack, run function, clear local variables from the top of the stack and push the return value onto the stack. If on the other hand I want to allocate some memory and say pass it to another thread or in general free it far away from where it was allocated all of a sudden I have a problem, the stack is not in the correct position when I want to free the memory.
You could say the stack facilitates fast sequential memory allocation.
Heap
Now the heap is different each allocation is generally tracked separately. This causes a lot of overhead for allocations and deallocations, but each one can be handled independently of other memory allocations, well until you run out of memory.
There are numerous algorithms for accomplishing this and it is probably a bit unwise to twitter on about them here but here is a link that talks about a few good simple heap allocation algorithms: Alternatives to malloc and new
So the heap facilitates random memory allocation but this comes with a runtime penalty, however that penalty is often small that what would be incurred if you had to handle the situation using just the stack.
It is about the memory handling and managing.
There are different type of registers of x86 architectures.
There are possibilities of hardware supported memory management on x86 architecture and so on.
Stack is used by instruction pointer, Heap is for data segment in some applications.
To read more I advice you read the following links:
http://en.wikipedia.org/wiki/Data_segment
http://en.wikipedia.org/wiki/X86_memory_segmentation
"A memory model allows a compiler to perform many important
optimizations" - Wikipedia

Direction of stack higher memory address to lower memory address

Direction of stack (higher memory address to lower memory address or from lower memory address to higher memory address) is dependent on machine architecture
Example Intel : higher memory address to lower memory address
SPARC : lower memory address to higher memory address
Is there any way by which we can change the direction of stack memory allocation using code.
Thanks.
In general, management of the stack is performed by the compiler (assuming we're talking about something like C or C++ here). However, the ISA may offer assistance, for instance push and pop instructions on x86.
There is no way to do this from C or C++, unless your compiler offers a non-portable language extension or a command-line option to control this (I can't see why it would, because changing this would make your program/library incompatible with all other programs/libraries!)
Stack is used on machine-instruction level. You cannot change processing unit behavior with code. The only thing one can do is to create program emulation level.
Some processors include explicit circuitry which pushes things onto a stack and pops them in various circumstances. Other processors do not include any such circuitry for a 'big' stack, but just provide a limited number of hardware registers or circuits that are used to store things like return addresses, and possibly a means by which software can copy the addresses stored in those registers or circuits to other parts of memory.
On processors whose hardware doesn't explicitly manipulate a stack in memory, one could use whatever pattern one wanted if one had control over all the code the processor would execute. Generally, however, there will be a pattern that the processor manufacturer recommends for implementing a stack, and code generated by compilers or by other people will most likely expect to use a stack implemented in that fashion.

Resources