in Pharo, how can I find the currently evaluating stack?
Well, in fact, the issue aint that simple: thisContext can be a quite expensive operation, compared to like a message send.
In Visualworks Smalltalk, stack access is extermly expensive because it uses the native C-stack and thus any access to thisContext must reify the entire C-Stack into causally connected Smalltalk objects. That is, for each C stack frame a Smalltalk object is to be created (including possible JIT deoptimization) and furthermore all changes to these objects must be reflected back to the C stack.
In Pharo (and Squeak, for that matter) it is less awkward, since it uses Smalltalk objects for the stack. But still the object pool which caches stack frames is flushed upon each call. (Yes, other than eg in Java, pooling objects does improve performance in Squeak ... welcome back to the 90ies :)
You evaluate
thisContext contextStack
Here, thisContext is really a special variable that points to the currently active stack frame. Then, contextStack returns an array with the entire stack.
Related
I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)
Objects can be put on and removed only from the top of a stack. But what about reading and writing their values? Please correct me if I'm wrong, but I think process must be able to read from any part of the stack, since if only reading from the top was possible it would have to remove (and store somewhere) whole content of the stack above a variable it wants to examine. But in that case, how does the process know where exactly in the stack is a particular variable? I suspect it just holds a pointer to it, but where is that pointer stored?
Another thing - reading about stacks I often find phrases like "All memory allocated on the stack is known at compile time." Well, I probably misunderstand this, so please tell me where's the flaw in my logic:
Suppose a local variable is created when an if() statement is true, and isn't when it's false. Whether it's true will turn out at run time. So at compile time there's no way to know if it should be created, hence I wouldn't think memory for it is allocated at all, as it would be wasteful. Consequently, it isn't created/known at compile time.
At compile time, it's known how much space each type needs: An Integer, for instance, is 4 Bytes wide on 32 bit platforms, and a class with 2 Integers consumes 8 Bytes. Whether this space is allocated for a specific variable is not necessarily known (may depend on an if, as you stated).
When you invoke a method, all parameters and the return address are pushed onto the stack. To get one parameter, you walk up the stack up to its position, which is computed by the base pointer and the size of each parameter.
So it is not entirely true for this stack that you can access the top element only. It is, however, for the Stack data structure.
Ok, I asked the difference between Stackoverflow and bufferoverflow yesterday and almost getting voted down to oblivion and no new information.
So it got me thinking and I decided to rephrase my question in the hopes that I get reply which actually solves my issue.
So here goes nothing.
I am aware of four memory segments(correct me if I am wrong). The code, data, stack and heap. Now AFAIK the the code segment stores the code, while the data segment stores the data related to the program. What seriously confuses me is the purpose of the stack and the heap!
From what I have understood, when you run a function, all the related data to the function gets stored in the stack and when you recursively call a function inside a function, inside of a function... While the function is waiting on the output of the previous function, the function and its necessary data don't pop out of the stack. So you end up with a stack overflow. (Again please correct me if I am wrong)
Also I know what the heap is for. As I have read someplace, its for dynamically allocating data when a program is executing. But this raises more questions that solves my problems. What happens when I initially initialize my variables in the code.. Are they in the code segment or in the data segment or in the heap? Where do arrays get stored? Is it that after my code executes all that was in my heap gets erased? All in all, please tell me about heap in a more simplified manner than just, its for malloc and alloc because I am not sure I completely understand what those terms are!
I hope people when answering don't get lost in the technicalities and can keep the terms simple for a layman to understand (even if the concept to be described is't laymanish) and keep educating us with the technical terms as we go along. I also hope this is not too big a question, because I seriously think they could not be asked separately!
What is the stack for?
Every program is made up of functions / subroutines / whatever your language of choice calls them. Almost always, those functions have some local state. Even in a simple for loop, you need somewhere to keep track of the loop counter, right? That has to be stored in memory somewhere.
The thing about functions is that the other thing they almost always do is call other functions. Those other functions have their own local state - their local variables. You don't want your local variables to interfere with the locals in your caller. The other thing that has to happen is, when FunctionA calls FunctionB and then has to do something else, you want the local variables in FunctionA to still be there, and have their same values, when FunctionB is done.
Keeping track of these local variables is what the stack is for. Each function call is done by setting up what's called a stack frame. The stack frame typically includes the return address of the caller (for when the function is finished), the values for any method parameters, and storage for any local variables.
When a second function is called, then a new stack frame is created, pushed onto the top of the stack, and the call happens. The new function can happily work away on its stack frame. When that second function returns, its stack frame is popped (removed from the stack) and the caller's frame is back in place just like it was before.
So that's the stack. So what's the heap? It's got a similar use - a place to store data. However, there's often a need for data that lives longer than a single stack frame. It can't go on the stack, because when the function call returns, it's stack frame is cleaned up and boom - there goes your data. So you put it on the heap instead. The heap is a basically unstructured chunk of memory. You ask for x number of bytes, and you get it, and can then party on it. In C / C++, heap memory stays allocated until you explicitly deallocate. In garbage collected languages (Java/C#/Python/etc.) heap memory will be freed when the objects on it aren't used anymore.
To tackle your specific questions from above:
What's the different between a stack overflow and a buffer overflow?
They're both cases of running over a memory limit. A stack overflow is specific to the stack; you've written your code (recursion is a common, but not the only, cause) so that it has too many nested function calls, or you're storing a lot of large stuff on the stack, and it runs out of room. Most OS's put a limit on the maximum size the stack can reach, and when you hit that limit you get the stack overflow. Modern hardware can detect stack overflows and it's usually doom for your process.
A buffer overflow is a little different. So first question - what's a buffer? Well, it's a bounded chunk of memory. That memory could be on the heap, or it could be on the stack. But the important thing is you have X bytes that you know you have access to. You then write some code that writes X + more bytes into that space. The compiler has probably already used the space beyond your buffer for other things, and by writing too much, you've overwritten those other things. Buffer overruns are often not seen immediately, as you don't notice them until you try to do something with the other memory that's been trashed.
Also, remember how I mentioned that return addresses are stored on the stack too? This is the source of many security issues due to buffer overruns. You have code that uses a buffer on the stack and has an overflow vulnerability. A clever hacker can structure the data that overflows the buffer to overwrite that return address, to point to code in the buffer itself, and that's how they get code to execute. It's nasty.
What happens when I initially initialize my variables in the code.. Are they in the code segment or in the data segment or in the heap?
I'm going to talk from a C / C++ perspective here. Assuming you've got a variable declaration:
int i;
That reserves (typically) four bytes on the stack. If instead you have:
char *buffer = malloc(100);
That actually reserves two chunks of memory. The call to malloc allocates 100 bytes on the heap. But you also need storage for the pointer, buffer. That storage is, again, on the stack, and on a 32-bit machine will be 4 bytes (64-bit machine will use 8 bytes).
Where do arrays get stored...???
It depends on how you declare them. If you do a simple array:
char str[128];
for example, that'll reserve 128 bytes on the stack. C never hits the heap unless you explicitly ask it to by calling an allocation method like malloc.
If instead you declare a pointer (like buffer above) the storage for the pointer is on the stack, the actual data for the array is on the heap.
Is it that after my code executes all that was in my heap gets erased...???
Basically, yes. The OS will clean up the memory used by a process after it exits. The heap is a chunk of memory in your process, so the OS will clean it up. Although it depends on what you mean by "clean it up." The OS marks those chunks of RAM as now free, and will reuse it later. If you had explicit cleanup code (like C++ destructors) you'll need to make sure those get called, the OS won't call them for you.
All in all, please tell me about heap in a more simplified manner than just, its for malloc and alloc?
The heap is, much like it's name, a bunch of free bytes that you can grab a piece at a time, do whatever you want with, then throw back to use for something else. You grab a chunk of bytes by calling malloc, and you throw it back by calling free.
Why would you do this? Well, there's a couple of common reasons:
You don't know how many of a thing
you need until run time (based on
user input, for example). So you
dynamically allocate on the heap as
you need them.
You need large data structures. On
Windows, for example, a thread's
stack is limited by default to 1
meg. If you're working with large
bitmaps, for example, that'll be a
fast way to blow your stack and get
a stack overflow. So you grab that
space of the heap, which is usually
much, much larger than the stack.
The code, data, stack and heap?
Not really a question, but I wanted to clarify. The "code" segment contains the executable bytes for your application. Typically code segments are read only in memory to help prevent tampering. The data segment contains constants that are compiled into the code - things like strings in your code or array initializers need to be stored somewhere, the data segment is where they go. Again, the data segment is typically read only.
The stack is a writable section of memory, and usually has a limited size. The OS will initialize the stack and the C startup code calls your main() function for you. The heap is also a writable section of memory. It's reserved by the OS, and functions like malloc and free manage getting chunks out of it and putting them back.
So, that's the overview. I hope this helps.
With respect to stack... This is precicely where the parameters and local variables of the functions / procedures are stored. To be more precise, the params and local variables of the currently executing function is only accessible from the stack... Other variables that belong to chain of functions that were executed before it will be in stack but will not be accessible until the current function completed its operations.
With respect global variables, I believe these are stored in data segment and is always accessible from any function within the created program.
With respect to Heap... These are additional memories that can be made allotted to your program whenever you need them (malloc or new)... You need to know where the allocated memory is in heap (address / pointer) so that you can access it when you need. Incase you loose the address, the memory becomes in-accessible, but the data still remains there. Depending on the platform and language this has to be either manually freed by your program (or a memory leak occurs) or needs to be garbage collected. Heap is comparitively huge to stack and hence can be used to store large volumes of data (like files, streams etc)... Thats why Objects / Files are created in Heap and a pointer to the object / file is stored in stack.
In terms of C/C++ programs, the data segment stores static (global) variables, the stack stores local variables, and the heap stores dynamically allocated variables (anything you malloc or new to get a pointer to). The code segment only stores the machine code (the part of your program that gets executed by the CPU).
I've searched a while but no conclusive answer is present on why value types have to be allotted on the stack while the reference types i.e. dynamic memory or the objects have to reside on the heap.
why cannot the same be alloted on the stack?
They can be. In practice they're not because stack is a typically scarcer resource than heap and allocating reference types on the stack may exhaust it quickly. Further, if a function returns data allocated on its stack, it will require copying semantics on the caller's part or risk returning something that will be overwritten by the next function call.
Value types, typically local variables, can be brought in and out of scope quickly and easily with native machine instructions. Copy semantics for value types on return is trivial as most fit into machine registers. This happens often and should be as cheap as possible.
It is not correct that value types always live on the stack. Read Jon Skeet's article on the topic:
Memory in .NET - what goes where
I understand that the stack paradigm (nested allocations/deallocations) cannot handle certain algorithms which need non-nested object lifetimes.
just as the static allocation paradigm cannot handle recursive procedure calls. (e.g. naive calculation of fibonacci(n) as f(n-1) + f(n-2))
I'm not aware of a simple algorithm that would illustrate this fact though. any suggestions would be appreciated :-)
Local variables are allocated in the stack. If that was not the case, you wouldn't be able to have variables pointing to the heap when allocating variable's memory. You CAN allocate things in the stack if you want, just create a buffer big enough locally and manage it yourself.
Anything a method puts on the stack will vanish when the method exits. In .net and Java, it would be perfectly acceptable (in fact desirable) if a class object vanished as soon as the last reference to it vanished, but it would be fatal for an object to vanish while references to it still exist. It is not in the general case possible for the compiler to know, when a method creates an object, whether any references to that object will continue to exist after the method exits. Absent such assurance, the only safe way to allocate class objects is to store them on the heap.
Incidentally, in .net, one major advantage of mutable value types is that they can be passed by reference without surrendering perpetual control over them. If class 'foo', or a method thereof, has a structure 'boz' which one of foo's methods passes by reference to method 'bar', it is possible for bar, or the methods it calls, to do whatever they want to 'boz' until they return, but once 'bar' returns any references it held to 'boz' will be gone. This often leads to much safer and cleaner semantics than the promiscuously-sharable references used for class objects.
I've heard of stackless languages. However I don't have any idea how such a language would be implemented. Can someone explain?
The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.
The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area.
One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.
Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).
There are several reasons for using heap allocation for stack frames:
If the program does deep recursion dependent on the specific problem it is solving,
it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations.
Allocating stack frames means the application never has to say its sorry until there's
literally no allocatable memory left.
The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".
Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding.
The PARLANSE programming language I implemented does 1) and 2). I'm working on 3). It is amusing to note that PARLANSE allocates stack frames from a very fast-access heap-per-thread; it costs typically 4 machine instructions. The current implementation is x86 based, and the allocated frame is placed in the x86 EBP/ESP register much like other conventional x86 based language implementations. So it does use the hardware "contiguous stack" (including pushing and poppping) just in chunks. It also generates "frame local" subroutine calls the don't switch stacks for lots of generated utility code where the stack demand is known in advance.
Stackless Python still has a Python stack (though it may have tail call optimization and other call frame merging tricks), but it is completely divorced from the C stack of the interpreter.
Haskell (as commonly implemented) does not have a call stack; evaluation is based on graph reduction.
There is a nice article about the language framework Parrot. Parrot does not use the stack for calling and this article explains the technique a bit.
In the stackless environments I'm more or less familiar with (Turing machine, assembly, and Brainfuck), it's common to implement your own stack. There is nothing fundamental about having a stack built into the language.
In the most practical of these, assembly, you just choose a region of memory available to you, set the stack register to point to the bottom, then increment or decrement to implement your pushes and pops.
EDIT: I know some architectures have dedicated stacks, but they aren't necessary.
Call me ancient, but I can remember when the FORTRAN standards and COBOL did not support recursive calls, and therefore didn't require a stack. Indeed, I recall the implementations for CDC 6000 series machines where there wasn't a stack, and FORTRAN would do strange things if you tried to call a subroutine recursively.
For the record, instead of a call-stack, the CDC 6000 series instruction set used the RJ instruction to call a subroutine. This saved the current PC value at the call target location and then branches to the location following it. At the end, a subroutine would perform an indirect jump to the call target location. That reloaded saved PC, effectively returning to the caller.
Obviously, that does not work with recursive calls. (And my recollection is that the CDC FORTRAN IV compiler would generate broken code if you did attempt recursion ...)
There is an easy to understand description of continuations on this article: http://www.defmacro.org/ramblings/fp.html
Continuations are something you can pass into a function in a stack-based language, but which can also be used by a language's own semantics to make it "stackless". Of course the stack is still there, but as Ira Baxter described, it's not one big contiguous segment.
Say you wanted to implement stackless C. The first thing to realize is that this doesn't need a stack:
a == b
But, does this?
isequal(a, b) { return a == b; }
No. Because a smart compiler will inline calls to isequal, turning them into a == b. So, why not just inline everything? Sure, you will generate more code but if getting rid of the stack is worth it to you then this is easy with a small tradeoff.
What about recursion? No problem. A tail-recursive function like:
bang(x) { return x == 1 ? 1 : x * bang(x-1); }
Can still be inlined, because really it's just a for loop in disguise:
bang(x) {
for(int i = x; i >=1; i--) x *= x-1;
return x;
}
In theory a really smart compiler could figure that out for you. But a less-smart one could still flatten it as a goto:
ax = x;
NOTDONE:
if(ax > 1) {
x = x*(--ax);
goto NOTDONE;
}
There is one case where you have to make a small trade off. This can't be inlined:
fib(n) { return n <= 2 ? n : fib(n-1) + fib(n-2); }
Stackless C simply cannot do this. Are you giving up a lot? Not really. This is something normal C can't do well very either. If you don't believe me just call fib(1000) and see what happens to your precious computer.
Please feel free to correct me if I'm wrong, but I would think that allocating memory on the heap for each function call frame would cause extreme memory thrashing. The operating system does after all have to manage this memory. I would think that the way to avoid this memory thrashing would be a cache for call frames. So if you need a cache anyway, we might as well make it contigous in memory and call it a stack.