I'm failing to understand how the stack works - stack

I'm building an emulator for the MOS6502 processor, and at the moment I'm trying to simulate the stack in code, but I'm really failing to understand how the stack works in the context of the 6502.
One of the features of the 6502's stack structure is that when the stack pointer reaches the end of the stack it will wrap around, but I don't get how this feature even works.
Let's say we have a stack with 64 maximum values if we push the values x, y and z onto the stack, we now have the below structure. With the stack pointer pointing at address 0x62, because that was the last value pushed onto the stack.
+-------+
| x | 0x64
+-------+
| y | 0x63
+-------+
| z | 0x62 <-SP
+-------+
| | ...
+-------+
All well and good. But now if we pop those three values off the stack we now have an empty stack, with the stack pointer pointing at value 0x64
+-------+
| | 0x64 <-SP
+-------+
| | 0x63
+-------+
| | 0x62
+-------+
| | ...
+-------+
If we pop the stack a fourth time, the stack pointer wraps around to point at address 0x00, but what's even the point of doing this when there isn't a value at 0x00?? There's nothing in the stack, so what's the point in wrapping the stack pointer around????
I can understand this process when pushing values, if the stack is full and a value needs to be pushed to the stack it'll overwrite the oldest value present on the stack. This doesn't work for popping.
Can someone please explain this because it makes no sense.

If we pop the stack a fourth time, the stack pointer wraps around to point at address 0x00, but what's even the point of doing this when there isn't a value at 0x00?? There's nothing in the stack, so what's the point in wrapping the stack pointer around????
It is not done for a functional reason. The 6502 architecture was designed so that pushing and popping could be done by incrementing an 8 bit SP register without any additional checking. Checks for overflow or underflow of the SP register would involve more silicon to implement them, more silicon to implement the stack overflow / underflow handling ... and extra gate delays in a critical path.
The 6502 was designed to be cheap and simple using 1975 era chip technology1. Not fast. Not sophisticated. Not easy to program2
1 - According to Wikipedia, the original design had ~3200 or ~3500 transistors. One of the selling points of the 6502 was that it was cheaper than its competitors. Fewer transistors meant smaller dies, better yields and lower production costs.
2 - Of course, this is relative. Compared to some ISAs, the 6502 is easy because it is simple and orthogonal, and you have so few options to chose from. But compared to others, the limitations that make it simple actually make it difficult. For example, the fact that there are at most 256 bytes in the stack page that have to be shared by everything. It gets awkward if you are implementing threads or coroutines. Compare this with an ISA where the SP is a 16 bit register or the stack can be anywhere.

Related

In memory, should stack bottom and heap bottom have the same address?

I'm using a tm4c123gh6pm MCU with this linker script. Going to the bottom, I see:
...
...
.bss (NOLOAD):
{
_bss = .;
*(.bss*)
*(COMMON)
_ebss = .;
} > SRAM
_heap_bottom = ALIGN(8);
_heap_top = ORIGIN(SRAM) + LENGTH(SRAM) - _stack_size;
_stack_bottom = ALIGN(8);
_stack_top = ORIGIN(SRAM) + LENGTH(SRAM);
It seems that heap and stack bottoms are the same. I have double checked it:
> arm-none-eabi-objdump -t mcu.axf | grep -E "(heap|stack)"
20008000 g .bss 00000000 _stack_top
20007000 g .bss 00000000 _heap_top
00001000 g *ABS* 00000000 _stack_size
20000558 g .bss 00000000 _heap_bottom
20000558 g .bss 00000000 _stack_bottom
Is this correct? As far as I can see, the stack could overwrite the heap, is this the case?
If I flash this FW it 'works' (at least for now), but I'm expecting it to fail if the stack gets big enough and I use dynamic memory. I have observed though that no one in my code or the startup script uses the stack and bottom symbols, so maybe even if I use the stack and heap everything keeps working. (Unless the stack and heap are special symbols used by someone I can't see, is this the case?)
I want to change the last part by:
_heap_bottom = ALIGN(8);
_heap_top = ORIGIN(SRAM) + LENGTH(SRAM) - _stack_size;
_stack_bottom = ORIGIN(SRAM) + LENGTH(SRAM) - _stack_size + 4; // or _heap_top + 4
_stack_top = ORIGIN(SRAM) + LENGTH(SRAM);
Is the above correct?
If you write your own linker script then it is up to you how stack and heap are arranged.
One common approach is to have stack and heap in the same block, with stack growing downwards from the highest address towards the lowest, and heap growing upwards from a lower address towards the highest.
The advantage of this approach is that you don't need to calculate how much heap or stack you need separately. As long as the total of stack and heap used at any one instant is less than the total memory available, then everything will be ok.
The disadvantage of this approach is that when you allocate more memory than you have, your stack will overflow into your heap or vice-versa, and your program will fail in a variety of ways which are very difficult to predict or to identify when they happen.
The linker script in your question uses this approach, but appears to have a mistake detailed below.
Note that using the names top and bottom when talking about stacks on ARM is very unhelpful because when you push something onto the stack the numerical value of the stack pointer decreases. When the stack is empty the stack pointer has its highest value, and when it is full the stack pointer has its lowest value. It is ambiguous whether "top" refers to the highest address or the location of the current pointer, and whether bottom refers to the lowest address or the address where the first item is pushed.
In the CMSIS example linker scripts the lower and upper bounds of the heap are called __heap_base and __heap_limit, and the lower and upper bounds of the stack are called __stack_limit and __initial_sp respectively.
In this script the symbols have the following meanings:
_heap_bottom is the lowest address of the heap.
_heap_top is the upper address that the heap must not grow beyond if you want to leave at least _stack_size bytes for the stack.
For _stack_bottom, it appears that the script author probably mistakenly thought that ALIGN(8) would align the most recently assigned value, and so they wanted _stack_bottom to be an aligned version of _heap_top, which would make it the value of the stack pointer when _stack_size bytes are pushed to it. In fact ALIGN(8) aligns the value of ., which still has the same value as _heap_bottom as you have observed.
Finally _stack_top is the highest address in memory, it is the value the stack pointer will start with when the stack is empty.
Having an incorrect value for the stack limit almost certainly does absolutely nothing at all, because this symbol is probably never used in the code. On this ARMv7M processor the push and pop instructions and other accesses to the stack by hardware assume that the stack is an infinite resource. Compilers using all the normal ABIs also generate code which does not check before growing the stack either. The reason for this is that it is one of the most common operations performed, and so adding extra instructions would cripple performance. The next generation ARMv8M does have hardware support for a stack limit register, though.
My advice to you is just delete the line. If anything is using it, then you are basically losing the whole benefit of sharing your stack and heap space. If you do want to calculate and check for it, then your suggestion is correct except that you don't need to add + 4. This would create a 4 byte gap which is not usable as either heap or stack.
As an aside, I personally prefer to put the stack at the bottom of memory and the heap at the top, growing way from each other. That way, if either of them get bigger than they should they go into an unallocated address space which can be configured to cause a bus fault straight away, without any software checking the values all the time.

Memory Model in Ruby

How memory is managed in ruby. For Ex: if we take the C program during execution, the following is the memory model. Similar to this
how memory is handled in ruby.
C:
__________________
| |
| stack |
| |
------------------
| |
| <Un Allocated|
| space> |
------------------
| |
| |
| Heap |
| |
| |
__________________
| |
| data |
__________________
| text |
__________________
Ruby:
?
There is no such thing as "memory" in Ruby.
Class#allocate allocates an object and returns that object. And that is the entire extent of interaction that a programmer can have with the object space subsystem.
Where that object is allocated, how it is allocated, if it stays at the same place in memory or moves around, none of that is specified or observable. For example, on MagLev, an object may actually not be allocated in memory at all, but on disk, or in another computer's memory. JRuby, IronRuby, Opal, Cardinal, MacRuby, etc. "outsource" their memory management to a third party, they literally don't even know what's happening to their memory.
A Ruby implementation may use a separate stack and heap, it may use a heap-allocated stack, it may not even use a stack at all (e.g. Cardinal).
Note: the ObjectSpace module allows a limited amount of introspection and reflection of the object space. In general, when I say something is "impossible" in Ruby, there's always an implicit caveat "unless you use reflection". However, even ObjectSpace does not leak any information about the organization of memory.
In YARV, there is also the objspace library and the GC module, which provide internal implementation details about YARV. However, they are private internal implementation details of YARV, they are not even guaranteed to exist in other implementations, and they may change at any time without notice even within YARV.
You may note that I didn't write anything about garbage collection! Well, actually, Ruby only specifies when objects are referenced and when they aren't. What to do with un-referenced objects, it doesn't say. It makes sense for an implementation to reclaim the space used by those unreferenced objects, and all of them do to some degree (e.g. older versions of YARV would not reclaim unreferenced Symbols), but it is not required nor specified. And all implementations use very different approaches. Again, JRuby, IronRuby, Opal, Cardinal, MacRuby, Topaz, MagLev, etc. "outsource" that problem to the underlying platform, Rubinius uses a generational, copying, moving, tracing collector based on the Immix collector, YARV uses a simple mark-and-sweep tracing collector.

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

const char programSource[] =
"__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"
"{"
" int gid = get_global_id(0);"
"for(int i=0; i<10; i++){"
" a[gid] = b[gid] + c[gid];}"
"}";
The kernel above is a vector addition done ten times per loop. I have used the programming guide and stack overflow to figure out how global memory works, but I still can't figure out by looking at my code if I am accessing global memory in a good way. I am accessing it in a contiguous fashion and I am guessing in an aligned way. Does the card load 128kb chunks of global memory for arrays a, b, and c? Does it then load the 128kb chunks for each array once for every 32 gid indexes processed? (4*32=128) It seems like then I am not wasting any global memory bandwidth right?
BTW, the compute profiler shows a gld and gst efficiency of 1.00003, which seems weird, I thought it would just be 1.0 if all my stores and loads were coalesced. How is it above 1.0?
Yes your memory access pattern is pretty much optimal. Each halfwarp is accessing 16 consecutive 32bit words. Furthermore the access is 64byte aligned, since the buffers themselves are aligned and the startindex for each halfwarp is a multiple of 16. So each halfwarp will generate one 64Byte transaction. So you shouldn't waste memory bandwidth through uncoalesced accesses.
Since you asked for examples in your last question lets modify this code for other (less optimal access pattern (since the loop doesn't really do anything I will ignore that):
kernel void vecAdd(global int* a, global int* b, global int* c)
{
int gid = get_global_id(0);
a[gid+1] = b[gid * 2] + c[gid * 32];
}
At first lets se how this works on compute 1.3 (GT200) hardware
For the writes to a this will generate a slightly unoptimal pattern (following the halfwarps identified by their id range and the corresponding access pattern):
gid | addr. offset | accesses | reasoning
0- 15 | 4- 67 | 1x128B | in aligned 128byte block
16- 31 | 68-131 | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
32- 47 | 132-195 | 1x128B | in aligned 128byte block
48- 63 | 196-256 | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
So basically we are wasting about half our bandwidth (the less then doubled access width for the odd halfwarps doesn't help much because it generates more accesses, which isn't faster then wasting more bytes so to speak).
For the reads from b the threads access only even elements of the array, so for each halfwarp all accesses lie in a 128byte aligned block (the first element is at the 128B boundary, since for that element the gid is a multiple of 16=> the index is a multiple of 32, for 4 byte elements, that means the address offset is a multiple of 128B). The accesspattern stretches over the whole 128B block, so this will do a 128B transfer for every halfwarp, again waisting half the bandwidth.
The reads from c generate one of the worst case scenarios, where each thread indices in its own 128B block, so each thread needs its own transfer, which one one hand is a bit of a serialization scenario (although not quite as bad as normaly, since the hardware should be able to overlap the transfers). Whats worse is the fact that this will transfer a 32B block for each thread, wasting 7/8 of the bandwidth (we access 4B/thread, 32B/4B=8, so only 1/8 of the bandwidth is utilized). Since this is the accesspattern of naive matrixtransposes, it is highly advisable to do those using local memory (speaking from experience).
Compute 1.0 (G80)
Here the only pattern which will create a good access is the original, all patterns in the example will create completely uncoalesced access, wasting 7/8 of the bandwidth (32B transfer/thread, see above). For G80 hardware every access where the nth thread in a halfwarp doesn't access the nth element creates such uncoalesced accesses
Compute 2.0 (Fermi)
Here every access to memory creates 128B transactions (as many as necessary to gather all data, so 16x128B in the worst case), however those are cached, making it less obvious where data will be transfered. For the moment lets assume the cache is big enough to hold all data and there are no conflicts, so every 128B cacheline will be transferred at most once. Lets furthermoe assume a serialized execution of the halfwarps, so we have a deterministic cache occupation.
Accesses to b will still always transfer 128B Blocks (no other thread indices in the coresponding memoryarea). Access to c will generate 128B transfers per thread (worst accesspattern possible).
For accesses to a it is the following (treating them like reads for the moment):
gid | offset | accesses | reasoning
0- 15 | 4- 67 | 1x128B | bringing 128B block to cache
16- 31 | 68-131 | 1x128B | offsets 68-127 already in cache, bring 128B for 128-131 to cache
32- 47 | 132-195 | - | block already in cache from last halfwarp
48- 63 | 196-259 | 1x128B | offsets 196-255 already in cache, bringing in 256-383
So for large arrays the accesses to a will waste almost no bandwidth theoretically.
For this example the reality is of course not quite as good, since the accesses to c will trash the cache pretty nicely
For the profiler I would assume that the efficiencies over 1.0 are simply results of floating point inaccurencies.
Hope that helps

What is a stack pointer used for in microprocessors?

I am preparing for a microprocessor exam. If the use of a program counter is to hold the address of the next instruction, what is use of stack pointer?
A stack is a LIFO data structure (last in, first out, meaning last entry you push on to the stack is the first one you get back when you pop). It is typically used to hold stack frames (bits of the stack that belong to the current function).
This may include, but is not limited to:
the return address.
a place for a return value.
passed parameters.
local variables.
You push items onto the stack and pop them off. In a microprocessor, the stack can be used for both user data (such as local variables and passed parameters) and CPU data (such as return addresses when calling subroutines).
The actual implementation of a stack depends on the microprocessor architecture. It can grow up or down in memory and can move either before or after the push/pop operations.
Operation which typically affect the stack are:
subroutine calls and returns.
interrupt calls and returns.
code explicitly pushing and popping entries.
direct manipulation of the stack pointer register, sp.
Consider the following program in my (fictional) assembly language:
Addr Opcodes Instructions ; Comments
---- -------- -------------- ----------
; 1: pc<-0000, sp<-8000
0000 01 00 07 load r0,7 ; 2: pc<-0003, r0<-7
0003 02 00 push r0 ; 3: pc<-0005, sp<-7ffe, (sp:7ffe)<-0007
0005 03 00 00 call 000b ; 4: pc<-000b, sp<-7ffc, (sp:7ffc)<-0008
0008 04 00 pop r0 ; 7: pc<-000a, r0<-(sp:7ffe[0007]), sp<-8000
000a 05 halt ; 8: pc<-000a
000b 06 01 02 load r1,[sp+2] ; 5: pc<-000e, r1<-(sp+2:7ffe[0007])
000e 07 ret ; 6: pc<-(sp:7ffc[0008]), sp<-7ffe
Now let's follow the execution, describing the steps shown in the comments above:
This is the starting condition where pc (the program counter) is 0 and sp is 8000 (all these numbers are hexadecimal).
This simply loads register r0 with the immediate value 7 and moves pc to the next instruction (I'll assume that you understand the default behavior will be to move to the next instruction unless otherwise specified).
This pushes r0 onto the stack by reducing sp by two then storing the value of the register to that location.
This calls a subroutine. What would have been pc in the next step is pushed on to the stack in a similar fashion to r0 in the previous step, then pc is set to its new value. This is no different to a user-level push other than the fact it's done more as a system-level thing.
This loads r1 from a memory location calculated from the stack pointer - it shows a way to pass parameters to functions.
The return statement extracts the value from where sp points and loads it into pc, adjusting sp up at the same time. This is like a system-level pop instruction (see next step).
Popping r0 off the stack involves extracting the value from where sp currently points, then adjusting sp up.
The halt instruction simply leaves pc where it is, an infinite loop of sorts.
Hopefully from that description, it will become clear. Bottom line is: a stack is useful for storing state in a LIFO way and this is generally ideal for the way most microprocessors do subroutine calls.
Unless you're a SPARC of course, in which case you use a circular buffer for your stack :-)
Update: Just to clarify the steps taken when pushing and popping values in the above example (whether explicitly or by call/return), see the following examples:
LOAD R0,7
PUSH R0
Adjust sp Store val
sp-> +--------+ +--------+ +--------+
| xxxx | sp->| xxxx | sp->| 0007 |
| | | | | |
| | | | | |
| | | | | |
+--------+ +--------+ +--------+
POP R0
Get value Adjust sp
+--------+ +--------+ sp->+--------+
sp-> | 0007 | sp->| 0007 | | 0007 |
| | | | | |
| | | | | |
| | | | | |
+--------+ +--------+ +--------+
The stack pointer stores the address of the most recent entry that was pushed onto the stack.
To push a value onto the stack, the stack pointer is incremented to point to the next physical memory address, and the new value is copied to that address in memory.
To pop a value from the stack, the value is copied from the address of the stack pointer, and the stack pointer is decremented, pointing it to the next available item in the stack.
The most typical use of a hardware stack is to store the return address of a subroutine call. When the subroutine is finished executing, the return address is popped off the top of the stack and placed in the Program Counter register, causing the processor to resume execution at the next instruction following the call to the subroutine.
http://en.wikipedia.org/wiki/Stack_%28data_structure%29#Hardware_stacks
You got more preparing [for the exam] to do ;-)
The Stack Pointer is a register which holds the address of the next available spot on the stack.
The stack is a area in memory which is reserved to store a stack, that is a LIFO (Last In First Out) type of container, where we store the local variables and return address, allowing a simple management of the nesting of function calls in a typical program.
See this Wikipedia article for a basic explanation of the stack management.
For 8085: Stack pointer is a special purpose 16-bit register in the Microprocessor, which holds the address of the top of the stack.
The stack pointer register in a computer is made available for general purpose use by programs executing at lower privilege levels than interrupt handlers. A set of instructions in such programs, excluding stack operations, stores data other than the stack pointer, such as operands, and the like, in the stack pointer register. When switching execution to an interrupt handler on an interrupt, return address data for the currently executing program is pushed onto a stack at the interrupt handler's privilege level. Thus, storing other data in the stack pointer register does not result in stack corruption. Also, these instructions can store data in a scratch portion of a stack segment beyond the current stack pointer.
Read this one for more info.
General purpose use of a stack pointer register
The Stack is an area of memory for keeping temporary data. Stack is used by the CALL instruction to keep the return address for procedures The return RET instruction gets this value from the stack and returns to that offset. The same thing happens when an INT instruction calls an interrupt. It stores in the Stack the flag register, code segment and offset. The IRET instruction is used to return from interrupt call.
The Stack is a Last In First Out (LIFO) memory. Data is placed onto the Stack with a PUSH instruction and removed with a POP instruction. The Stack memory is maintained by two registers: the Stack Pointer (SP) and the Stack Segment (SS) register. When a word of data is PUSHED onto the stack the the High order 8-bit Byte is placed in location SP-1 and the Low 8-bit Byte is placed in location SP-2. The SP is then decremented by 2. The SP addds to the (SS x 10H) register, to form the physical stack memory address. The reverse sequence occurs when data is POPPED from the Stack. When a word of data is POPPED from the stack the the High order 8-bit Byte is obtained in location SP-1 and the Low 8-bit Byte is obtained in location SP-2. The SP is then incremented by 2.
The stack pointer holds the address to the top of the stack. A stack allows functions to pass arguments stored on the stack to each other, and to create scoped variables. Scope in this context means that the variable is popped of the stack when the stack frame is gone, and/or when the function returns. Without a stack, you would need to use explicit memory addresses for everything. That would make it impossible (or at least severely difficult) to design high-level programming languages for the architecture.
Also, each CPU mode usually have its own banked stack pointer. So when exceptions occur (interrupts for example), the exception handler routine can use its own stack without corrupting the user process.
Should you ever crave deeper understanding, I heartily recommend Patterson and Hennessy as an intro and Hennessy and Patterson as an intermediate to advanced text. They're pricey, but truly non-pareil; I just wish either or both were available when I got my Masters' degree and entered the workforce designing chips, systems, and parts of system software for them (but, alas!, that was WAY too long ago;-). Stack pointers are so crucial (and the distinction between a microprocessor and any other kind of CPU so utterly meaningful in this context... or, for that matter, in ANY other context, in the last few decades...!-) that I doubt anything but a couple of thorough from-the-ground-up refreshers can help!-)
On some CPUs, there is a dedicated set of registers for the stack. When a call instruction is executed, one register is loaded with the program counter at the same time as a second register is loaded with the contents of the first, a third register is be loaded with the second, and a fourth with the third, etc. When a return instruction is executed, the program counter is latched with the contents of the first stack register and the same time as that register is latched from the second; that second register is loaded from a third, etc. Note that such hardware stacks tend to be rather small (many the smaller PIC series micros, for example, have a two-level stack).
While a hardware stack does have some advantages (push and pop don't add any time to a call/return, for example) having registers which can be loaded with two sources adds cost. If the stack gets very big, it will be cheaper to replace the push-pull registers with an addressable memory. Even if a small dedicated memory is used for this, it's cheaper to have 32 addressable registers and a 5-bit pointer register with increment/decrement logic, than it is to have 32 registers each with two inputs. If an application might need more stack than would easily fit on the CPU, it's possible to use a stack pointer along with logic to store/fetch stack data from main RAM.
A stack pointer is a small register that stores the address of the top of stack. It is used for the purpose of pointing address of the top of the stack.

Stacks - why PUSH and POP?

I was wondering why we use the terms "push" and "pop" for adding/removing items from stacks? Is there some physical metaphor that caused those terms to be common?
The only suggestion I have is something like a spring-loaded magazine for a handgun, where rounds are "pushed" into it and can be "popped" out, but that seems a little unlikely.
A second stack trivia question: Why do most CPUs implement the call stack as growing downwards in memory, rather than upwards?
For your second question, Wikipedia has an article about the CS philosophy that controls the stack:
http://en.wikipedia.org/wiki/LIFO
And for the first, also on wikipedia:
A frequently used metaphor is the idea
of a stack of plates in a spring
loaded cafeteria stack. In such a
stack, only the top plate is visible
and accessible to the user, all other
plates remain hidden. As new plates
are added, each new plate becomes the
top of the stack, hiding each plate
below, pushing the stack of plates
down. As the top plate is removed from
the stack, they can be used, the
plates pop back up, and the second
plate becomes the top of the stack.
Two important principles are
illustrated by this metaphor: the Last
In First Out principle is one; the
second is that the contents of the
stack are hidden. Only the top plate
is visible, so to see what is on the
third plate, the first and second
plates will have to be removed. This
can also be written as FILO-First In
Last Out, i.e. the record inserted
first will be popped out at last.
I believe the spring loaded stack of plates is correct, as the source for the term PUSH and POP.
In particular, the East Campus Commons Cafeteria at MIT had spring loaded stacks of plates in the 1957-1967 time frame. The terms PUSH and POP would have been in use by the Tech Model Railroad Club. I think this is the origin.
The Tech Model Railroad Club definitely influenced the design of the Digital Equipment Corporation's (DEC) PDP-6. The PDP-6 was one of the first machines to have stack oriented instructions in the hardware. The instructions were PUSH, POP, PUSHJ, POPJ.
http://ed-thelen.org/comp-hist/pdp-6.html#Special%20Features
For the second question: Assembler programmers on small systems tend to write code that begins at low addresses in memory, and grow to higher addresses as more code is added.
Because of this, making a stack grow downward allows you to start the stack at the top of physical memory and allow the two memory zones to grow towards each other. This simplifies memory management in this sort of trivial environment.
Even in a system with segregated ROM/RAM fixed data allocations are easiest to build from the bottom up and thus replace the code portion of the above explanation.
While such trivial memory schemes are very rare anymore, the hardware practice continues as established.
Think of it like a pez dispenser. You can push a new one on top. And then pop it off the top.
That is always what I think of when I think push and pop. (Probably not very historical though)
Are you asking yourself what the heck are PEZ? See the comments.
Re your "second trivial question": I've seen considerable inconsistency in defining what "up" and "down" mean! From early days, some manufacturers and authors drew memory diagrams with low addresses at the top of the page (presumably mimicking the order in which a page is read), while others put high addresses at the top of the page (presumably mimicking graph paper coordinates or the floors in a building).
Of course the concept of a stack (and the concept of addressable memory as well) is independent of such visual metaphors. One can implement a stack which "grows" in either direction. In fact, I've often seen the trick below (in bare-metal level implementations) used to share a region of memory between two stacks:
+---+---+-------- -------+--+--+--+
| | | -> ... <- | | | |
+---+---+-------- -------+--+--+--+
^ ^
Stack 1 both stacks Stack 2
base "grow" toward base
the middle
So my answer is that stacks conceptually never grow either "downward" or "upward" but simply grow from their base. An individual stack may be implemented in either direction (or in neither direction, if it's using a linked representation with garbage collection, in which case the elements may be anywhere in nodespace).
Alliteration is always attractive (see what I did there?), and these words are short, alliterative, and suggestive. The same goes for the old BASIC commands peek and poke, which have the extra advantage of the parallel k's.
A common physical metaphor is a cafeteria plate dispenser, where a spring-loaded stack of plates makes it so that you can take a plate off the top, but the next plate rises to be in the same position.
The answers on this page pretty much answer the stack direction question. If I had to sum it up, I would say it is done downwards to remain consistent with ancient computers.
I think the original story came about because of some developers seeing the plate stack (like you often see in buffet restaurants). You pushed a new plate on to the top of the stack, and you popped one off the top as well.
As to stacks growing downwards in memory, remember that When dealing with hierarchical data structures (trees), most programmers are happy to draw one on a page with the base (or trunk) at the top of the page...
I know this thread is really old, but I have a thought about the second question:
In my mind, the stack grows up, even though the memory addresses decrease. If you were to write a whole bunch of numbers on a piece of paper, you would start at the top left, with 0. Then you would increase the numbers going left to right, then top to bottom. So say the stack is like this:
000 001 002 003 004 000 001 002 003 004 000 001 002 003 004
005 006 007 008 009 005 006 007 008 009 005 006 007 008 009
010 011 012 013 014 010 011 012 013 014 010 011 012 013 014
015 016 017 018 019 015 016 017 018 019 015 016 017 018 019
020 021 022 023 024 020 021 022 023 024 020 021 022 023 024
025 026 027 028 029 025 026 027 028 029 025 026 027 028 029
where the bold numbers represent stack memory, and unbold numbers represent memory addresses which the stack is not using. Each block of the same numbers represents a stage of a program where the call stack has grown.
Even though the memory addresses are moving downward, the stack is growing upwards.
Similarly, with the spring loaded plate stack,
if you took a plate off the top of the stack, you would call that the first plate (smallest number) right? Even thought it is the highest up. A programmer might even call it the zeroth plate.
For the question of why do stacks grow down, I would imagine it is used to save memory.
If you start from the top of the stack memory (the highest value addresses) and work down to zero, I assume its easier to check if you have reached address $0x00000000 than allocating a variable to give you the maximum height of the stack and checking whether or not you have reached that address.
I assume this makes it easier to check if you're reaching the end of your addressable space as no matter how much memory is available the limit of the stack is always going to be $0x00000000.

Resources