Stacks - why PUSH and POP? - stack

I was wondering why we use the terms "push" and "pop" for adding/removing items from stacks? Is there some physical metaphor that caused those terms to be common?
The only suggestion I have is something like a spring-loaded magazine for a handgun, where rounds are "pushed" into it and can be "popped" out, but that seems a little unlikely.
A second stack trivia question: Why do most CPUs implement the call stack as growing downwards in memory, rather than upwards?

For your second question, Wikipedia has an article about the CS philosophy that controls the stack:
http://en.wikipedia.org/wiki/LIFO
And for the first, also on wikipedia:
A frequently used metaphor is the idea
of a stack of plates in a spring
loaded cafeteria stack. In such a
stack, only the top plate is visible
and accessible to the user, all other
plates remain hidden. As new plates
are added, each new plate becomes the
top of the stack, hiding each plate
below, pushing the stack of plates
down. As the top plate is removed from
the stack, they can be used, the
plates pop back up, and the second
plate becomes the top of the stack.
Two important principles are
illustrated by this metaphor: the Last
In First Out principle is one; the
second is that the contents of the
stack are hidden. Only the top plate
is visible, so to see what is on the
third plate, the first and second
plates will have to be removed. This
can also be written as FILO-First In
Last Out, i.e. the record inserted
first will be popped out at last.

I believe the spring loaded stack of plates is correct, as the source for the term PUSH and POP.
In particular, the East Campus Commons Cafeteria at MIT had spring loaded stacks of plates in the 1957-1967 time frame. The terms PUSH and POP would have been in use by the Tech Model Railroad Club. I think this is the origin.
The Tech Model Railroad Club definitely influenced the design of the Digital Equipment Corporation's (DEC) PDP-6. The PDP-6 was one of the first machines to have stack oriented instructions in the hardware. The instructions were PUSH, POP, PUSHJ, POPJ.
http://ed-thelen.org/comp-hist/pdp-6.html#Special%20Features

For the second question: Assembler programmers on small systems tend to write code that begins at low addresses in memory, and grow to higher addresses as more code is added.
Because of this, making a stack grow downward allows you to start the stack at the top of physical memory and allow the two memory zones to grow towards each other. This simplifies memory management in this sort of trivial environment.
Even in a system with segregated ROM/RAM fixed data allocations are easiest to build from the bottom up and thus replace the code portion of the above explanation.
While such trivial memory schemes are very rare anymore, the hardware practice continues as established.

Think of it like a pez dispenser. You can push a new one on top. And then pop it off the top.
That is always what I think of when I think push and pop. (Probably not very historical though)
Are you asking yourself what the heck are PEZ? See the comments.

Re your "second trivial question": I've seen considerable inconsistency in defining what "up" and "down" mean! From early days, some manufacturers and authors drew memory diagrams with low addresses at the top of the page (presumably mimicking the order in which a page is read), while others put high addresses at the top of the page (presumably mimicking graph paper coordinates or the floors in a building).
Of course the concept of a stack (and the concept of addressable memory as well) is independent of such visual metaphors. One can implement a stack which "grows" in either direction. In fact, I've often seen the trick below (in bare-metal level implementations) used to share a region of memory between two stacks:
+---+---+-------- -------+--+--+--+
| | | -> ... <- | | | |
+---+---+-------- -------+--+--+--+
^ ^
Stack 1 both stacks Stack 2
base "grow" toward base
the middle
So my answer is that stacks conceptually never grow either "downward" or "upward" but simply grow from their base. An individual stack may be implemented in either direction (or in neither direction, if it's using a linked representation with garbage collection, in which case the elements may be anywhere in nodespace).

Alliteration is always attractive (see what I did there?), and these words are short, alliterative, and suggestive. The same goes for the old BASIC commands peek and poke, which have the extra advantage of the parallel k's.
A common physical metaphor is a cafeteria plate dispenser, where a spring-loaded stack of plates makes it so that you can take a plate off the top, but the next plate rises to be in the same position.

The answers on this page pretty much answer the stack direction question. If I had to sum it up, I would say it is done downwards to remain consistent with ancient computers.

I think the original story came about because of some developers seeing the plate stack (like you often see in buffet restaurants). You pushed a new plate on to the top of the stack, and you popped one off the top as well.

As to stacks growing downwards in memory, remember that When dealing with hierarchical data structures (trees), most programmers are happy to draw one on a page with the base (or trunk) at the top of the page...

I know this thread is really old, but I have a thought about the second question:
In my mind, the stack grows up, even though the memory addresses decrease. If you were to write a whole bunch of numbers on a piece of paper, you would start at the top left, with 0. Then you would increase the numbers going left to right, then top to bottom. So say the stack is like this:
000 001 002 003 004 000 001 002 003 004 000 001 002 003 004
005 006 007 008 009 005 006 007 008 009 005 006 007 008 009
010 011 012 013 014 010 011 012 013 014 010 011 012 013 014
015 016 017 018 019 015 016 017 018 019 015 016 017 018 019
020 021 022 023 024 020 021 022 023 024 020 021 022 023 024
025 026 027 028 029 025 026 027 028 029 025 026 027 028 029
where the bold numbers represent stack memory, and unbold numbers represent memory addresses which the stack is not using. Each block of the same numbers represents a stage of a program where the call stack has grown.
Even though the memory addresses are moving downward, the stack is growing upwards.
Similarly, with the spring loaded plate stack,
if you took a plate off the top of the stack, you would call that the first plate (smallest number) right? Even thought it is the highest up. A programmer might even call it the zeroth plate.

For the question of why do stacks grow down, I would imagine it is used to save memory.
If you start from the top of the stack memory (the highest value addresses) and work down to zero, I assume its easier to check if you have reached address $0x00000000 than allocating a variable to give you the maximum height of the stack and checking whether or not you have reached that address.
I assume this makes it easier to check if you're reaching the end of your addressable space as no matter how much memory is available the limit of the stack is always going to be $0x00000000.

Related

Using HERE in Forth for temporary space

I'm writing a game in Forth (for learning purposes).
The game is played on a "10 cell board". I'm trying new stuff so I did
here 10 [char] - fill
to set up the space for the board.
Then, to play 'X' in position 3
[char] X here 3 + c!
This has been working fine, but raises the question
Is this OK?
What if the board was a million cells wide?
Thanks
The described approach has the certain environmental dependencies, so your program just have to match the environmental restriction on programs of your Forth system (i.e. that you use).
1. Size of the data space
The word UNUSED returns "the amount of space remaining in the region addressed by HERE". So, a program can check the available space.
Also, according to the subsection 4.1.3 Other system documentation of the Forth Standard:
A system shall provide the following information:
[...]
program data space available, in address units;
So, you have just to check whether your Forth system provides enough data space for you program, and how the available data space can be configured (if any).
2. Transient regions
In the general case, it is not safe for a portable program to use the data space without reserving it.
According to the section 3.3.3.6 Other transient regions of the Forth Standard, the contents of the data space regions identified by PAD, WORD, and #> may become invalid after the data space is allocated. Сonsequently, the contents of the region identified by HERE may become invalid after the contents of the regions identified by PAD, WORD, and #> are changed.
See also A.3.3.3.6 Other transient regions:
In many existing Forth systems, these areas are at HERE or just beyond it, hence the many restrictions.
Moreover, some Forth systems may use the region identified by HERE in internal purposes during translation. E.g. Gforth 0.7.9 uses this region when decoding escaped strings.
The phrase:
s\" test\-passed" cr here over type cr type cr
outputs:
test-passed
test-passed
So, you have to check the restrictions of your Forth system whether you may use the region identified by HERE without reserving the space (and in what conditions).

nand2tetris. Memory implementation

I realized Data memory implementation in nand2tetris course. But I really don't understand some parts of my implementation:
CHIP Memory {
IN in[16], load, address[15];
OUT out[16];
PARTS:
DMux4Way(in=load, sel=address[13..14], a=RAM1, b=RAM2, c=scr, d=kbr);
Or(a=RAM1, b=RAM2, out=RAM);
RAM16K(in=in, load=RAM, address=address[0..13], out=RAMout);
Screen(in=in, load=scr, address=address[0..12], out=ScreenOut);
Keyboard(out=KeyboardOut);
Mux4Way16(a=RAMout, b=RAMout, c=ScreenOut, d=KeyboardOut, sel=address[13..14], out=out);
}
Is responsible for what load here. I understand that if load is 0 - out of Dmux4Way in any case will be 0 0 0 0. But i don't understand how it works in that case after that. Namely how it allows don't load data in Memory.
At least incomprehensible why in Screen we fed address[0..12] instead address[0..14] - full address. In my opinion we should use second because Screen memory map stay after RAM memory map and if we want to request for Screen memory map - we should use range (16 384 - 24 575) - decimal or (100000000000000 - 101111111111111) - binary. But how we can represent that range use just 13 width buss (address[0..12]) ??? It's impossible.
Therefore if we want to represent Screen memory map we should use range which was presented above. And that range has 15 width or address[0..14] BUT not address[0..12] (width 13). But why works just address[0..12] and doesn't work address[0..14](full address)
DMux4Way(in=load, sel=address[13..14], a=RAM1, b=RAM2, c=scr, d=kbr);
I'm sorry to criticize you at the beginning, but questions you ask suggest that you didn't do this exercise yourself or didn't start the whole course from the beginning.
To answer your questions:
Ad.1.
You demultiplex a single bit (load bit) to the correct memory part. Thereafter, you then feed the input data to all memory parts at the same time.
It's easier and neater than doing it the other way around, namely, to direct 16-bit input to the correct part (RAM16K, screen, or keyboard) while having a load bit that is connected and active at every register in all the parts.
To clarify. You have 2 possible destinations when writing data: RAM and Screen. The smallest demultiplexer you have is a 4-way multiplexer and that's what you're using. When you write into memory, you need to provide 2 pieces of information: the data and destination, both at the same time.
You might demultiplex the input data with DMux4Way16 and separately single load bit with DMux4Way but that would take 2 demultiplexers, and we can do better than that. That's what's done here, you direct data input to both RAM and Screen and then only use one demultiplexer : DMux4Way to select one of 2 possible destinations, only the one selected will be loaded with new data, on the other data input will be ignored. Knowing that, you need to study A-instruction format: when bit 14 and 13 of A-instruction (or data residing in A-register) have the binary value 00 or 01, the destination is RAM. When bit 14 and 13 have the binary value 10, it means the screen is the destination.
When you notice that you choose these 2 bits as sel for your demultiplexer. Selections 0 and 1 have the same meaning, so you can OR them and feed the output as load to RAM. Selection 2 means Screen will be loaded with a, new value, so load bit goes there. Selection 3 is never used so we don't care about it - output d of demultiplexer will not be connected anywhere. We make use of the demultiplexer's feature: The selected output will have value 1 and all other outputs will yield 0 as a result. It means only 1 memory destination will be loaded.
Ad.2.
Screen is separate device, it has nothing to do with RAM, ROM or Keyboard memory devices here. You, and only you, give meaning to what bits mean what to this specific device. To answer your question, when you address some register in Screen you address it in its own internal address space. In its internal address space first address will be 0, but from whole Memory it will be 16384. It's your job to make this transition. In this particular case, size of Screen memory device it is not necessary to use 14-bit address bus, 13 bits is all you need. What would 14th bit mean in this case? It wouldn't add any value. Also, you are user and not designer of Screen, you only look at and follow its interface description.
Hope it answers your questions, if not I urge you to go back and study more carefully previous hardware related chapters from course.

How is RAM able to acess any place in memory at O(1) speed

We are taught that the abstraction of the RAM memory is a long array of bytes. And that for the CPU it takes the same amount of time to access any part of it. What is the device that has the ability to access any byte out of the 4 gigabytes (on my computer) in the same time? As this does not seem as a trivial task for me.
I have asked colleagues and my professors, but nobody can pinpoint to the how this task can be achieved with simple logic gates, and if it isn't just a tricky combination of logic gates, then what is it?
My personal guess is that you could achieve the access of any memory in O(log(n)) speed, where n would be the size of memory. Because each gate would split the memory in two and send you memory access instruction to the next split the memory in two gate. But that requires ALOT of gates. I can't come up with any other educated guess, and I don't even know the name of the device that I should look up in Google.
Please help my anguished curiosity, and thanks in advance.
edit<
This is what I learned!
quote from yours "the RAM can send the value from cell addressed X to some output pins", here is where everyone skip (again) the thing that is not trivial for me. The way that I see it, In order to build a gate that from 64 pins decides which byte out of 2^64 to get, each pin needs to split the overall possible range of memory into two. If bit at index 0 is 0 -> then the address is at memory 0-2^64/2, else address is at memory 2^64/2-2^64. And so on, However the amout of gates (lets call them) that the memory fetch will go through will be 64, (a constant). However the amount of gates needed is N, where N is the number of memory bytes there are.
Just because there is 64 pins, it doesn't mean that you can still decode it into a single fetch from a range of 2^64. Does 4 gigabytes memory come with a 4 gigabytes gates in the memory control???
now this can be improved, because as I read furiously more and more about how this memory is architectured, if you place the memory into a matrix with sqrt(N) rows and sqrt(N) columns, the amount of gates that a fetch memory will need to go through is O(log(sqrt(N)*2) and the amount of gates that will be required will be 2*sqrt(N), which is much better, and I think that its probably a trade secret.
/edit<
What the heck, I might as well make this an answer.
Yes, in the physical world, memory access cannot be constant time.
But it cannot even be logarithmic time. The O(log n) circuit you have in mind ultimately involves some sort of binary (or whatever) tree, and you cannot make a binary tree with constant-length wires in a 3D universe.
Whatever the "bits per unit volume" capacity of your technology is, storing n bits requires a sphere with radius O(n^(1/3)). Since information can only travel at the speed of light, accessing a bit at the other end of the sphere requires time O(n^(1/3)).
But even this is wrong. If you want to talk about actual limitations of our universe, our physics friends say the absolute maximum number of bits you can store in any sphere is proportional to the sphere's surface area, not its volume. So the actual radius of a minimal sphere containing n bits of information is O(sqrt(n)).
As I mentioned in my comment, all of this is pretty much moot. The models of computation we generally use to analyze algorithms assume constant-access-time RAM, which is close enough to the truth in practice and allows a fair comparison of competing algorithms. (Although good engineers working on high-performance code are very concerned about locality and the memory hierarchy...)
Let's say your RAM has 2^64 cells (places where it is possible to store a single value, let's say 8-bit). Then it needs 64 pins to address every cell with a different number. When at the input pins of your RAM there 'appears' a binary number X the RAM can send the value from cell addressed X to some output pins, and your CPU can get the value from there. In hardware the addressing can be done quite easily, for example by using multiple NAND gates (such 'addressing device' from some logic gates is called a decoder).
So it is all happening at the hardware-level, this is just direct addressing. If the CPU is able to provide 64 bits to 64 pins of your RAM it can address every single memory cell (as 64 bit is enough to represent any number up to 2^64 -1). The only reason why you do not get the value immediately is a kind of 'propagation time', so time it takes for the signal to go through all the logic gates in the circuit.
The component responsible for dealing with memory accesses is the memory controller. It is used by the CPU to read from and write to memory.
The access time is constant because memory words are truly layed out in a matrix form (thus, the "byte array" abstraction is very realistic), where you have rows and columns. To fetch a given memory position, the desired memory address is passed on to the controller, which then activates the right column.
From http://computer.howstuffworks.com/ram1.htm:
Memory cells are etched onto a silicon wafer in an array of columns
(bitlines) and rows (wordlines). The intersection of a bitline and
wordline constitutes the address of the memory cell.
So, basically, the answer to your question is: the memory controller figures it out. Of course that, given a memory address, the mapping to column and row must be easy to calculate in a constant time.
To fully understand this topic, I recommend you to read this guide on how memory works: http://computer.howstuffworks.com/ram.htm
There are so many concepts to master that it is difficult to explain it all in one answer.
I've been reading your comments and questions until I answered. I think you are on the right track, but there is some confusion here. The random access in which you are implying doesn't exist in the same way you think it does.
Reading, writing, and refreshing are done in a continuous cycle. A particular cell in memory is only read or written in a certain interval if a signal is detected to do so in that cycle. There is going to be support circuitry that includes "sense amplifiers to amplify the signal or charge detected on a memory cell."
Unless I am misunderstanding what you are implying, your confusion is in how easy it is to read/write to a cell. It's different dependent on chip design but there IS a minimum number of cycles it takes to read or write data to a cell.
These are my sources:
http://www.doc.ic.ac.uk/~dfg/hardware/HardwareLecture16.pdf
http://www.electronics.dit.ie/staff/tscarff/memory/dram_cycles.htm
http://www.ece.cmu.edu/~ece548/localcpy/dramop.pdf
To avoid a humungous answer, I left most of the detail out but all three of these will describe the process you are looking for.

How is a context diagram different than a level 0 diagram?

What is the difference, if any, between a context diagram and a level 0 diagram?
There are some conflicts in the literature about these two terms.
Refer page 54 of this book for example. It is highly rated on google books and is a standard text in many schools. It says that a context diagram is the same as a Level 0 DFD. This one disagrees on page 210.
I'll first address the notion of "levels".
As we know, initially, the whole of the system is represented by one big block, and interactions with the system are clearly depicted. Initially, we are seeing the system with a naked eye.
Now, think of yourself holding something like a microscope. You place the lens above the system block and zoom in. This "zooming in" takes you to the next level in the hierarchy. So now, you see that the system is made up of a number of blocks.
You pick up any of the sub-blocks, and then zoom in again, thus going to the next level and so on.
So we see that there is a hierarchy of diagrams, with each level taking us to the next level of detail. The only bone of contention that now remains is name of the first level (The view with the naked eye).
As you can see, the question is not very objective, hence the ambiguity.
We can have :
Context Diagram ->
Level 0 DFD ->
-> Level n DFD</pre>
OR
Context Diagram/Level 0 DFD
->Level 1 DFD
->Level n DFD
It boils down to which one looks better. In my personal opinion , the first hierarchy is more apt. This is because initially, all we see is the system and the context within which it operates. I feel that anyone who understands the explanation should't worry much about the nomenclature.
Refer this for more.
A very difficult discussion.
My thoughts:
A context diagram only has 1 process, while a DFD level 0 can have more.
The context diagram established context at the system to be developed that is it represents the interaction at the system with various external entities.
Where data flow diagram is a simple graphical notation that can be used to represent a system in the term of input data to the system,various processing carried out on this data and the output generated by the system. It is simple to understand and use.

What is a stack pointer used for in microprocessors?

I am preparing for a microprocessor exam. If the use of a program counter is to hold the address of the next instruction, what is use of stack pointer?
A stack is a LIFO data structure (last in, first out, meaning last entry you push on to the stack is the first one you get back when you pop). It is typically used to hold stack frames (bits of the stack that belong to the current function).
This may include, but is not limited to:
the return address.
a place for a return value.
passed parameters.
local variables.
You push items onto the stack and pop them off. In a microprocessor, the stack can be used for both user data (such as local variables and passed parameters) and CPU data (such as return addresses when calling subroutines).
The actual implementation of a stack depends on the microprocessor architecture. It can grow up or down in memory and can move either before or after the push/pop operations.
Operation which typically affect the stack are:
subroutine calls and returns.
interrupt calls and returns.
code explicitly pushing and popping entries.
direct manipulation of the stack pointer register, sp.
Consider the following program in my (fictional) assembly language:
Addr Opcodes Instructions ; Comments
---- -------- -------------- ----------
; 1: pc<-0000, sp<-8000
0000 01 00 07 load r0,7 ; 2: pc<-0003, r0<-7
0003 02 00 push r0 ; 3: pc<-0005, sp<-7ffe, (sp:7ffe)<-0007
0005 03 00 00 call 000b ; 4: pc<-000b, sp<-7ffc, (sp:7ffc)<-0008
0008 04 00 pop r0 ; 7: pc<-000a, r0<-(sp:7ffe[0007]), sp<-8000
000a 05 halt ; 8: pc<-000a
000b 06 01 02 load r1,[sp+2] ; 5: pc<-000e, r1<-(sp+2:7ffe[0007])
000e 07 ret ; 6: pc<-(sp:7ffc[0008]), sp<-7ffe
Now let's follow the execution, describing the steps shown in the comments above:
This is the starting condition where pc (the program counter) is 0 and sp is 8000 (all these numbers are hexadecimal).
This simply loads register r0 with the immediate value 7 and moves pc to the next instruction (I'll assume that you understand the default behavior will be to move to the next instruction unless otherwise specified).
This pushes r0 onto the stack by reducing sp by two then storing the value of the register to that location.
This calls a subroutine. What would have been pc in the next step is pushed on to the stack in a similar fashion to r0 in the previous step, then pc is set to its new value. This is no different to a user-level push other than the fact it's done more as a system-level thing.
This loads r1 from a memory location calculated from the stack pointer - it shows a way to pass parameters to functions.
The return statement extracts the value from where sp points and loads it into pc, adjusting sp up at the same time. This is like a system-level pop instruction (see next step).
Popping r0 off the stack involves extracting the value from where sp currently points, then adjusting sp up.
The halt instruction simply leaves pc where it is, an infinite loop of sorts.
Hopefully from that description, it will become clear. Bottom line is: a stack is useful for storing state in a LIFO way and this is generally ideal for the way most microprocessors do subroutine calls.
Unless you're a SPARC of course, in which case you use a circular buffer for your stack :-)
Update: Just to clarify the steps taken when pushing and popping values in the above example (whether explicitly or by call/return), see the following examples:
LOAD R0,7
PUSH R0
Adjust sp Store val
sp-> +--------+ +--------+ +--------+
| xxxx | sp->| xxxx | sp->| 0007 |
| | | | | |
| | | | | |
| | | | | |
+--------+ +--------+ +--------+
POP R0
Get value Adjust sp
+--------+ +--------+ sp->+--------+
sp-> | 0007 | sp->| 0007 | | 0007 |
| | | | | |
| | | | | |
| | | | | |
+--------+ +--------+ +--------+
The stack pointer stores the address of the most recent entry that was pushed onto the stack.
To push a value onto the stack, the stack pointer is incremented to point to the next physical memory address, and the new value is copied to that address in memory.
To pop a value from the stack, the value is copied from the address of the stack pointer, and the stack pointer is decremented, pointing it to the next available item in the stack.
The most typical use of a hardware stack is to store the return address of a subroutine call. When the subroutine is finished executing, the return address is popped off the top of the stack and placed in the Program Counter register, causing the processor to resume execution at the next instruction following the call to the subroutine.
http://en.wikipedia.org/wiki/Stack_%28data_structure%29#Hardware_stacks
You got more preparing [for the exam] to do ;-)
The Stack Pointer is a register which holds the address of the next available spot on the stack.
The stack is a area in memory which is reserved to store a stack, that is a LIFO (Last In First Out) type of container, where we store the local variables and return address, allowing a simple management of the nesting of function calls in a typical program.
See this Wikipedia article for a basic explanation of the stack management.
For 8085: Stack pointer is a special purpose 16-bit register in the Microprocessor, which holds the address of the top of the stack.
The stack pointer register in a computer is made available for general purpose use by programs executing at lower privilege levels than interrupt handlers. A set of instructions in such programs, excluding stack operations, stores data other than the stack pointer, such as operands, and the like, in the stack pointer register. When switching execution to an interrupt handler on an interrupt, return address data for the currently executing program is pushed onto a stack at the interrupt handler's privilege level. Thus, storing other data in the stack pointer register does not result in stack corruption. Also, these instructions can store data in a scratch portion of a stack segment beyond the current stack pointer.
Read this one for more info.
General purpose use of a stack pointer register
The Stack is an area of memory for keeping temporary data. Stack is used by the CALL instruction to keep the return address for procedures The return RET instruction gets this value from the stack and returns to that offset. The same thing happens when an INT instruction calls an interrupt. It stores in the Stack the flag register, code segment and offset. The IRET instruction is used to return from interrupt call.
The Stack is a Last In First Out (LIFO) memory. Data is placed onto the Stack with a PUSH instruction and removed with a POP instruction. The Stack memory is maintained by two registers: the Stack Pointer (SP) and the Stack Segment (SS) register. When a word of data is PUSHED onto the stack the the High order 8-bit Byte is placed in location SP-1 and the Low 8-bit Byte is placed in location SP-2. The SP is then decremented by 2. The SP addds to the (SS x 10H) register, to form the physical stack memory address. The reverse sequence occurs when data is POPPED from the Stack. When a word of data is POPPED from the stack the the High order 8-bit Byte is obtained in location SP-1 and the Low 8-bit Byte is obtained in location SP-2. The SP is then incremented by 2.
The stack pointer holds the address to the top of the stack. A stack allows functions to pass arguments stored on the stack to each other, and to create scoped variables. Scope in this context means that the variable is popped of the stack when the stack frame is gone, and/or when the function returns. Without a stack, you would need to use explicit memory addresses for everything. That would make it impossible (or at least severely difficult) to design high-level programming languages for the architecture.
Also, each CPU mode usually have its own banked stack pointer. So when exceptions occur (interrupts for example), the exception handler routine can use its own stack without corrupting the user process.
Should you ever crave deeper understanding, I heartily recommend Patterson and Hennessy as an intro and Hennessy and Patterson as an intermediate to advanced text. They're pricey, but truly non-pareil; I just wish either or both were available when I got my Masters' degree and entered the workforce designing chips, systems, and parts of system software for them (but, alas!, that was WAY too long ago;-). Stack pointers are so crucial (and the distinction between a microprocessor and any other kind of CPU so utterly meaningful in this context... or, for that matter, in ANY other context, in the last few decades...!-) that I doubt anything but a couple of thorough from-the-ground-up refreshers can help!-)
On some CPUs, there is a dedicated set of registers for the stack. When a call instruction is executed, one register is loaded with the program counter at the same time as a second register is loaded with the contents of the first, a third register is be loaded with the second, and a fourth with the third, etc. When a return instruction is executed, the program counter is latched with the contents of the first stack register and the same time as that register is latched from the second; that second register is loaded from a third, etc. Note that such hardware stacks tend to be rather small (many the smaller PIC series micros, for example, have a two-level stack).
While a hardware stack does have some advantages (push and pop don't add any time to a call/return, for example) having registers which can be loaded with two sources adds cost. If the stack gets very big, it will be cheaper to replace the push-pull registers with an addressable memory. Even if a small dedicated memory is used for this, it's cheaper to have 32 addressable registers and a 5-bit pointer register with increment/decrement logic, than it is to have 32 registers each with two inputs. If an application might need more stack than would easily fit on the CPU, it's possible to use a stack pointer along with logic to store/fetch stack data from main RAM.
A stack pointer is a small register that stores the address of the top of stack. It is used for the purpose of pointing address of the top of the stack.

Resources