Is there any way to inspect kernel space in GDB? - memory

I may have a more fundamental misunderstanding here, so I will outline everything:
I wanted to gain a better understanding of how programs are laid out in memory. Starting from here I went and made some simple programs and opened them up in GDB to see where things were laid in a more practical sense:
0x0 - 0x08048000 = ??
0x08048000 = Start .text section
0x08048000 = PLT
0x08048300 = _start
0x08048400 = main
0x08048480 = other functions
0x0804a000 = GOT
0x0804a020 = Start .data section
0x0804a028 = Start .bss section
(random offset)
0x0804b008 = Start heap
...
0xf7?????? = Start memory mapping section
0xf7e50000 = #included library function definitions
0xf7ff0000 = Linux dynamic loader
(random offset)
0xffffd010 = Top of stack (grows negatively)
(random offset)
I understand that a lot of these addresses are subject to change, but it helped me visualize it by assigning numbers to things.
Anyway, in the following picture presented in the source above, there's a block dedicated to the Kernel space at the top of the program address space:
But a whole gigabyte is allowed for it! The top of the stack in the program I examined was at 0xffffd010, leaving very little space for kernel-related things afterwards. Is it really all there? Does it ever grow, pushing the rest of the program segments closer together in the virtual address space? More importantly, how can I examine it and play with it?

The top of the stack in the program I examined was at 0xffffd010, leaving very little space for kernel-related things afterwards. Is it really all there?
Your stack is at the top of memory — there's no kernel mapping. That suggests that one of the following is the case:
You're running a 32-bit binary on a 64-bit system, so the kernel is way off in 64-bit space where you can't see it.
You're running a weird kernel with the 4GB/4GB patch applied, so the kernel is (again) in a totally separate address space
You're on a non-x86 architecture that always has separate address spaces for user and system processes (like PowerPC, I believe?)
To get a look at what your address space actually looks like, take a look at /proc/$pid/maps for your process while it's running.
Does it ever grow, pushing the rest of the program segments closer together in the virtual address space?
No. The size of the kernel mapping is compiled into the kernel, and never changes at runtime. (It can be configured to be 2GB/2GB instead of 3GB/1GB, but that's very uncommon.)
More importantly, how can I examine it and play with it?
You can't — at least, not from user space. That's where the kernel lives.

Related

forth implementation with JIT write protection?

I believe Apple has disabled being able to write and execute memory at the same time on the ARM64 architecture, see:
See mmap() RWX page on MacOS (ARM64 architecture)?
This makes it difficult to port implementations like jonesforth, which keeps generated code and the code to generate it (like the built-in assembler in jonesforth.f) in the same segment.
I thought I could do something like map the user space from start to HERE as 'r-x', and from here to the end as 'rw-'. Then I'd have to constantly remap memory as I compile new words, and I couldn't go and fix up previous words (I believe SCODE would make use of it).
Do you have any advice on how to handle such limitations ?
I guess I should look into other forth implementations that are running on M1 Macs.
A Forth implementation can have a problem with write-protected segments of code only when it generates machine code that should be executable at once. There is no such a problem if it uses threaded code. So it's supposed bellow that the Forth system have to generate machine code.
Data space and code space
Obviously, you have to separate code space from data space. Data space (at least mutable regions of, including regions for variables and data fields), as well as internal mutable memory regions and probably headers, should be mapped to 'rw-' segments. Code space should be mapped to 'r-x' segments.
The word here ( -- addr ) returns the address of the first cell available for reservation, which is writable for a program, and it should be always in an 'rw-' segment. You can have an internal word code::here ( -- addr ) that returns address in code space, if you need.
A decision for execution tokens is a compromise between speed and simplicity of implementation (an 'r-x' segment vs 'rw-'). The simplest case is that an execution token is represented by an address in an 'rw-' segment, and then execute does an additional dereferencing to get the corresponding address of code.
Code generation
In the given conditions we should generate machine code into an 'rw-' segment, but before this code is executed, this segment should be made 'r-x'.
Probably, the simplest solution is to allocate a memory block for every new definition, resize (minimize) the block on completion and make it 'r-x'. Possible disadvantages — losses due to page size (e.g. 4 KiB), and maybe memory fragmentation.
Changing protection of the main code segment starting from code::here also implies losses due to page size granularity.
Another variant is to break creating of a definition into two stages:
generate intermediate representation (IR) in a separate 'rw-' segment during compilation of a definition;
when the definition is completed, generate machine code in the main code segment from IR, and discard IR code.
Actually, it could be machine code on the first stage too, and then it's just relocated into another place on the second stage.
Before write to the main code segment you change it (or its part) to 'rw-', and after that revert it to 'r-x'.
A subroutine that translates IR code should be resided in another 'r-x' segment that you don't change.
Forth is agnostic to the format of generated code, and in a straightforward system only a handful of definitions "know" what format is generated. So only these definitions should be changed to generate IR code. If you relocate machine code, you probably don't need to change even these definitions.

In memory, should stack bottom and heap bottom have the same address?

I'm using a tm4c123gh6pm MCU with this linker script. Going to the bottom, I see:
...
...
.bss (NOLOAD):
{
_bss = .;
*(.bss*)
*(COMMON)
_ebss = .;
} > SRAM
_heap_bottom = ALIGN(8);
_heap_top = ORIGIN(SRAM) + LENGTH(SRAM) - _stack_size;
_stack_bottom = ALIGN(8);
_stack_top = ORIGIN(SRAM) + LENGTH(SRAM);
It seems that heap and stack bottoms are the same. I have double checked it:
> arm-none-eabi-objdump -t mcu.axf | grep -E "(heap|stack)"
20008000 g .bss 00000000 _stack_top
20007000 g .bss 00000000 _heap_top
00001000 g *ABS* 00000000 _stack_size
20000558 g .bss 00000000 _heap_bottom
20000558 g .bss 00000000 _stack_bottom
Is this correct? As far as I can see, the stack could overwrite the heap, is this the case?
If I flash this FW it 'works' (at least for now), but I'm expecting it to fail if the stack gets big enough and I use dynamic memory. I have observed though that no one in my code or the startup script uses the stack and bottom symbols, so maybe even if I use the stack and heap everything keeps working. (Unless the stack and heap are special symbols used by someone I can't see, is this the case?)
I want to change the last part by:
_heap_bottom = ALIGN(8);
_heap_top = ORIGIN(SRAM) + LENGTH(SRAM) - _stack_size;
_stack_bottom = ORIGIN(SRAM) + LENGTH(SRAM) - _stack_size + 4; // or _heap_top + 4
_stack_top = ORIGIN(SRAM) + LENGTH(SRAM);
Is the above correct?
If you write your own linker script then it is up to you how stack and heap are arranged.
One common approach is to have stack and heap in the same block, with stack growing downwards from the highest address towards the lowest, and heap growing upwards from a lower address towards the highest.
The advantage of this approach is that you don't need to calculate how much heap or stack you need separately. As long as the total of stack and heap used at any one instant is less than the total memory available, then everything will be ok.
The disadvantage of this approach is that when you allocate more memory than you have, your stack will overflow into your heap or vice-versa, and your program will fail in a variety of ways which are very difficult to predict or to identify when they happen.
The linker script in your question uses this approach, but appears to have a mistake detailed below.
Note that using the names top and bottom when talking about stacks on ARM is very unhelpful because when you push something onto the stack the numerical value of the stack pointer decreases. When the stack is empty the stack pointer has its highest value, and when it is full the stack pointer has its lowest value. It is ambiguous whether "top" refers to the highest address or the location of the current pointer, and whether bottom refers to the lowest address or the address where the first item is pushed.
In the CMSIS example linker scripts the lower and upper bounds of the heap are called __heap_base and __heap_limit, and the lower and upper bounds of the stack are called __stack_limit and __initial_sp respectively.
In this script the symbols have the following meanings:
_heap_bottom is the lowest address of the heap.
_heap_top is the upper address that the heap must not grow beyond if you want to leave at least _stack_size bytes for the stack.
For _stack_bottom, it appears that the script author probably mistakenly thought that ALIGN(8) would align the most recently assigned value, and so they wanted _stack_bottom to be an aligned version of _heap_top, which would make it the value of the stack pointer when _stack_size bytes are pushed to it. In fact ALIGN(8) aligns the value of ., which still has the same value as _heap_bottom as you have observed.
Finally _stack_top is the highest address in memory, it is the value the stack pointer will start with when the stack is empty.
Having an incorrect value for the stack limit almost certainly does absolutely nothing at all, because this symbol is probably never used in the code. On this ARMv7M processor the push and pop instructions and other accesses to the stack by hardware assume that the stack is an infinite resource. Compilers using all the normal ABIs also generate code which does not check before growing the stack either. The reason for this is that it is one of the most common operations performed, and so adding extra instructions would cripple performance. The next generation ARMv8M does have hardware support for a stack limit register, though.
My advice to you is just delete the line. If anything is using it, then you are basically losing the whole benefit of sharing your stack and heap space. If you do want to calculate and check for it, then your suggestion is correct except that you don't need to add + 4. This would create a 4 byte gap which is not usable as either heap or stack.
As an aside, I personally prefer to put the stack at the bottom of memory and the heap at the top, growing way from each other. That way, if either of them get bigger than they should they go into an unallocated address space which can be configured to cause a bus fault straight away, without any software checking the values all the time.

STM32 Current Flash Vector Address

I'm working on a dual OS system with STM32F103, I have two separate program that programmed on different FLASH locations. if both of the programs are the same, the only way to know which of them running is just by its start vector address.
But How I Can Read The Current Program Start Vector Address in STM32 ???
After reading the comments, it sounds like what you have/want is a bootloader. If your goal here is to have two different applications, one to do your main processing and real time handling and the other to just program new firmware, then you want to make a bootloader in your default boot flash space.
Bootloaders fundamentally do a few things, everything else is extra.
Check itself using some type of data integrity check like a CRC.
Checks the application
Jumps to the application.
Bootloaders will also program applications in the app space and verify they are programmed correctly before jumping as well. Colin gave some good advice about appending a CRC to the hex file before it is programmed in flash space to verify the applications.
There are a few things to look out for. The first would be the linker script and this is extremely important. A linker script will be used to map input objects to output objects and then determine based upon that script, what memory space they go into. For both of your applications, you need to create a memory map of how you want both programs to sit inside of the flash space. From this point, you can then make linker scripts for both programs so that a hex file can be generated within the parameters of what you deem acceptable flash space for the program. Each project you have will have its own linker script. An example would look something like this:
LR_IROM1 0x08000000 0x00010000 { ; load region size_region
ER_IROM1 0x08000000 0x00010000 { ; load address = execution address
*.o (RESET, +First)
*(InRoot$$Sections)
.ANY (+RO)
}
RW_IRAM1 0x20000000 0x00018000 { ; RW data
.ANY (+RW +ZI)
}
}
This will give RAM for the application to use as well as a starting point for the application.
After that, you can start the bootloader and give it information about where the application space lies for jumping and programming. Once again this is all determined by you from your memory map and both applications' linker scripts. You are going to need to add a separate entry inside of the linker for your CRC and length for a comparison of the calculated versus stored as well. Whatever tool you use to append the CRC to the hex file and have it programmed to flash space, remember to note the location and make it known to the linker script so you can reference those addresses to check integrity later.
After you check everything and it is determined that it is okay to go to the application, you can use some ARM assembly to jump to the starting application address. Before jumping, make sure to disable all peripherals and interrupts that were enabled in the bootloader. As Colin mentioned, these will share RAM, so it is important you de-initialize all used, otherwise, you'll end up with a hard fault.
At this point, the program used another hex file laid out by a linker script, so it should begin executing as planned, as long as you have the correct vector table offset, which gets into your question fully.
As far as your question on the "Flash vector address", I think what your really mean is your interrupt vector table address. An interrupt vector table is a data structure in memory that maps interrupt requests to the addresses of interrupt handlers. This is where the PC register grabs the next available instruction address upon hardware interrupt triggers, for example. You can see this by keeping track of the ARM pipeline in a few lines of assembly code. Each entry of this table is a handler's address. This offset must be aligned with your application, otherwise you will never go into the main function and the program will sit in the application space, but have nothing to do since all handlers addresses are unknown. This is what the SCB->VTOR is for. It is a vector interrupt table offset address register. In this case, there are a few things you can do. Luckily, these are hard-coded inside of STM generated files inside of the file "system_stm32(xx)xx.c" (xx is your microcontroller variant). There is a define for something called VECT_TAB_OFFSET which is the offset in the memory map of the vector table and is assigned to the SCB->VTOR register with the value that is chosen. Your interrupt vector table will always lie at the starting address of your main application, so for the bootloader it can be 0x00, but for the application, it will be the subtraction of the starting address of the application space, and the first addressable flash address of the microcontroller.
/************************* Miscellaneous Configuration ************************/
/*!< Uncomment the following line if you need to relocate your vector Table in
Internal SRAM. */
/* #define VECT_TAB_SRAM */
#define VECT_TAB_OFFSET 0x00 /*!< Vector Table base offset field.
This value must be a multiple of 0x200. */
/******************************************************************************/
Make sure you understand what is expected from the micro side using STM documentation before programming things. Vector tables in this chip can only be in multiples of 0x200. But to answer your question, this address can be determined by a few things. Your memory map, and eventually, you will have a hard-coded reference to it as a define. You can figure it out from there.
Hope this helps and good luck to you on your application.

Do modern OS's use paging and segmentation?

I was reading about memory architecture and I got a bit confused with the paging and segmentation. I read that modern OS systems use only paging to manage memory access but looking at a disassembled codes I can see segments like "ds" and "fs". Does it means that the OS (saw that on Windows and Linux) is using both segmentation and paging or is it just mapping all the segments into the same pages (making segments irrelevant) ?
Ok, based on the book Modern Operating Systems 3rd Edition by Andrew S. Tanenbaum and the materials form Open Security Training (opensecuritytraining.info), i manage to understand the segmentation and paging and the answer for my question is:
Concepts:
1.1.Segmentation:
Segmentation is the division of the memory into pieces (sections) called segments. These segments are independents from each other, have variable sizes and may grow as needed.
1.2. Virtual Memory:
A virtual memory is an abstraction of real memory. This means that it maps a virtual address (used by programs) into a physical address (used by the hardware). If a program wants to access the memory 2000 (mov eax, 2000), the virtual address (2000) will be converted into the real physical address ( for example 1422) and then the program will access the memory 1422 thinking that he’s accessing the memory 2000.
So, if virtual memory is being used by the system, programs no longer access real memory directly, instead, they used the correspondent virtual memory.
1.3. Paging:
Paging is a virtual memory management scheme. As explained before, it maps a virtual address into a physical address. Paging divides the virtual memory into pieces called “pages” and also divides the physical memory into pieces called “frame pages”. A page can be bound to one or more frame page ( keep in mind that a page can map different frame pages, but just one at the time)
Advantages and Disadvantages
2.1. Segmentation:
The main advantage of using segmentation is to divide the memory into different linear address spaces. With this programs can have an address space for his code, other for the stack, another for dynamic variables (heap) etc. The disadvantage for segmentation is the fragmentation of the memory.
Consider this example – One portion of the memory is divided into 4 segments sequentially allocated -> seg1(in use) --- seg2(in use)--- seg3(in use)---seg4(in use). Now if we free the memory from seg2, we will have -> seg1(in use) --- seg2(FREE)--- seg3(in use)---seg4(in use). If we want to allocate some data, we can do it by using seg2, but if the size of data is bigger than the size of the seg2, we won’t be able to do so and the space will be wasted and the memory fragmented. Another problem is that some segments could have a larger size and since this segment can’t be “broken” into smaller pieces, it must be fully allocated in memory.
2.1. Paging:
The main advantages of using paging is to abstract the physical memory, hence, the programs (and programmers) don’t need to bother with memory addresses. You can have two programs that access the (virtual) memory 2000 because it will be mapped into two different physical memories. Also, it’s possible to use the hard disk to ensure that only the necessary pages are allocated into memory. The main disadvantage is that paging uses only one linear address space. Paging maps virtual memories from 0 to the max addressable value for all programs. Now consider this example - two chunks of memory, “chk1” and “chk2” are next to each other on virtual memory ( chk1 is allocated from address 1000 to 2000 and chk2 uses from 2001 to 3000), if chk1 needs to grow, the OS needs to make room in the memory for the new size of chk1. If that happens again, the OS will have to do the same thing or, if no space can be found, an exception will pop. The routines for managing this kind of operations are very bad for performance, because it this growth and shrink of the memory happens a lot of time.
Segmentation and Paging
3.1. Why combine both?
An operational systems can combine both segmentation and paging. Why? Because combining then it’s possible to have the best from both without their disadvantages. So, the memory is divided in many segments and each one of those have one or more pages. When a program tries to access a memory address, first the OS goes to the correspondent segment, there he’ll find the correspondent page, so he goes to the page and there he’ll find the frame page that the program wanted to access. Using this approach, the OS divides the memory into various segments, including segments for the kernel and user programs. These segments have variable sizes and access protection features. To solve the fragmentation and the “big segment” problems, paging is used. With paging a larger segment is broken into several pages and only the necessary pages remains in memory (and more pages could be allocated in memory if needed). So, basically, the Modern OSs have two memory abstractions, where segmentation is more used for “handling the programs” and Paging for “managing the physical memory”.
3.2. How it really works?
An OS running segmentation and Paging will have the following structures:
3.2.1. Segment Selector:
This represents an index on the Global/Local Descriptor Table. It contains 3 fields that represent the index on the descriptor table, a bit to identify if that segment is present on Global or Local descriptor table and a privilege level.
3.2.2. Segment Register:
A CPU register used to store segment selector. Usually ( on a x86 machine) there is at least the register CS (Code Segment) and DS (Data Segment).
3.2.3. Segment descriptors:
This structure contains the data regarding a segment, like his base address, size ( in pages or in bytes), the access privilege, the information of if this segment is present on the memory or not, etc. (for all the fields, search for segment descriptors on google)
3.2.4. Global/Local Descriptor Table:
The table that contains several segment descriptors. So this structure holds all the segment descriptors for the system. If the table is Global, it can hold other things like Local Descriptor Tables descriptors, Task State Segment Descriptors and Call Gate descriptors (I’ll not explain these here, please, google them). If the table is Local, it will only ( as far as I know) holds user related segment descriptors.
3.3. How it works?
So, to access a memory, first the program needs to access a segment. To do this a segment selector is loaded into a segment register and then a Global or Local Descriptor table (depending of field on the segment selector). This is the reason of the fully memory address be SEGMENT REGISTER: ADDRESS , like CS:ADDRESS -> 001B:0044BF7A. Now the OS goes to the G/LDT and (using the index field of the segment selector) finds the segment descriptor of the address trying to access. Then it checks if the segment if present, the protection and if everything is ok it goes to the address stated on the “base field” (of the descriptor) + the address offset . If paging is not enabled, the system goes directly into the real memory, but with paging on the address is treated as a virtual address and it goes to the Page directory. The base address + the offset are called Linear Address and will be interpreted as 3 fields: Directory+Page+offset. So on the Directory page, it will search for the directory entry specified on the “directory” field of the linear address, this entry points to the page table and the field “page” of the linear address is used to find the page, this entry points to a frame page and the offset is used to find the exactly address that the program want to access.
3.4. Modern Operational Systems
Modern OSes "do not use" segmentation. Its in quotes because they use 4 segments: Kernel Code Segment, Kernel Data Segment, User Code Segment and User Data Segment. What does it means is that all user's processes have the same code and data segments (so the same segment selector). The segments only change when going from user to kernel.
So, all the path explained on the section 3.3. occurs, but they use the same segments and, since the page tables are individual per process, a page fault is difficult to happen.
Hope this helps and if there is any mistakes or some more details ( I was I bit generic on some parts) please, feel free to comment and reply.
Thank you guys
Danilo PC
They only use paging for memory protection, while they use segmentation for other purposes (like storing thread-local data).

How does a program look in memory?

How is a program (e.g. C or C++) arranged in computer memory? I kind of know a little about segments, variables etc, but basically I have no solid understanding of the entire structure.
Since the in-memory structure may differ, let's assume a C++ console application on Windows.
Some pointers to what I'm after specifically:
Outline of a function, and how is it called?
Each function has a stack frame, what does that contain and how is it arranged in memory?
Function arguments and return values
Global and local variables?
const static variables?
Thread local storage..
Links to tutorial-like material and such is welcome, but please no reference-style material assuming knowledge of assembler etc.
Might this be what you are looking for:
http://en.wikipedia.org/wiki/Portable_Executable
The PE file format is the binary file structure of windows binaries (.exe, .dll etc). Basically, they are mapped into memory like that. More details are described here with an explanation how you yourself can take a look at the binary representation of loaded dlls in memory:
http://msdn.microsoft.com/en-us/magazine/cc301805.aspx
Edit:
Now I understand that you want to learn how source code relates to the binary code in the PE file. That's a huge field.
First, you have to understand the basics about computer architecture which will involve learning the general basics of assembly code. Any "Introduction to Computer Architecture" college course will do. Literature includes e.g. "John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach" or "Andrew Tanenbaum, Structured Computer Organization".
After reading this, you should understand what a stack is and its difference to the heap. What the stack-pointer and the base pointer are and what the return address is, how many registers there are etc.
Once you've understood this, it is relatively easy to put the pieces together:
A C++ object contains code and data, i.e., member variables. A class
class SimpleClass {
int m_nInteger;
double m_fDouble;
double SomeFunction() { return m_nInteger + m_fDouble; }
}
will be 4 + 8 consecutives bytes in memory. What happens when you do:
SimpleClass c1;
c1.m_nInteger = 1;
c1.m_fDouble = 5.0;
c1.SomeFunction();
First, object c1 is created on the stack, i.e., the stack pointer esp is decreased by 12 bytes to make room. Then constant "1" is written to memory address esp-12 and constant "5.0" is written to esp-8.
Then we call a function that means two things.
The computer has to load the part of the binary PE file into memory that contains function SomeFunction(). SomeFunction will only be in memory once, no matter how many instances of SimpleClass you create.
The computer has to execute function SomeFunction(). That means several things:
Calling the function also implies passing all parameters, often this is done on the stack. SomeFunction has one (!) parameter, the this pointer, i.e., the pointer to the memory address on the stack where we have just written the values "1" and "5.0"
Save the current program state, i.e., the current instruction address which is the code address that will be executed if SomeFunction returns. Calling a function means pushing the return address on the stack and setting the instruction pointer (register eip) to the address of the function SomeFunction.
Inside function SomeFunction, the old stack is saved by storing the old base pointer (ebp) on the stack (push ebp) and making the stack pointer the new base pointer (mov ebp, esp).
The actual binary code of SomeFunction is executed which will call the machine instruction that converts m_nInteger to a double and adds it to m_fDouble. m_nInteger and m_fDouble are found on the stack, at ebp - x bytes.
The result of the addition is stored in a register and the function returns. That means the stack is discarded which means the stack pointer is set back to the base pointer. The base pointer is set back (next value on the stack) and then the instruction pointer is set to the return address (again next value on the stack). Now we're back in the original state but in some register lurks the result of the SomeFunction().
I suggest, you build yourself such a simple example and step through the disassembly. In debug build the code will be easy to understand and Visual Studio displays variable names in the disassembly view. See what the registers esp, ebp and eip do, where in memory your object is allocated, where the code is etc.
What a huge question!
First you want to learn about virtual memory. Without that, nothing else will make sense. In short, C/C++ pointers are not physical memory addresses. Pointers are virtual addresses. There's a special CPU feature (the MMU, memory management unit) that transparently maps them to physical memory. Only the operating system is allowed to configure the MMU.
This provides safety (there is no C/C++ pointer value you can possibly make that points into another process's virtual address space, unless that process is intentionally sharing memory with you) and lets the OS do some really magical things that we now take for granted (like transparently swap some of a process's memory to disk, then transparently load it back when the process tries to use it).
A process's address space (a.k.a. virtual address space, a.k.a. addressable memory) contains:
a huge region of memory that's reserved for the Windows kernel, which the process isn't allowed to touch;
regions of virtual memory that are "unmapped", i.e. nothing is loaded there, there's no physical memory assigned to those addresses, and the process will crash if it tries to access them;
parts the various modules (EXE and DLL files) that have been loaded (each of these contains machine code, string constants, and other data); and
whatever other memory the process has allocated from the system.
Now typically a process lets the C Runtime Library or the Win32 libraries do most of the super-low-level memory management, which includes setting up:
a stack (for each thread), where local variables and function arguments and return values are stored; and
a heap, where memory is allocated if the process calls malloc or does new X.
For more about the stack is structured, read about calling conventions. For more about how the heap is structured, read about malloc implementations. In general the stack really is a stack, a last-in-first-out data structure, containing arguments, local variables, and the occasional temporary result, and not much more. Since it is easy for a program to write straight past the end of the stack (the common C/C++ bug after which this site is named), the system libraries typically make sure that there is an unmapped page adjacent to the stack. This makes the process crash instantly when such a bug happens, so it's much easier to debug (and the process is killed before it can do any more damage).
The heap is not really a heap in the data structure sense. It's a data structure maintained by the CRT or Win32 library that takes pages of memory from the operating system and parcels them out whenever the process requests small pieces of memory via malloc and friends. (Note that the OS does not micromanage this; a process can to a large extent manage its address space however it wants, if it doesn't like the way the CRT does it.)
A process can also request pages directly from the operating system, using an API like VirtualAlloc or MapViewOfFile.
There's more, but I'd better stop!
For understanding stack frame structure you can refer to
http://en.wikipedia.org/wiki/Call_stack
It gives you information about structure of call stack, how locals , globals , return address is stored on call stack
Another good illustration
http://www.cs.uleth.ca/~holzmann/C/system/memorylayout.pdf
It might not be the most accurate information, but MS Press provides some sample chapters of of the book Inside Microsoft® Windows® 2000, Third Edition, containing information about processes and their creation along with images of some important data structures.
I also stumbled upon this PDF that summarizes some of the above information in an nice chart.
But all the provided information is more from the OS point of view and not to much detailed about the application aspects.
Actually - you won't get far in this matter with at least a little bit of knowledge in Assembler. I'd recoomend a reversing (tutorial) site, e.g. OpenRCE.org.

Resources