In Linux, the mmap(2) man page explains that an anonymous mapping
. . . is not backed by any file; its contents are initialized to zero.
The FreeBSD mmap(2) man page does not make a similar guarantee about zero-filling, though it does promise that bytes after the end of a file in a non-anonymous mapping are zero-filled.
Which flavors of Unix promise to return zero-initialized memory from anonymous mmaps? Which ones return zero-initialized memory in practice, but make no such promise on their man pages?
It is my impression that zero-filling is partially for security reasons. I wonder if any mmap implementations skip the zero-filling for a page that was mmapped, munmapped, then mmapped again by a single process, or if any implementations fill a newly mapped page with pseudorandom bits, or some non-zero constant.
P.S. Apparently, even brk and sbrk used to guarantee zero-filled pages. My experiments on Linux seem to indicate that, even if full pages are zero-filled upon page fault after a sbrk call allocates them, partial pages are not:
#include <unistd.h>
#include <stdio.h>
int main() {
const intptr_t many = 100;
char * start = sbrk(0);
sbrk(many);
for (intptr_t i = 0; i < many; ++i) {
start[i] = 0xff;
}
printf("%d\n",(int)start[many/2]);
sbrk(many/-2);
sbrk(many/2);
printf("%d\n",(int)start[many/2]);
sbrk(-1 * many);
sbrk(many/2);
printf("%d\n",(int)start[0]);
}
Which flavors of Unix promise to return zero-initialized memory from anonymous mmaps?
GNU/Linux
As you said in your question, the Linux version of mmap promises to zero-fill anonymous mappings:
MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero.
NetBSD
The NetBSD version of mmap promises to zero-fill anonymous mappings:
MAP_ANON
Map anonymous memory not associated with any specific file. The file descriptor is not used for creating MAP_ANON regions, and must be specified as -1. The mapped memory will be zero filled.
OpenBSD
The OpenBSD manpage of mmap does not promise to zero-fill anonymous mappings. However, Theo de Raadt (prominent OpenBSD developer), declared in November 2019 on the OpenBSD mailing list:
Of course it is zero filled. What else would it be? There are no plausible alternatives.
I think it detracts from the rest of the message to say something so obvious.
And the other OpenBSD developers did not contradict him.
IBM AIX
The AIX version of mmap promises to zero-fill anonymous mappings:
MAP_ANONYMOUS
Specifies the creation of a new, anonymous memory region that is initialized to all zeros.
HP-UX
According to nixdoc.net, the HP-UX version of mmap promises to zero-fill anonymous mappings:
If MAP_ANONYMOUS is set in flags, a new memory region is created and initialized to all zeros.
Solaris
The Solaris version of mmap promises to zero-fill anonymous mappings:
When MAP_ANON is set in flags, and fildes is set to -1, mmap() provides a direct path to return anonymous pages to the caller. This operation is equivalent to passing mmap() an open file descriptor on /dev/zero with MAP_ANON elided from the flags argument.
This Solaris man page gives us a way to get zero-filled memory pages without relying on the behavior of mmap used with the MAP_ANONYMOUS flag: do not use the MAP_ANONYMOUS flag, and create a mapping backed by the /dev/zero file. It would be useful to know the list of Unix-like operating systems providing the /dev/zero file, to see if this approach is more portable than using the MAP_ANONYMOUS flag (neither /dev/zero nor MAP_ANONYMOUS are POSIX).
Interestingly, the Wikipedia article about /dev/zero claims that MAP_ANONYMOUS was introduced to remove the need of opening /dev/zero when creating an anonymous mapping.
It's hard to say which ones promise what without simply exhaustively enumerating all man pages or other release documentation, but the underlying code that handles MAP_ANON is (usually? always?) also used to map in bss space for executables, and bss space needs to be zero-filled. So it's pretty darn likely.
As for "giving you back your old values" (or some non-zero values but most likely, your old ones) if you unmap and re-map, it certainly seems possible, if some system were to be "lazy" about deallocation. I have only used a few systems that support mmap in the first place (BSD and Linux derivatives) and neither one is lazy that way, at least, not in the kernel code handling mmap.
The reason sbrk might or might not zero-fill a "regrown" page is probably tied to history, or lack thereof. The current FreeBSD code matches with what I recall from the old, pre-mmap days: there are two semi-secret variables, minbrk and curbrk, and both brk and sbrk will only invoke SYS_break (the real system call) if they are moving curbrk to a value that is at least minbrk. (Actually, this looks slightly broken: brk has the at-least behavior but sbrk just adds its argument to curbrk and invokes SYS_break. Seems harmless since the kernel checks, in sys_obreak() in /sys/vm/vm_unix.c, so a too-negative sbrk() will fail with EINVAL.)
I'd have to look at the Linux C library (and then perhaps kernel code too) but it may simply ignore attempts to "lower the break", and merely record a "logical break" value in libc. If you have mmap() and no backwards compatibility requirements, you can implement brk() and sbrk() entirely in libc, using anonymous mappings, and it would be trivial to implement both of them as "grow-only", as it were.
Related
#include <iostream>
int main(int argc, char** argv) {
int* heap_var = new int[1];
/*
* Page size 4KB == 4*1024 == 4096
*/
heap_var[1025] = 1;
std::cout << heap_var[1025] << std::endl;
return 0;
}
// Output: 1
In the above code, I allocated 4 bytes of space in the heap. Now as the OS maps the virtual memory to system memory in pages (which are 4KB each), A block of 4KB in my virtual mems heap would get mapped to the system mem. For testing I decided I would try to access other addresses in my allocated page/heap-block and it worked, however I shouldn't have been allowed to access more than 4096 bytes from the start (which means index 1025 as an int variable is 4 bytes).
I'm confused why I am able to access 4*1025 bytes (More than the size of the page that has been allocated) from the start of the heap block and not get a seg fault.
Thanks.
The platform allocator likely allocated far more than the page size is since it is planning to use that memory "bucket" for other allocation or is likely keeping some internal state there, it is likely that in release builds there is far more than just a page sized virtual memory chunk there. You also don't know where within that particular page the memory has been allocated (you can find out by masking some bits) and without mentioning the platform/arch (I'm assuming x86_64) there is no telling that this page is even 4kb, it could be a 2MB "huge" page or anything alike.
But by accessing outside array bounds you're triggering undefined behavior like crashes in case of reads or data corruption in case of writes.
Don't use memory that you don't own.
I should also mention that this is likely unrelated to C++ since the new[] operator usually just invokes malloc/calloc behind the scenes in the core platform library (be that libSystem on OSX or glibc or musl or whatever else on Linux, or even an intercepting allocator). The segfaults you experience are usually from guard pages around heap blocks or in absence of guard pages there simply using unmapped memory.
NB: Don't try this at home: There are cases where you may intentionally trigger what would be considered undefined behavior in general, but on that specific platform you may know exactly what's there (a good example is abusing pthread_t opaque on Linux to get tid without an overhead of an extra syscall, but you have to make sure you're using the right libc, the right build type of that libc, the right version of that libc, the right compiler that it was built with etc).
It doesn't seem like it would be difficult to associate ranges with segments of memory. Then have an assembly instruction which treats 2 integers as "location" & "offset" (another for "data" if setting), and returns the data and error code. This would mean no longer having to make a choice between speed and security/safety when working with arrays.
Another example might be a function which verifies that instructions originating in a particular memory range cannot physically access memory outside that range. If all hardware connected to the motherboard had this capability (and were made to be compatible with each other), it would be trivial to make perfect virtual machines that run at nearly the same speed as the physical machine.
Dustin Soodak
Yes.
Decades ago, Lisp machines performed simultaneous validation checks (e.g. type checks and bounds checks) as the program ran with the assumption the program and state were valid, jumping "back in time" if the check failed - unfortunately this ability to get "free" runtime validation was lost when conventional (i.e. x86) machines became dominant.
https://en.wikipedia.org/wiki/Lisp_machine
Lisp Machines ran the tests in parallel with the more conventional single instruction additions. If the simultaneous tests failed, then the result was discarded and recomputed; this meant in many cases a speed increase by several factors. This simultaneous checking approach was used as well in testing the bounds of arrays when referenced, and other memory management necessities (not merely garbage collection or arrays).
Fortunately we're finally learning from the past and slowly, and by piecemeal, reintroducing those innovations - Intel's "MPX" (Memory Protection eXtensions) for x86 were introduced in Skylake-generation processors for hardware bounds-checking - though it isn't perfect.
(x86 is a regression in other ways too: IBM's mainframes had true hardware-accelerated system virtualization in the 1980s - we didn't get it on x86 until 2005 with Intel's "VT-x" and AMD's "AMD-V" extensions).
x86 BOUND
Technically, x86 does have hardware bounds-checking: the the BOUND instruction was introduced in 1982 in the Intel 80188 (as well as the Intel 286 and above, but not the Intel 8086, 8088 or 80186 processors).
While the BOUND instruction does provide hardware bounds-checking, I understand it indirectly caused performance issues because it breaks the hardware branch predictor (according to a Reddit thread, but I'm unsure why), but also because it requires the bounds to be specified in a tuple in memory - that's terrible for performance - I understand at runtime it's no faster than manually having the instructions to do an "if index not in range [x,y] then signal the BR exception to the program or OS" (so you might imagine the BOUND instruction was added for the convenience of people who coded assembly by-hand, which was quite common in the 1980s).
The BOUND instruction is still present in today's processors, but it was not included in AMD64 (x64) - likely for the performance reasons I explained above, and also because likely very few people were using it (and compilers could trivially replace it with a manual bounds check, that might have better performance anyway, as that could use registers).
Another disadvantage to storing the array bounds in memory is that code elsewhere (that wasn't subject to BOUNDS checking) could overwrite the previously written bounds for another pointer and circumvent the check that way - this is mostly a problem with code that intentionally tries to disable safety features (i.e. malware), but if the bounds were stored in the stack - and given how easy it is to corrupt the stack, it has even less utility.
Intel MPX
Intel MPX was introduced in Skylake architecture in 2015 and should be present in all Skylake and subsequent processor models in the mainstream Intel Core family (including Xeon, and non-SoC versions of Celeron and Pentium). Intel also implemented MPX in the Goldmont architecture (Atom, and SoC versions of Celeron and Pentium) from 2016 onwards.
MPX is superior to BOUND in that it provides dedicated registers to store the bounds range so the bounds-check should be almost zero-cost compared to BOUND which required a memory access. On the Intel 486 the BOUND instruction takes 7 cycles (compare to CMP which takes only 2 cycles even if the operand was a memory address). In Skylake the MPX equivalent (BNDMK, BNDCL and BNDCU) are all 1-cycle instructions and BNDMK can be amortized as it only needs to be called once for each new pointer).
I cannot find any information on wherever or not AMD has implemented their own version of MPX yet (as of June 2017).
Critical thoughts on MPX
Unfortunately the current state of MPX is not all that rosy - a recent paper by Oleksenko, Kuvaiskii, et al. in February 2017 "Intel MPX Explained" (PDF link: caution: not yet peer-reviewed) is a tad critical:
Our main conclusion is that Intel MPX is a promising technique that is not yet practical for widespread adoption. Intel MPX’s performance overheads are still high (~50% on average), and the supporting infrastructure has bugs which may cause compilation or runtime errors. Moreover, we showcase the design limitations of Intel MPX: it cannot detect temporal errors, may have false positives and false negatives in multithreaded code, and its restrictions
on memory layout require substantial code changes for some programs.
Also note that compared to the Lisp Machines of yore, Intel MPX is still executed inline - whereas in Lisp Machines (if my understanding is correct) bounds checks happened concurrently in hardware with a retroactive jump backwards if the check failed; thus, so-long as a running program's pointers do not point to out-of-bounds locations then there would be an absolutely zero runtime performance cost, so if you have this C code:
char arr[10];
arr[9] = 'a';
arr[8] = 'b';
Then under MPX then this would be executed:
Time Instruction Notes
1 BNDMK arr, arr+9 Set bounds 0 to 9.
2 BNDCL arr Check `arr` meets lower-bound.
3 BNDCU arr Check `arr` meets upper-bound.
4 MOV 'a' arr+9 Assign 'a' to arr+9.
5 MOV 'a' arr+8 Assign 'a' to arr+8.
But on a Lisp machine (if it were magically possible to compile C to Lisp...), then the program-reader-hardware in the computer has the ability to execute additional "side" instructions concurrently with the "actual" instructions, allowing the "side" instructions to instruct the computer to disregard the results from the "actual" instructions in the event of an error:
Time Actual instruction Side instruction
1 MOV 'A' arr+9 ENSURE arr+9 BETWEEN arr, arr+9
2 MOV 'A' arr+8 ENSURE arr+8 BETWEEN arr, arr+9
I understand the instructions-per-cycle for the "side" instructions are not the same as the "Actual" instructions - so the side-check for the instruction at Time=1 might only complete after the "Actual" instructions have already progressed on to Time=3 - but if the check failed then it would pass the instruction pointer of the failed instruction to the exception handler that would direct the program to disregard the results of the instructions executed after Time=1. I don't know how they could achieve that without massive amounts of memory or some mandatory execution pauses, possibly memory-fencing too -
that's outside the scope of my answer, but it is at least theoretically possible.
(Note in this contrived example I'm using constexpr index values that a compiler can prove will never be out-of-bounds so would omit the MPX checks entirely - so pretend they're user-supplied variables instead :) ).
I'm not an expert in x86 (or have any experience in microprocessor design, spare a CS500-level course I took at UW and didn't do the homework for...) but I don't believe concurrent execution of bounds-checks nor "time travel" is possible with x86's current design, despite the extant implementation of out-of-order execution - I might be wrong, however. I speculate that if all pointer-types were promoted to 3-tuples ( struct BoundedPointer<T> { T* ptr, T* min, T* max } - which technically already happens with MPX and other software-based bounds-checks as every guarded pointer has its bounds defined when BNDMK is called) then the protection could be provided for free by the MMU - but now pointers will consume 24 bytes of memory, each, instead of the current 8 bytes - or compare to the measly 4 bytes under 32-bit x86 - RAM is plentiful, but still a finite resource that shouldn't be wasted.
MPX in GCC
GCC supported for MPX from version 5.0 to 9.1 ( https://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20GCC%20compiler ) when it was removed due to its maintenance burden.
MPX in Visual Studio / Visual C++
Visual Studio 2015 Update 1 (2015.1) added "experimental" support for MPX with the /d2MPX switch ( https://blogs.msdn.microsoft.com/vcblog/2016/01/20/visual-studio-2015-update-1-new-experimental-feature-mpx/ ). Support is still present in Visual Studio 2017 but Microsoft has not announced if it's considered a mainstream (i.e. non-experimental) feature yet.
MPX in Clang / LLVM
Clang has partially supported manual use of MPX in the past, but that support was fully removed in version 10.0
As of July 2021, LLVM still seems capable of outputting MPX instructions, but I can't see any evidence of an MPX "pass".
MPX in Intel C/C++ Compiler
The Intel C/C++ Compiler has supported MPX since version 15.0.
The XL compilers available on the IBM POWER processors on the Little Endian Linux, Big Endian Linux or AIX operating systems have a different implementation of array bounds checking.
Using the -qcheck or its synonym -C option turns on various kinds of checking. -qcheck=bounds checks array bounds. When this is used, the compilers check that every array reference has a valid subscript.
The hardware instruction used is a conditional trap, comparing the subscript to the upper limit and trapping if the subscript is too large or too small. In C and C++ the lower limit is 0. In Fortran it defaults to 1 but can be any integer. When it is not zero, the lower limit is subtracted from the subscript being checked, and the check compares that to the upper limit minus the lower limit.
When the limit is known at compile time and small enough, a conditional trap immediate instruction is enough. When the limit is calculated at execution time or is greater than 65535, a conditional trap instruction comparing two registers is needed.
The performance impact is small for several reasons:
1. The conditional trap instructions are fast.
2. They are executed in a standard integer pipeline. Since most POWER CPUs have 2 or 4 integer pipelines, there is usually an otherwise empty slot to put the trap in, so it is often essentially zero cost.
3. When it can the compiler optimizer moves the conditional trap out of loops so it is executed only once, checking all loop iterations at once.
4. When it can prove the actual subscript cannot exceed the limit, the optimizer discards the instruction.
5. Also when it can prove the subscript will also be invalid, the optimizer uses an unconditional trap.
6. If necessary -qcheck can be used during testing and skipped for production builds, but the overhead is small enough that's not usually necessary.
If my memory is correct, one long ago paper reported a 2% slowdown in one case and 0% in another. Since that CPU had only one integer pipeline, the slowdown should be significantly less with modern CPUs.
Other checking using the same mechanism is available to detect dereferencing NULL pointers, dividing an integer by zero, using an uninitialized auto variable, specially written asserts, etc.
This doesn't include all kinds of invalid memory usage, but it does handle the most common kind, does it very efficiently, and is very easy to use.
GCC supports -fbounds-check for similar purposes, but at this time it is only available for the Fortran front end (gfortran).
Why is the memory address 0x0 reserved, and for what? I am having trouble understanding for what exactly, thank you for helping
It is mostly a convention, and it is implementation specific.
The C language standard (C99 or C11) -and some other programming languages such as Lisp- has the notion of null pointer which cannot be dereferenced (that would be undefined behavior, segmentation fault) and is different of any other pointer (to some valid memory location). Tony Hoare modestly called that notion "my billion dollar mistake", and some languages (Haskell, Ocaml) have some tagged unions types (e.g. 'a option in Ocaml) instead.
Most implementations (but not all) represent the null pointer by address 0.
In practice, on a desktop, laptop or tablet, a user-mode C program runs in some virtual address space where the page containing the address 0 is not mapped. (On some Linux, you perhaps could mmap(2) with MAP_FIXED the address 0, but that would be poor taste...)
In some embedded microcontrollers (e.g. AVR), address 0 could be used.
In theory (and in the past), addresses might be more complex than a number... (in the 1980s, e.g. x86 memory segmentation on i286, and iAPX432 addressing, Rekursiv architecture, etc...)
Read several books and web pages on C programming, microprocessor architectures & instruction sets, operating system principles, virtual memory, MMUs.
It has been a common practice on paged memory systems not to map the first (zeroth) page by default. This is a convention normally enforced by the linker. When the program loader reads the executable file, it never gets an instruction to map the first logical page.
The reason for this is to detect null pointer errors.
int *whatever = 0 ;
. . . .
*whatever = 10 ;
will cause an access violation.
That said, it is usually possible for a process to map the first (zeroth) page after execution starts and, in some cases, you can specify linker options allowing program sections to be based there.
Prequisites
POSIX.1 2008 specifies the setrlimit() and getrlimit() functions. Various constants are provided for the resource argument, some of which are reproduced below for easier understaning of my question.
The following resources are defined:
(...)
RLIMIT_DATA
This is the maximum size of a data segment of the process, in bytes. If this limit is exceeded, the malloc() function shall fail with errno set to [ENOMEM].
(...)
RLIMIT_STACK
This is the maximum size of the initial thread's stack, in bytes. The implementation does not automatically grow the stack beyond this limit. If this limit is exceeded, SIGSEGV shall be generated for the thread. If the thread is blocking SIGSEGV, or the process is ignoring or catching SIGSEGV and has not made arrangements to use an alternate stack, the disposition of SIGSEGV shall be set to SIG_DFL before it is generated.
RLIMIT_AS
This is the maximum size of total available memory of the process, in bytes. If this limit is exceeded, the malloc() and mmap() functions shall fail with errno set to [ENOMEM]. In addition, the automatic stack growth fails with the effects outlined above.
Furthermore, POSIX.1 2008 defines data segment like this:
3.125 Data Segment
Memory associated with a process, that can contain dynamically allocated data.
I understand that the RLMIT_DATA resource was traditionally used to denote the maximum amount of memory that can be assigned to a process with the brk() function. Recent editions of POSIX.1 do no longer specify this function and many operating systems (e.g. Mac OS X) do not support this function as a system call. Instead it is emulated with a variant of mmap() which is not part of POSIX.1 2008.
Questions
I am a little bit confused about the semantic and use of the RLIMIT_DATA resource. Here are the concrete questions I have:
Can the stack be part of the data segment according to this specification?
The standard says about RLIMIT_DATA: “If this limit is exceeded, the malloc() function shall fail with errno set to [ENOMEM].” Does this mean that memory allocated with malloc() must be part of the data segment?
On Linux, memory allocated with mmap() does not count towards the data segment. Only memory allocated with brk() or sbrk() is part of the data segment. Recent versions of the glibc use a malloc() implementation that allocates all its memory with mmap(). The value of RLIMIT_DATA thus has no effect on the amount of memory you can allocate with this implementation of malloc().
Is this a violation of POSIX.1 2008?
Do other platforms exhibit similar behavior?
The standard says about RLIMIT_AS: "If this limit is exceeded, the malloc() and mmap() functions shall fail with errno set to [ENOMEM]." As the failure of mmap() is not specified for RLIMIT_DATA, I conclude that memory obtained from mmap() does not count towards the data segment.
Is this assumption true? Does this only apply to non-POSIX variants of mmap()?
FreeBSD also shares the problem of malloc(3) being implemented using mmap(2) in the default malloc implementation. I ran into this when porting a product from FreeBSD 6 to 7, where the switch happened. We switched the default limit for each process from RLIMIT_DATA=512M to RLIMIT_VMEM=512M, i.e. limit the virtual memory allocation to 512MB.
As for whether this violates POSIX, I don't know. My gut feeling is that lots of things violate POSIX and a 100% POSIX compliant system is as rare as a strictly-confirming C compiler.
EDIT: heh, and now I see that FreeBSD's name RLIMIT_VMEM is non-standard; they define RLIMIT_AS as RLIMIT_VMEM for POSIX compatibility.
How is a program (e.g. C or C++) arranged in computer memory? I kind of know a little about segments, variables etc, but basically I have no solid understanding of the entire structure.
Since the in-memory structure may differ, let's assume a C++ console application on Windows.
Some pointers to what I'm after specifically:
Outline of a function, and how is it called?
Each function has a stack frame, what does that contain and how is it arranged in memory?
Function arguments and return values
Global and local variables?
const static variables?
Thread local storage..
Links to tutorial-like material and such is welcome, but please no reference-style material assuming knowledge of assembler etc.
Might this be what you are looking for:
http://en.wikipedia.org/wiki/Portable_Executable
The PE file format is the binary file structure of windows binaries (.exe, .dll etc). Basically, they are mapped into memory like that. More details are described here with an explanation how you yourself can take a look at the binary representation of loaded dlls in memory:
http://msdn.microsoft.com/en-us/magazine/cc301805.aspx
Edit:
Now I understand that you want to learn how source code relates to the binary code in the PE file. That's a huge field.
First, you have to understand the basics about computer architecture which will involve learning the general basics of assembly code. Any "Introduction to Computer Architecture" college course will do. Literature includes e.g. "John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach" or "Andrew Tanenbaum, Structured Computer Organization".
After reading this, you should understand what a stack is and its difference to the heap. What the stack-pointer and the base pointer are and what the return address is, how many registers there are etc.
Once you've understood this, it is relatively easy to put the pieces together:
A C++ object contains code and data, i.e., member variables. A class
class SimpleClass {
int m_nInteger;
double m_fDouble;
double SomeFunction() { return m_nInteger + m_fDouble; }
}
will be 4 + 8 consecutives bytes in memory. What happens when you do:
SimpleClass c1;
c1.m_nInteger = 1;
c1.m_fDouble = 5.0;
c1.SomeFunction();
First, object c1 is created on the stack, i.e., the stack pointer esp is decreased by 12 bytes to make room. Then constant "1" is written to memory address esp-12 and constant "5.0" is written to esp-8.
Then we call a function that means two things.
The computer has to load the part of the binary PE file into memory that contains function SomeFunction(). SomeFunction will only be in memory once, no matter how many instances of SimpleClass you create.
The computer has to execute function SomeFunction(). That means several things:
Calling the function also implies passing all parameters, often this is done on the stack. SomeFunction has one (!) parameter, the this pointer, i.e., the pointer to the memory address on the stack where we have just written the values "1" and "5.0"
Save the current program state, i.e., the current instruction address which is the code address that will be executed if SomeFunction returns. Calling a function means pushing the return address on the stack and setting the instruction pointer (register eip) to the address of the function SomeFunction.
Inside function SomeFunction, the old stack is saved by storing the old base pointer (ebp) on the stack (push ebp) and making the stack pointer the new base pointer (mov ebp, esp).
The actual binary code of SomeFunction is executed which will call the machine instruction that converts m_nInteger to a double and adds it to m_fDouble. m_nInteger and m_fDouble are found on the stack, at ebp - x bytes.
The result of the addition is stored in a register and the function returns. That means the stack is discarded which means the stack pointer is set back to the base pointer. The base pointer is set back (next value on the stack) and then the instruction pointer is set to the return address (again next value on the stack). Now we're back in the original state but in some register lurks the result of the SomeFunction().
I suggest, you build yourself such a simple example and step through the disassembly. In debug build the code will be easy to understand and Visual Studio displays variable names in the disassembly view. See what the registers esp, ebp and eip do, where in memory your object is allocated, where the code is etc.
What a huge question!
First you want to learn about virtual memory. Without that, nothing else will make sense. In short, C/C++ pointers are not physical memory addresses. Pointers are virtual addresses. There's a special CPU feature (the MMU, memory management unit) that transparently maps them to physical memory. Only the operating system is allowed to configure the MMU.
This provides safety (there is no C/C++ pointer value you can possibly make that points into another process's virtual address space, unless that process is intentionally sharing memory with you) and lets the OS do some really magical things that we now take for granted (like transparently swap some of a process's memory to disk, then transparently load it back when the process tries to use it).
A process's address space (a.k.a. virtual address space, a.k.a. addressable memory) contains:
a huge region of memory that's reserved for the Windows kernel, which the process isn't allowed to touch;
regions of virtual memory that are "unmapped", i.e. nothing is loaded there, there's no physical memory assigned to those addresses, and the process will crash if it tries to access them;
parts the various modules (EXE and DLL files) that have been loaded (each of these contains machine code, string constants, and other data); and
whatever other memory the process has allocated from the system.
Now typically a process lets the C Runtime Library or the Win32 libraries do most of the super-low-level memory management, which includes setting up:
a stack (for each thread), where local variables and function arguments and return values are stored; and
a heap, where memory is allocated if the process calls malloc or does new X.
For more about the stack is structured, read about calling conventions. For more about how the heap is structured, read about malloc implementations. In general the stack really is a stack, a last-in-first-out data structure, containing arguments, local variables, and the occasional temporary result, and not much more. Since it is easy for a program to write straight past the end of the stack (the common C/C++ bug after which this site is named), the system libraries typically make sure that there is an unmapped page adjacent to the stack. This makes the process crash instantly when such a bug happens, so it's much easier to debug (and the process is killed before it can do any more damage).
The heap is not really a heap in the data structure sense. It's a data structure maintained by the CRT or Win32 library that takes pages of memory from the operating system and parcels them out whenever the process requests small pieces of memory via malloc and friends. (Note that the OS does not micromanage this; a process can to a large extent manage its address space however it wants, if it doesn't like the way the CRT does it.)
A process can also request pages directly from the operating system, using an API like VirtualAlloc or MapViewOfFile.
There's more, but I'd better stop!
For understanding stack frame structure you can refer to
http://en.wikipedia.org/wiki/Call_stack
It gives you information about structure of call stack, how locals , globals , return address is stored on call stack
Another good illustration
http://www.cs.uleth.ca/~holzmann/C/system/memorylayout.pdf
It might not be the most accurate information, but MS Press provides some sample chapters of of the book Inside Microsoft® Windows® 2000, Third Edition, containing information about processes and their creation along with images of some important data structures.
I also stumbled upon this PDF that summarizes some of the above information in an nice chart.
But all the provided information is more from the OS point of view and not to much detailed about the application aspects.
Actually - you won't get far in this matter with at least a little bit of knowledge in Assembler. I'd recoomend a reversing (tutorial) site, e.g. OpenRCE.org.