Why is the memory address 0x0 reserved, and for what? - memory

Why is the memory address 0x0 reserved, and for what? I am having trouble understanding for what exactly, thank you for helping

It is mostly a convention, and it is implementation specific.
The C language standard (C99 or C11) -and some other programming languages such as Lisp- has the notion of null pointer which cannot be dereferenced (that would be undefined behavior, segmentation fault) and is different of any other pointer (to some valid memory location). Tony Hoare modestly called that notion "my billion dollar mistake", and some languages (Haskell, Ocaml) have some tagged unions types (e.g. 'a option in Ocaml) instead.
Most implementations (but not all) represent the null pointer by address 0.
In practice, on a desktop, laptop or tablet, a user-mode C program runs in some virtual address space where the page containing the address 0 is not mapped. (On some Linux, you perhaps could mmap(2) with MAP_FIXED the address 0, but that would be poor taste...)
In some embedded microcontrollers (e.g. AVR), address 0 could be used.
In theory (and in the past), addresses might be more complex than a number... (in the 1980s, e.g. x86 memory segmentation on i286, and iAPX432 addressing, Rekursiv architecture, etc...)
Read several books and web pages on C programming, microprocessor architectures & instruction sets, operating system principles, virtual memory, MMUs.

It has been a common practice on paged memory systems not to map the first (zeroth) page by default. This is a convention normally enforced by the linker. When the program loader reads the executable file, it never gets an instruction to map the first logical page.
The reason for this is to detect null pointer errors.
int *whatever = 0 ;
. . . .
*whatever = 10 ;
will cause an access violation.
That said, it is usually possible for a process to map the first (zeroth) page after execution starts and, in some cases, you can specify linker options allowing program sections to be based there.

Related

MIPS location of registers

I think I have a fairly basic MIPS question but am still getting my head wrapped around how addressing works in mips.
My question is: What is the address of the register $t0?
I am looking at the following memory allocation diagram from the MIPs "green sheet"
I had two ideas:
The register $t0 has a register number of 8 so I'm wondering if it would have an address of 0x0000 0008 and be in the reserved portion of the memory block.
Or would it fall in the Static Data Section and have an address of 0x1000 0008?
I know that MARS and different assemblers might start the addressing differently as described in this related question:
How is the la instruction translated in MIPS?
I trying to understand what the "base" address is for register $t0 so I have a better understanding how offsets(base) work.
For example what the address of 8($t0) would be
Thanks for the help!
feature
Registers
Memory
count
very few
vast
speed
fast
slow
Named
yes
no
Addressable
no
yes
There are typically 32 or fewer registers for programmers to use.  On a 32-bit machine we can address 2^32 different bytes of memory.
Registers are fast, while memory is slow, potentially taking dozens of cycles depending on cache features & conditions.
On a load-store machine, registers can be used directly in most instructions (by naming them in the machine code instruction), whereas memory access requires a separate load or store instruction.  Computational instructions on such a machine typically allows naming up to 3 registers (e.g. target, source1, source2).  Memory operands have to be brought into registers for computation (and sometimes moved back to memory).
Register can be named in instructions, but they do not have addresses and cannot be indexed.  On MIPS no register can be found as alias at some address in memory.  It is hard to put even a smallish array (e.g. array of 10) in registers because they have no addresses and cannot be indexed.  Memory has numerical addresses, so we can rely on storing arrays and objects in a predictable pattern of addresses.  (Memory generally doesn't have names, just addresses; however, there are usually special memory locations for working with I/O various devices, and, as you note memory is partitioned into sections that have start (and ending) addresses.)
To be clear, memory-based aliases have been designed into some processors of the past.  The HP/1000 (circa 70s'-80's), for example, had 2 registers (A & B), and they had aliases at memory locations 0 and 1, respectively.  However, this aliasing of CPU registers to memory is generally no longer done on modern processors.
For example what the address of 8($t0) would be
8($t0) refers to the memory address of (the contents of register $t0) + 8.  With proper usage, the program fragment would $t0 would be using $t0 as a pointer, which is some variable that holds a memory address.

what is the difference between small memory model and large memory model?

what difference does it make when i choose 'large memory model' instead of 'small memory model' inside Turbo C compiler ?
how does that change behavior of my program ?
regards,
essbeev.
It refers to very old concept of 16-bit memory model. 32bit & 64bit computers know nothing about these memory models.
So returning to your questions: small - declares that pointers allows you address only 64k of data or code. Pointer has length 16 bit. Entire your program is resided in single 64k segment. To explicitly address another part of memory you need explicitly declare pointer as FAR. large - declares that pointer to code or data has 32 bit, so it is FAR by default.
Hope you would not hang on these questions so long, since it is obsolete concept.
The 8086 processor has 20-bit physical addressing using a combination of 16-bit segment register and 16-bit offset. You could pack both into a 32-bit FAR pointer, or you could asssume a default segment register and store just the lower 16 bits in a NEAR pointer.
The difference between the small and large models is simply whether pointers are by default NEAR or FAR when not explicitly specified.

Address Error in Assembly (ColdFire MCF5307)

Taking my first course in assembly language, I am frustrated with cryptic error messages during debugging... I acknowledge that the following information will not be enough to find the cause of the problem (given my limited understanding of the assembly language, ColdFire(MCF5307, M68K family)), but I will gladly take any advice.
...
jsr out_string
Address Error (format 0x04 vector 0x03 fault status 0x1 status reg 0x2700)
I found a similar question on http://forums.freescale.com/freescale/board/message?board.id=CFCOMM&thread.id=271, regarding on ADDRESS ERROR in general.
The answer to the question states that the address error is because the code is "incorrectly" trying to execute on a non-aligned boundary (or accessing non-aligned memory).
So my questions will be:
What does it mean to "incorrectly" trying to execute a non-aligned boundary/memory? If there is an example, it would help a lot
What is non-aligned boundary/memory?
How would you approach fixing this problem, assuming you have little debugging technique(eg. using breakpoints and trace)
First of all, it is possible that isn't the instruction causing the error. Be sure to see if the previous or next instruction could have caused it. However, assuming that exception handlers and debuggers have improved:
An alignment exception is what occurs when, say 32 bit (4 byte) data is retrieved from an address which is not a multiple of 4 bytes. For example, variable x is 32 bits at address 2, then
const1: dc.w someconstant
x: dc.l someotherconstant
Then the instruction
mov.l x, %r0
would cause a data alignment fault on a 68000 (and 68010, IIRC). The 68020 eliminated this restriction and performs the unaligned access, but at the cost of decreased performance. I'm not aware of the jsr (jump to subroutine) instruction requiring alignment, but it's not unreasonable and it's easy to arrange—Before each function, insert the assembly language's macro for alignment:
.align long
func: ...
It has been a long time since I've used a 68K family processor, but I can give you some hints.
Trying to execute on an unaligned boundary means executing code at an odd address. If out_string were at an address with the low bit set for example.
The same holds true for a data access to memory of 2 or 4 byte data. I'm not sure if the Coldfire supports byte access to odd memory addresses, but the other 68K family members did.
The address error occurs on the instruction that causes the error in all cases.
Find out what instruction is there. If the pc matches (or is close) then it is an unaligned execution. If it is a memory access, e.g. move.w d0,(a0), then check to see what address is being read/written, in this case the one pointed at by a0.
I just wanted to add that this is very good stuff to figure out. I program high end medical imaging devices in my day job, but occasionally I need to get down to this level. I have found and fixed more than one COTS OS problem by being able to track down just this sort of problem.

How does a program look in memory?

How is a program (e.g. C or C++) arranged in computer memory? I kind of know a little about segments, variables etc, but basically I have no solid understanding of the entire structure.
Since the in-memory structure may differ, let's assume a C++ console application on Windows.
Some pointers to what I'm after specifically:
Outline of a function, and how is it called?
Each function has a stack frame, what does that contain and how is it arranged in memory?
Function arguments and return values
Global and local variables?
const static variables?
Thread local storage..
Links to tutorial-like material and such is welcome, but please no reference-style material assuming knowledge of assembler etc.
Might this be what you are looking for:
http://en.wikipedia.org/wiki/Portable_Executable
The PE file format is the binary file structure of windows binaries (.exe, .dll etc). Basically, they are mapped into memory like that. More details are described here with an explanation how you yourself can take a look at the binary representation of loaded dlls in memory:
http://msdn.microsoft.com/en-us/magazine/cc301805.aspx
Edit:
Now I understand that you want to learn how source code relates to the binary code in the PE file. That's a huge field.
First, you have to understand the basics about computer architecture which will involve learning the general basics of assembly code. Any "Introduction to Computer Architecture" college course will do. Literature includes e.g. "John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach" or "Andrew Tanenbaum, Structured Computer Organization".
After reading this, you should understand what a stack is and its difference to the heap. What the stack-pointer and the base pointer are and what the return address is, how many registers there are etc.
Once you've understood this, it is relatively easy to put the pieces together:
A C++ object contains code and data, i.e., member variables. A class
class SimpleClass {
int m_nInteger;
double m_fDouble;
double SomeFunction() { return m_nInteger + m_fDouble; }
}
will be 4 + 8 consecutives bytes in memory. What happens when you do:
SimpleClass c1;
c1.m_nInteger = 1;
c1.m_fDouble = 5.0;
c1.SomeFunction();
First, object c1 is created on the stack, i.e., the stack pointer esp is decreased by 12 bytes to make room. Then constant "1" is written to memory address esp-12 and constant "5.0" is written to esp-8.
Then we call a function that means two things.
The computer has to load the part of the binary PE file into memory that contains function SomeFunction(). SomeFunction will only be in memory once, no matter how many instances of SimpleClass you create.
The computer has to execute function SomeFunction(). That means several things:
Calling the function also implies passing all parameters, often this is done on the stack. SomeFunction has one (!) parameter, the this pointer, i.e., the pointer to the memory address on the stack where we have just written the values "1" and "5.0"
Save the current program state, i.e., the current instruction address which is the code address that will be executed if SomeFunction returns. Calling a function means pushing the return address on the stack and setting the instruction pointer (register eip) to the address of the function SomeFunction.
Inside function SomeFunction, the old stack is saved by storing the old base pointer (ebp) on the stack (push ebp) and making the stack pointer the new base pointer (mov ebp, esp).
The actual binary code of SomeFunction is executed which will call the machine instruction that converts m_nInteger to a double and adds it to m_fDouble. m_nInteger and m_fDouble are found on the stack, at ebp - x bytes.
The result of the addition is stored in a register and the function returns. That means the stack is discarded which means the stack pointer is set back to the base pointer. The base pointer is set back (next value on the stack) and then the instruction pointer is set to the return address (again next value on the stack). Now we're back in the original state but in some register lurks the result of the SomeFunction().
I suggest, you build yourself such a simple example and step through the disassembly. In debug build the code will be easy to understand and Visual Studio displays variable names in the disassembly view. See what the registers esp, ebp and eip do, where in memory your object is allocated, where the code is etc.
What a huge question!
First you want to learn about virtual memory. Without that, nothing else will make sense. In short, C/C++ pointers are not physical memory addresses. Pointers are virtual addresses. There's a special CPU feature (the MMU, memory management unit) that transparently maps them to physical memory. Only the operating system is allowed to configure the MMU.
This provides safety (there is no C/C++ pointer value you can possibly make that points into another process's virtual address space, unless that process is intentionally sharing memory with you) and lets the OS do some really magical things that we now take for granted (like transparently swap some of a process's memory to disk, then transparently load it back when the process tries to use it).
A process's address space (a.k.a. virtual address space, a.k.a. addressable memory) contains:
a huge region of memory that's reserved for the Windows kernel, which the process isn't allowed to touch;
regions of virtual memory that are "unmapped", i.e. nothing is loaded there, there's no physical memory assigned to those addresses, and the process will crash if it tries to access them;
parts the various modules (EXE and DLL files) that have been loaded (each of these contains machine code, string constants, and other data); and
whatever other memory the process has allocated from the system.
Now typically a process lets the C Runtime Library or the Win32 libraries do most of the super-low-level memory management, which includes setting up:
a stack (for each thread), where local variables and function arguments and return values are stored; and
a heap, where memory is allocated if the process calls malloc or does new X.
For more about the stack is structured, read about calling conventions. For more about how the heap is structured, read about malloc implementations. In general the stack really is a stack, a last-in-first-out data structure, containing arguments, local variables, and the occasional temporary result, and not much more. Since it is easy for a program to write straight past the end of the stack (the common C/C++ bug after which this site is named), the system libraries typically make sure that there is an unmapped page adjacent to the stack. This makes the process crash instantly when such a bug happens, so it's much easier to debug (and the process is killed before it can do any more damage).
The heap is not really a heap in the data structure sense. It's a data structure maintained by the CRT or Win32 library that takes pages of memory from the operating system and parcels them out whenever the process requests small pieces of memory via malloc and friends. (Note that the OS does not micromanage this; a process can to a large extent manage its address space however it wants, if it doesn't like the way the CRT does it.)
A process can also request pages directly from the operating system, using an API like VirtualAlloc or MapViewOfFile.
There's more, but I'd better stop!
For understanding stack frame structure you can refer to
http://en.wikipedia.org/wiki/Call_stack
It gives you information about structure of call stack, how locals , globals , return address is stored on call stack
Another good illustration
http://www.cs.uleth.ca/~holzmann/C/system/memorylayout.pdf
It might not be the most accurate information, but MS Press provides some sample chapters of of the book Inside Microsoft® Windows® 2000, Third Edition, containing information about processes and their creation along with images of some important data structures.
I also stumbled upon this PDF that summarizes some of the above information in an nice chart.
But all the provided information is more from the OS point of view and not to much detailed about the application aspects.
Actually - you won't get far in this matter with at least a little bit of knowledge in Assembler. I'd recoomend a reversing (tutorial) site, e.g. OpenRCE.org.

How does a virtual machine work?

I've been looking into how programming languages work, and some of them have a so-called virtual machines. I understand that this is some form of emulation of the programming language within another programming language, and that it works like how a compiled language would be executed, with a stack. Did I get that right?
With the proviso that I did, what bamboozles me is that many non-compiled languages allow variables with "liberal" type systems. In Python for example, I can write this:
x = "Hello world!"
x = 2**1000
Strings and big integers are completely unrelated and occupy different amounts of space in memory, so how can this code even be represented in a stack-based environment? What exactly happens here? Is x pointed to a new place on the stack and the old string data left unreferenced? Do these languages not use a stack? If not, how do they represent variables internally?
Probably, your question should be titled as "How do dynamic languages work?."
That's simple, they store the variable type information along with it in memory. And this is not only done in interpreted or JIT compiled languages but also natively-compiled languages such as Objective-C.
In most VM languages, variables can be conceptualized as pointers (or references) to memory in the heap, even if the variable itself is on the stack. For languages that have primitive types (int and bool in Java, for example) those may be stored on the stack as well, but they can not be assigned new types dynamically.
Ignoring primitive types, all variables that exist on the stack have their actual values stored in the heap. Thus, if you dynamically reassign a value to them, the original value is abandoned (and the memory cleaned up via some garbage collection algorithm), and the new value is allocated in a new bit of memory.
The VM has nothing to do with the language. Any language can run on top of a VM (the Java VM has hundreds of languages already).
A VM enables a different kind of "assembly language" to be run, one that is more fit to adapting a compiler to. Everything done in a VM could be done in a CPU, so think of the VM like a CPU. (Some actually are implemented in hardware).
It's extremely low level, and in many cases heavily stack based--instead of registers, machine-level math is all relative to locations relative to the current stack pointer.
With normal compiled languages, many instructions are required for a single step. a + might look like "Grab the item from a point relative to the stack pointer into reg a, grab another into reg b. add reg a and b. put reg a into a place relative to the stack pointer.
The VM does all this with a single, short instruction, possibly one or two bytes instead of 4 or 8 bytes PER INSTRUCTION in machine language (depending on 32 or 64 bit architecture) which (guessing) should mean around 16 or 32 bytes of x86 for 1-2 bytes of machine code. (I could be wrong, my last x86 coding was in the 80286 era.)
Microsoft used (probably still uses) VMs in their office products to reduce the amount of code.
The procedure for creating the VM code is the same as creating machine language, just a different processor type essentially.
VMs can also implement their own security, error recovery and memory mechanisms that are very tightly related to the language.
Some of my description here is summary and from memory. If you want to explore the bytecode definition yourself, it's kinda fun:
http://java.sun.com/docs/books/jvms/second_edition/html/Instructions2.doc.html
The key to many of the 'how do VMs handle variables like this or that' really comes down to metadata... The meta information stored and then updated gives the VM a much better handle on how to allocate and then do the right thing with variables.
In many cases this is the type of overhead that can really get in the way of performance. However, modern day implementations, etc have come a long way in doing the right thing.
As for your specific questions - treating variables as vanilla objects / etc ... comes down to reassigning / reevaluating meta information on new assignments - that's why x can look one way and then the next.
To answer a part of your questions, I'd recommend a google tech talk about python, where some of your questions concerning dynamic languages are answered; for example what a variable is (it is not a pointer, nor a reference, but in case of python a label).

Resources