Who loads the BIOS and the memory map during boot-up

Who loads the BIOS and the memory map during boot-up - memory

For the BIOS, Wikipedia states:
The address of the BIOS's memory is located such that it will be executed when the computer is first started up. A jump instruction then directs the processor to start executing code in the BIOS.
I know that BIOS lives in non-volatile memory. But it would have to be loaded into the RAM for it to be executed. So who loads the BIOS into RAM ?
I have also read that a memory map is loaded at start-up. Does the BIOS load this memory map ? Where is is stored ?

At initial power on, the BIOS is executed directly from ROM. The ROM chip is mapped to a fixed location in the processor's memory space (this is typically a feature of the chipset). When the x86 processor comes out of reset, it immediately begins executing from 0xFFFFFFF0.
However, executing directly from ROM is quite slow, so usually one of the first things the BIOS does is to copy and decompress the BIOS code into RAM, and it executes from there. Of course, the memory controller must be initialized first! The BIOS takes care of that beforehand.
The memory map layout will vary from system to system. At power-on, the BIOS will query the attached PCI/PCIe devices, determine what resources are needed, and place them in the memory map at the optimal location. If everything is working properly, memory-mapped devices should not overlap with RAM. (Note that on a 64-bit system with >3GB of RAM, things get complicated because you need a "hole" in the middle of RAM for your 32-bit PCI/PCIe devices. Some early x64 BIOSes and chipsets had issues with this.)

Related

Memory allocation during startup of a C++ program

I sometimes read that the code segment is placed into ROM/FLASH. Others state that it is also loaded into RAM.
Is my understanding correct that it is common to place it into FLASH primary memory in case of an embedded system? And what are the advantages? I assume the startup of the program will be faster but since the FLASH memory is much slower it would be better to additionally load it from FLASH to RAM during the startup phase when RAM usage does not matter?

Sometimes you don't have enough memory for your program, so you just leave it in ROM or flash. On a system flush with memory you just load everything into RAM, it's much faster.
Some embedded CPUs have 2K of memory, but 2MB of flash. As an example the RP2040 has 264KB of SRAM (RAM) but 2MB of Flash memory for your programs. That's a lot bigger than the memory footprint.
Flash is slow compared to modern DRAM but in an embedded environment the CPU isn't always that fast either. The RP2040 only runs at 133MHz, so it won't notice the difference between flash latency and SRAM latency like a chip running in the 2GHz range might. It's clocked 15x slower.
If you want to explore this more, embedded CPUs like the RP2040 are really cheap, some less than $1, so you can experiment on them and see how it plays out in real life without having to spend much money at all.

Generally, RAM is much faster than flash. Where to run the code from however, depends on the system. On most traditional embedded systems, you don't execute from RAM.
On low-end embedded systems (8 and 16-bitters) you always keep all code in flash and there won't be a performance difference between executing from RAM or flash. Such systems typically don't have a MMU nor protection against writing to the code area, so running code from RAM is highly dangerous since bugs can write straight into physical memory. Also, these systems tend to have very limited RAM.
On mid-range embedded systems (Cortex M etc) where you start to clock the core faster than the flash can keep up with, you need to introduce wait states, where the CPU waits for the flash to read. Typically you need wait states when you go beyond somewhere around 40-50MHz system clock on modern systems. The higher the clock, the more wait states you need.
Such systems do not typically execute code from RAM either, since they usually don't need extreme performance. And they typically don't have a lot of RAM either. In some cases like mid-range Power PC, you'll have instruction cache, which helps a lot in compensating for the slower flash, since instructions can be pre-loaded from flash to cache by branch prediction.
On high-end systems (Cortex A, x86 etc) there will be lots of RAM available for the purpose of executing the code from there and then you are expected to do so. On these systems, cache rather serves the purpose of speeding up access to RAM.
Historically, RAM was also much more prone to electromagnetic interference and could also lose charge over time unless you kept writing to the cells, so you didn't want to keep code in RAM for those reasons alone. That's not much of an issue today though.

x86 protected mode memory management

I'm newibe of x86 cpu.
I read all materials about memory management of protected mode in x86.
the materials are Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A, System Programming Guide, Part 1
I believe I understand the many steps when cpu is accessing memory.
: selector register is index of segment descriptor table, and the entry of descriptor table is base of the segment, and linear address is addition of the base of the segment and 32bit offset.
But, what I'am confusing about is, it seems to me that CPU cannot know which memory address it will be access at the first time until the all steps above is finished. If CPU want to access specific memory address, It must know the selector value, and offset. But my question is how does it know ?? only information does CPU know is memory address it want to access doesn't it??
How does CPU know the input(selector value, offset) already when it only knows the output(memory address)??

... by
Microprocessor Real Time Clocks or Timer Chips,
periodic function called 'clock signal'
by Memory Controller Hub
Advanced Configuration and Power Interface (ACPI)
ROM, a non-volatile memory inside chips (RealMode Memory Map)
The Local Descriptor Table (LDT) is a memory table used in the x86 architecture in protected mode and containing memory segment descriptors: start in linear memory, size, executability, writability, access privilege, actual presence in memory, etc.
Interrupt descriptor table, is a data structure used by the x86 architecture to implement an interrupt vector table. The IDT is used by the processor to determine the correct response to interrupts and exceptions.
Intel 8259 is a Programmable Interrupt Controller (PIC) designed for the Intel 8085 and Intel 8086 microprocessors. The initial part was 8259, a later A suffix version was upward compatible and usable with the 8086 or 8088 processor. The 8259 combines multiple interrupt input sources into a single interrupt output to the host microprocessor, extending the interrupt levels available in a system beyond the one or two levels found on the processor chip
You also missing real mode
look also DOS_Protected_Mode_Interface & Virtual Control Program Interface
How timer chip control reset line of CPU ?
See also OSCILLATOR CIRCUIT WITH SIGNAL BUFFERING AND START-UP CIRCUITRYfrom Google Patents
real time clock
The CPU 'start' executing code stored in ROM on the motherboard at address FFFF0
The routine test the central hardware, search for video ROM
...
So.. is it not the CPU that 'start' because is power supply line that 'starts'
The power supply signal is sent to the motherboard, where it is received by the processor timer chip that controls the reset line to the processor.
How does the BIOS detect RAM ? See also serial presence detect, power-on self-test (POST)
BIOS is a 16-bit program running in real mode
The BIOS begins its POST when the CPU is reset. The first memory location the CPU tries to execute is known as the reset vector. In the case of a hard reboot, the northbridge will direct this code fetch (request) to the BIOS located on the system flash memory. For a warm boot, the BIOS will be located in the proper place in RAM and the northbridge will direct the reset vector call to the RAM
What is this reset vector ?
The reset vector is the default location a central processing unit will go to find the first instruction it will execute after a reset.
The reset vector is a pointer or address, where the CPU should always begin as soon as it is able to execute instructions. The address is in a section of non-volatile memory initialized to contain instructions to start the operation of the CPU, as the first step in the process of booting the system containing the CPU.
The reset vector for the 8086 processor is at physical address FFFF0h (16 bytes below 1 MB). The value of the CS register at reset is FFFFh and the value of the IP register at reset is 0000h to form the segmented address FFFFh:0000h, which maps to physical address FFFF0h.
About northbridge
A northbridge or host bridge is one of the two chips in the core logic chipset architecture on a PC motherboard, the other being the southbridge. Unlike the southbridge, northbridge is connected directly to the CPU via the front-side bus (FSB)
Sources:
"80386 Programmer's Reference Manual" (PDF). Intel. 1990. Section 10.1 Processor State After Reset
"80386 Programmer's Reference Manual" (PDF). Intel. 1990. Section 10.2.3 First Instruction,

Paged memory vs Pinned memory in memory copy [duplicate]

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;
In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk.
If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA.
So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.

A verbose note on copying non-locked pages to locked pages.
It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.
Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.

Use of Virtual Memory

What happens if a page is present in Virtual Memory, but not in main memory?
How is it executed?
Is the program loaded into the Main Memory from the virtual Memory? If it is loaded to Main Memory from Virtual Memory, that that would be an IO operation since it is on disk.Then what is the use of Virtual Memory , if anyways we have to make an IO operation to execute it.
And when use program generates logical address , and MMU maps it to physical address , and if that address is not present in Main Memory , then does OS check in Virtual Memory??
Thanks in advance

Let me start by saying that this is a very simplified explanation, not the definite guide to virtual memory;
Virtual memory basically gives your process the illusion that it's the only thing running in the memory space of the computer. When the process accesses a virtual memory page, the MMU translates it into a physical memory access. If the physical memory page does not yet exist (or isn't in physical memory), the process is suspended and the operating system is notified and can add the page to memory (for example by fetching it from disk) before resuming the process again.
One reason for virtual memory is that the process doesn't have to worry too much how much memory it uses and doesn't have to change if you for example expand physical memory on the machine, it can just work as if it had all the memory it can address and have the operating system solve how the actual memory is used.
The reason it doesn't (usually) slow the computer to a crawl is that many processes don't use big parts of their memory at all times, if a memory page isn't accessed in an hour, the physical memory can be put to much better use during that hour than to be kept active. Of course, the more memory your processes actively use continuously, the slower your process will appear to run.

How does the kernel know about segment fault?

When a segment fault occurs, it means I access memory which is not allocated or protected.But How does the kernel or CPU know it? Is it implemented by the hardware? What data structures need the CPU to look up? When a set of memory is allocated, what data structures need to be modified?

The details will vary, depending on what platform you're talking about, but typically the MMU will generate an exception (interrupt) when you attempt an invalid memory access and the kernel will then handle this as part of an interrupt service routine.

A seg fault generally happens when a process attempts to access memory that the CPU cannot physically address. It is the hardware that notifies the OS about a memory access violation. The OS kernel then sends a signal to the process which caused the exception

To answer the second part of your question, again it depends on hardware and OS. In a typical system (i.e. x86) the CPU consults the segment registers (via the global or local descriptor tables) to turn the segment relative address into a virtual address (this is usually, but not always, a no-op on modern x86 operating systems), and then (the MMU does this bit really, but on x86 its part of the CPU) consults the page tables to turn that virtual address into a physical address. When it encounters a page which is not marked present (the present bit is not set in the page directory or tables) it raises an exception. When the OS handles this exception, it will either give up (giving rise to the segfault signal you see when you make a mistake or a panic) or it will modify the page tables to make the memory valid and continue from the exception. Typically the OS has some bookkeeping which says which pages could be valid, and how to get the page. This is how demand paging occurs.

It all depends on the particular architecture, but all architectures with paged virtual memory work essentially the same. There are data structures in memory that describe the virtual-to-physical mapping of each allocated page of memory. For every memory access, the CPU/MMU hardware looks up those tables to find the mapping. This would be horribly slow, of course, so there are hardware caches to speed it up.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart