VGA and integrated graphics theory - memory

I'm not really wanting to know the ins and outs of VGA but rather the basic principle of how it works (and with integrated graphics), The Intel website says -
So this stolen memory is used as the frame buffer for the VGA adapter and any reads/writes by the VGA graphics controller will be going and coming from there?
Example system with 1MB stolen VGA memory-
So if the above system was running in VGA mode and something was written to the legacy VGA address range (0xA0000 - 0xbffff), what would the process be?
Currently my understanding is that the memory controller would forward it from the CPU to the VGA adapter and then using the graphics translation table (GTT) it would translate this into a physical address at the top of DRAM between the range of 03F0_OOOOh - 03FF_FFFFh?
Would this mean that the legacy VGA memory range 0xA0000 - 0xbffff is not accessible in DRAM as the VGA adapter is using the address range for MMIO?
If anyone could help with those questions it would be greatly appreciated,
Thanks.

it has been quite a few years I wrote something directly for VGA so take that in mind.
The old legacy stuff (CGA/EGA,VGA) mapped all VRAM memory access to two segments only (2 x 64KByte)
graphic modes
A000:0000 - A000:FFFF
text modes
B800:0000 - B800:FFFF
So booth #1 and #2 64 KByte chunks of memory are not directly accessible instead VGA forwards its own memory there. With integrated cards + shared memory they do not have own memory so the chipset takes it from the global memory (usually from the top address space). In that case yes the memory is not accessible by HW (unless some feature of the chipset is used). The memory space in global memory is usually remapped or used for shadow of ROMs
gfx-BIOS
all legacy gfx cards has its own ``BIOS FLASH/EEPROM/EPROM/PROM` memory. I can't remember exactly how that works but as I remember expansion BIOS area starts around
C000:0000
where all BIOS able HW map their BIOS memory (not only gfx cards and not only entire segment in size).
Now there are many gfx modes that need more than 64KB of VRAM so you call gfx BIOS to map appropriate memory segment to A000:0000 or set it by control registers by IO operations on gfx IO ports. Gfx card remap memory and then you can use it ...
VESA
VESA VRAM can be accessed in the same way as on old legacy gfx stuff but VESA add LFB (linear frame buffer) support which can map entire VRAM to memory not just single segment and also can use extended memory (on just base it would have not much use).
As I wrote before it has been some years I deal with this stuff so if I am wrong please edit or add comment ...

Related

DirectX RenderContext RAM/VRAM

I have 8GB or Vram (Gpu) & 16GB of Normal Ram when allocating (creating) many lets say large 4096x4096 textures i eventual run out of Vram.. however from what i can see it then create it on ram instead.. When ever you need to render (with or to) it .. it seams to transfer the render-context from the ram to vram in order to do so. Running normal accessing many render-context over and over every frame (60fps etc) the pc lags out as it tries to transfer very high amounts back and forth. However so long the amount of new (not recently used render-contexts (etc still on ram not vram)) is references each second.. there should not be a issue (performance wise). The question is if this information is correct?
DirectX will allocate DEFAULT pool resources from video RAM and/or the PCIe aperture RAM which can both be accessed by the GPU directly. Often render targets must be in video RAM, and generally video RAM is faster memory--although it greatly depends on the exact architecture of the graphics card.
What you are describing is the 'over-commit' scenario where you have allocated more resources than actually fit in the GPU-accessible resources. In this case, DirectX 11 makes a 'best-effort' which generally involves changing virtual memory mapping to get the scene to render, but the performance is obviously quite poor compared to the more normal situation.
DirectX 12 leaves dealing with 'over-commit' up to the application, much like everything else about DirectX 12 where generally "runtime magic behavior" has been removed. See docs for details on this behavior, as well as this sample

x86 protected mode memory management

I'm newibe of x86 cpu.
I read all materials about memory management of protected mode in x86.
the materials are Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A, System Programming Guide, Part 1
I believe I understand the many steps when cpu is accessing memory.
: selector register is index of segment descriptor table, and the entry of descriptor table is base of the segment, and linear address is addition of the base of the segment and 32bit offset.
But, what I'am confusing about is, it seems to me that CPU cannot know which memory address it will be access at the first time until the all steps above is finished. If CPU want to access specific memory address, It must know the selector value, and offset. But my question is how does it know ?? only information does CPU know is memory address it want to access doesn't it??
How does CPU know the input(selector value, offset) already when it only knows the output(memory address)??
... by
Microprocessor Real Time Clocks or Timer Chips,
periodic function called 'clock signal'
by Memory Controller Hub
Advanced Configuration and Power Interface (ACPI)
ROM, a non-volatile memory inside chips (RealMode Memory Map)
The Local Descriptor Table (LDT) is a memory table used in the x86 architecture in protected mode and containing memory segment descriptors: start in linear memory, size, executability, writability, access privilege, actual presence in memory, etc.
Interrupt descriptor table, is a data structure used by the x86 architecture to implement an interrupt vector table. The IDT is used by the processor to determine the correct response to interrupts and exceptions.
Intel 8259 is a Programmable Interrupt Controller (PIC) designed for the Intel 8085 and Intel 8086 microprocessors. The initial part was 8259, a later A suffix version was upward compatible and usable with the 8086 or 8088 processor. The 8259 combines multiple interrupt input sources into a single interrupt output to the host microprocessor, extending the interrupt levels available in a system beyond the one or two levels found on the processor chip
You also missing real mode
look also DOS_Protected_Mode_Interface & Virtual Control Program Interface
How timer chip control reset line of CPU ?
See also OSCILLATOR CIRCUIT WITH SIGNAL BUFFERING AND START-UP CIRCUITRYfrom Google Patents
real time clock
The CPU 'start' executing code stored in ROM on the motherboard at address FFFF0
The routine test the central hardware, search for video ROM
...
So.. is it not the CPU that 'start' because is power supply line that 'starts'
The power supply signal is sent to the motherboard, where it is received by the processor timer chip that controls the reset line to the processor.
How does the BIOS detect RAM ? See also serial presence detect, power-on self-test (POST)
BIOS is a 16-bit program running in real mode
The BIOS begins its POST when the CPU is reset. The first memory location the CPU tries to execute is known as the reset vector. In the case of a hard reboot, the northbridge will direct this code fetch (request) to the BIOS located on the system flash memory. For a warm boot, the BIOS will be located in the proper place in RAM and the northbridge will direct the reset vector call to the RAM
What is this reset vector ?
The reset vector is the default location a central processing unit will go to find the first instruction it will execute after a reset.
The reset vector is a pointer or address, where the CPU should always begin as soon as it is able to execute instructions. The address is in a section of non-volatile memory initialized to contain instructions to start the operation of the CPU, as the first step in the process of booting the system containing the CPU.
The reset vector for the 8086 processor is at physical address FFFF0h (16 bytes below 1 MB). The value of the CS register at reset is FFFFh and the value of the IP register at reset is 0000h to form the segmented address FFFFh:0000h, which maps to physical address FFFF0h.
About northbridge
A northbridge or host bridge is one of the two chips in the core logic chipset architecture on a PC motherboard, the other being the southbridge. Unlike the southbridge, northbridge is connected directly to the CPU via the front-side bus (FSB)
Sources:
"80386 Programmer's Reference Manual" (PDF). Intel. 1990. Section 10.1 Processor State After Reset
"80386 Programmer's Reference Manual" (PDF). Intel. 1990. Section 10.2.3 First Instruction,

Is there device side pointer of host memory for kernel use in OpenCL (like CUDA)?

In CUDA, we can achieve kernel managed data transfer from host memory to device shared memory by device side pointer of host memory. Like this:
int *a,*b,*c; // host pointers
int *dev_a, *dev_b, *dev_c; // device pointers to host memory
…
cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);
…
//kernel launch
add<<<B,T>>>(dev_a,dev_b,dev_c);
// dev_a, dev_b, dev_c are passed into kernel for kernel accessing host memory directly.
In the above example, kernel code can access host memory via dev_a, dev_b and dev_c. Kernel can utilize these pointers to move data from host to shared memory directly without relaying them by global memory.
But seems that it is an mission impossible in OpenCL? (local memory in OpenCL is the counterpart of shared memory in CUDA)
You can find exactly identical API in OpenCL.
How it works on CUDA:
According to this presentation and the official documentation.
The money quote about cudaHostGetDevicePointer :
Passes back device pointer of mapped host memory allocated by
cudaHostAlloc or registered by cudaHostRegister.
CUDA cudaHostAlloc with cudaHostGetDevicePointer works exactly like CL_MEM_ALLOC_HOST_PTR with MapBuffer works in OpenCL. Basically if it's a discrete GPU the results are cached in the device and if it's a discrete GPU with shared memory with the host it will use the memory directly. So there is no actual 'zero copy' operation with discrete GPU in CUDA.
The function cudaHostGetDevicePointer does not take raw malloced pointers in, just like what is the limitation in OpenCL. From the API users point of view those two are exactly identical approaches allowing the implementation to do pretty much identical optimizations.
With discrete GPU the pointer you get points to an area where the GPU can directly transfer stuff in via DMA. Otherwise the driver would take your pointer, copy the data to the DMA area and then initiate the transfer.
However in OpenCL2.0 that is explicitly possible, depending on the capabilities of your devices. With the finest granularity sharing you can use randomly malloced host pointers and even use atomics with the host, so you could even dynamically control the kernel from the host while it is running.
http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
See page 162 for the shared virtual memory spec. Do note that when you write kernels even these are still just __global pointers from the kernel point of view.

ARM memory mapping: INT15 equivalent? Standard way to query memory map?

On PC-architectures (where the presence of the BIOS and the usage of it is pretty much standardized), you can discover the size of the RAM memory, as well as its reserved/free for use regions by using the INT15 BIOS interrupt, function 0xE820.
Since I'm passionate about low-level programming and after programming Intel architectures for approximately 6 months, I decided I should try and learn how other architectures work. So I've started digging into ARM development. I've got 2 boards I'm currently working on: Olimex A20 OlinuXino-MICRO and Samsung Arndale's Exynos 5250. What I'm trying to do is to port a hypervisor I've developed for Intel architectures to these two boards. I am now in the stage of trying to programmatically detect the memory map of the system in a reliable and acceptably standardized way (I would prefer not to write entirely different code for different ARM boards). But so far, I find the relevant documentation to be a little bit confusing.
On the Olimex A20 I've got a Cortex-A7 ARM CPU.
The PDF found here: http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_principles_of_arm_memory_maps.pdf , which applies to Cortex-A7 and other CPUs, states at page 14 that the memory addressing space from 1GB-to-2GB is reserved for Memory-Mapped I/O devices, whereas the Olimex-A20 documentation found at this link https://github.com/OLIMEX/OLINUXINO/blob/master/HARDWARE/A20-PDFs/A20%20User%20Manual%202013-03-22.pdf?raw=true states at page 21 that the memory addressing space from 1GB-to-3GB is DDR-II/DDR-III memory.
Am I simply confused or is there an inconsistency between these two documents?
The memory maps on ARM chips are highly chip-specific. There is also usually nothing like a BIOS, so your bootloader or hypervisor will have to figure out the memory layout on its own.
Generally you'd need to work with the SDRAM controller to query and initialize the installed SDRAM chips. This is a non-trivial and, once again, very chip-specific process. You should check the code of bootloaders (e.g. U-Boot) you have available for your chips and look for the memory init code.
However, in many of cases, the memory "map" (start of RAM and its size) is simply hardcoded for each board the bootloader is ported to, since it's unlikely to change at every boot.
Historically ARM boot-loaders pass information to the Linux kernel using ATAG structures as desribed in Booting ARM Linux. At a minimum the boot-loader is expected initialize the RAM in the system and pass ATAG_MEM structures to describe where the RAM lives in the address space. Interpreting these structures would give you some of the information you need but it doesn't tell you anything about any peripheral devices. In this booting method the machine type is used to trigger platform specific code to initialize the rest of the hardware.
The new way of doing this is through the Flattened Device Tree. The device tree originated with OpenFirmware and besides describing the RAM mapping can also describe the rest of the hardware and peripherals.

Realistic data rate over PCI bus using DMA?

What is the realistic data transfer rate over a 32-bit/33MHz PCI bus? We need to transfer 32K 32-bit samples from a PCI card to an Intel CPU running Windows. I would think the block would transfer in 1msec but it is taking 40msec. The PCI board has a PLX PCI-9056. We are accessing card memory with a virtual address, but our CPU is bricked-out which make me think the data rate is being held up by CPU involvement. If we go to DMA, will we transfer in closer to 1msec? The reason I have my doubts is the PXI SDK User Manual states:
"BAR space memory read/write is generally slow in relative terms. Reads are typically only 2-4MB/s."
You should check if you can enable burst mode and continuous burst, such that multiple DWords can be transmitted without new address cycles. This makes things much faster. The PLX PCI9056 supports this option, but it must be set by SW accordingly.
We have data rates up to 90 MB/s with DMA Master Transfer on our custom designed frame grabber card.

Resources