Create a DirectX staging texture backed by a custom RAM allocation

Create a DirectX staging texture backed by a custom RAM allocation - directx

I am the author of Looking Glass (https://looking-glass.io) and I am looking for a way to improve our DXGI Desktop Duplication capture performance. This question is specifically about how to avoid an extra CPU memory copy as part of our pipeline.
If you are not familiar with Looking Glass, what we are doing is using a virtual hardware device (IVSHMEM) to map a block of shared memory into a Windows Virtual Machine. We then use this shared RAM to pass the captured desktop back to the host so that it can be rendered on screen. We do this so that we can acquire the video output of a GPU that has been passed through to the guest by means of VFIO and integrate it into the Linux desktop.
Currently, our pipeline works as follows:
AcquireNextFrame
CopyResource to staging texture (system RAM)
mmap staging texture
memcpy to IVSHMEM memory
unmap staging texture
Is it possible to avoid this extra copy by creating a DX11 staging texture backed by the IVSHMEM device directly, removing the need to copy the texture again?
Ie:
AcquireNextFrame
CopyResource to texture backed by IVSHMEM ram.
If this needs to be done in kernel space (the IVSHMEM driver) this is possible, but it would be preferable if there is a userspace method of performing this.
Note, the IVSHMEM device is simply a dumb virtual device that provides the shared memory as one of its base address registers (BARs).
Edit: for reference here is the existing code we are using:
https://github.com/gnif/LookingGlass/blob/master/host/platform/Windows/capture/DXGI/src/dxgi.c

Related

Zynq 7020: some memory is unreadable

I have a Zynq 7020 chip with 250 MB of DDR memory attached to it put in ECC mode (so 125 MB effectively). It's attached to NAND flash memory and has a series of bootloaders which eventually load VxWorks to run some stuff.
We are about to do a test which will require me to read all the memory, flash, and FPGA configuration memory on the device after execution.
I have another [small] program that I will install via JTAG after the run and have it write out the rest of the RAM, then all the flash and FPGA configuration memory. This program is compiled by the Xilinx SDK and is bare-metal (no OS/bootloader).
When I load this program, I reset the processor (JTAG command), run a ps7_init.tcl script that sets all the CPU registers to a good configuration as set by Vivado, load the elf file onto the device, then run the processor. This program then tries to read memory starting at the address 0x0, but it crashes quickly. I told it to start at an address of 1 MB (1<<20) (because I know there's some weird memory map stuff at the beginning, so I tried this just in case) and it reads a little bit more then crashes again.
The crash appears to be the CPU not letting me read these areas of RAM.
Why not? How do I make it so I can read every byte of the 125 MB of RAM I have?

There is a template project that Xilinx provides that is close to exactly what you are trying to do. It is the "Memory Tests" template, and it runs through attached memory ranges and tests read/write operations. By default it tests DDR memory range and ps7_ram_1. The linker script for the application puts the program in ps7_ram_0, and doesn't test that range since you can't overwrite instruction and data memory for the application.
Code for the template can be found here:
<SDK installation directory>\data\embeddedsw\lib\sw_apps\memory_tests
I would recommend creating a new project from this template, and changing the memorytest.c file to fit your need.
To answer your question more directly: You are likely running into problems with the processors MMU (memory management unit) and cache management. If your application is being loaded into DDR, then it is possible that it is blocking you from accessing application instruction and data memory. If your application is loaded into OCM (like the template) there may be access problems with the memory in cache. If you disable cache using
Xil_DCacheDisable();
Then you should be able to read from the entire DDR memory space (as long as it exists). Make sure that you configure your applications linker script (*.ld) so that the application knows which memory devices are out there, their base addresses and size.

What If a processes don't fit in memory?

If processes don’t fit in memory, What moves them in and out of memory to run?
this question is based on Operating System Memory management theory.
I have checked about the purpose of memory management unit. Is this related to swapping?

The operating system will use a memory management technique called virtual memory.
This is when a computer compensates for shortages of physical memory by temporarily transferring pages (segments of memory) of data from RAM to backing store. RAM is much faster than secondary storage and when a computer needs to use secondary storage over primary the user will feel the computer running slower.
The operating systems virtual memory manager is responsible for managing this. It will use techniques such as placing pages that have not been referenced for in a while into secondary memory (you hard disk for example) and if a page in secondary storage is required it will move the page from secondary to primary memory.
Another point is that most modern apps will page themselves, such as when they are minimised for example, to reduce the amount of memory they're using for other applications running.

How to load two images onto a zynq zedboard

I'm trying to stitch two images on fpga using xilinx zedboard zynq7000. I couldn't find any material on how to dump two images onto the board and then get the output placing the images side to side. Any leads are greatly appreciated.

That board has arm processors, typically running linux. So at least you won't have any problems in getting images into the board, either with gigabit ethernet, or on sd card, or on memory stick in usb otg port. You don't really want to implement that channel yourself on fpga side, it would be just a waste of time.
To process images using fpga part (assuming that's the point of the task), you'll have your fpga-hw part connected to arm system via AXI interface, and memory-map it into linux application's address space.
You don't need entire images saved into fpga hw memory blocks, as probably pics will be too large to fit into available fpga resources, and because fpga can access linux side memory (big sdram) via fpga-to-sdram bridge.

can two process shared same GPU memory? (CUDA)

In CPU world one can do it via memory map. Can similar things done for GPU?
If two process can share a same CUDA context, I think it will be trivial - just pass GPU memory pointer around. Is it possible to share same CUDA context between two processes?
Another possibility I could think of is to map device memory to a memory mapped host memory. Since it's memory mapped, it can be shared between two processes. Does this make sense / possible, and are there any overhead?

CUDA MPS effectively allows CUDA activity emanating from 2 or more processes to behave as if they share the same context on the GPU. (For clarity: CUDA MPS does not cause two or more processes to share the same context. However the work scheduling behavior appears similar to what you would observe if the work were emanating from the same process and therefore the same context.) However this won't provide for what you are asking for:
can two processes share the same GPU memory?
One method to achieve this is via CUDA IPC (interprocess communication) API.
This will allow you to share an allocated device memory region (i.e. a memory region allocated via cudaMalloc) between multiple processes. This answer contains additional resources to learn about CUDA IPC.
However, according to my testing, this does not enable sharing of host pinned memory regions (e.g. a region allocated via cudaHostAlloc) between multiple processes. The memory region itself can be shared using ordinary IPC mechanisms available for your particular OS, but it cannot be made to appear as "pinned" memory in 2 or more processes (according to my testing).

where is the memory map configuration stored?

Assume there is an MCU(like a cypress PSOC4 chip which I'm using). It contains a flash memory(to store firmware) and a RAM(probably SRAM) inside the chip. I understand that even these two components need to be memory mapped in order for the processing unit to access them.
However, the flash memory and SRAM should be mapped every time the MPU is powered on, right?.
Then where is the configuration for memory map stored?
Is it somehow hardwired inside the MPU? Or is it stored in a separately hidden small piece of RAM?
I once thought that the memory map info should be located at the front of the firmware, but this doesn't make sense because the firmware is stored in the flash, and the MPU would have no idea where the flash is mapped to. So, I think this is a wrong idea.
By the way, is a memory map even configurable?

Yes hardwired in the mcu on boot, some mcus allow for remapping once up and running, but in order to boot the flash/rom has to be mapped to a known place, a sane design would also have the on chip sram mapped and ready to use on boot at a known location.
Some use straps (pins externally hardwired high or low) to manipulate how the mcu boots, sometimes that includes a different mapping. A single strap could for example choose between mapping a bootloader rom vs the user flash into the boot space of the processor. But that would be documented as with other mapping choices in the chip vendors documentation for the part.
Some mcus allow you to in software after boot move ram into the vector/exception table area so you can manipulate it at run time and not be limited to what was in the flash at boot. Some mcus are going so far as to have a mmu like feature, but I have a hard time calling those mcus as they can run in the hundreds of mhz, have floating point uints, caches, etc. Technically they are a SOC with ram and flash on chip, so classified as an MCU.
Your thinking is sane, the flash and sram mappings are in logic and at reset you can know where things will be. It is in the documentation for that product.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart