I have a USB3 camera, and I need to have the captured images to be loaded into DirectX texture. Currently I'm just doing it in my code in the user mode - grab images and upload them to GPU, which is, of cause, certain overhead on CPU and delay of ~5-7 milliseconds.
On a new PC there's Nvidia Quadro GPU card, which supports GPUDirect, as I understand, this allows faster memory sharing between the GPU and the CPU, but I'm not sure how can I take advantage of it. In order to capture the images directly to GPU buffers, does the camera driver need to support it? Or it's something I can configure in my code?
Your premise is wrong
as I understand, this allows faster memory sharing between the GPU and
the CPU
GPUDirect is NVIDIA Technology to support PCI to PCI transfers between devices and not between CPU and GPU. In fact it removes a CPU copy while transferring data between 2 PCI devices.
For utilizing GPUDirect in your case you need to have a capture card which will be connected via PCI (your camera will be connected to capture card). Then you can transfer frames Directly to GPU.
Can't comment if it will be faster than USB transfer.
Related
I'm using a PIC12F1840 chip with an MPU9250 accelerometer to collect movement data. I'm currently using a 1Kb, SPI RAM chip, but it gets full quite quickly, and there is also data loss while trying to read the data from it (due to the RAM's need for continuous power!).
It doesn't have to be quick, the MPU's fastest speed is the only 184Hz, and I'm planning to use it on a slower setting by default. Can someone suggest some type of memory?
So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.
Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.
CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.
I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.
Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.
in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!
When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.
I'm not really wanting to know the ins and outs of VGA but rather the basic principle of how it works (and with integrated graphics), The Intel website says -
So this stolen memory is used as the frame buffer for the VGA adapter and any reads/writes by the VGA graphics controller will be going and coming from there?
Example system with 1MB stolen VGA memory-
So if the above system was running in VGA mode and something was written to the legacy VGA address range (0xA0000 - 0xbffff), what would the process be?
Currently my understanding is that the memory controller would forward it from the CPU to the VGA adapter and then using the graphics translation table (GTT) it would translate this into a physical address at the top of DRAM between the range of 03F0_OOOOh - 03FF_FFFFh?
Would this mean that the legacy VGA memory range 0xA0000 - 0xbffff is not accessible in DRAM as the VGA adapter is using the address range for MMIO?
If anyone could help with those questions it would be greatly appreciated,
Thanks.
it has been quite a few years I wrote something directly for VGA so take that in mind.
The old legacy stuff (CGA/EGA,VGA) mapped all VRAM memory access to two segments only (2 x 64KByte)
graphic modes
A000:0000 - A000:FFFF
text modes
B800:0000 - B800:FFFF
So booth #1 and #2 64 KByte chunks of memory are not directly accessible instead VGA forwards its own memory there. With integrated cards + shared memory they do not have own memory so the chipset takes it from the global memory (usually from the top address space). In that case yes the memory is not accessible by HW (unless some feature of the chipset is used). The memory space in global memory is usually remapped or used for shadow of ROMs
gfx-BIOS
all legacy gfx cards has its own ``BIOS FLASH/EEPROM/EPROM/PROM` memory. I can't remember exactly how that works but as I remember expansion BIOS area starts around
C000:0000
where all BIOS able HW map their BIOS memory (not only gfx cards and not only entire segment in size).
Now there are many gfx modes that need more than 64KB of VRAM so you call gfx BIOS to map appropriate memory segment to A000:0000 or set it by control registers by IO operations on gfx IO ports. Gfx card remap memory and then you can use it ...
VESA
VESA VRAM can be accessed in the same way as on old legacy gfx stuff but VESA add LFB (linear frame buffer) support which can map entire VRAM to memory not just single segment and also can use extended memory (on just base it would have not much use).
As I wrote before it has been some years I deal with this stuff so if I am wrong please edit or add comment ...
What is the realistic data transfer rate over a 32-bit/33MHz PCI bus? We need to transfer 32K 32-bit samples from a PCI card to an Intel CPU running Windows. I would think the block would transfer in 1msec but it is taking 40msec. The PCI board has a PLX PCI-9056. We are accessing card memory with a virtual address, but our CPU is bricked-out which make me think the data rate is being held up by CPU involvement. If we go to DMA, will we transfer in closer to 1msec? The reason I have my doubts is the PXI SDK User Manual states:
"BAR space memory read/write is generally slow in relative terms. Reads are typically only 2-4MB/s."
You should check if you can enable burst mode and continuous burst, such that multiple DWords can be transmitted without new address cycles. This makes things much faster. The PLX PCI9056 supports this option, but it must be set by SW accordingly.
We have data rates up to 90 MB/s with DMA Master Transfer on our custom designed frame grabber card.