Where is stored a gl.CreateTexture?

Where is stored a gl.CreateTexture? - memory

When creating a new texture using createTexture from the Web GL context, where is it stored? And it uses purely the GPU?
In the memory or in the graphics card memory? How's the performance in terms of resources (specially CPU load)?
Thanks!

gl.createTexture provides you with a GPU resource id, when and if that texture is allocated in GPU memory is up to the GPU driver.

Related

Need for CPU-GPU sync if write and read happens on different memory pages of MTLBuffer?

I am using an MTLBuffer in Metal that I created by allocating several memory pages (using vm_allocate) with
device.makeBuffer(bytesNoCopy:length:options:deallocator:).
I write the buffer with CPU and the GPU only reads it. I know that generally I need to synchronise between CPU and GPU.
However, I have more knowledge about where in the MTLBuffer write (by CPU) and read (by GPU) happens and in my case writing is into different memory pages than the read (in a given time interval).
My question: Do I need to sync between CPU and GPU even if the relevant data that is written and read are on different memory pages (but in the same MTLBuffer)? Intuitively, I would think not, but then MTLBuffer is a bit opaque and I don't really know what kind of processing/requirement the GPU actually does/has with the MTLBuffer.
Additional info: This is a question for iOS and MTLStorageMode is shared.
Thank you very much for help!

Assuming the buffer was created with MTLStorageModeManaged, you can use the function didModifyRange to sync CPU to GPU for only a portion (a page for example) of the buffer.

How to allocate GpuMat in shared (not dedicated) GPU memory? [duplicate]

So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.

Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.

CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.

I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.

Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.

Nvidia GPUDirect and camera capturing to GPU

I have a USB3 camera, and I need to have the captured images to be loaded into DirectX texture. Currently I'm just doing it in my code in the user mode - grab images and upload them to GPU, which is, of cause, certain overhead on CPU and delay of ~5-7 milliseconds.
On a new PC there's Nvidia Quadro GPU card, which supports GPUDirect, as I understand, this allows faster memory sharing between the GPU and the CPU, but I'm not sure how can I take advantage of it. In order to capture the images directly to GPU buffers, does the camera driver need to support it? Or it's something I can configure in my code?

Your premise is wrong
as I understand, this allows faster memory sharing between the GPU and
the CPU
GPUDirect is NVIDIA Technology to support PCI to PCI transfers between devices and not between CPU and GPU. In fact it removes a CPU copy while transferring data between 2 PCI devices.
For utilizing GPUDirect in your case you need to have a capture card which will be connected via PCI (your camera will be connected to capture card). Then you can transfer frames Directly to GPU.
Can't comment if it will be faster than USB transfer.

Relation between OpenCL memory architecture and GPU's physical memory/caches (L1/L2...)?

Is there any direct relation between the OpenCL memory architecture:
Local/Global/Constant/Private memory
And the physical GPU's memory and caches.
For example a GPU card that have 1GB memory/L1 cache/L2 cache. Are these related to local/global.. memory?
Or are Local/Constant/Private memory allocated from Global memory?
-Thanks

OpenCL doesn't really discuss caching of memory. Most modern graphics cards do have some sort of caching protocols for global memory, but these are not guaranteed in older cards. However here is an overview of the different memories.
Private Memory - This memory is kept as registers per work-item. GPUs have very large register files per compute unit. However this memory can spill into local memory if needed. Private memory is allocated by default when you create variables.
Local Memory - Memory local to and shared by the workgroup. This memory system typically is on the compute unit itself and cannot be read or written to by other workgroups. This memory has typically very low latency on GPU architectures (on CPU architectures, this memory is simply part of your system memory). This memory is typically used as a manual cache for global memory. Local memory is specified by the __local attribute.
Constant Memory - Part of the global memory, but read only and can therefore be aggressively cached. __constant is used to define this type of memory.
Global Memory - This is the main memory of the GPU. __global is used to place memory into the global memory space.

What is the maximum size of the texture memory on a modern GPU?

We believe that texture memory is part of the global memory, is this true? If so, how much can you allocate? (Indirectly, how much is there?)
And is it true that all multiprocessors can read from the texture memory at the same time?

Texture data is contained in CUDA arrays, and CUDA arrays are allocated out of global memory; so however much global memory is still free (you can call cuMemGetInfo() to see how much free memory is left) is available for allocation as textures.
It's impossible to know how much memory is consumed by a given CUDA array - obviously it has to be at least Width*Height*Depth*sizeof(Texel), but it may take more because the driver has to do an allocation that conforms to the hardware's alignment requirements.

Texture limits for different compute capabilities can be found in the CUDA Programming Guide, available at the NVIDIA CUDA website.
For a given device, one can query device capabilities including texture limits using the cudaGetDeviceProperties function.
Allocation depends on the amount of available global memory and the segmentation of the memory, so there is no easy way to tell whether a given allocation will be successfull or not, especially when working with large textures.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart