I have a large dataset, say, 5 GB and I am doing stream-wise processing on the data, now, I need to figure out on how much data I can send to GPU at a time for processing, so that I can make utilization of GPU memory to the fullest.
Also, if my RAM is not sufficient to do processing/hold on 5 GB of data, what is the work-around for this?
A pipelined application might use 3 buffers on the GPU. One buffer is used to hold the data currently being transferred to the GPU (from the host), one buffer to hold the data currently being processed by the GPU, and one buffer to hold the data(results) currently being transferred from the GPU (to the host).
This implies that your application processing can be broken into "chunks". This is true for many applications that work on large data sets.
CUDA streams enable the developer to write code that allows these 3 operations (transfer to, process, transfer from) to run simultaneously.
There is no specific number that defines the size of the buffers in the above scenario. Certainly, a straightforward implementation would create 3 buffers, each of which is smaller than 1/3 of the total memory on the GPU, leaving some memory left over for overhead and other data that may need to live in GPU memory. So if your GPU has 5GB, you might be able to run with three 1GB buffers. But there is no tool like deviceQuery that will tell you this; it is not a property of the device.
You may want to read carefully the above linked programming guide section, as well as review the CUDA simple streams sample code.
Related
I am using an MTLBuffer in Metal that I created by allocating several memory pages (using vm_allocate) with
device.makeBuffer(bytesNoCopy:length:options:deallocator:).
I write the buffer with CPU and the GPU only reads it. I know that generally I need to synchronise between CPU and GPU.
However, I have more knowledge about where in the MTLBuffer write (by CPU) and read (by GPU) happens and in my case writing is into different memory pages than the read (in a given time interval).
My question: Do I need to sync between CPU and GPU even if the relevant data that is written and read are on different memory pages (but in the same MTLBuffer)? Intuitively, I would think not, but then MTLBuffer is a bit opaque and I don't really know what kind of processing/requirement the GPU actually does/has with the MTLBuffer.
Additional info: This is a question for iOS and MTLStorageMode is shared.
Thank you very much for help!
Assuming the buffer was created with MTLStorageModeManaged, you can use the function didModifyRange to sync CPU to GPU for only a portion (a page for example) of the buffer.
I have a MTLBuffer that is using memory that is allocated by the cpu and thus shared by both the cpu and the GPU.
Per Apple's suggestion I am using triple buffering to remove latency that might be caused by one processor waiting on the other to finish.
My vertex data changes every frame so every frame I am writing to one section of the array with the CPU and reading a different section with the GPU.
What I would like to do is read some of the values that the GPU is currently also reading as they save me some time doing calculations for the section of the buffer the CPU is writing to.
Essentially this is because the current frame's data is dependent on the previous frames data.
Is this valid? Can the CPU and the GPU be reading from the same portion of memory at once since memory is shared on iOS?
I think that's valid and safe, for two reasons. First, CPUs actually often have to read in order to write. Things like caches and memory buses don't allow for access to RAM at the granularity we usually think of (byte or even register size). In order to write, it usually has to read a larger chunk from memory, modify just the part written, and then (eventually) write the larger chunk back to memory. So, even the approach where you don't explicitly read from parts of the buffer that the GPU is reading and you only write to parts that the GPU isn't accessing can, in theory, still be implicitly reading from parts of the buffer that the GPU is reading. Since we're not given the info we'd need to reliably avoid that, I'd say it isn't considered a problem.
Second, no warning is given about what you describe in Apple's docs. There's the "Maintaining Coherency Between CPU and GPU Memory" section in the article about resource objects. That only discussed the case where either the CPU or GPU are modifying shared data, not where both are just reading.
Then there's the "Resource Storage Modes and Device Memory Models" section describing the new storage modes introduced with iOS 9 and macOS 10.11. And the docs for MTLResourceStorageModeShared itself. Again, there's mention of reading vs. writing, but none about reading vs. reading.
If there were a problem with simultaneous reading, I think Apple would have discussed it.
By pressing F12 and then Esc on Chrome, you can see a few options to tick. One of them is show FPS meter, which allows us to see GPU memory usage in real time.
I have a few questions regarding this GPU memory usage:
This GPU memory means the memory the webpage needs to store its code: variables, methods, images, cached videos, etc. Is this right to affirm?
Is there a reason as to why it has an upper bound of 512 Mb? Is there a way to reduce or increase it?
How much GPU memory usage is enough to see considerable slowdown on browser navigation?
If I have an array with millions of elements (just hypothetically), and I splice all the elements in the array, will it free the memory that was in use? Or will it not "really" free the memory, requiring an additional step to actually wipe it out?
1. What is stored in GPU memory
Although there are no hard-set rules on the type of data that can be stored in GPU-memory, the bulk of GPU memory generally contains single-frame resources like textures, multi-frame resources like vertex buffers and index buffer data, and programmable-shader compiled code fragments. So while in theory it is possible to store video's in GPU memory, as well as all kinds of other bulk data, in practice, for every streamed video only a bunch of frames will ever be in GPU-ram.
The main reason for this soft-selection of texture-like data sets is that a GPU is a parallel hardware architecture, and it expects the data to be compatible with that philosophy, which means that there are no inter-dependencies between sets of data (i.e. pixels). Decoding images from a video stream is more or less the same as resolving interdependence between data-blocks.
2. Is 512MB enough for everyone?
No. It's probably based on your hardware.
3. When does GPU memory become slow?
You have to know that some parts of the GPU memory are so fast you can't even start to appreciate the speed. There is nothing wrong with the speed of a GPU card. What matters is the time it takes to get the data IN that memory in the first place. That is called bandwidth, and the operations usually need to be synchronized. In that case, the driver will lock the Northbridge bus so that data can flow from main memory into GPU memory, and this locking + transfer takes quite some time.
So to answer the question, once it is uploaded, the GUI will remain fast, no matter how much more memory is used on the GPU card. The only thing that can slow it down, are changes to the GUI, and other GPU processes taking time to complete that may interfere with rendering operations.
4. Splicing ram memory frees it up?
I'm not quite sure what you mean by splicing. GPU memory is freed by applications that release that memory by using the API calls to do that. If you want to render you GPU memory blank, you'd have to grab the GPU handles of the resources first, upload 'clear' data into them, and then release the handles again, but (for normal single-threaded GPU applications) you can only do that in your own process context.
I'm currently working on an HLSL shader that is limited by global memory bandwidth. I need to coalesce as much memory as possible in each memory transaction. Based on the guidelines from NVIDIA for CUDA and OpenCL (DirectCompute documentation is quite lacking), the largest memory transaction size for compute capability 2.0 is 128 bytes, while the largest word that can be accessed is 16 bytes. Global memory accesses can be coalesced when the data being accessed by the threads in a warp fall into the same 128 byte segment. With this in mind, wouldn't structured buffers be detrimental for memory coalescing if the structure is larger than 16 bytes?
Suppose you have a structure of two float4's, call them A and B. You can access either A or B, but not both in a single memory transaction for an instruction issued in a non-divergent warp. The layout of the memory would look like ABABABAB. If you're trying to read consecutive structures into shared memory, wouldn't memory bandwidth be wasted by storing the data in this manner? For example, you can only access the A elements, but the hardware coalesces the memory transaction so it reads in 128 bytes of consecutive data, half of which is the B elements. Essentially, you're wasting half of your memory bandwidth. Wouldn't it be better to store the data like AAAABBBB, which is a structure of buffers instead of a buffer of structures? Or is this handled by the L1 cache, where the B elements are cached so you can access them faster when the next instruction is to read in the B elements? The only other solution would be to have even numbered threads access the A elements, while odd numbered elements access the B elements.
If memory bandwidth is indeed wasted, I don't see why anyone would use structured buffers other than for convenience. Hopefully I explained this well enough so someone could understand. I would ask this on the NVIDIA developer forums, but I think they're still down. Visual Studio keeps crashing when I try to run the NVIDIA Nsight frame profiler, so it's difficult to see how the memory bandwidth is affected by changes in how the data is stored. P.S., has anyone been able to successfuly run the NVIDIA Nsight frame profiler?
If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?
There are a couple things you can try to mitigate the PCIe bottleneck:
Asynchronous transfers - permits overlapping computation and bulk transfer
Mapped memory - allows a kernel to stream data to/from the GPU during execution
Note that neither of these techniques makes the transfer go faster, they just reduce the time the GPU is waiting on the data to arrive.
With the cudaMemcpyAsync API function you can initiate a transfer, launch one or more kernels that do not depend on the result of the transfer, synchronize the host and device, and then launch kernels that were waiting on the transfer to complete. If you can structure your algorithm such that you're doing productive work while the transfer is taking place, then asynchronous copies are a good solution.
With the cudaHostAlloc API function you can allocate host memory that can read and written directly from the GPU. The reason this is faster is that a block that needs host data only needs to wait for a small portion of the data to be transferred. In contrast, the usual approach makes all blocks wait until the entire transfer is complete. Mapped memory essentially breaks a big monolithic transfer into a bunch or smaller copy operations, so the latency is reduced.
You can read more about these topics in Section 3.2.6-3.2.7 of the CUDA Programming Guide and Section 3.1 of the CUDA Best Practices Guide. Chapter 3 of the OpenCL Best Practices Guide explains how to use these features in OpenCL.
You really need to do the math to be certain that you're going to be doing enough processing on the GPU to make it worthwhile transferring data between host and GPU. Ideally you do this at the design stage, before doing any coding, since it can be a deal-breaker.