HLSL: Memory coalescing with structured buffers - memory

I'm currently working on an HLSL shader that is limited by global memory bandwidth. I need to coalesce as much memory as possible in each memory transaction. Based on the guidelines from NVIDIA for CUDA and OpenCL (DirectCompute documentation is quite lacking), the largest memory transaction size for compute capability 2.0 is 128 bytes, while the largest word that can be accessed is 16 bytes. Global memory accesses can be coalesced when the data being accessed by the threads in a warp fall into the same 128 byte segment. With this in mind, wouldn't structured buffers be detrimental for memory coalescing if the structure is larger than 16 bytes?
Suppose you have a structure of two float4's, call them A and B. You can access either A or B, but not both in a single memory transaction for an instruction issued in a non-divergent warp. The layout of the memory would look like ABABABAB. If you're trying to read consecutive structures into shared memory, wouldn't memory bandwidth be wasted by storing the data in this manner? For example, you can only access the A elements, but the hardware coalesces the memory transaction so it reads in 128 bytes of consecutive data, half of which is the B elements. Essentially, you're wasting half of your memory bandwidth. Wouldn't it be better to store the data like AAAABBBB, which is a structure of buffers instead of a buffer of structures? Or is this handled by the L1 cache, where the B elements are cached so you can access them faster when the next instruction is to read in the B elements? The only other solution would be to have even numbered threads access the A elements, while odd numbered elements access the B elements.
If memory bandwidth is indeed wasted, I don't see why anyone would use structured buffers other than for convenience. Hopefully I explained this well enough so someone could understand. I would ask this on the NVIDIA developer forums, but I think they're still down. Visual Studio keeps crashing when I try to run the NVIDIA Nsight frame profiler, so it's difficult to see how the memory bandwidth is affected by changes in how the data is stored. P.S., has anyone been able to successfuly run the NVIDIA Nsight frame profiler?

Related

iOS memory shared between CPU and GPU and what that means for reading

I have a MTLBuffer that is using memory that is allocated by the cpu and thus shared by both the cpu and the GPU.
Per Apple's suggestion I am using triple buffering to remove latency that might be caused by one processor waiting on the other to finish.
My vertex data changes every frame so every frame I am writing to one section of the array with the CPU and reading a different section with the GPU.
What I would like to do is read some of the values that the GPU is currently also reading as they save me some time doing calculations for the section of the buffer the CPU is writing to.
Essentially this is because the current frame's data is dependent on the previous frames data.
Is this valid? Can the CPU and the GPU be reading from the same portion of memory at once since memory is shared on iOS?
I think that's valid and safe, for two reasons. First, CPUs actually often have to read in order to write. Things like caches and memory buses don't allow for access to RAM at the granularity we usually think of (byte or even register size). In order to write, it usually has to read a larger chunk from memory, modify just the part written, and then (eventually) write the larger chunk back to memory. So, even the approach where you don't explicitly read from parts of the buffer that the GPU is reading and you only write to parts that the GPU isn't accessing can, in theory, still be implicitly reading from parts of the buffer that the GPU is reading. Since we're not given the info we'd need to reliably avoid that, I'd say it isn't considered a problem.
Second, no warning is given about what you describe in Apple's docs. There's the "Maintaining Coherency Between CPU and GPU Memory" section in the article about resource objects. That only discussed the case where either the CPU or GPU are modifying shared data, not where both are just reading.
Then there's the "Resource Storage Modes and Device Memory Models" section describing the new storage modes introduced with iOS 9 and macOS 10.11. And the docs for MTLResourceStorageModeShared itself. Again, there's mention of reading vs. writing, but none about reading vs. reading.
If there were a problem with simultaneous reading, I think Apple would have discussed it.

How does browser GPU memory usage works?

By pressing F12 and then Esc on Chrome, you can see a few options to tick. One of them is show FPS meter, which allows us to see GPU memory usage in real time.
I have a few questions regarding this GPU memory usage:
This GPU memory means the memory the webpage needs to store its code: variables, methods, images, cached videos, etc. Is this right to affirm?
Is there a reason as to why it has an upper bound of 512 Mb? Is there a way to reduce or increase it?
How much GPU memory usage is enough to see considerable slowdown on browser navigation?
If I have an array with millions of elements (just hypothetically), and I splice all the elements in the array, will it free the memory that was in use? Or will it not "really" free the memory, requiring an additional step to actually wipe it out?
1. What is stored in GPU memory
Although there are no hard-set rules on the type of data that can be stored in GPU-memory, the bulk of GPU memory generally contains single-frame resources like textures, multi-frame resources like vertex buffers and index buffer data, and programmable-shader compiled code fragments. So while in theory it is possible to store video's in GPU memory, as well as all kinds of other bulk data, in practice, for every streamed video only a bunch of frames will ever be in GPU-ram.
The main reason for this soft-selection of texture-like data sets is that a GPU is a parallel hardware architecture, and it expects the data to be compatible with that philosophy, which means that there are no inter-dependencies between sets of data (i.e. pixels). Decoding images from a video stream is more or less the same as resolving interdependence between data-blocks.
2. Is 512MB enough for everyone?
No. It's probably based on your hardware.
3. When does GPU memory become slow?
You have to know that some parts of the GPU memory are so fast you can't even start to appreciate the speed. There is nothing wrong with the speed of a GPU card. What matters is the time it takes to get the data IN that memory in the first place. That is called bandwidth, and the operations usually need to be synchronized. In that case, the driver will lock the Northbridge bus so that data can flow from main memory into GPU memory, and this locking + transfer takes quite some time.
So to answer the question, once it is uploaded, the GUI will remain fast, no matter how much more memory is used on the GPU card. The only thing that can slow it down, are changes to the GUI, and other GPU processes taking time to complete that may interfere with rendering operations.
4. Splicing ram memory frees it up?
I'm not quite sure what you mean by splicing. GPU memory is freed by applications that release that memory by using the API calls to do that. If you want to render you GPU memory blank, you'd have to grab the GPU handles of the resources first, upload 'clear' data into them, and then release the handles again, but (for normal single-threaded GPU applications) you can only do that in your own process context.

shared memory vs texture memory in opencl

I am writing deinterlacing code in Opencl. I am reading the pixels using read_imageui() API in the local memory.
Just like the code at:
https://opencl-book-samples.googlecode.com/svn-history/r29/trunk/src/Chapter_19/oclFlow/lkflow.cl
As per my understanding when we read pixels using this API we are reading from the Texture memory. I am doubtful that using the pixels first in shared memory will help me gaining any speed as Texture memory already acts as cache and provides fast access to data.
Can anyone clarify my doubt ?
If you can fit all your data in private memory after reading it with read_imageui, you should definitely do that. Keep in mind that you only have 256 bytes of private memory per work item if your kernel compiles SIMD16 and 512 bytes if it compiles SIMD8.
Whether you should use local memory or not really depends on the access pattern. Indeed, Samplers have their own L1 and L2 caches, so if your data accesses always hit the caches, you should be fine. Remember, that local memory is banked, so you have 16 banks from which you can fetch 4 bytes at a time, which means that you get full bandwidth if you hit all 16 banks from all work items in one hardware thread (typically 16 or 8 of them). So, you might have a situation where you are better off reading image data into local memory first and then accessing local memory in an orderly fashion. Good example of this are algorithms like SIFT or SURF, where you access image in such a way that sampler cache really does not help much (you still get sampler interpolation benefits), but then you place all that data in local memory and access it repeatedly in a fairly regular pattern.
In general, that's true. However, even a cached read from a texture might be slower than a read from shared local memory, so for an algorithm that makes many overlapped reads from adjacent locations could still benefit somewhat from using shared local memory. However, it will make the kernel more complicated, so for many cases (and certainly during algorithm development) just rely on the cached texture reads.

Dynamic Array Memory Allocation Strategies

I've written a 32bit program using a dynamic array to store a list of triangles with an unknown count. My current strategy is to estimate a very large number of triangles and then trim the list when all the triangles are created. In some cases I'll only allocate memory once in others I'll need to add to the allocation.
With a very large data set I'm running out of memory when my application is memory usage is about 1.2GB and since the allocation step is so large I feel like I may be fragmenting memory.
Looking at FastMM (memory manager) I see these constants which would suggest one of these as a good size to increment by.
ChunkSize = 64 * 1024;
MaximumSmallBlockSize = 32752;
LargeBlockGranularity = 64 * 1024;
Would one of these be an optimal size for increasing the size of an array?
Eventually this program will become 64bit but we're not quite ready for that step.
Your real problem here is not that you are running out of memory, but that the memory allocator cannot find a large enough block of contiguous address space. Some simple things you can do to help include:
Execute the code in a 64 bit process.
Add the LARGEADDRESSAWARE PE flag so that your process gets a 4GB address space rather than 2GB.
Beyond that the best you can do is allocate smaller blocks so that you avoid the requirement to store your large data structure in contiguous memory. Allocate memory in blocks. So, if you need 1GB of memory, allocate 64 blocks of size 16MB, for instance. The exact block size that you use can be tuned to your needs. Larger blocks result in better allocation performance, but smaller blocks allow you to use more address space.
Wrap this up in a container that presents an array like interface to the consumer, but internally stores the memory in non-contiguous blocks.
As far as I know, dynamic arrays in Delphi use contiguous address space (at least in the virtual memory address space.)
Since you are running out of memory at 1.2 gb, I guess that's the point where the memory manager can't find a block contiguous memory large enough to fit a larger array.
One way you can work around this limitation would be to implement your array as a collection of smaller array of (lets say) 200 mb in size. That should give you some more headroom before you hit the memory cap.
From the 1.2 gb value, I would guess your program isn't compiled to be "large address aware". You can see here how to compile your application like this.
One last trick would be to actually save the array data in a file. I use this trick for one of my application where I needed to load a few GB of images to be displayed in a grid. What I did was to create a file with the attribute FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_DELETE_ON_CLOSE and saved/loaded images from the resulting file. From CreateFile documentation:
A file is being used for temporary storage. File systems avoid writing
data back to mass storage if sufficient cache memory is available,
because an application deletes a temporary file after a handle is
closed. In that case, the system can entirely avoid writing the data.
Otherwise, the data is written after the handle is closed.
Since it makes use of cache memory, I believe it allows an application to use memory beyond the 32 bits limitation since the cache is managed by the OS and (as far as I know) not mapped inside the process' virtual memory space. After doing this change, performance were still pretty good. But I can't say if performances would still be good enough for your needs.

Maximum data a GPU can take?

I have a large dataset, say, 5 GB and I am doing stream-wise processing on the data, now, I need to figure out on how much data I can send to GPU at a time for processing, so that I can make utilization of GPU memory to the fullest.
Also, if my RAM is not sufficient to do processing/hold on 5 GB of data, what is the work-around for this?
A pipelined application might use 3 buffers on the GPU. One buffer is used to hold the data currently being transferred to the GPU (from the host), one buffer to hold the data currently being processed by the GPU, and one buffer to hold the data(results) currently being transferred from the GPU (to the host).
This implies that your application processing can be broken into "chunks". This is true for many applications that work on large data sets.
CUDA streams enable the developer to write code that allows these 3 operations (transfer to, process, transfer from) to run simultaneously.
There is no specific number that defines the size of the buffers in the above scenario. Certainly, a straightforward implementation would create 3 buffers, each of which is smaller than 1/3 of the total memory on the GPU, leaving some memory left over for overhead and other data that may need to live in GPU memory. So if your GPU has 5GB, you might be able to run with three 1GB buffers. But there is no tool like deviceQuery that will tell you this; it is not a property of the device.
You may want to read carefully the above linked programming guide section, as well as review the CUDA simple streams sample code.

Resources