How to efficiently overwrite the memory region on GPU?

How to efficiently overwrite the memory region on GPU? - memory

I allocate the data block on GPU. And I have an algorithm to generate new data to replace the old one. The new buffer has the same data size. There is a solution is to bring the old data back to the cpu and then erase it but I think that’s highly inefficient and very slow. Is it possible to overwrite the old element with the new data at the same location?

If your kernels accept a pointer that's pointing to some buffer region, you may be able to just pass the original data pointer to that kernel, causing your input data to be overwritten by the results of the kernel.
Or if you're working with an algorithm that requires using a buffer, you could use cudaMemcpy to copy the results stored in the buffer to the region of memory holding your input data, overwriting it in the process.

Related

When to use setVertexBytes/setVertexBuffer when dealing with small data in Metal?

The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?

It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....

DirectX 12 Updating the Descriptor Heap

I'm currently writing my own graphics framework for DirectX12 (I've already written several DirectX 11 frameworks for personal game engines), and I'm currently trying to copy the methods used in the recent Hitman game for resource binding.
I'm confused about the best way to handle per-object resource binding for the SRV/CBV/UAV heap. I've watched several GDC presentations, and they all seem to gloss over this.
Only 1 SRV/CBV/UAV heap can be bound at a time, and switching the currently-bound heap in the middle of a command list can be bad for performance on some hardware by forcing a flush. Because of this, what is the best way to handle updating the heap with new descriptors? To me, it seems like each command list would:
Get a hold of a SRV/CBV/UAV heap for itself.
For each object in a subset of objects, create descriptors on the heap pointing to per-object data that was placed into a separate upload heap.
Afterwards, another command list takes this filled descriptor heap and binds it, then issues draw calls mixed with SetGraphicsRootDescriptorTable in order to move through the current descriptor heap.
This being said, several sources online (including another SO post) suggest using one large SRV/CBV/UAV heap and copying into it using CPU-visible heaps. I'm assuming they're not attempting to use the asynchronous CopyDescriptors, but rather CopyBufferRegion. I tried using CopyBufferRegion to update data per-object, but to me this seems under-performant with so many transitions between D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER and D3D12_RESOURCE_STATE_COPY_DEST. Am I misunderstanding something? Any clarity would be appreciated.

CopyDescriptors is not asynchronous, it is a CPU operation that is immediate on the CPU. It can happen anytime before a command list is executed for volatile descriptor ( after the command list operation using it is recorded ), or have to be ready at the usage for static descriptor ( root signature 1.1 ).
The usual approach is to have a large descriptor heap, keep a portion for static descriptors, then use the rest as a ring buffer, allocating descriptor table offset on demand to copy and use the needed descriptor for any draw/compute operation.
CopyBufferRegion has nothing to do here, remember that mapping buffers is also an immediate operation, so you also ring buffer a big chunk of memory for your per objet constant buffers, and you cycle into it. The only thing is that you need to make sure you do not overwrite memory or descriptor while they may still be in use, so you have to fence to prevent the case.

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.

It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.

Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.

As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

Texture streaming in DirectX11, Immutable vs Dynamic

We often have the case where we need to stream textures to the graphics card (in game case: terrains, in my case image from different input sources like cameras/capture cards/videos)
Of course in camera case, I receive my data in a separate thread, but still need to upload that data to the GPU for display.
I know 2 models for it.
Use a dynamic resource:
You create a dynamic texture which has the same size and format as your input image, when you receive a new image you set a flag that tells you need upload, and then use map in the device context to upload the texture data (with eventual double buffer of course).
Advantage is you have a single memory location, hence you don't have memory fragmentation over time.
Drawback is you need to upload in immediate context, so your upload had to be in your render loop.
Use immutable and load/discard
In that case you upload in the image receiving thread, by creating a new resource, push the data and discard the old resource.
Advantage is you should have a stall free upload (no need for immediate context, you can still run your command list while texture is uploading), resource can be used with a simple trigger once available (to swap SRV).
Drawback is you can fragment memory over time (by allocating and freeing resources in a constant manner (30 fps for a standard camera as example).
Also you have to deal with throttling yourself (but that part is not a big deal).
So is there something I missed in those techniques, or is there an even better way to handle this?

These are the two main methods of updating textures D3D11.
However, the assumption that the first method will not result in memory usage patterns identical to the second case is dependent on the driver, and likely is not true. You would use D3D11_MAP_WRITE_DISCARD if you are overwriting the whole image (which it sounds like what you are doing), meaning that the current contents of the buffer become undefined. However, this is only true from the CPU's point-of-view. They are retained for the GPU, if they are potentially used in a pending draw operation. Most (maybe all?) drivers will actually allocate new storage for the write location of the mapped texture in this case, otherwise command buffer processing would need to stall. The same holds if you do not use the discard flag. Instead, when the map command is processed in the command buffer, the resource's buffer is updated to the value returned from Map in D3D11_MAPPED_SUBRESOURCE.
Also, it is not true that you must update dynamic textures in the immediate context. Only that if you update them in a deferred context, you must use the D3D11_MAP_DISCARD flag. This means you could update the texture on a worker thread, if you are overwriting the entire texture.
The bottom line is that, since the CPU/GPU system on a PC is not a unified memory system, there will be synchronization issues updating GPU resources coming from the CPU.

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!

There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart