Running compute kernel on portion of MTLBuffer? - ios

I am populating an MTLBuffer with float2 vectors. The buffer is being created and populated like this:
struct Particle {
var position: float2
...
}
let particleCount = 100000
let bufferSize = MemoryLayout<Particle>.stride * particleCount
particleBuffer = device.makeBuffer(length: bufferSize)!
var pointer = particleBuffer.contents().bindMemory(to: Particle.self, capacity: particleCount)
pointer = pointer.advanced(by: currentParticles)
pointer.pointee.position = [x, y]
In my Metal file the buffer is being accessed like this:
struct Particle {
float2 position;
...
};
kernel void compute(device Particle *particles [[buffer(0)]],
uint id [[thread_position_in_grid]] … )
I need to be able to compute a given range of the MTLBuffer. For example, is it possible to run the compute kernel on say starting from the 50,000 value and ending at 75,000 value?
It seems like the offset parameter allows me to specify the start position, but it does not have a length parameter.
I see there is this call:
setBuffers(_:offsets:range:)
Does the range specify what portion of the buffer to run? It seems like the range specifies what buffers are used and not the range of values to use.

A compute shader doesn't run "on" a buffer (or portion). It runs on a grid, which is an abstract concept. As far as Metal is concerned, the grid isn't related to a buffer or anything else.
A buffer may be an input that your compute shader uses, but how it uses it is up to you. Metal doesn't know or care.
Here's my answer to a similar question.
So, the dispatch command you encode using a compute command encoder governs how many times your shader is invoked. It also dictates what thread_position_in_grid (and related) values each invocation receives. If your shader correlates each grid position to an element of an array backed by a buffer, then the number of threads you specify in your dispatch command governs how much of the buffer you end up accessing. (Again, this is not something Metal dictates; it's implicit in the way you code your shader.)
Now, to start at the 50,000th element, using an offset on the buffer binding to make that the effective start of the buffer from the point of view of the shader is a good approach. But it would also work to just add 50,000 to the index the shader computes when it accesses the buffer. And, if you only want to process 25,000 elements (75,000 minus 50,000), then just dispatch 25,000 threads.

Related

Using Metal discard_fragment() to discard individual samples in an MSAA attachment

For an MSSA attachment, the following simple Metal fragment shader is meant to be run in multiple render passes, once per sample, to fill the stencil attachment with different reference values per sample. It does not work as expected, and effectively fills all stencil pixel samples with the reference value on each renderpass.
struct _10
{
int _m0;
};
struct main0_out
{
float gl_FragDepth [[depth(any)]];
};
fragment main0_out main0(constant _10& _12 [[buffer(0)]], uint gl_SampleID [[sample_id]])
{
main0_out out = {};
if (gl_SampleID != _12._m0)
{
discard_fragment();
}
out.gl_FragDepth = 0.5;
return out;
}
The problem seems to be using discard_fragment() on a per-sample basis. The intended operation of discarding one sample but writing another does not occur. Instead, the sample is never discarded, regardless of the comparison value passed in the buffer.
In fact, from what I can tell from GPU capture shader tracing results, it appears that the entire if-discard clause is optimized away by the Metal compiler. My guess is that Metal probably recognizes the disconnect between per-sample invocations and discard_fragment(), and removes it, but I can't be sure.
I can't find any Metal documentation on discard_fragment() and its use with MSAA, so I can't tell whether discard_fragment() is supposed to work with individual sample invocations in that environment, or whether it can only discard the entire fragment (which admittedly the function name implies, but what does that mean for per-sample invocations?).
Does the logic and intention of this shader make sense? Is discard_fragment() supposed to work with individual sample invocations? And why would the Metal compiler possibly be removing the discard operation from my shader?

How many times is the vertex shader called with metal?

I've been learning some basic metal rendering, and I am stuck with some basic concepts:
I know we can send vertex data to a shader using:
renderEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)
And then we can retrieve it in the shader with:
vertex float4 basic_vertex(const device VertexIn* vertexIn [[ buffer(0) ]], unsigned int vid [[ vertex_id ]])
As I understand it, the vertex function will be called once per each vertex, and vertex_id will update on each call to contain the vertex index.
The question is, from where comes that vertex_id?
I could send to the shader more data with different sizes:
renderEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)
renderEncoder.setVertexBuffer(vertexBuffer2, offset: 0, index: 1)
If vertexBuffer has 3 elements , and vertexBuffer2 has 10 elements ...how many times are the vertex function called? 10?
Thanks!
That's determined by the draw call you make on the render command encoder. Take the simplest draw method:
drawPrimitives(type:vertexStart:vertexCount:)
The vertexCount determines how many times your vertex function is called. The vertex IDs passed to the vertex function are those in the range from vertexStart to vertexStart + vertexCount - 1.
If you consider another draw method:
drawPrimitives(type:vertexStart:vertexCount:instanceCount:)
That goes over the same range of vertex IDs. However, it calls your vertex function vertexCount * instanceCount times. There will be instanceCount calls with the vertex ID being vertexStart. The instance ID will range from 0 to instanceCount - 1 for those calls. Likewise, there will be instanceCount calls with the vertex ID being vertexStart + 1 (assuming vertexCount >= 2), one with each instance ID in [0..instanceCount-1]. Etc.
The other draw methods have various other options, but they mostly don't affect how many times the vertex function is called. For example, baseInstance shifts the range of the instance IDs, but not its size.
The various drawIndexedPrimitives() methods get the specific vertex IDs from a buffer instead of enumerating all vertex IDs in a range. That buffer may contain a given vertex ID in multiple locations. For that case, I don't think it's defined whether the vertex function might be called multiple times for the same vertex ID and instance ID. Metal will presumably try to avoid duplicating effort, but it might end up actually being faster to just call the vertex function for every index in the index buffer even if multiple such indexes end up being the same vertex ID.
The relationship between vertexes and data in buffers you pass to the vertex processing stage is entirely up to you. You don't have to pass any buffers, at all. For example, a vertex function could generate vertex information completely computationally just from the vertex ID and instance ID.
It is pretty common, of course, for at least some of the buffers to contain arrays of per-vertex data that are indexed into using the vertex ID. Other buffers might be uniform data that's the same for all vertexes (that is, you don't index into that buffer using the vertex ID). Metal itself doesn't know this, though.

how to access the neighbor unit in metal kernel function correctly

I used metal to do some interpolation task. I wrote the kernel function as followed:
kernel void kf_interpolation( device short *dst, device uchar *src, uint id [[ thread_position_in_grid ]] )
{
dst[id] = src[id-1] + src[id] + src[id+1];
}
That kernel function could not gave expected value. And I found the cause was that the src[id-1] was always 0, which was false value. However, src[id+1] contained the right value. The question is how could I use the neighbour unit correctly, e.g. [id-1], in kernel functions. Thanks in advance.
The most efficient way to handle edge cases like this is usually to grow your source array at each end and offset the indices. So for N calculations, allocate your src array with N+2 elements, fill elements 1 through N (inclusive) with the source data, and set element 0 and N+1 to whatever you want the edge condition to be.
An even more efficient method would be to use MTLTextures instead of MTLBuffers. MTLTextures have an addressing mode attached to them which causes the hardware to automatically substitute either zero or the nearest valid texel when you read off the edge of a texture. They can also do linear interpolation in hardware for free, which can be of great help for resampling, assuming bilinear interpolation is good enough for you. If not, I recommend looking at MPSImageLanczosScale as an alternative.
You can make a MTLTexture from a MTLBuffer. The two will alias the same pixel data.

Dynamic output from compute shader

If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.

Random access to D3D11 buffer with R8G8B8A8_UNorm format in HLSL

I have a D3D11 buffer with a few million elements that is supposed to hold data in the R8G8B8A8_UNorm format.
The desired behavior is the following: One shader calculates a vec4 and writes it to the buffer in a random access pattern. In the next pass, another shader reads the data in a random access pattern and processes them further.
My best guess would be to create an UnorderedAccessView with the R8G8B8A8_UNorm format. But how do I declare the RWBuffer<?> in HLSL, and how do I write to and read from it? Is it necessary to declare it as RWBuffer<uint> and do the packing from vec4 to uint manually?
In OpenGL I would create a buffer and a buffer texture. Then I can declare an imageBuffer with the rgba8 format in the shader, access it with imageLoad and imageStore, and the hardware does all the conversions for me. Is this possible in D3D11?
This is a little tricky due to a lot of different gotchas, but you should be able to do something like this.
In your shader that writes to the buffer declare:
RWBuffer<float4> WriteBuf : register( u1 );
Note that it is bound to register u1 instead of u0. Unordered access views (UAV) must start at slot 1 because the u# register is also used for render targets.
To write to the buffer just do something like:
WriteBuf[0] = float4(0.5, 0.5, 0, 1);
Note that you must write all 4 values at once.
In your C++ code, you must create an unordered access buffer, and bind it to a UAV. You can use the DXGI_FORMAT_R8G8B8A8_UNORM format. When you write 4 floats to it, the values will automatically be converted and packed. The UAV can be bound to the pipeline using OMSetRenderTargetsAndUnorderedAccessViews.
In your shader that reads from the buffer declare a read only buffer:
Buffer<float4> ReadBuf : register( t0 );
Note that this buffer uses t0 because it will be bound as a shader resource view (SRV) instead of UAV.
To access the buffer use something like:
float4 val = ReadBuf[0];
In your C++ code, you can bind the same buffer you created earlier to an SRV instead of a UAV. The SRV can be bound to the pipeline using PSSetShaderResources and can also be created with DXGI_FORMAT_R8G8B8A8_UNORM.
You can't bind both the SRV and UAV using the same buffer to the pipeline at the same time. So you must bind the UAV first and run your first shader pass. Then unbind the UAV, bind SRV, and run the second shader pass.
There are probably other ways to do this as well. Note that all of this requires shader model 5.

Resources