How to implement/use atomic counter in Metal fragment shader? - metal

I want to implement an A-Buffer algorithm for order-independent-transparency in my Metal application. The description of the technique mentions using an atomic counter. I've never used one of these or even heard of them. I just read about atomic variables in the Metal Shading Language Specification, but I can't figure out how to actually implement or use one.
Does anyone have experience with these in Metal? Can you point me to an example of how to set up and use a simple integer counter? Basically each render pass I need to be able to increment an integer from within the fragment shader, starting from zero. This is used to index into the A-Buffer.
Thanks!

Well, your question is lacking sufficient detail to provide much more than a general overview. You might consider adding an incomplete shader function, with pseudo-code where you're not sure how to implement something.
Anyway, an atomic counter is a variable of type atomic_uint (or atomic_int if you need sign). To be useful, the variable needs to be shared across a particular address space. Your example sounds like it needs device address space. So, you would want a device variable backed by a buffer. You would declare it as:
fragment FragmentOut my_fragment_func(device atomic_uint &counter [[buffer(0)]], ...)
{
...
}
You could also use a struct type for the parameter and have a field of the struct be your atomic_uint variable.
To atomically increment the atomic variable by 1 and obtain the prior value, you could do this:
uint value = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
The initial value of the atomic variable is taken from the contents of the buffer at a point before the draw or dispatch command is executed. It's not documented as such in the spec, but the size and bit-interpretation of an atomic type seems to match the corresponding non-atomic type. That is, you would write a uint (a.k.a. unsigned int or uint32_t) to the buffer to initialize an atomic_uint.

Related

How to integrate in Swift using vDSP

I try to find replacement for SciPy's cumtrapz function function in Swift. I found something called vDSP_vtrapzD but I have no idea how to use it. This is what I've done so far:
import Accelerate
var f1: [Double] = [<some data>]
var tdata: [Double] = [<time vector>]
var output = [Double](unsafeUninitializedCapacity:Int(f1.count), initializingWith: {_, _ in})
vDSP_vtrapzD(&f1, 1, &tdata, &output, 1, vDSP_Length(f1.count))
You're close, but you're using Array.init(unsafeUninitializedCapacity:initializingWith:) incorrectly. From its documentation:
Discussion
Inside the closure, set the initializedCount parameter to the number of elements that are initialized by the closure. The memory in the range buffer[0..<initializedCount] must be initialized at the end of the closure’s execution, and the memory in the range buffer[initializedCount...] must be uninitialized. This postcondition must hold even if the initializer closure throws an error.
This API is a more unsafe (but performant counterpart) to Array.init(repeating:count:), which allocates an array of a fixed size, and spends the time to initialize all its contents). This has two potential drawbacks:
If the purpose of the array is to provide a buffer to write a result into, then initializing it prior to that is redundant and wasteful
If the result you put into that buffer ends up being larger than your
array, you need to remember to manually "trim" the excess off by
copying it into a new array.
Array.init(unsafeUninitializedCapacity:initializingWith:) improves upon this by:
Asking you for the maximum capacity you might possibly need
Giving you a temporary buffer with the capacity
Importantly, it's uninitialized. This makes it faster, but also more dangerous (risk of buffer underflow errors) if used incorrectly.
You then tell it exactly how much of that temporary buffer you actually used
It will automatically copy that much of the buffer into the final array, and return that as the result.
You're using Array.init(unsafeUninitializedCapacity:initializingWith:) as if it were Array.init(repeating:count:). To use it correctly, you would put your initialization logic inside the initializer parameter, like so:
let result = Array<Double>(unsafeUninitializedCapacity: f1.count, initializingWith: { resultBuffer, count in
assert(f1.count == tdata.count)
vDSP_vtrapzD(
&f1, // Double-precision real input vector.
1, // Address stride for A.
&tdata, // Pointer to double-precision real input scalar: step size.
resultBuffer.baseAddress!, // Double-precision real output vector.
1, // Address stride for C.
vDSP_Length(f1.count) // The number of elements to process.,
)
count = f1.count // This tells Swift how many elements of the buffer to copy into the resultant Array
})
FYI, there's a nice Swift version of vDSP_vtrapzD that you can see here: https://developer.apple.com/documentation/accelerate/vdsp/integration_functions. The variant that returns the result uses the unsafeUninitializedCapacity initializer.
On a related note, there's also a nice Swift API to Quadrature: https://developer.apple.com/documentation/accelerate/quadrature-smu
simon
I've been grappling with using DSP in the last week too! Alexander's solution above was helpful. However I'd like to add a couple of things.
vDSP_vtrapzD only allows you to integrate an array of values against a fixed step increment, which is a scalar value. The function doesn't allow you to pass a varying time value for each increment.
The example solution confused me a little as it does a check to make sure the time array is the same size as the data array. This is not necessary as only the first value in the time vector will be used by the vDSP_vtrapzD function.
It's a shame that the vDSP_vtrapzD only takes a scalar as ca a fixed time component. In my experience this doesn't not reflect reality when working with time based data from sensors that don't emit data in precisely the same increments.

DirectCompute: How to read from a RWTexture2D<float4>?

I have the following buffer:
RWTexture2D<float4> Output : register(u0);
This buffer is used by a compute shader for rendering a computed image.
To write a pixel in that texture, I just use code similar to this:
Output[XY] = SomeFunctionReturningFloat4(SomeArgument);
This works very well and my computed image is correctly rendered on screen.
Now at some stage in the compute shader, I would like to read back an
already computed pixel and process it again.
Output[XY] = SomeOtherFunctionReturningFloat4(Output[XY]);
The compiler return an error:
error X3676: typed UAV loads are only allowed for single-component 32-bit element types
Any help appreciated.
In Compute Shaders, data access is limited on some data types, and not at all intuitive and straightforward. In your case, you use a
RWTexture2D<float4>
That is a UAV typed of DXGI_FORMAT_R32G32B32A32_FLOAT format.
This forma is only supported for UAV typed store, but it’s not supported by UAV typed load.
Basically, you can only write on it, but not read it. UAV typed load only supports 32 bit formats, in your case DXGI_FORMAT_R32_FLOAT, that can only contain a single component (32 bits and that’s all).
Your code should run if you use a RWTexture2D<float> but I suppose this is not enough for you.
Possible workarounds that spring to my minds are:
1. using 4 different RWTexture2D<float>, one for each component
2. using 2 different textures, RWTexture2D<float4> to write your values and Texture2D<float4> to read from
3. Use a RWStructuredBufferinstead of the texture.
I don’t know your code so I don’t know if solutions 1. and 2. could be viable. However, I strongly suggest going for 3. and using StructuredBuffer. A RWStructuredBuffer can hold any type of struct and can easily cover all your needs. To be honest, in compute shaders I almost only use them to pass data. If you need the final output to be a texture, you can do all your calculations on the buffer, then copy the results on the texture when you’re done. I would add that drivers often use CompletePath to access RWTexture2D data, and FastPath to access RWStructuredBuffer data, making the former awfully slower than the latter.
Reference for data type access is here. Scroll down to UAV typed load.

What is the correct sequence for uploading a uniform block?

In the example page at https://www.lighthouse3d.com/tutorials/glsl-tutorial/uniform-blocks/ has this:
uniformBlockBinding()
bindBuffer()
bufferData()
bindBufferBase()
But conceptually, wouldn't this be more correct?
bindBuffer()
bufferData()
uniformBlockBinding()
bindBufferBase()
The idea being that uploading to a buffer (bindBuffer+bufferData) should be agnostic about what the buffer will be used for - and then, separately, uniformBlockBinding()+bindBufferBase() would be used to update those uniforms, per shader, when the relevant buffer has changed?
Adding answer since the accepted answer has lots of info irrelevant to WebGL2
At init time you call uniformBlockBinding. For the given program it sets up which uniform buffer index bind point that particular program will get a particular uniform buffer from.
At render time you call bindBufferRange or bindBufferBase to bind a specific buffer to a specific uniform buffer index bind point
If you also need to upload new data to that buffer you can then call bufferData
In pseudo code
// at init time
for each uniform block
gl.uniformBlockBinding(program, indexOfBlock, indexOfBindPoint)
// at render time
for each uniform block
gl.bindBufferRange(gl.UNIFORM_BUFFER, indexOfBindPoint, buffer, offset, size)
if (need to update data in buffer)
gl.bufferData/gl.bufferSubData(gl.UNIFORM_BUFFER, data, ...)
Note that there is no “correct” sequence. The issue here is that how you update your buffers is really up to you. Since you might store multiple uniform buffer datas in a single buffer at different offsets then calling gl.bufferData/gl.bufferSubData like above is really not “correct”, it’s just one way of 100s.
WebGL2 (GLES 3.0 ES) does not support the layout(binding = x) mentioned in the accepted answer. There is also no such thing as glGenBuffers in WebGL2
Neither is "more correct" than the other; they all work. But if you're talking about separation of concerns, the first one better emphasizes correct separation.
glUniformBlockBinding modifies the program; it doesn't affect the nature of the buffer object or context buffer state. Indeed, by all rights, that call shouldn't even be in the same function; it's part of program object setup. In a modern GL tutorial, they would use layout(binding=X) to set the binding, so the function wouldn't even appear. For older code, it should be set to a known, constant value after creating the program and then left alone.
So calling the function between allocating storage for the buffer and binding it to an indexed bind point for use creates the impression that they should be calling glUniformBlockBinding every frame, which is the wrong impression.
And speaking of wrong impressions, glBindBufferBase shouldn't even be called there. The rest of that code is buffer setup code; it should only be done once, at the beginning of the application. glBindBufferBase should be called as part of the rendering process, not the setup process. In a good application, that call shouldn't be anywhere near the glGenBuffers call.

Meaning and Implications of InternalFormat, Format, and Type parameter for WebGL Textures

In WebGL calls to texSubImage2D and readPixels require a Format and Type parameters. In addition texSubImage2D requires an InternalFormat parameter. While it is easy to find documentation about which combinations of these parameters are valid, it is unclear exactly what these parameters mean and how to go about using them efficiently, particularly given that some internal formats can be paired with multiple types e.g.
R16F/HALF_FLOAT vs R16F/FLOAT or GL_R11F_G11F_B10F/FLOAT vs GL_R11F_G11F_B10F/GL_UNSIGNED_INT_10F_11F_11F_REV (where the notation I am using is InternalFormat/Type)
Also both of these API calls can be used in combination with a pixels parameter that can be a TypedArray -- it this case it is unclear which choices of TypedArray are valid for a given InternalFormat/Format/Type combo (and which choice is optimal in terms of avoiding casting)
For instance, is it true that the internal memory used by the GPU per texel is determined solely by the InternalFormat -- either in an implementation dependent way (e.g. WebGL1 unsized formats) or, for some newly added InternalFormats in WegGL2, a fully specified way.
Are the Format and Type parameters related primarily to how data is marshalled into and out of ArrayBuffers? For instance, if I use GL_R11F_G11F_B10F/GL_UNSIGNED_INT_10F_11F_11F_REV
does this mean I should be passing texSubImage2D an Uint32Array with each element of the array having its bits carefully twiddled in javascript whereas if I use GL_R11F_G11F_B10F/Float then I should use a Float32Array with three times number of elements as the prior case, and WebGL will handle the bit twiddling for me? Does WebGL try to check that the TypedArray I have passed is consistent with the Format/Type I have chosen or does it operate directly on the underlying ArrayBuffer? Could I have used a Float64Array in the last instance? And what to do about HALF_FLOAT?
It looks like bulk of the question can be answered by referring to section 3.7.6 Texture Objects of the WebGL2 Spec. In particular the info in the table found in the documentation for texImage2D which clarifies which TypedArray is required for each Type:
TypedArray WebGL Type
---------- ----------
Int8Array BYTE
Uint8Array UNSIGNED_BYTE
Int16Array SHORT
Uint16Array UNSIGNED_SHORT
Uint16Array UNSIGNED_SHORT_5_6_5
Uint16Array UNSIGNED_SHORT_5_5_5_1
Uint16Array UNSIGNED_SHORT_4_4_4_4
Int32Array INT
Uint32Array UNSIGNED_INT
Uint32Array UNSIGNED_INT_5_9_9_9_REV
Uint32Array UNSIGNED_INT_2_10_10_10_REV
Uint32Array UNSIGNED_INT_10F_11F_11F_REV
Uint32Array UNSIGNED_INT_24_8
Uint16Array HALF_FLOAT
Float32Array FLOAT
My guess is that
InternalFormat determines how much GPU memory is used to store the texture
Format and Type governs how data is marshalled into/ out of the texture and javascript.
Type determines what type of TypedArray must be used
Format plus the pixelStorei parameters (section 6.10) determine how many elements the TypedArray will need and which elements will actually by used (will things be tightly packed, will some rows be padded etc)
Todo:
Workout details for
encoding/decoding some of the more obscure Type values to and from javascript.
calculating typed array size requirement and stride info given Type, Format, and pixelStorei parameters

Direct3D 10 Hardware Instancing using Structured Buffers

I am trying to implement hardware instancing with Direct3D 10+ using Structured Buffers for the per instance data but I've not used them before.
I understand how to implement instancing when combining the per vertex and per instance data into a single structure in the Vertex Shader - i.e. you bind two vertex buffers to the input assembler and call the DrawIndexedInstanced function.
Can anyone tell me the procedure for binding the input assembler and making the draw call etc. when using Structured Buffers with hardware instancing? I can't seem to find a good example of it anywhere.
It's my understanding that Structured Buffers are bound as ShaderResourceViews, is this correct?
Yup, that's exactly right. Just don't put any per-instance vertex attributes in your vertex buffer or your input layout and create a ShaderResourceView of the buffer and set it on the vertex shader. You can then use the SV_InstanceID semantic to query which instance you're on and just fetch the relevant struct from your buffer.
StructuredBuffers are very similar to normal buffers. The only differences are that you specify the D3D11_RESOURCE_MISC_BUFFER_STRUCTURED flag on creation, fill in StructureByteStride and when you create a ShaderResourceView the Format is DXGI_UNKNOWN (the format is specified implicitly by the struct in your shader).
StructuredBuffer<MyStruct> myInstanceData : register(t0);
is the syntax in HLSL for a StructuredBuffer and you just access it using the [] operator like you would an array.
Is there anything else that's unclear?

Resources