In Metal, can you reuse buffer argument table indexes during a pass? - metal

I see example code in which different buffers are put at the same index during a single render pass. Like this:
renderEncoder.setVertexBuffer(firstBuffer, offset: 0, index: 0)
renderEncoder.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: vcount1)
renderEncoder.setVertexBuffer(secondBuffer, offset: 0, index: 0)
renderEncoder.drawPrimitives(type: .point, vertexStart: 0, vertexCount: vcount2)
The index parameter is an index into the "buffer argument table", which has 32 entries, so the legal values are 0 to 31.
But I also see documentation that says you can't change the contents of a buffer until after the GPU completes its work on the given render pass.
So, is the above code legal and not prone to any timing issues?
If so, I guess that means the limit of 32 is a limit on how many buffers you can use in a single draw call, not how many buffers you can use in a single pass, aka MTLCommandBuffer. Correct?

You can't change the contents of the buffers themselves, meaning the MTLBuffer objects. What you can change is which buffers are bound. When you call setVertexBuffer, command encoder remembers which buffer you bound there until you bind nil or another buffer and every time you issue a draw command (like drawPrimitives, or a dispatch command (like dispatchThreadgroups) the current bindings are "saved" and you can go ahead and encode new buffers (and also textures).

Related

Weird case with MemoryLayout using a struct protocol, different size reported

I'm working on a drawing engine using Metal. I am reworking from a previous version, so starting from scratch
I was getting error Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (IOAF code 3)
After some debugging I placed the blame to my drawPrimitives routine, I found the case quite interesting
I will have a variety of brushes, all of them will work with specific Vertex info
So I said, why not? Have all the brushes respond to a protocol
The protocol for the Vertices will be this:
protocol MetalVertice {}
And the Vertex info used by this specific brush will be:
struct PointVertex:MetalVertice{
var pointId:UInt32
let relativePosition:UInt32
}
The brush can be called either by giving Vertices previously created or by calling a function to create those vertices. Anyway, the real drawing happens at the vertice function
var vertices:[PointVertex] = [PointVertex].init(repeating: PointVertex(pointId: 0,
relativePosition: 0),
count: totalVertices)
for (verticeIdx, pointIndex) in pointsIndices.enumerated(){
vertices[verticeIdx].pointId = UInt32(pointIndex)
}
for vertice in vertices{
print("size: \(MemoryLayout.size(ofValue: vertice))")
}
self.renderVertices(vertices: vertices,
forStroke: stroke,
inDrawing: drawing,
commandEncoder: commandEncoder)
return vertices
}
func renderVertices(vertices: [MetalVertice], forStroke stroke: LFStroke, inDrawing drawing:LFDrawing, commandEncoder: MTLRenderCommandEncoder) {
if vertices.count > 1{
print("vertices a escribir: \(vertices.count)")
print("stride: \(MemoryLayout<PointVertex>.stride)")
print("size of array \(MemoryLayout.size(ofValue: vertices))")
for vertice in vertices{
print("ispointvertex: \(vertice is PointVertex)")
print("size: \(MemoryLayout.size(ofValue: vertice))")
}
}
let vertexBuffer = LFDrawing.device.makeBuffer(bytes: vertices,
length: MemoryLayout<PointVertex>.stride * vertices.count,
options: [])
This was the issue, calling this specific code produces these results in the console:
size: 8
size: 8
vertices a escribir: 2
stride: 8
size of array 8
ispointvertex: true
size: 40
ispointvertex: true
size: 40
In the previous function, the size of the vertices is 8 bytes, but for some reason, when they enter the next function they turn into 40 bytes, so the buffer is incorrectly constructed
if I change the function signature to:
func renderVertices(vertices: [PointVertex], forStroke stroke: LFStroke, inDrawing drawing:LFDrawing, commandEncoder: MTLRenderCommandEncoder) {
The vertices are correctly reported as 8 bytes long and the draw routine works as intended
Anything I'm missing? if the MetalVertice protocol introducing some noise?
In order to fulfill the requirement that value types conforming to protocols be able to perform dynamic dispatch (and also in part to ensure that containers of protocol types are able to assume that all of their elements are of uniform size), Swift uses what are called existential containers to hold the data of protocol-conforming value types alongside metadata that points to the concrete implementations of each protocol. If you've heard the term protocol witness table, that's what's getting in your way here.
The particulars of this are beyond the scope of this answer, but you can check out this video and this post for more info.
The moral of the story is: don't assume that Swift will lay out out your structs as-written. Swift can reorder struct members and add padding or arbitrary metadata, and it gives you practically no control over this. Instead, declare the structs you need to use in your Metal code in a C or Objective-C file and import them via a bridging header. If you want to use protocols to make it easier to address your structs polymorphically, you need to be prepared to copy them member-wise into your regular old C structs and prepared to pay the memory cost that that convenience entails.

Repeating a Function x times in the same Line in Google Sheets

I have been researching a solution to a problem that I just can not seem to avoid, and have yet to find a solution.
In brief, I am trying to calculate unique probabilities that lead to a "1 or 0" for more than one variable, but all in one cell.
Here is my working code line that represents the probability of just one variable:
=sum(if(randbetween(1,100) > subtotal(1,L23), 0, 1))
What I am trying to figure out is how to repeat this function times x, but with it yielding a different randbetween number each time, all in one cell.
As my x variable can represent 10 different independent variables at this time, and stem over 30 specific formula lengths for each IV, utilizing the preset workaround would lead me to creating hundreds of cells of data. I obviously do not want that clutter.
If code worked the way I wanted it to, the best formula-esque way I would describe what I wanted to happen is this:
=sum(repeatuniqueformula(sum(if(randbetween(1,100) > subtotal(1,L23), 0, 1)), x))
Simplified, relevant question gathered from a problem by problem analysis:
How to replicate a function in the function line that allows for the randbetween to recalculate each time.
Sub-information: If you simply multiply the function by lets say 6, it will multiply the answer of the randbetween function without recalculating.
=sum(if(randbetween(1,100) > subtotal(1,L23), 0, 1)*6)
Alternatively, I could do a workaround and create other cells with individual randbetween functions, but that causes a lot of manual work due to having to adjust the number of times a function in a line is repeated.
=sum(if(Q2 > subtotal(1,L15), 0, 1),if(Q3 > subtotal(1,L15), 0, 1),if(Q4 > subtotal(1,L15), 0, 1),if(Q5 > subtotal(1,L15), 0, 1),if(Q6 > subtotal(1,L15), 0, 1),if(Q7 > subtotal(1,L15), 0, 1),if(Q8 > subtotal(1,L15), 0, 1))
The alternative is both cluttery and takes a lot of effort to maintain, as changing the number of "x" will change the amount of
if(Q2 > subtotal(1,L15), 0, 1)
I would need.
In order to get what you want to happen (=sum(repeatuniqueformula(sum(if(randbetween(1,100) > subtotal(1,L23), 0, 1)), x))) you will have to create a custom function by using Google Apps Script, but x should be replaced by number or by a reference to a cell having a value or formula that returns that value.
References
https://developers.google.com/apps-script/guides/sheets
https://developers.google.com/apps-script/guides/sheets/functions

When to release a Vertex Array Object?

What are the guidelines for releasing a Vertex Array Object, e.g. binding to null?
Anecdotally it seems that I can have similar shaders and I only need to release after some grouping... or is the best-practice to do it after every grouped shader?
I guess there's another possibility of doing it after every draw call even if they are batched by shader but I don't think that's necessary...
It's not clear what you're asking. "When to release a texture". When you're done using it? I think you mean "unbind" not "release". "release" in most programming means to delete from memory or to at least allow to be deleted from memory.
Assuming you mean when to unbind a Vertex Array Object (VAO) the truth is you never have to unbind a VAO.
As explained else where VAOs contain all attribute state AND the ELEMENT_ARRAY_BUFFER binding so
currentVAO = {
elementArrayBuffer: someIndexBuffer,
attributes: [
{ enabled: true, size: 3, type: gl.FLOAT, stride: 0, offset: 0, buffer: someBuffer, },
{ enabled: true, size: 3, type: gl.FLOAT, stride: 0, offset: 0, buffer: someBuffer, },
{ enabled: true, size: 3, type: gl.FLOAT, stride: 0, offset: 0, buffer: someBuffer, },
{ enabled: false, size: 3, type: gl.FLOAT, stride: 0, offset: 0, buffer: null, },
{ enabled: false, size: 3, type: gl.FLOAT, stride: 0, offset: 0, buffer: null, },
{ enabled: false, size: 3, type: gl.FLOAT, stride: 0, offset: 0, buffer: null, },
...
... up to MAX_VERTEX_ATTRIBS ...
]
};
As long as you remember that gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, someBuffer) effects state inside the current VAO, not global state like gl.bindBuffer(gl.ARRAY_BUFFER).
I think that's the most confusing part. Most WebGL methods make it clearer what's being affected gl.bufferXXX affects buffers, gl.texXXX effects textures. gl.renderbufferXXX renderbuffers, gl.framebufferXX framebuffers, gl.vertexXXX effects vertex attributes (the VAO). etc.. But gl.bindBuffer is different at least in this case, it affects global state when binding to ARRAY_BUFFER but it affects VAO state when binding to ELEMENT_ARRAY_BUFFER.
my suggestion would be during initialization follow these steps in this order
for each object
1. create VAO
2. create vertex buffers and fill with data
3. setup all attributes
4. create index buffers (ELEMENT_ARRAY_BUFFER) and fill with data
At render time
for each object
1. use program (if program is different)
2. bind VAO for object (if different)
3. set uniforms and bind textures
4. draw
What's important to remember is that if you ever call gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, ...) your affecting the current VAO.
Why might I want to bind a null VAO? Mostly because I often forget the paragraph above because before VAOs ELEMENT_ARRAY_BUFFER was global state. So, when I forget that and I randomly bind some ELEMENT_ARRAY_BUFFER so that I can put some indices in it I've just changed the ELEMENT_ARRAY_BUFFER binding in the current VAO. Probably not something I wanted to do. By binding null, say after initializing all my objects and after my render loop, then I'm less likely cause that bug.
Also note that if I do want to update the indices of some geometry, meaning I want to call gl.bufferData or gl.bufferSubData I can sure I'm affecting the correct buffer in one of 2 ways. One by binding that buffer to ELEMENT_ARRAY_BUFFER and then calling gl.bufferData. The other by binding the appropriate VAO.
If that didn't make sense then assume I had 3 VAOs
// pseudo code
forEach([sphere, cube, torus])
create vao
create buffers and fill with data
create indices (ELEMENT_ARRAY_BUFFER)
fill out attributes
Now that I have 3 shapes lets say I wanted to change the indices in the sphere. I could do this 2 ways
One, I could bind the sphere's ELEMENT_ARRAY_BUFFER directly
gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, sphereElementArrayBuffer)
gl.bufferData(gl.ELEMENT_ARRAY_BUFFER...); // update the indices
This has the issue that if some other VAO is bound I just changed it's ELEMENT_ARRAY_BUFFER binding
Two, I could just bind the sphere's VAO since it's already got the ELEMENT_ARRAY_BUFFER bound
gl.bindVertexArray(sphereVAO);
gl.bufferData(gl.ELEMENT_ARRAY_BUFFER, ...); // update the indices
This seems safer IMO.
To reiterate, ELEMENT_ARRAY_BUFFER binding is part of VAO state.

How data layout in RAM memory?

I have a basic architecture based question. How does multi dimensional arrays layout in memory? Is this correct that data layout linearly in memory? Is so, is it correct that in row major order data store based on row orders (first row store, then second row ...) and in column major data stores based on columns?
Thanks
The representation of an array depends upon the programming language. Most languages (the C abortion and its progeny being notable exceptions) represent arrays using a descriptor. The descriptor specifies the number of dimensions the upper and lower bounds of each dimension, and where the data is located.
Usually, the all the data for the array is stored contiguously. Even when stored contiguously the ordering depends upon the language. In some languages [0, 0, 0] is stored next to [1, 0, 0] (Column Major—e.g. FORTRAN)). In others [0, 0, 0] is next to [0, 0, 1] (and [0, 0, 0] and [1, 0, 0] are apart—row major—e.g., Pascal). Some languages, such as Ada, leave the ordering up to the compiler implementation.
Each array is stored in sequence, naturally. It makes no sense to spread data all over the place.
Example in C:
int matrix[10][10];
matrix[9][1] = 1234;
printf("%d\n", matrix[9][1]); // prints 1234
printf("%d\n", ((int*)matrix)[9 * 10 + 1]); // prints 1234
Of course there is nothing enforcing you to organize data this way, if you want to make a mess you can do it.
For example, if instead of using an array of arrays you decide to dynamically allocate your matrix:
int **matrix;
matrix = malloc(10 * sizeof(int*));
for (int i = 0; i < 10; ++i)
matrix[i] = malloc(10 * sizeof(int));
The above example is most likely still stored in sequence, but certainly not in a contiguous manner, because there are 11 different memory blocks allocated and the memory manager is free to allocate them wherever it makes sense to it.

Wrong semaphor in case of opencl usage

Solution:
Finally I could solve or at least to find a good workaround for my problem.
This kind of semaphore doesn't work in case of NVIDIA.
I think this comment is right.
So I decided to use atomic_add() which is mandatory part of the OpenCL 1.1.
I have a resultBuffer array and resultBufferSize global variable and the last one is set to zero.
When I have results (my result is always!! x numbers) than I simple call
position = atomic_add(resultBufferSize, x);
and I can be sure no one writes between position and position + x into the buffer.
Don't forget the global variable must be volatile.
When the threads run into endless loops the resource is not available and therefore the -5 error code during the buffer reading.
Update:
When I read back:
oclErr |= clEnqueueReadBuffer(cqCommandQueue, cm_inputNodesArraySizes, CL_TRUE, 0, lastMapCounter*sizeof(cl_uint), (void*)&inputNodesArraySizes, 0, NULL, NULL);
The value of the lastMapCounter changes. It's strange because in the ocl code I do nothing and I take care of sizes: what I wrote into the buffer creation and what I copy I read the same back. And a hidden bufferoverflow can cause many stange things indeed.
End of update
I did the following code and there is a bug in it. I want a semaphore to change the resultBufferSize global variable (now I just want to try it how it works) and get back a big number (it is supposed that each worker write something). But I get always 3 or sometimes errors. There is no logic how the compiler works.
__kernel void findCircles(__global uint *inputNodesArray, __global
uint*inputNodesArraySizes, uint lastMapCounter,
__global uint *resultBuffer,
__global uint *resultBufferSize, volatile __global uint *sem)
{
for(;atom_xchg(sem, 1) > 0;)
(*resultBufferSize) = (*resultBufferSize) + 3;
atom_xchg(sem, 0);
}
I got -48 during the kernel execution and sometimes it's OK and I got -5 when I want to read back the buffer (the size buffer).
Do you have any idea where I can find the bug?
NVIDIA opencl 1.1 which is used.
Of course on the host I configure everything well:
uint32 resultBufferSize = 0;
uint32 sem;
cl_mem cmresultBufferSize = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,
sizeof(uint32), NULL, &ciErrNum);
cl_mem cmsem = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, sizeof(uint32), NULL,
&ciErrNum);
ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_mem), (void*)&cmresultBufferSize);
ciErrNum = clSetKernelArg(ckKernel, 5, sizeof(cl_mem), (void*)&cmsem);
ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL,
&szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmresultBufferSize, CL_TRUE, 0,
sizeof(uint32), (void*)&resultBufferSize, 0, NULL, NULL);
(in case of this code the kernel is OK and the last reading is return -5)
I know you have come to a conclusion on this, but I want to point out two things:
1) The semaphore is non-portable because it isn't SIMD safe, as pointed out in the linked thread.
2) The memory model is not strong enough to give a meaning to the code. The update of the result buffer could move out of the critical section - nothing in the model says otherwise. At the very least you'd need fences, but the language around fences in the 1.x specs is also fairly weak. You'd need an OpenCL 2.0 implementation to be confident that this aspect is safe.

Resources