How Metal distribute the image block to each thread group?

How Metal distribute the image block to each thread group? - metal

For example, if I want to do a grayscale transformation, I need to set up my threadsPerGroup and thread group in the following way.
NSUInteger maxTotalThreadsPerThreadgroup = [self.computePipelineState maxTotalThreadsPerThreadgroup];
MTLSize threadgroupCounts = MTLSizeMake(threadExecutionWidth * 2, threadExecutionWidth * 2, 1);
MTLSize threadsPerThreadGroup = MTLSizeMake([self.texutre width] / threadgroupCounts.width + 1,
[self.texutre height] / threadgroupCounts.height + 1,
1);
I know the image will be chopped into different blocks and each one will be processed by one thread group. But it seems in the kernel, we will just read the 2d texture, and then output the processed texture.
But the question is that how the image is chopped into different 2d texture? How do we know if each block of image get assigned to a thread to process? Is this done by Metal itself? Or we need to manually assign each block to each threadgroup by using the gid?

Metal doesn't know or care whether your shader is operating on an image. It doesn't "chop" the image or anything like that.
A compute shader is processed over a "grid". The grid is an abstraction. It's an arbitrary way for you to organize the work. Metal doesn't assign any significance to the grid, such as associating a position in the grid with a pixel in an image.
Such an association, if it exists, is implicit in how your shader code behaves. Yes, that is largely based on what the shader does with thread_position_in_grid, thread_position_in_threadgroup, thread_index_in_threadgroup, etc.
So, if you're using a gid variable with the thread_position_in_grid attribute, and you use its coordinates as image coordinates, then that usage is what dictates that each grid position corresponds to an image pixel. Once you do that, then it follows that each thread group corresponds to a block of the image, since a thread group is just a block of grid positions. Again, though, this is not something that Metal is doing, it's something that your shader is doing.
You could do something entirely different and Metal wouldn't care.

Related

Efficiently count how many transparent pixels are in UIImage/CIImage with Metal

What is the fastest way we can count how many transparent pixels exist in CIImage/UIImage?
For example:
My first thought, if we speak about efficiency, is to use Metal Kernel using either CIColorKernel or so, but I can't understand how to use it to output "count".
Also other ideas I had in mind:
Use some kind of average color to calculate it, the "redder" the more filled with pixels? Maybe some kind of linear calculation depends on the image size (using CIAreaAverage CIFilter?
Count pixels one by one and check the RGB values?
Using Metal parallel capabilities, similar to this post: Counting coloured pixels on the GPU - Theory?
Scale down the image and then count? Or do all the other processes suggested above just with scaled than version, and multiple it back depends on the scale down proportions after calculating?
What is the fastest way to achieve this count?

To answer your question how to do it metal, you would use device atomic_int.
Essentially you create an Int MTLBuffer and pass it to your kernel and increment it with atomic_fetch_add_explicit.
Create buffer once:
var bristleCounter = 0
counterBuffer = device.makeBuffer(bytes: &bristleCounter, length: MemoryLayout<Int>.size, options: [.storageModeShared])
Reset counter to 0 and binding counter buffer:
var z = 0
counterBuffer.contents().copyMemory(from: &z, byteCount: MemoryLayout<Int>.size)
kernelEncoder.setBuffer(counterBuffer, offset: 0, index: 0)
Kernel:
kernel void myKernel (device atomic_int *counter [[buffer(0)]]) {}
Increment counter in Kernel (and get the value):
int newCounterValue = atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);
Get the counter on the CPU side:
kernelEncoder.endEncoding()
kernelBuffer.commit()
kernelBuffer.waitUntilCompleted()
//Counter from kernel now in counterBuffer
let bufPointer = counterBuffer.contents().load(as: Int.self)
print("Counter: \(bufPointer)")

What you want to perform is a reduction operation, which is not necessarily well-suited for the GPU due to its massively parallel nature. I'd recommend not writing a reduction operation for the GPU yourself, but rather use some highly optimized built-in APIs that Apple provides (like CIAreaAverage or the corresponding Metal Performance Shaders).
The most efficient way depends a bit on your use case, specifically where the image comes from (loaded via UIImage/CGImage or the result of a Core Image pipeline?) and where you'd need the resulting count (on the CPU/Swift side or as an input for another Core Image filter?).
It also depends on if the pixels could also be semi-transparent (alpha not 0.0 or 1.0).
If the image is on the GPU and/or the count should be used on the GPU, I'd recommend using CIAreaAverage. The alpha value of the result should reflect the percentage of transparent pixels. Note that this only works if there are now semi-transparent pixels.
The next best solution is probably just iterating the pixel data on the CPU. It might be a few million pixels, but the operation itself is very fast so this should take almost no time. You could even use multi-threading by splitting the image up in chunks and use concurrentPerform(...) of DispatchQueue.
A last, but probably overkill solution would be to use Accelerate (this would make #FlexMonkey happy): Load the image's pixel data into a vDSP buffer and use the sum or average methods to calculate the percentage using the CPU's vector units.
Clarification
When I was saying that a reduction operation is "not necessarily well-suited for the GPU", I meant to say that it's rather complicated to implement in an efficient way and by far not as straightforward as a sequential algorithm.
The check whether a pixel is transparent or not can be done in parallel, sure, but the results need to be gathered into a single value, which requires multiple GPU cores reading and writing values into the same memory. This usually requires some synchronization (and thereby hinders parallel execution) and incurs latency cost due to access to the shared or global memory space. That's why efficient gather algorithms for the GPU usually follow a multi-step tree-based approach. I can highly recommend reading NVIDIA's publications on the topic (e.g. here and here). That's also why I recommended using built-in APIs when possible since Apple's Metal team knows how to best optimize these algorithms for their hardware.
There is also an example reduction implementation in Apple's Metal Shading Language Specification (pp. 158) that uses simd_shuffle intrinsics for efficiently communicating intermediate values down the tree. The general principle is the same as described by NVIDIA's publications linked above, though.

If the image contains semitransparent pixels, it can be easily preprocessed to make all pixels with alpha below certain threshold fully transparent, or fully opaque otherwise. Then the CIAreaAverage could be applied, as was originally suggested in the question, and finally the approximate number of the fully opaque pixels can be calculated by multiplying alpha component of the result by the image size.
For pre-processing we could use a trivial CIColorKernel like this:
half4 clampAlpha(coreimage::sample_t color) {
half4 out = half4(color);
out.a = step(half(0.99), out.a);
return out;
}
(Choose whatever threshold you like instead of 0.99)
To get the alpha component out of the output of CIAreaAverage we could do something like this:
let context = CIContext(options: [.workingColorSpace: NSNull(), .outputColorSpace: NSNull()])
var color: [Float] = [0, 0, 0, 0]
context.render(output,
toBitmap: &color,
rowBytes: MemoryLayout<Float>.size * 4,
bounds: CGRect(origin: .zero, size: CGSize(width: 1, height: 1)),
format: .RGBAf,
colorSpace: nil)
// color[3] contains alpha component of the result
With that approach everything is done on GPU while taking advantage of its inherent parallelism.
BTW, check this app out https://apps.apple.com/us/app/filter-magic/id1594986951. It lets you play with every single CoreImage filter out there.

iOS Metal Shader - Texture read and write access?

I'm using a metal shader to draw many particles onto the screen. Each particle has its own position (which can change) and often two particles have the same position. How can I check if the texture2d I write into does not have a pixel at a certain position yet? (I want to make sure that I only draw a particle at a certain position if there hasn't been drawn a particle yet, because I get an ugly flickering if many particles are drawn at the same positon)
I've tried outTexture.read(particlePosition), but this obviously doesn't work, because of the texture access qualifier, which is access::write.
Is there a way I can have read and write access to a texture2d at the same time? (If there isn't, how could I still solve my problem?)

There are several approaches that could work here. In concurrent systems programming, what you're talking about is termed first-write wins.
1) If the particles only need to preclude other particles from being drawn (and aren't potentially obscured by other elements in the scene in the same render pass), you can write a special value to the depth buffer to signify that a fragment has already been written to a particular coordinate. For example, you'd turn on depth test (using the depth compare function Equal), clear the depth buffer to some distant value (like 1.0), and then write a value of 0.0 to the depth buffer in the fragment function. Any subsequent write to a given pixel will fail to pass the depth test and will not be drawn.
2) Use framebuffer read-back. On iOS, Metal allows you to read from the currently-bound primary renderbuffer by attributing a parameter to your fragment function with [[color(0)]]. This parameter will contain the current color value in the renderbuffer, which you can test against to determine whether it has been written to. This does require you to clear the texture to a predetermined color that will never otherwise be produced by your fragment function, so it is more limited than the above approach, and possibly less performant.
All of the above applies whether you're rendering to a drawable's texture for direct presentation to the screen, or to some offscreen texture.

To answer the read and write part : you can specify a read/write access for the output texture as such :
texture2d<float, access::read_write> outTexture [[texture(1)]],
Also, your texture descriptor must specify usage :
textureDescriptor?.usage = [.shaderRead, .shaderWrite]

What are Apple's Metal (Metal Shader Language) texture coordinates?

In iOS or OS/X what texture coordinates are used in Metal Shader Language kernel function? For example, given an MTLTexture and uint2 gid[[thread_position_in_grid]] Is gid.x and gid.ybetween 0..1 (x and y are floats) or 0..inTexture.get_width() (x and y are integers).
Thanks in Advance

thread_position_in_grid is an index (an integer) in the grid that takes values in the ranges you specify in dispatchThreadgroups:threadsPerThreadgroup:. It's up to you to decide how many thread groups you want, and how many threads per group.
In the following sample code you can see that threadsPerGroup.width * numThreadgroups.width == inputImage.width and threadsPerGroup.height * numThreadgroups.height == inputImage.height. In this case, a position in the grid will thus be a non-normalized (integer) pixel coordinate.

Each launch of a compute shader in Metal is accompanied by a dense rectangular 3D grid of thread IDs. The dimensions of the grid is set when you call [MTLComputeCommandEncoder dispatchThreadGroups:threadsPerThreadgroup:]. You can for example have a threadgroup size of {16,16,1} (256 threads in a threadgroup as a 16x16x1 square), and threadgroup count of {1,2,1}, which will cause two threadgroups to be launched with a total area of 512 threads in the shape {16,32,1}. These are the integers that appear at the top of your kernel as [[thread_position_in_grid]]. The thread position is the way that you tell which thread you are, just like the threadID parameter passed to a block by dispatch_apply().
Metal specifies no mapping from [[thread_position_in_grid]] to coordinates in a texture. This is done by you in software in your compute shader. If you want to read every other pixel in a region of a texture at some offset in the image, then you need to multiply the threadID by two and add an offset in your kernel before passing the new coordinate to texture2d.sample. Since Metal can not launch partial threadgroups, it is up to you to make sure that unneeded threadgroups are not executed. For example, when applied to a smaller texture, the full size of your 32x64 launch might cause you to write off the end of your texture. In this case you must check the threadID to see if the thread will write off the end and then either return out of the shader or skip over the texture write call for that thread to avoid the problem.

thread_position_in_grid is always made of unsigned integers, and provides these options, but none of them are related to texture coordinates. It may be helpful to ask another, related question, because you seem to be conflating the idea of textures and kernel functions.
16- or -32 bit
1D, 2D, or 3D

OpenGL ES 2 (iOS) Morph / Animate between two set of vertexes

I have two sets of vertexes used as a line strip:
Vertexes1
Vertexes2
It's important to know that these vertexes have previously unknown values, as they are dynamic.
I want to make an animated transition (morph) between these two. I have come up with two different ways of doing this:
Option 1:
Set a Time uniform in the vertex shader, that goes from 0 - 1, where I can do something like this:
// Inside main() in the vertex shader
float originX = Position.x;
float destinationX = DestinationVertexPosition.x;
float interpolatedX = originX + (destinationX - originX) * Time;
gl_Position.x = interpolatedX;
As you probably see, this has one problem: How do I get the "DestinationVertexPosition" in there?
Option 2:
Make the interpolation calculation outside the vertex shader, where I loop through each vertex and create a third vertex set for the interpolated values, and use that to render:
// Pre render
// Use this vertex set to render
InterpolatedVertexes
for (unsigned int i = 0; i < vertexCount; i++) {
float originX = Vertexes1[i].x;
float destinationX = Vertexes2[i].x;
float interpolatedX = originX + (destinationX - originX) * Time;
InterpolatedVertexes[i].x = interpolatedX;
}
I have highly simplified these two code snippets, just to make the idea clear.
Now, from the two options, I feel like the first one is definitely better in terms of performance, given stuff happens at the shader level, AND I don't have to create a new set of vertexes each time the "Time" is updated.
So, now that the introduction to the problem has been covered, I would appreciate any of the following three things:
A discussion of better ways of achieving the desired results in OpenGL ES 2 (iOS).
A discussion about how Option 1 could be implemented properly, either by providing the "DestinationVertexPosition" or by modifying the idea somehow, to better achieve the same result.
A discussion about how Option 2 could be implemented.

In ES 2 you specify such attributes as you like — there's therefore no problem with specifying attributes for both origin and destination, and doing the linear interpolation between them in the vertex shader. However you really shouldn't do it component by component as your code suggests you want to as GPUs are vector processors, and the mix GLSL function will do the linear blend you want. So e.g. (with obvious inefficiencies and assumptions)
int sourceAttribute = glGetAttribLocation(shader, "sourceVertex");
glVertexAttribPointer(sourceAttribute, 3, GL_FLOAT, GL_FALSE, 0, sourceLocations);
int destAttribute = glGetAttribLocation(shader, "destVertex");
glVertexAttribPointer(destAttribute, 3, GL_FLOAT, GL_FALSE, 0, destLocations);
And:
gl_Position = vec4(mix(sourceVertex, destVertex, Time), 1.0);

Your two options here have a trade off: supply twice the geometry once and interpolate between that, or supply only one set of geometry, but do so for each frame. You have to weigh geometry size vs. upload bandwidth.
Given my experience with iOS devices, I'd highly recommend option 1. Uploading new geometry on every frame can be extremely expensive on these devices.
If the vertices are constant, you can upload them once into one or two vertex buffer objects (VBOs) with the GL_STATIC_DRAW flag set. The PowerVR SGX series has hardware optimizations for dealing with static VBOs, so they are very fast to work with after the initial upload.
As far as how to upload two sets of vertices for use in a single shader, geometry is just another input attribute for your shader. You could have one, two, or more sets of vertices fed into a single vertex shader. You just define the attributes using code like
attribute vec3 startingPosition;
attribute vec3 endingPosition;
and interpolate between them using code like
vec3 finalPosition = startingPosition * (1.0 - fractionalProgress) + endingPosition * fractionalProgress;
Edit: Tommy points out the mix() operation, which I'd forgotten about and is a better way to do the above vertex interpolation.
In order to inform your shader program as to where to get the second set of vertices, you'd use pretty much the same glVertexAttribPointer() call for the second set of geometry as the first, only pointing to that VBO and attribute.
Note that you can perform this calculation as a vector, rather than breaking out all three components individually. This doesn't get you much with a highp default precision on current PowerVR SGX chips, but could be faster on future ones than doing this one component at a time.
You might also want to look into other techniques used for vertex skinning, because there might be other ways of animating vertices that don't require two full sets of vertices to be uploaded.
The one case that I've heard where option 2 (uploading new geometry on each frame) might be preferable is in specific cases where using the Accelerate framework to do vector manipulation of the geometry ends up being faster than doing the skinning on-GPU. I remember the Unity folks were talking about this once, but I can't remember if it was for really small or really large sets of geometry. Option 1 has been faster in all the cases I've worked with myself.

Darkening part of a surface in Direct3D 9

In Direct3D 9, I'm trying to modify a surface thus:
Given a rectangle, for each of the pixels in the given surface within the rectangle's bounds, each of the channels (R, G, B, A) would be multiplied by a certain (float) value to either dim or brighten it.
How would I go about doing this? Preferably I want to avoid using LockRect (especially as it seems to not work with the default pool).

If you are wanting to update a Surfaces pixels directly, you can use "Device.UpdateTexture". This updates a Texture created in Pool.SystemMemory to a Texture created in Pool.Default.
But this doesn't sound like what you want to be doing. Use an Effect to hardware accelerate this. If you would like to know how I can show you.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart