Can I process an image row- (or column-)wise using Metal? - ios

I'm trying to create an implementation of seam carving which will run on the GPU using Metal. The dynamic programming part of the algorithm requires the image to be processed row-by-row (or column-by-column), so it's not perfectly suited to GPU I know, but I figure with images potentially thousands of pixels wide/tall, it will still benefit from parallelisation.
My two ideas were either
write a shader that uses 2D textures, but ensure that Metal computes over the image in the correct order, finishing one row before starting the next
write a shader using 1D textures, then manually pass in each row of the image to compute; ideally creating a view into the 2D textures rather than having to copy the data into separate 1D textures
I am still new to Metal and Swift, so my naïve attempts at both of these did not work. For option 1, I tried to dispatch threads using a threadgroup of Nx1x1, but the resulting texture just comes back all zeros – besides I am not convinced this is even right in theory, even if I tell it to use a threadgroup of height one, I'm not sure I can guarantee it will start on the first row. For option 2, I simply couldn't find a nice way to create a 1D view into a 2D texture by row/column – the documentation seems to suggest that Metal does not like giving you access to the underlying data, but I wouldn't be surprised if there was a way to do this.
Thanks for any ideas!

Related

How to implement Tiling with Compute shader in Metal?

I am new to Metal and compute world. With basic understanding of GPU programming, I want to optimize texel read of input image/data. I have already tried combining multiple read and process those pixels but somehow that makes it even worse.
I have tried to combining multiple read in many ways like reading 4 in rows, columns, 2x2 in a block, 4 pixels each from different quadrant of the image, without any gain.
I found during experiments that a simple read of 4k image takes around 4.5ms on iPhone X or 11 or Pro. So I want to reduce that time in order to optimize the whole process.
I came to know about tiling local storage is quite fast and I think I can read multiple texels in this memory to process in a go later/next pass/after read completes.
My questions:
How can I implement Tiling with Compute Shader in Metal? Is there a sample or example code?
What is the possibility or optimization here?
Is there any other way to optimize texel read to make it faster.

What is the correct way to store per-pixel persistence data in the Metal compute kernel?

I am trying to implement the MoG background subtraction algorithm based on the opencv cuda implementation
What I need is to maintain a set of gaussian parameter independently for each pixel location across multiple frame. Currently I am just allocating a single big MTLBuffer to do the job and on every frame, I have to invoke the commandEncoder.setBuffer API. Is there a better way? I read about imageblock but I am not sure if it is relevant.
Also, I would be really happy if you can spot any things that shouldn't be directly translated from cuda to metal.
Allocate an 8 bit texture and store intermediate values into the texture in your compute shader. Then after this texture is rendered, you can rebind it as an input texture to whatever other methods need to read from it in the rest of the renders. You can find a very detailed example of this sort of thing at this github example project of a parallel prefix sum on top of Metal. This example also shows how to write XCTest regression tests for your Metal shaders. Github MetalPrefixSum

Making a large lookup table in openGLES on IOS

I am using openGLES on IOS to do some general data processing. Currently I am trying to make a large lookup table (~1M elements) of float values accessed by integer indexes, and I would like it to be 1D (though 2D works). I have learnt that using texture/sampler is probably the way to do that, but my remaining questions are:
Sampler or Texture, which is more efficient? What would be the parameter settings to achieve the optimal results (like those configured in glTexParameteri())?
I know I can use 1-sample-high 2D sampler/texture as 1D, but being out of curiosity, I wonder if the 1D sampler/texture is removed on IOS es3? I cannot find the method glTexImage2D() nor parameters GL_TEXTURE_1D with ES3/gl.h imported.
OpenGL ES does not have 1D textures. Never did in any previous version, and still doesn't up to the most recent version (3.2). And I very much doubt it ever will.
At least in my opinion, that's no big loss. You can do anything you could have done with a 1D texture using a 2D texture of height 1. The only minor inconvenience is that you have to pass in some more sampling attributes, and a second texture coordinate when you sample the texture in your GLSL code.
For the sizes you're looking at, you'll have the same problem with a 2D texture of height 1 that you would have faced with 1D textures as well: You're limited by the maximum texture size. This is given by the value you can query with glGetIntegerv(GL_MAX_TEXTURE_SIZE, ...). Typical values for relatively recent mobile platforms are 2K to 8K. Based on the published docs, it looks like the limit is 4096 on recent Apple platforms (A7 to A9).
There is nothing I can think of that would give you a much larger range in a single dimension. There is a EXT_texture_buffer extension that targets your use case, but I don't see it in the list of supported extensions for iOS.
So the best you can probably do is store the data in a 2D texture, and use div/mod arithmetic to split your large 1D index into 2 texture coordinates.

Compute Shader, Buffer or texture

I'm trying to implement fluid dynamics using compute shaders. In the article there are a series of passes done on a texture since this was written before compute shaders.
Would it be faster to do each pass on a texture or buffer? The final pass would have to be applied to a texture anyways.
I would recommend using whichever dimensionality of resource fits the simulation. If it's a 1D simulation, use a RWBuffer, if it's a 2D simulation use a RWTexture2D and if it's a 3D simulation use a RWTexture3D.
There appear to be stages in the algorithm that you linked that make use of bilinear filtering. If you restrict yourself to using a Buffer you'll have to issue 4 or 8 memory fetches (depending on 2D or 3D) and then more instructions to calculate the weighted average. Take advantage of the hardware's ability to do this for you where possible.
Another thing to be aware of is that data in textures is not laid out row by row (linearly) as you might expect, instead it's laid in such a way that neighbouring texels are as close to one another in memory as possible; this can be called Tiling or Swizzling depending on whose documentation you read. For that reason, unless your simulation is one-dimensional, you may well get far better cache coherency on reads/writes from a resource whose layout most closely matches the dimensions of the simulation.

Speed of ComputeShader vs. PixelShader

I've got a question regarding ComputeShader compared to PixelShader.
I want to do some processing on a buffer, and this is possible both with a pixel shader and a compute shader, and now I wonder if there is any advantage in either over the other one, specifically when it comes to speed. I've had issues with either getting to use just 8 bit values, but I should be able to work-around that.
Every data point in the output will be calculated from using in total 8 data points surrounding it (MxN matrix), so I'd think this would be perfect for a pixel shader, since the different outputs don't influence each other at all.
But I was unable to find any benchmarkings to compare the shaders, and now I wonder which one I should aim for. Only target is the speed.
From what i understand, shaders are shaders in the sense that they are just programs run by alot of threads on data. Therefore, in general there should not be any diffrence in terms of computing power/speed doing calculations in the pixel shader as opposed to the compute shader. However..
To do calculations on the pixelshader you have to massage your data so that it looks like image data, this means you have to draw a quad first of all, but also that your output must have the 'shape' of a pixel (float4 basically). This data must then be interpreted by you app into something useful
if you're using the computeshader you can completly control the number of threads to use where as for pixel shaders they have to be valid resolutions. Also you can input and output data in any format you like and take advantage of accelerated conversion using UAVs (i think)
i'd recommend using computeshaders since they are ment for doing general purpose computation and are alot easier to work with. Your over all application will probably be faster too, even if the actual shader computation time is about the same, just because you can avoid some of the hoops you have to jump through just through to get pixel shaders to do what you want.

Resources