I'm trying to implement fluid dynamics using compute shaders. In the article there are a series of passes done on a texture since this was written before compute shaders.
Would it be faster to do each pass on a texture or buffer? The final pass would have to be applied to a texture anyways.
I would recommend using whichever dimensionality of resource fits the simulation. If it's a 1D simulation, use a RWBuffer, if it's a 2D simulation use a RWTexture2D and if it's a 3D simulation use a RWTexture3D.
There appear to be stages in the algorithm that you linked that make use of bilinear filtering. If you restrict yourself to using a Buffer you'll have to issue 4 or 8 memory fetches (depending on 2D or 3D) and then more instructions to calculate the weighted average. Take advantage of the hardware's ability to do this for you where possible.
Another thing to be aware of is that data in textures is not laid out row by row (linearly) as you might expect, instead it's laid in such a way that neighbouring texels are as close to one another in memory as possible; this can be called Tiling or Swizzling depending on whose documentation you read. For that reason, unless your simulation is one-dimensional, you may well get far better cache coherency on reads/writes from a resource whose layout most closely matches the dimensions of the simulation.
Related
I'm trying to create an implementation of seam carving which will run on the GPU using Metal. The dynamic programming part of the algorithm requires the image to be processed row-by-row (or column-by-column), so it's not perfectly suited to GPU I know, but I figure with images potentially thousands of pixels wide/tall, it will still benefit from parallelisation.
My two ideas were either
write a shader that uses 2D textures, but ensure that Metal computes over the image in the correct order, finishing one row before starting the next
write a shader using 1D textures, then manually pass in each row of the image to compute; ideally creating a view into the 2D textures rather than having to copy the data into separate 1D textures
I am still new to Metal and Swift, so my naïve attempts at both of these did not work. For option 1, I tried to dispatch threads using a threadgroup of Nx1x1, but the resulting texture just comes back all zeros – besides I am not convinced this is even right in theory, even if I tell it to use a threadgroup of height one, I'm not sure I can guarantee it will start on the first row. For option 2, I simply couldn't find a nice way to create a 1D view into a 2D texture by row/column – the documentation seems to suggest that Metal does not like giving you access to the underlying data, but I wouldn't be surprised if there was a way to do this.
Thanks for any ideas!
I am trying to implement the MoG background subtraction algorithm based on the opencv cuda implementation
What I need is to maintain a set of gaussian parameter independently for each pixel location across multiple frame. Currently I am just allocating a single big MTLBuffer to do the job and on every frame, I have to invoke the commandEncoder.setBuffer API. Is there a better way? I read about imageblock but I am not sure if it is relevant.
Also, I would be really happy if you can spot any things that shouldn't be directly translated from cuda to metal.
Allocate an 8 bit texture and store intermediate values into the texture in your compute shader. Then after this texture is rendered, you can rebind it as an input texture to whatever other methods need to read from it in the rest of the renders. You can find a very detailed example of this sort of thing at this github example project of a parallel prefix sum on top of Metal. This example also shows how to write XCTest regression tests for your Metal shaders. Github MetalPrefixSum
Topic is pretty much the question. I'm trying to understand how CPU and GPU cooperation works.
I'm developing my game via cocos2d. It is a game engine so it redraws the whole screen 60 times per second. Every node in cocos2d draws its own set of triangles. Usually you set vertexes for triangle after performing node transforms (from node to world) on CPU side. I've realized the way to do it on GPU side with vertex shaders by passing view model projection to uniforms.
I see CPU time decreasing by ~1ms and gpu time raised by ~0.5ms.
Can I consider this as a performance gain?
In other words: if something can be done on GPU side is there any reasons you shouldn't do it?
The only time you shouldn't do something on the GPU side is if you need the result (in easily accessible form) on the CPU side to further the simulation.
Taking your example. If we assume you have 4 250KB meshes which represent a hierarchy of body parts (as a skeleton). Lets assume you are using a 4x4 matrix of floats for the transformations (64bytes) for each mesh. You could either:
Each frame, perform the mesh transformation calculations on the application side (CPU) and then upload the four meshes to the GPU. This would result in about ~1000kb of data being sent to the GPU per frame.
When the application starts, upload the data for the 4 meshes to the GPU (this will be in a rest / identity pose). Then each frame when you make the render call, you calculate only the new matrices for each mesh (position/rotation/scale) and upload those matrices to the GPU and perform the transformation there. This results in ~256bytes being sent to the GPU per frame.
As you can see, even if the data in the example is fabricated, the main advantage is that you are minimizing the amount of data being transferred between CPU and GPU on a per frame basis.
The only time you would prefer the first option is if your application needs the results of the transformation to do some other work. The GPU is very efficient (especially at processing vertices in parallel), but it isn't too easy to get information back from the GPU (and then its usually in the form on a texture - i.e. a RenderTarget). One concrete example of this 'further work' might be performing collision checks on transformed mesh positions.
edit
You can tell based on how you are calling the openGL api where the data is stored to some extent*. Here is a quick run-down:
Vertex Arrays
glVertexPointer(...)
glDrawArray(...)
using this method passing an array of vertices from the CPU -> GPU each frame. The vertices are processed sequentially as they appear in the array. There is a variation of this method (glDrawElements) which lets you specify indices.
VBOs
glBindBuffer(...)
glBufferData(...)
glDrawElements(...)
VBOs allow you to store the mesh data on the GPU (see below for note). In this way, you don't need to send the mesh data to the GPU each frame, only the transformation data.
*Although we can indicate where our data is to be stored, it is not actually specified in the OpenGL specification how the vendors are to implement this. It means that, we can give hints that our vertex data should be stored in VRAM, but ultimately, it is down to the driver!
Good reference links for this stuff is:
OpenGL ref page: https://www.opengl.org/sdk/docs/man/html/start.html
OpenGL explanations: http://www.songho.ca/opengl
Java OpenGL concepts for rendering: http://www.java-gaming.org/topics/introduction-to-vertex-arrays-and-vertex-buffer-objects-opengl/24272/view.html
I'm writing a pixel shader that has the property where for a given quad the values returned only vary by u-axis value. I.e. for a fixed u, then the color output is constant as v varies.
The computation to calculate the color at a pixel is relatively expensive - i.e. does multiple samples per pixel / loops etc..
Is there a way to take advantage of the v-invariance property? If I was doing this on a CPU then you'd obviously just cache the values once calculated but guess that doesn't apply because of parallelism. It might be possible for me to move the texture generation to the CPU side and have the shader access a Texture1D but I'm not sure how fast that will be.
Is there a paradigm that fits this situation on GPU cards?
cheers
Storing your data in a 1D texture and sampling it in your pixel shader looks like a good solution. Your GPU will be able to use texture caching features, allowing it to make use of the fact that many of you pixels are using the same value from your 1D texture. This should be really fast, texture fetching and caching is one of the main reasons your gpu is so efficient at rendering.
It is commonpractice to make a trade-off between calculating the value in the pixel shader, or using a lookup table texture. You are doing complex calculations by the sound of it, so using a lookup texture with certainly improve performance.
Note that you could still generate this texture by the GPU, there is no need to move it to the CPU. Just render to this 1D texture using your existing shader code as a prepass.
For simplicity of the problem let's consider spheres. Let's say I have a sphere, and before execution I know the radius, the position and the triangle count. Let's also say the triangle count is sufficiently large (e.g. ~50k triangles).
Would it be faster generally to create this sphere mesh before hand and stream all 50k triangles to the graphics card, or would it be faster to send a single point (representing the centre of the sphere) and use tessellation and geometry shaders to build the sphere on the GPU?
Would it still be faster if I had 100 of these spheres in different positions? Can I use hull/geometry shaders to create something which I can then combine with instancing?
Tessellation is certainly valuable. Especially when combined with displacement from a heightmap. The isolated environment described in your question is bound not to fully answer your question.
Before using tessellation you would need to know that you will become CPU poly/triangle bound and therefore need to start utilizing the GPU to help you increase the overall triangles of your game/scene. Calculations are very fast on the GPU so yes using tessellation multiple subdivision levels is advisable if you are going to do it...though sometimes I've been happy with just subdividing 3-4 times from a 200 tri plane.
Mainly tessellation is used for environmental/static mesh scene objects so that you can spend your tri's on characters and other moving/animated models without becoming CPU bound.
Checkout engines like Unity3D and CryEngine for tessellation examples to help the learning curve.
I just so happen to be working with this at the same time.
In terms of FPS, the pre-computed method would be faster in this situation since you can
dump one giant 50K triangle sphere payload (like any other model) and
draw it in multiple places from there.
The tessellation method would be slower since all the triangles would
be generated from a formula, multiple times per frame.