Efficient pixel shader when only u-axis varies? - directx

I'm writing a pixel shader that has the property where for a given quad the values returned only vary by u-axis value. I.e. for a fixed u, then the color output is constant as v varies.
The computation to calculate the color at a pixel is relatively expensive - i.e. does multiple samples per pixel / loops etc..
Is there a way to take advantage of the v-invariance property? If I was doing this on a CPU then you'd obviously just cache the values once calculated but guess that doesn't apply because of parallelism. It might be possible for me to move the texture generation to the CPU side and have the shader access a Texture1D but I'm not sure how fast that will be.
Is there a paradigm that fits this situation on GPU cards?
cheers

Storing your data in a 1D texture and sampling it in your pixel shader looks like a good solution. Your GPU will be able to use texture caching features, allowing it to make use of the fact that many of you pixels are using the same value from your 1D texture. This should be really fast, texture fetching and caching is one of the main reasons your gpu is so efficient at rendering.
It is commonpractice to make a trade-off between calculating the value in the pixel shader, or using a lookup table texture. You are doing complex calculations by the sound of it, so using a lookup texture with certainly improve performance.
Note that you could still generate this texture by the GPU, there is no need to move it to the CPU. Just render to this 1D texture using your existing shader code as a prepass.

Related

For batch rendering multiple similar objects which is more performant, drawArrays(TRIANGLE_STRIP) with "degenerate triangles" or drawArraysInstanced?

MDN states that:
Fewer, larger draw operations will generally improve performance. If
you have 1000 sprites to paint, try to do it as a single drawArrays()
or drawElements() call.
It's common to use "degenerate triangles" if you need to draw
discontinuous objects as a single drawArrays(TRIANGLE_STRIP) call.
Degenerate triangles are triangles with no area, therefore any
triangle where more than one point is in the same exact location.
These triangles are effectively skipped, which lets you start a new
triangle strip unattached to your previous one, without having to
split into multiple draw calls.
However, it is also commmonly recommended that for multiple similar objects one should use instanced rendered. For webGl2 something like drawArraysInstanced() or for webGl1 drawArrays with the ANGLE_instanced_arrays extension activated.
For my personal purposes I need to render a large amount of rectangles of the same width in a 2d plane but with varying heights (webgl powered charting application). So any recommendation particular to my usecase is valuable.
Degenerate triangles are generally faster than drawArraysInstanced but there's arguably no reason to use degenerate triangles when you can just make quads with no degenerate triangles.
While it's probably true that degenerate triangles are slightly faster than quads you're unlikely to notice that difference. In fact I suspect it wold be difficult to create an example in WebGL that would show the difference.
To be clear I'm suggesting manually instanced quads. If you want to draw 1000 quads put 1000 quads in a single vertex buffer and draw all with 1 draw call using either drawElements or drawArrays
On the other hand instanced quads using drawArraysInstances might be the most convenient way depending on what you are trying to do.
If it was me though I'd first test without optimization, drawing 1 quad per draw call unless I already knew I was going to draw > 1000 quads. Then I'd find some low-end hardware and see if it's too slow. Most GPU apps get fillrate bound (drawing pixels) before they get vertex bound so even on a slow machine drawing lots of quads might be slow in a way that optimizing vertex calculation won't fix the issue.
You might find this and/or this useful
You can take as a given that the performance of rendering has been optimized by the compiler and the OpenGL core.
static buffers
If you have a buffers that are static then there is generally an insignificant performance difference between the techniques mentioned. Though different hardware (GPUs) will favor one technique over another, but there is no way to know what type of GPU you are running on.
Dynamic buffers
If however when the buffers are dynamic then you need to consider the transfer of data from the CPU RAM to the GPU RAM. This transfer is a slow point and on most GPU's will stop rendering as the data is moved (Messing up concurrent rendering advantages).
On average anything that can be done to reduce the size of the buffers moved will improve the performance.
2D Sprites Triangle V Triangle_Strip
At the most basic 2 floats per vertex (x,y for 2D sprites) you need to modify and transfer a total of 6 verts per quad for gl.TRIANGLE (6 * 2 * b = 48bytes per quad. where b is bytes per float (4)). If you use (gl.TRIANGLE_STRIP) you need to move only 4 verts for a single quad, but for more than 1 you need to create the degenerate triangle each of which requires an additional 2 verts infront and 2 verts behind. So the size per quad is (8 * 2 * 4 = 64bytes per quad (actual can drop 2verts lead in and 2 lead out, start and end of buffer))
Thus for 1000 sprites there are 12000 doubles (64Bit) that are converted to Floats (32Bit) then transfer is 48,000bytes for gl.TRIANGLE. For gl.TRIANGLE_STRIP there are 16,000 doubles for a total of 64,000bytes transferred
There is a clear advantage when using triangle over triangle strip in this case. This is compounded if you include additional per vert data (eg texture coords, color data, etc)
Draw Array V Element
The situation changes when you use drawElements rather than drawArray as the verts used when drawing elements are located via the indices buffer (a static buffer). In this case you need only modify 4Verts per quad (for 1000 quads modify 8000 doubles and transfer 32,000bytes)
Instanced V modify verts
Now using elements we have 4 verts per quad (modify 8 doubles, transfer 32bytes).
Using drawArray or drawElement and each quad has a uniform scale, be rotated, and a position (x,y), using instanced rendering each quad needs only 4 doubles per vert, the position, scale, and rotation (done by the vertex shader).
In this case we have reduced the work load down to (for 1000 quads) modify 4,000 doubles and transfer 16,000bytes
Thus instanced quads are the clear winner in terms of alleviating the transfer and JavaScript bottle necks.
Instanced elements can go further, in the case where it is only position needed, and if that position is only within a screen you can position a quad using only 2 shorts (16bit Int) reducing the work load to modify 2000 ints (32bit JS Number convert to shorts which is much quicker than the conversion of Double to Float)) and transfer only 4000bytes
Conclusion
It is clear in the best case that instanced elements offer up to 16times less work setting and transferring quads to the GPU.
This advantage does not always hold true. It is a balance between the minimal data required per quad compared to the minimum data set per vert per quad (4 verts per quad).
Adding additional capabilities per quad will alter the balance, so will how often you modify the buffers (eg with texture coords you may only need to set the coords once when not using instanced, by for instanced you need to transfer all the data per quad each time anything for that quad has changed (Note the fancy interleaving of instance data can help)
There is also the hardware to consider. Modern GPUs are much better at state changes (transfer speeds), in these cases its all in the JavaScript code where you can gain any significant performance increase. Low end GPUs are notoriously bad at state changes, though optimal JS code is always important, reducing the data per quad is where the significant performance is when dealing with low end devices

If I use vertex shader to do all operations on object, then constant buffer can be empty?

The program cycle is
Update();
UpdatePipeline();
In Update() constant buffer for each object, that after transformations, has this object world matrix is copied to GPU upload heap. And in UpdatePipeline(), among other things, installed shaders are called. Because we do all matrix transformation using CPU, vertex shader just returns position, right? If yes - is it true that performance will increase?
Now I want to do all transformations using GPU, i.e. via vertex shader. It means that in Update() I just should call memcpy() with an empty constant buffer as a source?
A constant buffer is just a buffer to move data from the CPU to the GPU. Whether you use one, or how many you use, and what you are using them for, is up to you.
The most common and most simple use-case for a constant buffer is to move a transformation matrix to the GPU. That matrix is indeed calculated by the CPU, and the vertex shader uses that matrix to transform the positions in the vertex buffer from local to screen space. This allows the CPU to move an object around without the need to update the (usually quite big) vertex buffer.
Whether performance increases depends on your hardware, your code, and - most importantly - what you are comparing the performance with. Since I don't know your current code, nor what exact changes you intent to do, I can't even guess if it will increase or not.
Also, even though I don't know your code, just by the way you phrased your question I would assume that you defenitly don't want to use a constant buffer as a source for any operation on the CPU.

Compute Shader, Buffer or texture

I'm trying to implement fluid dynamics using compute shaders. In the article there are a series of passes done on a texture since this was written before compute shaders.
Would it be faster to do each pass on a texture or buffer? The final pass would have to be applied to a texture anyways.
I would recommend using whichever dimensionality of resource fits the simulation. If it's a 1D simulation, use a RWBuffer, if it's a 2D simulation use a RWTexture2D and if it's a 3D simulation use a RWTexture3D.
There appear to be stages in the algorithm that you linked that make use of bilinear filtering. If you restrict yourself to using a Buffer you'll have to issue 4 or 8 memory fetches (depending on 2D or 3D) and then more instructions to calculate the weighted average. Take advantage of the hardware's ability to do this for you where possible.
Another thing to be aware of is that data in textures is not laid out row by row (linearly) as you might expect, instead it's laid in such a way that neighbouring texels are as close to one another in memory as possible; this can be called Tiling or Swizzling depending on whose documentation you read. For that reason, unless your simulation is one-dimensional, you may well get far better cache coherency on reads/writes from a resource whose layout most closely matches the dimensions of the simulation.

iOS OpenGL ES 2.0 VBO confusion

I'm attempting to render a large number of textured quads on the iPhone. To improve render speeds I've created a VBO that I leverage to render my objects in a single draw call. This seems to work well, but I'm new to OpenGL and have run into issues when it comes to providing a unique transform for each of my quads (ultimately I'm looking for each quad to have a custom scale, position and rotation).
After a decent amount of Googling, it appears that the standard means of handling this situation is to pass a uniform matrix to the vertex shader and to have each quad take care of rendering itself. But this approach seems to negate the purpose of the VBO, by ultimately requiring a draw call per object.
In my mind, it makes sense that each object should keep it's own model view matrix, using it to transform, scale and rotate the object as necessary. But applying separate matrices to objects in a VBO has me lost. I've considered two approaches:
Send the model view matrix to the vertex shader as a non-uniform attribute and apply it within the shader.
Or transform the vertex data before it's stored in the VBO and sent to the GPU
But the fact that I'm finding it difficult to find information on how best to handle this leads me to believe I'm confusing the issue. What's the best way of handling this?
This is the "evergreen" question (a good one) on how to optimize the rendering of many simple geometries (a quad is in fact 2 triangles, 6 vertices most of the time unless we use a strip).
Anyway, the use of VBO vs VAO in this case should not mean a significant advantage since the size of the data to be transferred on the memory buffer is rather low (32 bytes per vertex, 96 bytes per triangle, 192 per quad) which is not a big effort for nowadays memory bandwidth (yet it depends on How many quads you mean. If you have 20.000 quads per frame then it would be a problem anyway).
A possible approach could be to batch the drawing of the quads by building a new VAO at each frame with the different quads positioned in your own coordinate system. Something like shifting the quads vertices to the correct position in a "virtual" mesh origin. Then you just perform a single draw of the newly creates mesh in your VAO.
In this way, you could batch the drawing of multiple objects in fewer calls.
The problem would be if your quads need to "scale" and "rotate" and not just translate, you can compute it with CPU the actual vertices position but it would be way to costly in terms of computing power.
A simple suggestion on top of the way you transfer the meshes is to use a texture atlas for all the textures of your quads, in this way you will need a much lower (if not needed at all) texture bind operation which might be costly in rendering operations.

Speed of ComputeShader vs. PixelShader

I've got a question regarding ComputeShader compared to PixelShader.
I want to do some processing on a buffer, and this is possible both with a pixel shader and a compute shader, and now I wonder if there is any advantage in either over the other one, specifically when it comes to speed. I've had issues with either getting to use just 8 bit values, but I should be able to work-around that.
Every data point in the output will be calculated from using in total 8 data points surrounding it (MxN matrix), so I'd think this would be perfect for a pixel shader, since the different outputs don't influence each other at all.
But I was unable to find any benchmarkings to compare the shaders, and now I wonder which one I should aim for. Only target is the speed.
From what i understand, shaders are shaders in the sense that they are just programs run by alot of threads on data. Therefore, in general there should not be any diffrence in terms of computing power/speed doing calculations in the pixel shader as opposed to the compute shader. However..
To do calculations on the pixelshader you have to massage your data so that it looks like image data, this means you have to draw a quad first of all, but also that your output must have the 'shape' of a pixel (float4 basically). This data must then be interpreted by you app into something useful
if you're using the computeshader you can completly control the number of threads to use where as for pixel shaders they have to be valid resolutions. Also you can input and output data in any format you like and take advantage of accelerated conversion using UAVs (i think)
i'd recommend using computeshaders since they are ment for doing general purpose computation and are alot easier to work with. Your over all application will probably be faster too, even if the actual shader computation time is about the same, just because you can avoid some of the hoops you have to jump through just through to get pixel shaders to do what you want.

Resources