If I use vertex shader to do all operations on object, then constant buffer can be empty? - directx

The program cycle is
Update();
UpdatePipeline();
In Update() constant buffer for each object, that after transformations, has this object world matrix is copied to GPU upload heap. And in UpdatePipeline(), among other things, installed shaders are called. Because we do all matrix transformation using CPU, vertex shader just returns position, right? If yes - is it true that performance will increase?
Now I want to do all transformations using GPU, i.e. via vertex shader. It means that in Update() I just should call memcpy() with an empty constant buffer as a source?

A constant buffer is just a buffer to move data from the CPU to the GPU. Whether you use one, or how many you use, and what you are using them for, is up to you.
The most common and most simple use-case for a constant buffer is to move a transformation matrix to the GPU. That matrix is indeed calculated by the CPU, and the vertex shader uses that matrix to transform the positions in the vertex buffer from local to screen space. This allows the CPU to move an object around without the need to update the (usually quite big) vertex buffer.
Whether performance increases depends on your hardware, your code, and - most importantly - what you are comparing the performance with. Since I don't know your current code, nor what exact changes you intent to do, I can't even guess if it will increase or not.
Also, even though I don't know your code, just by the way you phrased your question I would assume that you defenitly don't want to use a constant buffer as a source for any operation on the CPU.

Related

Speed difference between updating texture or updating buffers

I'm interesting about speed of updating a texture or buffer in WebGL.
(I think this performance would be mostly same with OpenGLES2)
If I needs to update texture or buffer one time per frame which contains same amount of data in byte size, which is good for performance?
Buffer usage would be DRAW_DYNAMIC and these buffer should be drawed by index buffers.
This would be really up to the device/driver/browser. There's no general answer. One device or driver might be faster for buffers, another for textures. There's also the actual access. Buffers don't have random access, textures do. Do if you need random access your only option is a texture.
One example of a driver optimization is if you replace the entire buffer or texture it's possible for the driver to just create a new buffer or texture internally and then start using it when appropriate. If it doesn't do this and you update a buffer or texture that is currently being used, as in commands have already been issued to draw something using the buffer or texture but those commands have not yet been executed, then the driver would have to stall your program, wait for the buffer or texture to be used, so it can then replace it with the new contents. This also suggests that gl.bufferData can be faster than gl.bufferSubData and gl.texImage2D can be faster than gl.texSubImage2D but it's only can be. Again, it's up to the driver what it does, what optimizations it can and can't, does and doesn't do.
As for WebGL vs OpenGL ES 2, WebGL is more strict. You mentioned index buffers. Well, WebGL has to validate index buffers. When you draw it has to check that all the indices in your buffer are in range for the currently bound and used attribute buffers. WebGL implementations cache this info so they don't have to do it again but if you update an index buffer the cache for that buffer is cleared so in that case updating textures would probably be faster than updating index buffers. On the other hand it comes back to usage. If you're putting vertex positions in a texture and looking them up in the vertex shader from the texture vs using them in a buffer I while updating the texture might be faster rendering vertices doing texture lookups is likely slower. Too slow is again up to your app and the device/driver etc...

Efficient pixel shader when only u-axis varies?

I'm writing a pixel shader that has the property where for a given quad the values returned only vary by u-axis value. I.e. for a fixed u, then the color output is constant as v varies.
The computation to calculate the color at a pixel is relatively expensive - i.e. does multiple samples per pixel / loops etc..
Is there a way to take advantage of the v-invariance property? If I was doing this on a CPU then you'd obviously just cache the values once calculated but guess that doesn't apply because of parallelism. It might be possible for me to move the texture generation to the CPU side and have the shader access a Texture1D but I'm not sure how fast that will be.
Is there a paradigm that fits this situation on GPU cards?
cheers
Storing your data in a 1D texture and sampling it in your pixel shader looks like a good solution. Your GPU will be able to use texture caching features, allowing it to make use of the fact that many of you pixels are using the same value from your 1D texture. This should be really fast, texture fetching and caching is one of the main reasons your gpu is so efficient at rendering.
It is commonpractice to make a trade-off between calculating the value in the pixel shader, or using a lookup table texture. You are doing complex calculations by the sound of it, so using a lookup texture with certainly improve performance.
Note that you could still generate this texture by the GPU, there is no need to move it to the CPU. Just render to this 1D texture using your existing shader code as a prepass.

How do PowerVR GPUs provide a depth buffer?

iOS devices use a PowerVR graphics architecture. The PowerVR architecture is a tile-based deferred rendering model. The primary benefit of this model is that it does not use a depth buffer.
However, I can access the depth buffer on my iOS device. Specifically, I can use an offscreen Frame Buffer Object to turn the depth buffer into a color texture and render it.
If the PowerVR architecture doesn't use a depth buffer, how is it that I'm able to render a depth buffer?
It is true that a tile-based renderer doesn't need a traditional depth buffer in order to work.
TBR split the screen in tiles and completely renders the contents of this tile using fast on-chip memory to store temporary colors and depths. Then, when the tile is finished, the final values are moved to the actual framebuffer. However, depth values in a depth buffer are traditionally temporary because they are just used as a hidden surface algorithm. Then depth values in this case can be completely discarded after the tile is rendered.
That means that effectively tile-based renderers don't really need a full screen depth buffer in slower video memory, saving both bandwidth and memory.
The Metal API easily exposes this functionality, allowing you to set the storeAction of the depth buffer to 'don't care' value, meaning that it will not back up the resulting depth values into main memory.
The exception to this case is that you may need the depth buffer contents after rendering (i.e. for a deferred renderer or as a source for some algorithm that operates with depth values). In that case the hardware must ensure that the depth values are stored in the framebuffer for you tu use.
Tile-based deferred rendering — as the name clearly says — works on a tile-by-tile basis. Each portion of the screen is loaded into internal caches, processed and written out again. The hardware takes the list of triangles overlapping the current tile and the current depth values and from those comes up with a new set of colours and depths and then writes those all out again. So whereas a completely dumb GPU might do one read and one write to every relevant depth buffer value per triangle, the PowerVR will do one read and one write per batch of geometry, with the ray casting-style algorithm doing the rest in between.
It isn't really possible to implement OpenGL on a 'pure' tile-based deferred renderer because the depth buffer needs to be readable. It also generally isn't efficient to make the depth buffer write only because the colour buffer is readable at any time (either explicitly via glReadPixels or implicitly according to whenever you present the frame buffer as per your OS's mechanisms), meaning that the hardware may have to draw a scene, then draw more onto it.
PowerVR does use a depth buffer, but in a different way than a regular(Immediate Mode Rendering) GPU
The differed part of Tile-based differed rendering means that triangles for a give scene are first processed (shaded, transformed clipped, etc. ) and saved into an intermediate buffer. Only after the entire scene is processed the tiles are rendered one by one.
Having all the processed triangles in one buffer allows the hardware to perform hidden surface removal - removing the triangles that will end up being hidden/overdrawn by other triangles. This significantly reduces the number of rendered triangles, resulting in improved performance and reduced power consumption.
Hidden surface removal typically uses something called a Tab Buffer as well as a depth buffer. (Both are small on-chip memories as they store a tile at a time)
Not sure why you're saying that PowerVR doesn't use a depth buffer. My guess is that it is just a "marketing" way of saying that there is not need to perform expensive writes and reads from system memory in order to perform depth test.
p.s
Just to add to Tommy's answer: the primary benefits of tile based differed rendering are:
Since fragments are processed a tile at a time all color/depth/stencil buffer read and writes are performed from a fast on-chip memory. While the color buffer still has to be read/written to system memory ones per tile, in many cases the depth and stencil buffers need to be written to system memory only if it is required for later use(like your user case). System memory traffic is a significant source of power consumption... so you can see how it reduced power consumption.
Differed rendering enables hidden surface removal. Less rendered triangles means less fragments processing, means less texture memory access.

OpenGL batching and instance uniqueness

I've been working on improving my OpenGL ES 2.0 render performance by introducing batching; specifically one creates a RenderBatch, specifying a texture and a shader (for now) upon creation. This sets the state into a VAO to allow for inexpensive state switching. I started the implementation looking something like this:
batch = RenderBatch.new "SpriteSheet" "FlatShader"
batch.begin GL_TRIANGLE_STRIP
batch.addGeometry Geometry.newFromFile "Billboard"
batch.end
batch.render renderEngine
But then it hit me: my Billboard file has vertices that are meant to be scaled and translated for specific instance usage. So I added a transform argument to the addGeometry call.
batch.addGeometry(Geometry.newFromFile("Billboard"), myObject.transform)
This solves the problem of scaling, translating, and rotating the vertices, but it does so by first looking up the vertex information, transforming it by the transform matrix, and then inserts it into the batch data. While this works it seems inefficient; it is CPU intensive and doesn't take advantage of the GPU's transformation power. However, it works, so not that big of a deal. (Would be nice to have a better way to do this though)
However, I've run into a roadblock: texture coordinates may need to be different for each instance as well, and that means I would have to pass in a texture transformation matrix, and now this is feeling hacky.
Is there an easier way to handle this kind of transformation to existing data using shaders that does not limit the geometry/models given and is easily extensible to use normal maps, UV maps, and other fancy tricks? Thanks!
It seems to me that what you are talking about are shader uniforms. Normally you would set up the vertex data and attributes for each batch in a VBO and a VAO. Then, in your render method, you switch to the correct VAO and set up the shader uniforms. These normally include a model-view-projection matrix to transform vertices into clip space, which necessarily would change nearly every frame, the correct texture to use, etc.
This is efficient because the unchanging vertex data is held in GPU memory, the VAO takes care of cheap state switching, and only the uniforms, which generally change often, are sent to the GPU each render call.
If you are batching multiple objects that require separate model view projection matrices, then you have a few options:
you have to perform a separate draw call for each batch that requires a separate model view projection matrix
use an array of model view projection matrices as a uniform and have an attribute for each object that provides the correct projection matrix index to use
you have to transform the vertices using the CPU and refill the VBO with the updated data
The first method is the preferred solution, it will be efficient and simple. The slow part of rendering lots of draw calls is generally getting the data from the CPU to the GPU, if you already have the vertex data in VBOs then the overhead of a draw call per object is not going to be a big deal. This also solves the problem of how to provide different uniforms per object based on object properties. In each objects render method, the relevant properties are set up as uniforms before the draw call is made. If each object requires different data sent to the GPU, then how else could this work?
This is a trade-off situation. Costs of state changes due to insufficient batching compared to costs of transformation on the CPU. There is no single best solution, but it depends on how much of your scene is static, how much is dynamic and how it is laid out.
A common solution is to put static objects, whose transformation relative to each other never changes into a single VBO, or few VBOs (if they use different textures, vertex formats, etc), completely transformed. This is done once before rendering. Not each frame. Dynamic objects (players, monster, whatever) are then rendered individually, with transformation done in the vertex shader.
You can still optimize for state changes by roughly ordering the drawing of the individual objects by textures and programs.

Speed of ComputeShader vs. PixelShader

I've got a question regarding ComputeShader compared to PixelShader.
I want to do some processing on a buffer, and this is possible both with a pixel shader and a compute shader, and now I wonder if there is any advantage in either over the other one, specifically when it comes to speed. I've had issues with either getting to use just 8 bit values, but I should be able to work-around that.
Every data point in the output will be calculated from using in total 8 data points surrounding it (MxN matrix), so I'd think this would be perfect for a pixel shader, since the different outputs don't influence each other at all.
But I was unable to find any benchmarkings to compare the shaders, and now I wonder which one I should aim for. Only target is the speed.
From what i understand, shaders are shaders in the sense that they are just programs run by alot of threads on data. Therefore, in general there should not be any diffrence in terms of computing power/speed doing calculations in the pixel shader as opposed to the compute shader. However..
To do calculations on the pixelshader you have to massage your data so that it looks like image data, this means you have to draw a quad first of all, but also that your output must have the 'shape' of a pixel (float4 basically). This data must then be interpreted by you app into something useful
if you're using the computeshader you can completly control the number of threads to use where as for pixel shaders they have to be valid resolutions. Also you can input and output data in any format you like and take advantage of accelerated conversion using UAVs (i think)
i'd recommend using computeshaders since they are ment for doing general purpose computation and are alot easier to work with. Your over all application will probably be faster too, even if the actual shader computation time is about the same, just because you can avoid some of the hoops you have to jump through just through to get pixel shaders to do what you want.

Resources