DirectX world view matrix multiplications - GPU or CPU the place - directx

I am new to directx, but have been surprised that most examples I have seen the world matrix and view matrix are multiplied as part of the vertex shader, rather than being multiplied by the CPU and the result being passed to the shader.
For rigid objects this means you multiply the same two matrices once for every single vertex of the object. I know that the GPU can do this in parallel over a number of vertices (don't really have an idea how many), but isn't this really inefficient, or am I just missing something? I am still new and clueless.

In general, you want to do it on the CPU. However, DirectX 9 has the concept of "preshaders", which means that this multiplication will be done on the CPU up-front. This has been removed for newer APIs, but it might be very well relevant for the examples you're looking at.
Moreover, modern GPUs are extremely fast when it comes to ALU operations compared to memory access. Having a modestly complex vertex shader (with a texture fetch maybe) means that the math required to do the matrix multiplication comes for free, so the authors might have not even bothered.
Anyway, the best practice is to pre-multiply everything constant on the CPU. Same applies for moving work from the pixel shaders into the vertex shaders (if something is constant across a triangle, don't compute it per-pixel.)

Well, that doesn't sound clueless to me at all, you are absolutely right!
I don't know exactly what examples you have been looking at, but in general you'd pass precalculated matrices as much as possible, that is what semantics like WORLDVIEW (and even more appropriate for simple shaders, WORLDVIEWPROJECTION) are for.
Exceptions could be cases where the shader code needs access to the separate matrices as well (but even then I'd usually pass the combined matrices as well)... or perhaps those examples where all about illustrating matrix multiplication. :-)

Related

Optimize OpenGL ES 2.0 drawing iOS

I have this huge model(helix) created with 2 million vertices at once and some million more indices for which vertices to use.
I am pretty sure this is a very bad way to draw so many vertices.
I need some hints to where I should start to optimize this?
I thought about copying 1 round of my helix (vertices) and moving the z of that. But in the end, I would be drawing a lot of triangles at once again...
How naive are you currently being? As per rickster's comment, there's a serious case of potential premature optimisation here: the correct way to optimise is to find the actual bottlenecks and to widen those.
Knee-jerk thoughts:
Minimise memory bandwidth. Pack your vertices into the smallest space they can fit into (i.e. limit precision where it is acceptable to do so) and make sure all the attributes that describe a single vertex are contiguously stored (i.e. the individual arrays themselves will be interleaved).
Consider breaking your model up to achieve that aim. Instanced drawing as rickster suggests is a good idea if it's sufficiently repetitive. You might also consider what you can do with 65536-vertex segments, since that'll cut your index size.
Use triangle strips if it allows you to specify the geometry in substantially fewer indices, even if you have to add degenerate triangles.
Consider where the camera will be. Do you really need that level of detail all the way around? Will the whole thing even ever be on screen? If not then consider level-of-detail solutions and subdivision for culling (both outside the viewport and within via the occlusion query).

Which is faster: creating a detailed mesh before execution or tessellating?

For simplicity of the problem let's consider spheres. Let's say I have a sphere, and before execution I know the radius, the position and the triangle count. Let's also say the triangle count is sufficiently large (e.g. ~50k triangles).
Would it be faster generally to create this sphere mesh before hand and stream all 50k triangles to the graphics card, or would it be faster to send a single point (representing the centre of the sphere) and use tessellation and geometry shaders to build the sphere on the GPU?
Would it still be faster if I had 100 of these spheres in different positions? Can I use hull/geometry shaders to create something which I can then combine with instancing?
Tessellation is certainly valuable. Especially when combined with displacement from a heightmap. The isolated environment described in your question is bound not to fully answer your question.
Before using tessellation you would need to know that you will become CPU poly/triangle bound and therefore need to start utilizing the GPU to help you increase the overall triangles of your game/scene. Calculations are very fast on the GPU so yes using tessellation multiple subdivision levels is advisable if you are going to do it...though sometimes I've been happy with just subdividing 3-4 times from a 200 tri plane.
Mainly tessellation is used for environmental/static mesh scene objects so that you can spend your tri's on characters and other moving/animated models without becoming CPU bound.
Checkout engines like Unity3D and CryEngine for tessellation examples to help the learning curve.
I just so happen to be working with this at the same time.
In terms of FPS, the pre-computed method would be faster in this situation since you can
dump one giant 50K triangle sphere payload (like any other model) and
draw it in multiple places from there.
The tessellation method would be slower since all the triangles would
be generated from a formula, multiple times per frame.

Speed of ComputeShader vs. PixelShader

I've got a question regarding ComputeShader compared to PixelShader.
I want to do some processing on a buffer, and this is possible both with a pixel shader and a compute shader, and now I wonder if there is any advantage in either over the other one, specifically when it comes to speed. I've had issues with either getting to use just 8 bit values, but I should be able to work-around that.
Every data point in the output will be calculated from using in total 8 data points surrounding it (MxN matrix), so I'd think this would be perfect for a pixel shader, since the different outputs don't influence each other at all.
But I was unable to find any benchmarkings to compare the shaders, and now I wonder which one I should aim for. Only target is the speed.
From what i understand, shaders are shaders in the sense that they are just programs run by alot of threads on data. Therefore, in general there should not be any diffrence in terms of computing power/speed doing calculations in the pixel shader as opposed to the compute shader. However..
To do calculations on the pixelshader you have to massage your data so that it looks like image data, this means you have to draw a quad first of all, but also that your output must have the 'shape' of a pixel (float4 basically). This data must then be interpreted by you app into something useful
if you're using the computeshader you can completly control the number of threads to use where as for pixel shaders they have to be valid resolutions. Also you can input and output data in any format you like and take advantage of accelerated conversion using UAVs (i think)
i'd recommend using computeshaders since they are ment for doing general purpose computation and are alot easier to work with. Your over all application will probably be faster too, even if the actual shader computation time is about the same, just because you can avoid some of the hoops you have to jump through just through to get pixel shaders to do what you want.

Calculation of vertex normals in DirectX

As a learning experience, I'm writing an Immediate mode managed DirectX 9 application.
I'm manually calculating Vertex normals across all triangles in a scene to allow smooth Gouraud shading.
This works as expected, but I'm guessing this is not the most efficient approach. Is it possible to get the GPU to do this for me?
You could in theory generate the vertex normals inside the vertex shader. That involves computation every single time you render a mesh using that shader though, so why not generate them in advance.
If you mean you want to generate them in advance of rendering, but use the GPU instead of the CPU, I would say that it's not worth the bother of speeding up something you are only going to do once. Besides, I'm not sure if DX9 has a way to get computed vertex information back from a shader (DX10 does).
All in all, the best thing to do in most cases is the traditional: compute vertex normals in the program that saves the data files that contain the meshes - do it as a pre-computation step. Usually you have them if the mesh came from a 3d package like Max or Maya, because there is artistic information in the normals, unless you know the whole mesh is supposed to be perfectly smooth (or faceted), it's not computable in the general case.

XNA/DirectX: Should you always use indices?

I'm implementing billboards for vegetation where a billboard is of course a single quad consisting of two triangles. The vertex data is stored in a vertex buffer, but should I bother with indices? I understand that the savings on things like terrain can be huge in terms of vertices sent to the graphics card when you use indices, but using indices on billboards means that I'll have 4 vertices per quad rather than 6, since each quad is completely separate from the others.
And is it possible that the use of indices actually reduces performance because there is an extra level of indirection? Or isn't that of any significance at all?
I'm asking this because using indices would slightly complicate matters and I'm curious to know if I'm not doing extra work that just makes things slower (whether just in theory or actually noticeable in practice).
This is using XNA, but should apply to DirectX.
Using indices not only saves on bandwidth, by sending less data to the card, but also reduces the amount of work the vertex shader has to do. The results of the vertex shader can be cached if there is an index to use as a key.
If you render lots of this billboarded vegetation and don't change your index buffer, I think you should see a small gain.
When it comes to very primitive gemotery then it might won't make any sense to use indices, I won't even bother with performance in that case, even the modest HW will render millions of triangles a seconds.
Now, technically, you don't know how the HW will handle the data internally, it might convert them to indices anyway because that's the most popular form of geometry presentation.

Resources