Once more: triangle strips vs triangle lists - directx

I decided to build my engine on triangle lists after reading (a while ago) that indexed triangle lists perform better due to less draw calls needed. Today i stumbled on 0xffffffff, which in DX is considered a strip-cut index so you can draw multiple strips in one call. Does this mean that triangle lists no longer hold superior performance?

It is possible to draw multiple triangle strips in a single draw call using degenerate triangles which have an area of zero. A strip cut is made by simply repeating the last vertex of the previous and the first vertex of the next strip, adding two elements per strip break (two zero-area triangles).
New in Direct3D 10 are the strip-cut index (for indexed geometry) and the RestartStrip HLSL function. Both can be used to replace the degenerate triangles method effectively cutting down the bandwidth cost. (Instead of two indices for a cut only one is needed.)
Expressiveness
Can any primitive list be converted to an equal strip and vise versa? Strip to list conversion is of course trivial. For list to strip conversion we have to assume that we can cut the strip. Then we can map each primitive in the list to a one-primitive-sub-strip, though this would not be useful.
So, at least for triangle primitives, strips and lists always had the same expressiveness. Before Direct3D 10 strip cuts in line strips where not possible, so they actually were not equally expressive.
Memory and Bandwidth
How much data needs to be sent to the GPU? In order to compare the methods we need to be able to calculate the number of elements needed for a certain topology.
Primitive List Formula
N ... total number of elements (vertices or indices)
P ... total number of primitives
n ... elements per primitive (point => 1, line => 2, triangle => 3)
N = Pn
Primitive Strip Formula
N, P, n ... same as above
S ... total number of sub-strips
o ... primitive overlap
c ... strip cut penalty
N = P(n-o) + So + c(S-1)
primitive overlap describes the number of elements shared by adjacent primitives. In a classical triangle strip a triangle uses two vertices from the previous primitive, so the overlap is 2. In a line strip only one vertex is shared between lines, so the overlap is 1. A triangle strip using an overlap of 1 is of course theoretically possible but has no representation in Direct3D.
strip cut penalty is the number of elements needed to start a new sub-strip. It depends on the method used. Using strip-cut indices the penalty would be 1, since one index is used to separate two strips. Using degenerate triangles the penalty would be two, since we need two zero-area triangles for a strip cut.
From these formulas we can deduce that it depends on the geometry which method needs the least space.
Caching
One important property of strips is the high temporal locality of the data. When a new primitive is assembled each vertex needs to be fetched from GPU memory. For a triangle this has to be done three times. Now accessing memory is usually slow, that's why processors use multiple levels of caches. In the best case the data needed is already stored in the cache, reducing memory access time. Now for triangle strips the last two vertices of the previous primitive are used, almost guaranteeing that two of three vertices are already present in the cache.
Ease of Use
As stated above, converting a list to a strip is very simple. The problem is converting a list to an efficient primitive strip by reducing the number of sub-strips. For simple procedurally generated geometry (e.g. heightfield terrains) this is usually achievable. Writing a converter for existing meshes might be more difficult.
Conclusion
The introduction of Direct3D 10 has not much impact on the strip vs. list question. There is now equal expressiveness for line strips and a slight data reduction. In any case, when using strips you always gain the most if you reduce the number of sub-strips.

On modern hardware with pre- and post-transform vertex caches, tri-stripping is not a win over indexed triangle lists. The only time you really use tri-stripping would be non-indexed primitives generated by something where the strips are trivial to compute such as a terrain system.
Instead, you should do vertex cache optimization of indexed triangle lists for best performance. The Hoppe algorithm is implemented DirectXMesh, or you can look at Tom Forsyth's alternative algorithm.

Related

Color data for for each degenerate triangle

I want to specify a color for each strip in my triangle strip separated by degenerate triangles. As of now, I am sending that data with each vertex so my vertices consists of: [ PosX, PosY, PosZ, ColorX, ColorY, ColorZ, ColorW ]. However, the color data is constant for every vertex until it may change after a degenerate triangle, and therefore I am wasting 8 bytes per vertex (Wide P3 colors).
Obviously using many draw calls instead of degenerate triangles would solve the memory issue (by using an uniform buffer to store the color for each draw call), but would also create a massive performance overhead which is unwanted.
Since I only have simple shapes, the degenerate triangles are better than index buffers in my case. Although the current implementation of repeating the color data for all vertices in each strip works, I would like to know if there is another way to pass this data more efficiently to the GPU. Could not find much about this topic on Google (presumably because degenerate triangles are not very common nowadays). If anyone could give me some insight on how the memory usage may be optimised without worsening the performance (maybe by using another buffer? I imagine I need to be able to tell which strip separated by degenerate triangles I am currently drawing in the shader) it would be highly appreciated.
I am using Swift 4 and Metal.

Implementation of image dilation and erosion

I am trying to figure out an efficient way of implementing image dilation and erosion for binary images. As far as I understand it, the naive way would be:
loop through the image
if pixel is 1
loop through the neighborhood based on the structuring element's
height and width
(dilate) substitute each pixel of the image with the value in the
corresponding location of the SE
(erode) check if all neighborhood is equal to the SE, if so keep all
the pixels, else delete the centre
so this means that for each pixel I have to loop through the SE as well making this a O(NMW*H).
Is there a more elegant way of doing this?
Yes there are!!!
First you want to decompose (if possible) your structuring element into segments (a square being composed by a vertical and an horizontal segment). And then you perform only erosion/dilation on segments, which already decreases the complexity.
Now for the erosion/dilation parts, you have different solutions:
If you work only on 8-bits images and do not C/C++, you use an implementation with histograms in order to keep track of the minimum/maximum value. See this remarkable work here. He even adds "landmarks" in order to reduce the number of operations.
If you use C/C++ and work on different types of image encodings, then you can use fast comparisons (SSE2, SSE4 and auto-vectorization), as it is the case in the SMIL library. In this case, you compare row with row, instead of working pixel by pixel, using material accelerations. It seems to be the fastest library ever.
A last way to do, slower but works for all types of encoding, is to use the Lemmonier algorithm. It is implemented by the fulguro library.
For structuring elements of type disk, there is nothing "fast", you have to use the basic algorithm. For hexagonal structuring elements, you can work row by row, but it cannot be parallelized.

degenerated vertices and GL_LINE_STRIP

I'm on iOS 5.1
I was trying to display several batch of lines in the same vertex array and I wanted to separate them using degenerated vertices. But it does not seem to work. I line is drawn between each batch of vertices.
Googling the problem gave me results that degenerated vertices was not compatible with GL_LINE_STRIP but I'm not really sure about it. Can someone confirm that? And also what's the alternative ?
As far as I know, you can only draw a continuous line using a single vertex array and GL_LINE_STRIP. The alternative is to use GL_LINES, which treats each vertex pair as an independent line segment. To get a contiguous line, duplicate the last vertex of the previous segment as the start of your next segment in your vertex array.
One possibility that may come to your mind would be to use vertices with some strange values (like infinity, or a w of 0), but those will most probably just get rendered as normal points at some crazy distance away (and thus you get weird clipped lines). So this won't work generally.
When drawing triangle strips, you can use degenerate triangles to restart the strip. This works by duplicating a vertex (or better two consecutive ones), which then results in a triangle (or better four) that degenerates to a line and thus has zero area (and is not drawn). But lets look at a line strip. When duplicating a vertex there, you get a line that degenerates to a point (and is thus not drawn), but once when starting the next line strip, you have to get a new vertex and since two distinct vertices always make a valid line, you see that by duplicating vertices you cannot get a line strip restart.
So there is no real way to put multiple line strips into a single draw call using degenerate vertices (though modern desktop GL has other ways to do it). The best idea would probably be to just use an ordinary line set (GL_LINES), as Drewsmits suggests. Ok, you will approximately double the number of vertices (if your strips are very long), but the less driver overhead from the batching may possibly outweight the additional memory and copying overhead.
Whereas you can't use degenerate vertices in a line strip, you can use a primitive restart index (usually the maximum index value possible for the numeric type), also called a strip-cut index in Direct3D parlance. It can be configured using glPrimitiveRestartIndex.

How to divide a runtime procedural generated world into chunks

I've been thinking of making a top-down 2D game with a pseudo-infinite runtime procedural generated world. I've read several articles about procedural generation and, maybe I've misread or misunderstood them, but I have yet to come across one explaining how to divide the world into chunks (like Minecraft apparently does).
Obviously, I need to generate only the part of the world that the player can currently see. If my game is tile-based, for example, I could divide the world into n*n chunks. If the player were at the border of such a chunk, I would also generate the adjacent chunk(s).
What I can't figure out is how exactly do I take a procedural world generation algorithm and only use it on one chunk at a time. For example, if I have an algorithm that generates a big structure (e.g. castle, forest, river) that would spread across many chunks, how can I adjust it to generate only one chunk, and afterwards the adjacent chunks?
I apologize if I completely missed something obvious. Thank you in advance!
Study the Midpoint displacement algorithm. Note that the points all along one side are based on the starting values of the corners. You can calculate them without knowing the rest of the grid.
I used this approach to generate terrain. I needed the edges of each 'chunk' of terrain to line up with the adjacent chunks. Using a variation of the Midpoint displacement algorithm I made it so that the height of each point along the edge of a chunk was calculated based only on values at the two corners. If I needed to add randomness, I seeded a random number generator with data from the two corners. This way, any two adjacent chunks could be generated independently and the edges were sure to match.
You can use approaches for height-maps for other things. Instead of height, the data could determine vegetation type, population density, etc. Instead of a chunks of height map where the hills and valleys match up you can have a vegetation map where the forests match up.
It certainly takes some creative programming for any kind of complex world.

XNA/DirectX: Should you always use indices?

I'm implementing billboards for vegetation where a billboard is of course a single quad consisting of two triangles. The vertex data is stored in a vertex buffer, but should I bother with indices? I understand that the savings on things like terrain can be huge in terms of vertices sent to the graphics card when you use indices, but using indices on billboards means that I'll have 4 vertices per quad rather than 6, since each quad is completely separate from the others.
And is it possible that the use of indices actually reduces performance because there is an extra level of indirection? Or isn't that of any significance at all?
I'm asking this because using indices would slightly complicate matters and I'm curious to know if I'm not doing extra work that just makes things slower (whether just in theory or actually noticeable in practice).
This is using XNA, but should apply to DirectX.
Using indices not only saves on bandwidth, by sending less data to the card, but also reduces the amount of work the vertex shader has to do. The results of the vertex shader can be cached if there is an index to use as a key.
If you render lots of this billboarded vegetation and don't change your index buffer, I think you should see a small gain.
When it comes to very primitive gemotery then it might won't make any sense to use indices, I won't even bother with performance in that case, even the modest HW will render millions of triangles a seconds.
Now, technically, you don't know how the HW will handle the data internally, it might convert them to indices anyway because that's the most popular form of geometry presentation.

Resources