I'm doing depth peeling with a very simple fragment function:
struct VertexOut {
float4 position [[ position ]];
};
fragment void depthPeelFragment(VertexOut in [[ stage_in ]],
depth2d<float, access::read> previousDepth)
{
float4 p = in.position;
if(!is_null_texture(previousDepth) && p.z <= previousDepth.read(uint2(p.xy)))
{
discard_fragment();
}
}
(My depth buffer pixel format is MTLPixelFormatDepth32Float)
This works well on my Mac. In each pass I submit the same geometry, and eventually no more fragments are written and the process terminates. For example, with a test sphere, there are two passes with the same number of fragments written each pass (front and back hemispheres).
However on iPad, the process does not terminate. There are some (not all) fragments which, despite being rendered in the previous pass, are not discarded in subsequent passes.
What platform differences could cause this?
Is the z-coordinate of the position attribute always the value written to the depth buffer?
According to Apple engineer, it's not a logarithmic depth buffer.
Note that I cannot simply limit the number of passes (I'm not using this for OIT).
Update:
Here's what the depth texture looks like on the 3rd pass via GPU capture (the green represents the bounds of rendered geometry):
The distribution of points makes this look like a floating point accuracy issue.
Furthermore, if I add an epsilon to the previous depth buffer:
p.z <= previousDepth.read(uint2(p.xy))+.0000001
then the process does terminate on the iPad. However the results aren't accurate enough for downstream use.
Related
The Metal Shading Language includes a lot of mathematic functions, but it seems most of the codes inside Metal official documentation just use it to map vertexes from pixel space to clip space like
RasterizerData out;
out.clipSpacePosition = vector_float4(0.0, 0.0, 0.0, 1.0);
float2 pixelSpacePosition = vertices[vertexID].position.xy;
vector_float2 viewportSize = vector_float2(*viewportSizePointer);
out.clipSpacePosition.xy = pixelSpacePosition / (viewportSize / 2.0);
out.color = vertices[vertexID].color;
return out;
Except for GPGPU using kernel functions to do parallel computation, what things that vertex function can do, with some examples? In a game, if all vertices positions are calculated by the CPU, why GPU still matters? What does vertex function do usually?
Vertex shaders compute properties for vertices. That's their point. In addition to vertex positions, they also calculate lighting normals at each vertex. And potentially texture coordinates. And various material properties used by lighting and shading routines. Then, in the fragment processing stage, those values are interpolated and sent to the fragment shader for each fragment.
In general, you don't modify vertices on the CPU. In a game, you'd usually load them from a file into main memory, put them into a buffer and send them to the GPU. Once they're on the GPU you pass them to the vertex shader on each frame along with model, view, and projection matrices. A single buffer containing the vertices of, say, a tree or a car's wheel might be used multiple times. Each time all the CPU sends is the model, view, and projection matrices. The model matrix is used in the vertex shader to reposition and scale the vertice's positions in world space. The view matrix then moves and rotates the world around so that the virtual camera is at the origin and facing the appropriate way. Then the projection matrix modifies the vertices to put them into clip space.
There are other things a vertex shader can do, too. You can pass in vertices that are in a grid in the x-y plane, for example. Then in your vertex shader you can sample a texture and use that to generate the z-value. This gives you a way to change the geometry using a height map.
On older hardware (and some lower-end mobile hardware) it was expensive to do calculations on a texture coordinate before using it to sample from a texture because you lose some cache coherency. For example, if you wanted to sample several pixels in a column, you might loop over them adding an offset to the current texture coordinate and then sampling with the result. One trick was to do the calculation on the texture coordinates in the vertex shader and have them automatically interpolated before being sent to the fragment shader, then doing a normal look-up in the fragment shader. (I don't think this is an optimization on modern hardware, but it was a big win on some older models.)
First, I'll address this statement
In a game, if all vertices positions are calculated by the CPU, why GPU still matters? What does vertex function do usually?
I don't believe I've seen anyone calculating positions for meshes that will be later used to render them on a GPU. It's slow, you would need to get all this data from CPU to a GPU (which means copying it through a bus if you have a dedicated GPU). And it's just not that flexible. There are much more things other than vertex positions that are required to produce any meaningful image and calculating all this stuff on CPU is just wasteful, since CPU doesn't care for this data for the most part.
The sole purpose of vertex shader is to provide rasterizer with primitives that are in clip space. But there are some other uses that are mostly tricks based on different GPU features.
For example, vertex shaders can write out some data to buffers, so, for example, you can stream out transformed geometry if you don't want to transform it again at a later vertex stage if you have multi-pass rendering that uses the same geometry in more than one pass.
You can also use vertex shaders to output just one triangle that covers the whole screen, so that fragment shaders gets called one time per pixel for the whole screen (but, honestly, you are better of using compute (kernel) shaders for this).
You can also write out data from vertex shader and not generate any primitives. You can do that by generating degenerate triangles. You can use this to generate bounding boxes. Using atomic operations you can update min/max positions and read them at a later stage. This is useful for light culling, frustum culling, tile-based processing and many other things.
But, and it's a BIG BUT, you can do most of this stuff in a compute shader without incurring GPU to run all the vertex assembly pipeline. That means, you can do full-screen effects using just a compute shader (instead of vertex and fragment shader and many pipeline stages in between, such as rasterizer, primitive culling, depth testing and output merging). You can calculate bounding boxes and do light culling or frustum culling in compute shader.
There are reasons to fire up the whole rendering pipeline instead of just running a compute shader, for example, if you will still use triangles that are output from vertex shader, or if you aren't sure how primitives are laid out in memory so you need vertex assembler to do the heavy lifting of assembling primitives. But, getting back to your point, almost all of the reasonable uses for vertex shader include outputting primitives in clip space. If you aren't using resulting primitives, it's probably best to stick to compute shaders.
I have been testing WebGL to see whether I can batch-draw polygons in a particular way. I am going to simplify the use case, but it goes something along the lines of the following:
First, my vertices are simply:
vertices[v0_xy0, v1_xyz, ... vn_xyz]
In my case, each vertex must have a z value in the range (0 - 100) (I pick 100 arbitrarily) because I want all of those vertices to be depth tested against each other using those z values. On batch N + 1, I am limited to depth values (0 - 100) again, but I need the vertices in this batch to be guaranteed to be drawn atop all previous batches (layers of vertices). In other words, vertices within each batch are depth tested against each, but each batch is just drawn atop the previous one as if there were no depth testing.
At first I was going to try drawing to a texture with a framebuffer and depthbuffer attachment, draw to the canvas, repeat for the next group of vertices, but I realized that I might be able to do just this:
// pseudocode
function drawBuffers()
// clear both the color and the depth
gl.clearDepth(1.0);
gl.clear(gl.CLEAR_COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
// iterate over all vertex batches
for each vertexBatch in vertexBatches do
// draw the batch with depth testing
gl.draw(vertexBatch);
// clear the depth buffer
/* QUESTION: does this guarantee that subsequent batches
will be drawn atop previous batches, or will the pixels be written at
random (sometimes underneath, sometimes above)?
*/
gl.clearDepth(1.0);
gl.clear(gl.DEPTH_BUFFER_BIT);
endfor
end drawBuffers
I tested the above by drawing two overlapping quads, clearing the depth buffer, translating left and in negative z (in an attempt to "go under" the previous batch), and drawing the two overlapping quads again. I think that this works because I see that the second pair of quads are drawn in front of the first pair even though their z values are behind the previous pair's z values;
I am not certain that my test is reliable though. Could there be some undefined behavior involved? Is it just a coincidence that my test works as a result of the clearDepth setting and shapes?
May I have clarification so I can confirm whether my method will work for sure?
Thank you.
Since WebGL is based on OpenGL ES see OpenGL ES 1.1 Full Specification, 4.1.6 Depth Buffer Test, page 104:
The depth buffer test discards the incoming fragment if a depth comparison fails.
....
The comparison is specified with
void DepthFunc( enum func );
This command takes a single symbolic constant: one of NEVER, ALWAYS, LESS, LEQUAL, EQUAL, GREATER, GEQUAL, NOTEQUAL. Accordingly, the depth buffer test passes never, always, if the incoming fragment’s zw value is less than, less than or equal to, equal to, greater than, greater than or equal to, or not equal to the depth value stored at the location given by the incoming fragment’s (xw, yw) coordinates.
This means, if the clear value for the depth buffer glClearDepth is 1.0 (1.0 is the initial value)
gl.clearDepth(1.0);
and the depth buffer is cleared
gl.clear(gl.DEPTH_BUFFER_BIT);
and the depth function glDepthFunc is LESS or LEQUAL (LESS is the initial value)
gl.enable(gl.DEPTH_TEST);
gl.depthFunc(gl.LEQUAL);
then the next fragment which is drawn to any (xw, yw) coordinates, will pass the depth test and will overwrite the fragment stored at the location (xw, yw).
(Of course gl.BLEND has to be disabled and the fragment has to be in clip space)
In OpenGL, depth buffer values are calculated based on the near and far clipping planes of the scene. (Reference: Getting the true z value from the depth buffer)
How does this work in WebGL? My understanding is that WebGL is unaware of my scene's far and near clipping planes. The near and far clipping planes are used to calculate my projection matrix, but I never tell WebGL what they are explicitly so it can't use them to calculate depth buffer values.
How does WebGL set values in the depth buffer when my scene is rendered?
WebGL (like modern OpenGL and OpenGL ES) gets the depth value from the value you supply to gl_Position.z in your vertex shader (though you can also write directly to the depth buffer using certain extensions but that's far less common)
There is no scene in WebGL nor modern OpenGL. That concept of a scene is part of legacy OpenGL left over from the early 90s and long since deprecated. It doesn't exist in OpenGL ES (the OpenGL that runs on Android, iOS, ChromeOS, Raspberry PI, WebGL etc...)
Modern OpenGL and WebGL are just rasterization APIs. You write shaders which are small functions that run on the GPU. You provide those shaders with data through attributes (per iteration data), uniforms (global variables), textures (2d/3d arrays), varyings (data passed from vertex shaders to fragment shaders).
The rest is up to you and what your supplied shader functions do. Modern OpenGL and WebGL are for all intents and purposes just generic computing engines with certain limits. To get them to do anything is up to you to supply shaders.
See webglfundamentals.org for more.
In the Q&A you linked to it's the programmer supplied shaders that decide to use frustum math to decide how to set gl_Position.z. The frustum math is supplied by the programmer. WebGL/GL don't care how gl_Position.z is computed, only that it's a value between -1.0 and +1.0 so how to take a value from the depth buffer and go back to Z is solely up to how the programmer decided to calculate it in the first place.
This article covers the most commonly used math for setting gl_Position.z when rendering 3d with WebGL/OpenGL. Based on your question though I'd suggest reading the preceding articles linked at the beginning of that one.
As for what actual values get written to the depth buffer it's
ndcZ = gl_Position.z / gl_Position.w;
depthValue = (far - near) / 2 * ndcZ + (near - far) / 2
near and far default to 0 and 1 respectively though you can set them with gl.depthRange but assuming they are 0 and 1 then
ndcZ = gl_Position.z / gl_Position.w;
depthValue = .5 * ndcZ - .5
That depthValue would then be in the 0 to 1 range and converted to whatever bit depth the depth buffer is. It's common to have a 24bit depth buffer so
bitValue = depthValue * (2^24 - 1)
I've got the below shader (pieces removed for length and clarity), and would like to find a bettter way to do this. I would like to send an array of textures, of which the size is variable, to my metal shader. I'll do some calculations on the vertex positions, and then figure out which texture to use.
Currently I have just hard coded things and used several if statements, but this is ugly (and I'm guessing not fast). Is there any way I can compute i and then use i as a texture subscript (like tex[i].sample)?
// Current code - its ugly
fragment half4 SimpleTextureFragment(VertextOut inFrag [[stage_in]],
texture2d<half> tex0 [[ texture(0) ]]
texture2d<half> tex1 [[ texture(1) ]]
texture2d<half> tex2 [[ texture(2) ]]
...
texture2d<half> texN [[ texture(N) ]]
)
{
constexpr sampler quad_sampler;
int i = (Compute_Correct_Texture_to_Use);
if(i==0)
{
half4 color = tex0.sample(quad_sampler, inFrag.tex_coord);
}
else if(i==1)
{
half4 color = tex1.sample(quad_sampler, inFrag.tex_coord);
}
...
else if(i==n)
{
half4 color = texN.sample(quad_sampler, inFrag.tex_coord);
}
return color;
}
You are right that your method will not be fast. Best case, the shader will have lots of branching (which is not good), worse case, the shader will actually sample from ALL your textures and then discard the results it does not use (this will be even slower).
This is not a case that GPUs handle particularly well, so my advice would be to slightly refactor your approach to be more GPU friendly. Without knowing more about what you are doing at a higher level, my first suggestion would be to consider using 2d array textures.
2d array textures essentially merge X 2D textures in to a single texture with X slices to it. You only have to pass a single texture to Metal and you can calculate which slice to sample from in the shader exactly as you are already doing, but with this approach you will get rid of all the 'if' branches and will only need to call sample once like this: tex.sample( my_sampler, inFrag.tex_coord, i );
If your textures are all the same size and format, then this will work very easily. You just have to copy each of your 2D textures in to a slice of the array texture. If your textures are different in size or format, you may have to work around that possibly by stretching some so that they all end up the same dimensions.
See here for docs: https://developer.apple.com/library/ios/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Mem-Obj/Mem-Obj.html#//apple_ref/doc/uid/TP40014221-CH4-SW10
(look for 'Texture Slices')
Also here: https://developer.apple.com/library/prerelease/ios/documentation/Metal/Reference/MTLTexture_Ref/index.html#//apple_ref/c/econst/MTLTextureType2DArray
Metal shader languages docs here: https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/std-lib/std-lib.html#//apple_ref/doc/uid/TP40014364-CH5-SW17 (look for '2D Texture Array')
I'm trying to render a large number of very small 2D quads as fast as possible on an Apple A7 GPU using the Metal API. Researching that GPU's triangle throughput numbers, e.g. here, and from Apple quoting >1M triangles on screen during their keynote demo, I'd expect to be able to render something like 500,000 such quads per frame at 60fps. Perhaps a bit less, given that all of them are visible (on screen, not hidden by z-buffer) and tiny (tricky for the rasterizer), so this likely isn't a use case that the GPU is super well optimized for. And perhaps that Apple demo was running at 30fps, so let's say ~200,000 should be doable. Certainly 100,000 ... right?
However, in my test app the max is just ~20,000 -- more than that and the framerate drops below 60 on an iPad Air. With 100,000 quads it runs at 14 fps, i.e. at a throughput of 2.8M trianlges/sec (compare that to the 68.1M onscreen triangles quoted in the AnandTech article!).
Even if I make the quads a single pixel small, with a trivial fragment shader, performance doesn't improve. So we can assume that this is vertex bound, and the GPU report in Xcode agrees ("Tiler" is at 100%). The vertex shader is trivial as well, doing nothing but a little scaling and a translation math, so I'm assuming the bottleneck is some fixed-function stage...?
Just for some more background info, I'm rendering all the geometry using a single instanced draw call, with one quad per instance, i.e. 4 vertices per instance. The quad's positions are applied from a separate buffer that's indexed by instance id in the vertex shader. I've tried a few other methods as well (non-instanced with all vertices pre-transformed, instanced+indexed, etc), but that didn't help. There are no complex vertex attributes, buffer/surface formats, or anything else I can think of that seems likely to hit a slow path in the driver/GPU (though I can't be sure of course). Blending is off. Pretty much everything else is in the default state (things like viewport,scissor,ztest,culling,etc).
The application is written in Swift, though hopefully that doesn't matter ;)
What I'm trying to understand is whether the performance I'm seeing is expected when rendering quads like this (as opposed to a "proper" 3d scene), or whether some more advanced techniques are needed to get anwhere close to the advertised triangle throughputs. What do people think is likely the limiting bottleneck here?
Also, if anyone knows any reason why this might be faster in OpenGL than in Metal (I haven't tried, and can't think of any reason), then I'd love to hear it as well.
Thanks
Edit: adding shader code.
vertex float4 vertex_shader(
const constant float2* vertex_array [[ buffer(0) ]],
const device QuadState* quads [[ buffer(1) ]],
constant const Parms& parms [[ buffer(2) ]],
unsigned int vid [[ vertex_id ]],
unsigned int iid [[ instance_id ]] )
{
float2 v = vertex_array[vid]*0.5f;
v += quads[iid].position;
// ortho cam and projection transform
v += parms.cam.position;
v *= parms.cam.zoom * parms.proj.scaling;
return float4(v, 0, 1.0);
}
fragment half4 fragment_shader()
{
return half4(0.773,0.439,0.278,0.4);
}
Without seeing your Swift/Objective-C code I cannot be sure, but my guess is you are spending too much time calling your instancing code. Instancing is useful when you have a model with hundreds of triangles in it, not for two.
Try creating a vertex buffer with 1000 quads in it and see if the performance increases.