I've got the below shader (pieces removed for length and clarity), and would like to find a bettter way to do this. I would like to send an array of textures, of which the size is variable, to my metal shader. I'll do some calculations on the vertex positions, and then figure out which texture to use.
Currently I have just hard coded things and used several if statements, but this is ugly (and I'm guessing not fast). Is there any way I can compute i and then use i as a texture subscript (like tex[i].sample)?
// Current code - its ugly
fragment half4 SimpleTextureFragment(VertextOut inFrag [[stage_in]],
texture2d<half> tex0 [[ texture(0) ]]
texture2d<half> tex1 [[ texture(1) ]]
texture2d<half> tex2 [[ texture(2) ]]
...
texture2d<half> texN [[ texture(N) ]]
)
{
constexpr sampler quad_sampler;
int i = (Compute_Correct_Texture_to_Use);
if(i==0)
{
half4 color = tex0.sample(quad_sampler, inFrag.tex_coord);
}
else if(i==1)
{
half4 color = tex1.sample(quad_sampler, inFrag.tex_coord);
}
...
else if(i==n)
{
half4 color = texN.sample(quad_sampler, inFrag.tex_coord);
}
return color;
}
You are right that your method will not be fast. Best case, the shader will have lots of branching (which is not good), worse case, the shader will actually sample from ALL your textures and then discard the results it does not use (this will be even slower).
This is not a case that GPUs handle particularly well, so my advice would be to slightly refactor your approach to be more GPU friendly. Without knowing more about what you are doing at a higher level, my first suggestion would be to consider using 2d array textures.
2d array textures essentially merge X 2D textures in to a single texture with X slices to it. You only have to pass a single texture to Metal and you can calculate which slice to sample from in the shader exactly as you are already doing, but with this approach you will get rid of all the 'if' branches and will only need to call sample once like this: tex.sample( my_sampler, inFrag.tex_coord, i );
If your textures are all the same size and format, then this will work very easily. You just have to copy each of your 2D textures in to a slice of the array texture. If your textures are different in size or format, you may have to work around that possibly by stretching some so that they all end up the same dimensions.
See here for docs: https://developer.apple.com/library/ios/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Mem-Obj/Mem-Obj.html#//apple_ref/doc/uid/TP40014221-CH4-SW10
(look for 'Texture Slices')
Also here: https://developer.apple.com/library/prerelease/ios/documentation/Metal/Reference/MTLTexture_Ref/index.html#//apple_ref/c/econst/MTLTextureType2DArray
Metal shader languages docs here: https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/std-lib/std-lib.html#//apple_ref/doc/uid/TP40014364-CH5-SW17 (look for '2D Texture Array')
Related
I'm writing some metal code that draws a skybox. I'd like for the depth output by the vertex shader to always be 1, but of course, I'd also like the vertices to be drawn in their correct positions.
In OpenGL, you could use glDepthRange(1,1) to have the depth always be written out as 1.0 in this scenario. I don't see anything similar in Metal. Does such a thing exist? If not, is there another way to always output 1.0 as the depth from the vertex shader?
What I'm trying to accomplish is drawing the scenery first and then drawing the skybox to avoid overdraw. If I just set the z component of the outgoing vertex to 1.0, then the geometry doesn't draw correctly, obviously. What are my options here?
Looks like you can specify the fragment shader output (return value) format roughly so:
struct MyFragmentOutput {
// color attachment 0
float4 color_att [[color(0)]];
// depth attachment
float depth_att [[depth(depth_argument)]]
}
as seen in the section "Fragment Function Output Attributes" on page 88 of the Metal Shading Language Specification (https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf). Looks like any is a working value for depth_argument (see here for more: In metal how to clear the depth buffer or the stencil buffer?)
Then you would set you fragment shader to use that format
fragment MyFragmentOutput interestingShaderFragment
// instead of: fragment float4 interestingShaderFragment
and finally just write to the depth buffer in your fragment shader:
MyFragmentOutput out;
out.color_att = float(rgb_color_here, 1.0);
out.depth_att = 1.0;
return out;
Tested and it worked.
Here's my fragment shader.
constexpr sampler ColorSampler;
fragment half4 T(VertexOut Vertex [[stage_in]], texture2d_array<float> Texture [[texture(0)]]) {
return half4(Vertex.Color, 1)*half4(Texture.sample(ColorSampler, Vertex.TexCoord.xy/Vertex.TexCoord.z, 0));
}
This compiles. However, how do I pass a texture2d_array<float> from the Swift side that this shader will accept? I have a variable called Textures of type [MTLTexture?], but how do I pass that as an array to texture buffer 0? This doesn't work.
CommandEncoder.setFragmentTextures(Textures, range: 0..<Textures.count)
It doesn't work because it's putting each texture in the texture buffer of its index. How do I properly pass a texture array to the fragment function?
How do I do it?
And is what I'm trying a good idea even?
And how am I expected to research if almost no in-depth content about this exists?
setFragmentTextures does indeed bind each texture individually.
If you want to use texture2d_array in your shader you'd have to create a MTLTexture with MTLTextureType type2DArray.
I want to optimize the fragment shader performance. Currently my fragment shader is
fragment half4 fragmen_shader_texture(VertexOutTexture vIn [[stage_in]],
texture2d<half> texture [[texture(0)]]){
constexpr sampler defaultSampler;
half4 color = half4(texture.sample(defaultSampler, vIn.textureCoordinates));
return color;
}
The task of this is to return the texture color. Anyway to optimize more than this.
No options for optimizing the fragment shader AFAICT, it's doing virtually nothing other than sampling the texture. However, depending on your situation, there still might be scope for optimization by:
Reducing bandwidth usage by using a more compact texture format (565 or 4444 instead of 8888, or better still 4-bit or 2-bit PVRTC).
Making sure that alpha blending is disabled if alpha blending is not required.
If the texture has lots of 'empty space' (e.g. think particle texture with a central circular blob and blank corners) then you could make sure the geometry fits it more tightly by rendering it as an Octagon rather than as a quad for instance.
Enable mipmapping if there's any possibility the image can be minimized. Disable more expensive mipmapping options like trilinear/anisotropic filtering.
I'm trying to render a large number of very small 2D quads as fast as possible on an Apple A7 GPU using the Metal API. Researching that GPU's triangle throughput numbers, e.g. here, and from Apple quoting >1M triangles on screen during their keynote demo, I'd expect to be able to render something like 500,000 such quads per frame at 60fps. Perhaps a bit less, given that all of them are visible (on screen, not hidden by z-buffer) and tiny (tricky for the rasterizer), so this likely isn't a use case that the GPU is super well optimized for. And perhaps that Apple demo was running at 30fps, so let's say ~200,000 should be doable. Certainly 100,000 ... right?
However, in my test app the max is just ~20,000 -- more than that and the framerate drops below 60 on an iPad Air. With 100,000 quads it runs at 14 fps, i.e. at a throughput of 2.8M trianlges/sec (compare that to the 68.1M onscreen triangles quoted in the AnandTech article!).
Even if I make the quads a single pixel small, with a trivial fragment shader, performance doesn't improve. So we can assume that this is vertex bound, and the GPU report in Xcode agrees ("Tiler" is at 100%). The vertex shader is trivial as well, doing nothing but a little scaling and a translation math, so I'm assuming the bottleneck is some fixed-function stage...?
Just for some more background info, I'm rendering all the geometry using a single instanced draw call, with one quad per instance, i.e. 4 vertices per instance. The quad's positions are applied from a separate buffer that's indexed by instance id in the vertex shader. I've tried a few other methods as well (non-instanced with all vertices pre-transformed, instanced+indexed, etc), but that didn't help. There are no complex vertex attributes, buffer/surface formats, or anything else I can think of that seems likely to hit a slow path in the driver/GPU (though I can't be sure of course). Blending is off. Pretty much everything else is in the default state (things like viewport,scissor,ztest,culling,etc).
The application is written in Swift, though hopefully that doesn't matter ;)
What I'm trying to understand is whether the performance I'm seeing is expected when rendering quads like this (as opposed to a "proper" 3d scene), or whether some more advanced techniques are needed to get anwhere close to the advertised triangle throughputs. What do people think is likely the limiting bottleneck here?
Also, if anyone knows any reason why this might be faster in OpenGL than in Metal (I haven't tried, and can't think of any reason), then I'd love to hear it as well.
Thanks
Edit: adding shader code.
vertex float4 vertex_shader(
const constant float2* vertex_array [[ buffer(0) ]],
const device QuadState* quads [[ buffer(1) ]],
constant const Parms& parms [[ buffer(2) ]],
unsigned int vid [[ vertex_id ]],
unsigned int iid [[ instance_id ]] )
{
float2 v = vertex_array[vid]*0.5f;
v += quads[iid].position;
// ortho cam and projection transform
v += parms.cam.position;
v *= parms.cam.zoom * parms.proj.scaling;
return float4(v, 0, 1.0);
}
fragment half4 fragment_shader()
{
return half4(0.773,0.439,0.278,0.4);
}
Without seeing your Swift/Objective-C code I cannot be sure, but my guess is you are spending too much time calling your instancing code. Instancing is useful when you have a model with hundreds of triangles in it, not for two.
Try creating a vertex buffer with 1000 quads in it and see if the performance increases.
I'm trying to implement the technique described at : Compositing Images with Depth.
The idea is to use an existing texture (loaded from an image) as a depth mask, to basically fake 3D.
The problem I face is that glDrawPixels is not available in OpenglES. Is there a way to accomplish the same thing on the iPhone?
The depth buffer is more obscured than you think in OpenGL ES; not only is glDrawPixels absent but gl_FragDepth has been removed from GLSL. So you can't write a custom fragment shader to spool values to the depth buffer as you might push colours.
The most obvious solution is to pack your depth information into a texture and to use a custom fragment shader that does a depth comparison between the fragment it generates and one looked up from a texture you supply. Only if the generated fragment is closer is it allowed to proceed. The normal depth buffer will catch other cases of occlusion and — in principle — you could use a framebuffer object to create the depth texture in the first place, giving you a complete on-GPU round trip, though it isn't directly relevant to your problem.
Disadvantages are that drawing will cost you an extra texture unit and textures use integer components.
EDIT: for the purposes of keeping the example simple, suppose you were packing all of your depth information into the red channel of a texture. That'd give you a really low precision depth buffer, but just to keep things clear, you could write a quick fragment shader like:
void main()
{
// write a value to the depth map
gl_FragColor = vec4(gl_FragCoord.w, 0.0, 0.0, 1.0);
}
To store depth in the red channel. So you've partially recreated the old depth texture extension — you'll have an image that has a brighter red in pixels that are closer, a darker red in pixels that are further away. I think that in your question, you'd actually load this image from disk.
To then use the texture in a future fragment shader, you'd do something like:
uniform sampler2D depthMap;
void main()
{
// read a value from the depth map
lowp vec3 colourFromDepthMap = texture2D(depthMap, gl_FragCoord.xy);
// discard the current fragment if it is less close than the stored value
if(colourFromDepthMap.r > gl_FragCoord.w) discard;
... set gl_FragColor appropriately otherwise ...
}
EDIT2: you can see a much smarter mapping from depth to an RGBA value here. To tie in directly to that document, OES_depth_texture definitely isn't supported on the iPad or on the third generation iPhone. I've not run a complete test elsewhere.