I have high number of variables (30 uniforms (mostly vec4), about 20 variables (vec3, float, vec4) within shader) within fragment shader. It runs just fine on iPhone5S, but I have serious problem on iPhone4. GPU time is 1s / frame and 98% of the time is shader run time.
According to Apple API
OpenGL ES limits the number of each variable type you can use in a
vertex or fragment shader. The OpenGL ES specification doesn’t require
implementations to provide a software fallback when these limits are
exceeded; instead, the shader simply fails to compile or link. When
developing your app you must ensure that no errors occur during shader
compilation, as shown in Listing 10-1.
But from this I quite dont understand. Do they provide SW fallback or not? Because I have no errors during compilation or linking of shader and yet performance is poor. I have comment almost everything out and just leave 2 texture lookups and directional light computation. I changed other functions to return just vec4(0,0,0,0).
The limitation on uniforms is much higher than that. GLSL ES (2.0) requires 512 scalar uniform components per-vertex shader (though ES describes this in terms of the number of vectors -- 128). Assuming all 30 of your uniforms were vec4, you still have enough storage for 98 more.
The relevant limits are gl_MaxVertexUniformVectors and gl_MaxFragmentUniformVectors. Implementations are only required to support 16 in the fragment shader, but most will far exceed the minimum - check the values yourself. Query the limits from GL ES rather than trying to figure them out in your GLSL program with some Frankenstein shader code ;)
OpenGL ES 2.0 Shading Language - Appendix A: Limitations - pp. 113
const mediump int gl_MaxVertexAttribs = 8;
const mediump int gl_MaxVertexUniformVectors = 128;
const mediump int gl_MaxVaryingVectors = 8;
const mediump int gl_MaxVertexTextureImageUnits = 0;
const mediump int gl_MaxCombinedTextureImageUnits = 8;
const mediump int gl_MaxTextureImageUnits = 8;
const mediump int gl_MaxFragmentUniformVectors = 16;
const mediump int gl_MaxDrawBuffers = 1;
In fact, it would be a good idea to query all of the GLSL program / shader limits just to get a better idea of the constraints you need to work under for your target software/hardware. It is better to plan ahead than to wait to address these things until your program blows up.
As for software fallbacks, I doubt it. This is an embedded environment, there is not much need for such a thing. When developing the actual software on a PC/Mac, they usually ship with a reference software implementation mostly for testing purposes. Individual components may sometimes fallback to software to overcome hardware limitations, but that is necessary because of the wide variety of hardware in Apple's Mac line alone. But when you are writing an app that is specifically written for a single specification of hardware it is generally acceptable to give a complete failure if you try to do something that exceeds the limitations (which you are expected to be familiar with).
Related
What's the best practice for them? Is there any performance difference?
What's the best practice for them?
For the most part these only matter on mobile. The spec says an implementation can always use a higher precision so on desktop both the vertex shader and fragment shader run in highp always. (I know of no desktop GPUs for which this is not true)
From the spec section 4.5.2
4.5.2 Precision Qualifiers
...
Precision qualifiers declare a minimum range and precision that the underlying implementation must use
when storing these variables. Implementations may use greater range and precision than requested, but
not less.
For Mobile and Tablets then there are several answers. There is no best. It's up to you
use the lowest precision you can that still does what you need it to do.
use highp and ignore the perf issues and the old phones where it doesn't work
use mediump and ignore the bugs (See below)
check if the user's device supports highp, if not use different shaders with less features.
WebGL defaults to vertex shaders use highp and fragment shaders don't have a default an you have to specify one. Further, highp in the fragment shader is an optional feature and some mobile GPUs don't support it. I don't know what percent that is in 2019. AFAIK most or maybe even all phones shipping in 2019 support highp but older phones (2011, 2012, 2013) don't.]
From the spec:
The vertex language requires any uses of lowp, mediump and highp to compile and link without error.
The fragment language requires any uses of lowp and mediump to compile without error. Support for
highp is optional.
Examples of places you generally need highp. Phong shaded point lights usually need highp. So for example you might use only directional lights on a system that doesn't support highp OR you might use only directional lights on mobile for performance.
Is there any performance difference?
Yes but as it says above an implemenation is free to use a higher precision. So if you use mediump on a desktop GPU you won't see any perf difference since it's really using highp always. On mobile you will see a perf diff, at least in 2019. You may also see where your shaders really needed highp.
Here is a phong shader set to use mediump. On desktop since mediump is actually highp it works
On Mobile where mediump is actually mediump it breaks
An example where mediump would be fine, at least in the fragment shader, is most 2D games.
I am trying to digest these two links:
https://www.khronos.org/opengl/wiki/Rendering_Pipeline_Overview
https://www.khronos.org/opengl/wiki/Vertex_Shader
The pipeline overview says that vertex shader runs before the primitive assembly.
The second one mentions this:
A vertex shader is (usually) invariant with its input. That is, within a single Drawing Command, two vertex shader invocations that get the exact same input attributes will return binary identical results. Because of this, if OpenGL can detect that a vertex shader invocation is being given the same inputs as a previous invocation, it is allowed to reuse the results of the previous invocation, instead of wasting valuable time executing something that it already knows the answer to.
OpenGL implementations generally do not do this by actually comparing the input values (that would take far too long). Instead, this optimization typically only happens when using indexed rendering functions. If a particular index is specified more than once (within the same Instanced Rendering), then this vertex is guaranteed to result in the exact same input data.
Therefore, implementations employ a cache on the results of vertex shaders. If an index/instance pair comes up again, and the result is still in the cache, then the vertex shader is not executed again. Thus, there can be fewer vertex shader invocations than there are vertices specified.
So if i have two quads with two triangles each:
indexed:
verts: { 0 1 2 3 }
tris: { 0 1 2 }
{ 1 2 3 }
soup:
verts: { 0 1 2 3 4 5 }
tris: { 0 1 2 }
{ 3 4 5 }
and perhaps a vertex shader that looks like this:
uniform mat4 mvm;
uniform mat4 pm;
attribute vec3 position;
void main (){
vec4 res;
for ( int i = 0; i < 256; i++ ){
res = pm * mvm * vec4(position,1.);
}
gl_Position = res;
Should I care that one has 4 vertices while the other one has 6? Is this even true from gpu to gpu, will one invoke the vertex shader 4 times vs 6? How is this affected by the cache:
If an index/instance pair comes up again, and the result is still in the cache...
How is the primitive number related to performance here? In both cases i have the same amount of primitives.
In the case of a very simple fragment shader, but an expensive vertex shader:
void main(){
gl_FragColor = vec4(1.);
}
And a tessellated quad (100x100 segments) can i say that the indexed version will run faster, or can run faster, or maybe say nothing?
Like everything in GPUs according to the spec you can say nothing. It's up to the driver and GPU. In reality though in your example 4 vertices will run faster than 6 pretty much everywhere?
Search for vertex order optimization and lots of articles come up
Linear-Speed Vertex Cache Optimisation
Triangle Order Optimization
AMD Triangle Order Optimization Tool
Triangle Order Optimization for Graphics Hardware Computation Culling
unrelated but another example of the spec vs realtiy is that according to the spec depth testing happens AFTER the fragment shader runs (otherwise you couldn't set gl_FragDepth in the fragment shader. In reality though as long as the results are the same the driver/GPU can do whatever it wants so fragment shaders that don't set gl_FragDepth or discard certain fragments are depth tested first and only run if the test passes.
In OpenGL ES it is possible to set precision to uniforms and attributes using lopw/mediump/highp. Is there something like this in Metal?
The metal shading language supports the half data type (see section 2.1 of the spec). It's defined there as:
A 16-bit floating-point. The half data type must conform to the IEEE 754 binary16 storage format.
This makes it pretty much equivalent to mediump.
There isn't really an equivalent to lowp in metal. However, that's no real loss because I believe that metal capable iOS GPUs don't benefit from lowp anyway and just do any lowp operations at mediump.
I am trying to learn WebGL and I have the following fragment shader. So far I have actually managed to get my PC to reset spontaneously and also Windows to inform my graphics driver crashed and restarted. All through JS in a browser!
Now I have progressed to the point where nothing happens at all, just the WebGL renderer goes into hibernation. The code below isn't intended to do anything, I am just learning syntax, so don't worry about the fact it's not going to put anything on the screen, but the question is why does this kill my GPU?
precision mediump float;
uniform sampler2D tex;
void main(void)
{
const int gsl=1024;
vec4 texel=vec4(0.5, 0.5, 0.5, 1.0);
for(int i = 0; i < gsl; i++)
{
float xpos=mod(float(i),256.0);
float ypos=float(i)/256.0;
vec2 vTextureCoord=vec2(xpos,ypos);
texel= texture2D(text, vTextureCoord);
}
gl_FragColor = texel;
}
Most likely it's because the shader is too slow.
Unlike CPUs, GPUs do not have preemptable multitasking (at least not yet). That means when you give a GPU something to do it has to do it to completion. There's no interrupting it like you can with a CPU.
So for example if you ask a GPU to draw 1000000 fullscreen polygons even a fast GPU will take several seconds during which time it can do nothing else. Similarly if you give it a very expensive per pixel fragment shader and draw a lot of pixels with it it will take a very long time during which the GPU can't be interrupted. If you gave it something that took 30 minutes the user could not use their machine for 30 minutes
The solution is OS times how long each GPU operation takes. If it takes too long (like 2-3 seconds) than the OS just resets the GPU. At that point the OS has no idea how far the GPU got in the current operation. A good OS/Driver then just kills the one context that issued the bad draw call. An older OS kills all contexts across all programs.
Note of course that too long depends on the GPU. A fast GPU can do things in moments and a slow GPU might take seconds. Also different GPUs have different types of optimizations.
TL;DR: Your shader probably crashed because it runs too slow and the OS reset the GPU.
Alright, so this has been bugging me for a while now, and could not find anything on MSDN that goes into the specifics that I need.
This is more of a 3 part question, so here it goes:
1-) When creating the swapchain applications specify backbuffer pixel formats, and most often is either B8G8R8A8 or R8G8B8A8. This gives 8 bit per color channel so a total of 4 bytes is used per pixel....so why does the pixel shader has to return a color as a float4 when float4 is actually 16 bytes?
2-) When binding textures to the Pixel Shader my textures are DXGI_FORMAT_B8G8R8A8_UNORM format, but why does the sampler need a float4 per pixel to work?
3-) Am I missing something here? am I overthinking this or what?
Please provide links to to support your claim. Preferably from MSDN!!!!
GPUs are designed to perform calculations on 32bit floating point data, at least if they want to support D3D11. As of D3D10 you can also perform 32bit signed and unsigned integer operations. There's no requirement or language support for types smaller than 4 bytes in HLSL, so there's no "byte/char" or "short" for 1 and 2 byte integers or lower precision floating point.
Any DXGI formats that use the "FLOAT", "UNORM" or "SNORM" suffix are non-integer formats, while "UINT" and "SINT" are unsigned and signed integer. Any reads performed by the shader on the first three types will be provided to the shader as 32 bit floating point irrespective of whether the original format was 8 bit UNORM/SNORM or 10/11/16/32 bit floating point. Data in vertices is usually stored at a lower precision than full-fat 32bit floating point to save memory, but by the time it reaches the shader it has already been converted to 32bit float.
On output (to UAVs or Render Targets) the GPU compresses the "float" or "uint" data to whatever format the target was created at. If you try outputting float4(4.4, 5.5, 6.6, 10.1) to a target that is 8-bit normalised then it'll simply be truncated to (1.0,1.0,1.0,1.0) and only consume 4 bytes per pixel.
So to answer your questions:
1) Because shaders only operate on 32 bit types, but the GPU will compress/truncate your output as necessary to be stored in the resource you currently have bound according to its type. It would be madness to have special keywords and types for every format that the GPU supported.
2) The "sampler" doesn't "need a float4 per pixel to work". I think you're mixing your terminology. The declaration that the texture is a Texture2D<float4> is really just stating that this texture has four components and is of a format that is not an integer format. "float" doesn't necessarily mean the source data is 32 bit float (or actually even floating point) but merely that the data has a fractional component to it (eg 0.54, 1.32). Equally, declaring a texture as Texture2D<uint4> doesn't mean that the source data is 32 bit unsigned necessarily, but more that it contains four components of unsigned integer data. However, the data will be returned to you and converted to 32 bit float or 32 bit integer for use inside the shader.
3) You're missing the fact that the GPU decompresses textures / vertex data on reads and compresses it again on writes. The amount of storage used for your vertices/texture data is only as much as the format that you create the resource in, and has nothing to do with the fact that the shader is operating on 32 bit floats / integers.