I am trying to learn WebGL and I have the following fragment shader. So far I have actually managed to get my PC to reset spontaneously and also Windows to inform my graphics driver crashed and restarted. All through JS in a browser!
Now I have progressed to the point where nothing happens at all, just the WebGL renderer goes into hibernation. The code below isn't intended to do anything, I am just learning syntax, so don't worry about the fact it's not going to put anything on the screen, but the question is why does this kill my GPU?
precision mediump float;
uniform sampler2D tex;
void main(void)
const int gsl=1024;
vec4 texel=vec4(0.5, 0.5, 0.5, 1.0);
for(int i = 0; i < gsl; i++)
float xpos=mod(float(i),256.0);
float ypos=float(i)/256.0;
vec2 vTextureCoord=vec2(xpos,ypos);
texel= texture2D(text, vTextureCoord);
gl_FragColor = texel;
Most likely it's because the shader is too slow.
Unlike CPUs, GPUs do not have preemptable multitasking (at least not yet). That means when you give a GPU something to do it has to do it to completion. There's no interrupting it like you can with a CPU.
So for example if you ask a GPU to draw 1000000 fullscreen polygons even a fast GPU will take several seconds during which time it can do nothing else. Similarly if you give it a very expensive per pixel fragment shader and draw a lot of pixels with it it will take a very long time during which the GPU can't be interrupted. If you gave it something that took 30 minutes the user could not use their machine for 30 minutes
The solution is OS times how long each GPU operation takes. If it takes too long (like 2-3 seconds) than the OS just resets the GPU. At that point the OS has no idea how far the GPU got in the current operation. A good OS/Driver then just kills the one context that issued the bad draw call. An older OS kills all contexts across all programs.
Note of course that too long depends on the GPU. A fast GPU can do things in moments and a slow GPU might take seconds. Also different GPUs have different types of optimizations.
TL;DR: Your shader probably crashed because it runs too slow and the OS reset the GPU.
I am trying to digest these two links:
The pipeline overview says that vertex shader runs before the primitive assembly.
The second one mentions this:
A vertex shader is (usually) invariant with its input. That is, within a single Drawing Command, two vertex shader invocations that get the exact same input attributes will return binary identical results. Because of this, if OpenGL can detect that a vertex shader invocation is being given the same inputs as a previous invocation, it is allowed to reuse the results of the previous invocation, instead of wasting valuable time executing something that it already knows the answer to.
OpenGL implementations generally do not do this by actually comparing the input values (that would take far too long). Instead, this optimization typically only happens when using indexed rendering functions. If a particular index is specified more than once (within the same Instanced Rendering), then this vertex is guaranteed to result in the exact same input data.
Therefore, implementations employ a cache on the results of vertex shaders. If an index/instance pair comes up again, and the result is still in the cache, then the vertex shader is not executed again. Thus, there can be fewer vertex shader invocations than there are vertices specified.
So if i have two quads with two triangles each:
verts: { 0 1 2 3 }
tris: { 0 1 2 }
{ 1 2 3 }
verts: { 0 1 2 3 4 5 }
tris: { 0 1 2 }
{ 3 4 5 }
and perhaps a vertex shader that looks like this:
uniform mat4 mvm;
uniform mat4 pm;
attribute vec3 position;
void main (){
vec4 res;
for ( int i = 0; i < 256; i++ ){
res = pm * mvm * vec4(position,1.);
gl_Position = res;
Should I care that one has 4 vertices while the other one has 6? Is this even true from gpu to gpu, will one invoke the vertex shader 4 times vs 6? How is this affected by the cache:
If an index/instance pair comes up again, and the result is still in the cache...
How is the primitive number related to performance here? In both cases i have the same amount of primitives.
In the case of a very simple fragment shader, but an expensive vertex shader:
void main(){
gl_FragColor = vec4(1.);
And a tessellated quad (100x100 segments) can i say that the indexed version will run faster, or can run faster, or maybe say nothing?
Like everything in GPUs according to the spec you can say nothing. It's up to the driver and GPU. In reality though in your example 4 vertices will run faster than 6 pretty much everywhere?
Search for vertex order optimization and lots of articles come up
Linear-Speed Vertex Cache Optimisation
Triangle Order Optimization
AMD Triangle Order Optimization Tool
Triangle Order Optimization for Graphics Hardware Computation Culling
unrelated but another example of the spec vs realtiy is that according to the spec depth testing happens AFTER the fragment shader runs (otherwise you couldn't set gl_FragDepth in the fragment shader. In reality though as long as the results are the same the driver/GPU can do whatever it wants so fragment shaders that don't set gl_FragDepth or discard certain fragments are depth tested first and only run if the test passes.
I've got a MPSImageGaussianBlur object doing work on each frame of a compute pass (Blurring the contents of an intermediate texture).
While the app is still running at 60fps no problem, I see an increase of ~15% in CPU usage when enabling the blur pass. I'm wondering if this is normal?
I'm just curious as to what could be going on under the hood of MPSImageGaussianBlur's encodeToCommandBuffer: operation that would see so much CPU utilization. In my (albeit naive) understanding, I'd imagine there would just be some simple encoding along the lines of:
MPSImageGaussianBlur.encodeToCommandBuffer: pseudo-method :
func encodeToCommandBuffer(commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture) {
let encoder = commandBuffer.computeCommandEncoder()
encoder.setTexture(sourceTexture, atIndex: 0)
encoder.setTexture(destinationTexture, atIndex: 1)
// kernel weights would be built at initialization and
// present here as a `kernelWeights` property
encoder.setTexture(self.kernelWeights, atIndex: 2)
let threadgroupsPerGrid = ...
let threadsPerThreadgroup = ...
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
Most of the 'performance magic' would be implemented on the algorithms running in the compute kernel function. I can appreciate that bit because performance (on the GPU) is pretty fantastic independent of the blurRadius I initialize the MPSImageGaussianBlur with.
Some probably irrelevant details about my specific setup:
MPSImageGaussianBlur initialized with blur radius 8 pixels.
The texture I'm blurring is 128 by 128 pixels.
Performing all rendering in an MTKViewDelegate's drawInMTKView: method.
I hope this question is somewhat clear in it's intent.
MPSGaussianBlur is internally a complex multipass algorithm. It is spending some time allocating textures out of its internal texture cache to hold the intermediate data. There is the overhead of multiple kernel launches to be managed. Also some resources like Gaussian blur kernel weights need to be set up. When you commit the command buffer, all these textures need to be wired down (iOS) and some other work needs to be done. So, it is not quite as simple as you imagine.
The texture you are using is small enough that the relatively fixed CPU overhead can start to become an appreciable part of the time.
Filing a radar on the CPU cost of MPSGassianBlur would cause Apple to spend an hour or two looking if something can be improved, and will be worth your time.
I honestly would not be surprised if under the hood the gpu was being less accessed than you would think for the kernel. In my first experiences with metal compute I found performance underwhelming and fell back again on neon. It was counter intuitive. I really wouldn't be surprised if the cpu hit was neon. I saw the same using mps Gaussian. It would be nice to get this confirmed. Neon has a lot of memory and instruction features that are friendlier to this use case.
Also, an indicator that this might be the case is that these filters don't run on OS X Metal. If it were just compute shaders I'm sure they could run. But Neon code can't run on the simulator.
I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
Host code gist:
Kernel (shader) code:
GPU frame capture profiling results:
The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.
I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.
I have high number of variables (30 uniforms (mostly vec4), about 20 variables (vec3, float, vec4) within shader) within fragment shader. It runs just fine on iPhone5S, but I have serious problem on iPhone4. GPU time is 1s / frame and 98% of the time is shader run time.
According to Apple API
OpenGL ES limits the number of each variable type you can use in a
vertex or fragment shader. The OpenGL ES specification doesn’t require
implementations to provide a software fallback when these limits are
exceeded; instead, the shader simply fails to compile or link. When
developing your app you must ensure that no errors occur during shader
compilation, as shown in Listing 10-1.
But from this I quite dont understand. Do they provide SW fallback or not? Because I have no errors during compilation or linking of shader and yet performance is poor. I have comment almost everything out and just leave 2 texture lookups and directional light computation. I changed other functions to return just vec4(0,0,0,0).
The limitation on uniforms is much higher than that. GLSL ES (2.0) requires 512 scalar uniform components per-vertex shader (though ES describes this in terms of the number of vectors -- 128). Assuming all 30 of your uniforms were vec4, you still have enough storage for 98 more.
The relevant limits are gl_MaxVertexUniformVectors and gl_MaxFragmentUniformVectors. Implementations are only required to support 16 in the fragment shader, but most will far exceed the minimum - check the values yourself. Query the limits from GL ES rather than trying to figure them out in your GLSL program with some Frankenstein shader code ;)
OpenGL ES 2.0 Shading Language - Appendix A: Limitations - pp. 113
const mediump int gl_MaxVertexAttribs = 8;
const mediump int gl_MaxVertexUniformVectors = 128;
const mediump int gl_MaxVaryingVectors = 8;
const mediump int gl_MaxVertexTextureImageUnits = 0;
const mediump int gl_MaxCombinedTextureImageUnits = 8;
const mediump int gl_MaxTextureImageUnits = 8;
const mediump int gl_MaxFragmentUniformVectors = 16;
const mediump int gl_MaxDrawBuffers = 1;
In fact, it would be a good idea to query all of the GLSL program / shader limits just to get a better idea of the constraints you need to work under for your target software/hardware. It is better to plan ahead than to wait to address these things until your program blows up.
As for software fallbacks, I doubt it. This is an embedded environment, there is not much need for such a thing. When developing the actual software on a PC/Mac, they usually ship with a reference software implementation mostly for testing purposes. Individual components may sometimes fallback to software to overcome hardware limitations, but that is necessary because of the wide variety of hardware in Apple's Mac line alone. But when you are writing an app that is specifically written for a single specification of hardware it is generally acceptable to give a complete failure if you try to do something that exceeds the limitations (which you are expected to be familiar with).
Apple says in their Best Practices For Shaders to avoid branching if possible, and especially branching on values calculated within the shader. So I replaced some if statements with the built-in clamp() function. My question is, are clamp(), min(), and max() likely to be more efficient, or are they merely convenience (i.e. macro) functions that simply expand to if blocks?
I realize the answer may be implementation dependent. In any case, the functions are obviously cleaner and make plain the intent, which the compiler could do something with.
Historically speaking GPUs have supported per-fragment instructions such as MIN and MAX for much longer than they have supported arbitrary conditional branching. One example of this in desktop OpenGL is the GL_ARB_fragment_program extension (now superseded by GLSL) which explicitly states that it doesn't support branching, but it does provide instructions for MIN and MAX as well as some other conditional instructions.
I'd be pretty confident that all GPUs will still have dedicated hardware for these operations given how common min(), max() and clamp() are in shaders. This isn't guaranteed by the specification because an implementation can optimize code however it sees fit, but in the real world you should use GLSL's built-in functions rather than rolling your own.
The only exception would be if your conditional was being used to avoid a large amount of additional fragment processing. At some point the cost of a branch will be less than the cost of running all the code in the branch, but the balance here will be very hardware dependent and you'd have to benchmark to see if it actually helps in your application on its target hardware. Here's the kind of thing I mean:
void main() {
vec3 N = ...;
vec3 L = ...;
float NDotL = dot(N, L);
if (NDotL > 0.0)
// Lots of very intensive code for an awesome shadowing algorithm that we
// want to avoid wasting time on if the fragment is facing away from the light
Just clamping NDotL to 0-1 and then always processing the shadow code on every fragment only to multiply through your final shadow term by NDotL is a lot of wasted effort if NDotL was originally <= 0, and we can theoretically avoid this overhead with a branch. The reason this kind of thing is not always a performance win is that it is very dependent on how the hardware implements shader branching.