WebGL how to avoid long shader compile stalling a tab - webgl

I have a giant shader that takes more than a minute to compile, which completely stalls whole browser during the process. As far as I know shader compilation cannot be made asynchronous, so you can run other WebGL commands while waiting for compilation to be done.
I already tried the following:
don't use that particular shader for some time - this doesn't work, because most other WebGL commands will wait for it to finish, even if that shader program is never made active
use another context - same as above, but even WebGL commands from another context will cause the stall
use OffscreenCanvas in web worker - this doesn't avoid the stall either, and even if it is in worker, it stalls whole browser. Even if I wait few minutes after command to link program to issue any other WebGL command, browser stalls (as if nothing was happening during that time)
Another problem is that it sometimes crashes WebGL (context loss), which crashes all contexts on page (or in worker).
Is there something I can do to avoid stalling browser?
Can I split my shader to multiple parts and compile them separately?
This is how my program initialization looks like, can it be changed somehow?
let vertexShader = gl.createShader(gl.VERTEX_SHADER);
let fragmentShader = gl.createShader(gl.FRAGMENT_SHADER);
let program = gl.createProgram();
gl.shaderSource(vertexShader, vertexSource);
gl.shaderSource(fragmentShader, fragmentSource);
gl.compileShader(vertexShader);
gl.compileShader(fragmentShader);
gl.attachShader(program, vertexShader);
gl.attachShader(program, fragmentShader);
gl.linkProgram(program);
gl.useProgram(program);
let status = gl.getProgramParameter(program, gl.LINK_STATUS);
let programLog = gl.getProgramInfoLog(program);
Waiting after call to linkProgram for minutes doesn't help even in worker.
As a final thing to note: I can have e.g. windows game using OpenGL running that is not affected by this (game is running, I start compiling this shader in browser and game continues to run ok while browser stalls)

Update
Chromium added the KHR_parallel_shader_compile extension which allows you to query if a shader is done compiling.
Unfortunately only Chromium (Chrome/Edge/Brave/etc...) has implemented it as of January 2021
Original answer
There is no good solution.
Browsers on Windows use DirectX because OpenGL doesn't ship by default on many machines and because lots of other features needed for the browser are incompatible with OpenGL.
DirectX takes a long time to compile shaders. Only Microsoft can fix that. Microsoft has provided source to an HLSL shader compiler but it only works with DX12.
Some people suggest allowing webpages to provide binary shaders but that's never going to happen ever for 2 very important reasons
They aren't portable
A webpage would have to provide 100s or 1000s of variations of binary shaders. One for every type of GPU * every type of driver * every platform (iOS, Android, PI, Mac, Windows, Linux, Fire, ...). Webpages are supposed to load everywhere so shader binaries are not solution for the web.
It would be a huge security issue.
Having users download random binary blobs that are given to the OS/GPU to execute would be huge source for exploits.1
Note that some browsers (Chrome in particular) cache shader binaries locally behind the scenes but that doesn't help first time compilation.
So basically at the moment there is no solution. You can make simpler shaders or compile less of them at once. People have asked for an async extension to compile shaders but there's been no movement.
Here's a thread from 2 years ago
https://www.khronos.org/webgl/public-mailing-list/public_webgl/1702/msg00039.php
Just a personal opinion but I'm guessing the reason there isn't much movement for an async extension it's way more work to implement than it sounds and that plenty of sites with complex shaders exist and seem to work.
1The shaders you pass to WebGL as text GLSL are compiled by the browser, checked for all kinds of issues, rejected if any of the WebGL rules are broken, they are then re-written to be safe with bug workarounds inserted, variable names re-written, clamping instructions added, sometimes loops unrolled, all kinds of things to make sure you can't crash the driver. You can use WEBGL_debug_shaders extension to see the shader that's actually sent to the driver.
A binary shader is a blob you give to the driver, you have no chance to inspect it or verify its not doing something bad as it's a driver proprietary binary. There is no documentation on what's in it, the format, they can change with every GPU and every driver. You just have to trust the driver. Drivers are not trustworthy. On top of which it's untrusted code executing on your machine. It would no different than downloading random .exes and executing them therefore it won't happen.
As for WebGPU, No, there is no more security risk with WebGL. Even if it uses a binary format that binary format will be for WebGPU itself, not the driver. WebGPU will read the binary, check all the rules are followed, then generate a shader that matches the user's GPU. That generated shader could be GLSL, HLSL, MetalSL, SPIR-V, whatever works but similarly to WebGL it will write a shader only after verifying all the rules are followed and then the shader it writes, just like WebGL, will include workarounds, clamping and whatever else is needed make the shader safe. Note as of today 2018/11/30 it's undecided what the shader format for WebGPU is. Google and Mozilla are pushing for a subset of SPIR-V in binary, Apple and Microsoft are pushing for WHLSL, a variation of HLSL in text
Note that when the browser says "RATS! WebGL it a snag" that doesn't mean the driver crashed. Rather it nearly always means the GPU was reset for taking too long. In Chrome (not sure about other browsers), when Chrome asks the GPU (via the driver) to do something it starts a timer. If the GPU doesn't finish within 2-5 seconds (not sure the actual timeout) then Chrome will kill the GPU process. This includes compiling shaders and since it's DirectX that takes the most time to compile this is why this issues comes up most on DirectX.
On Windows even if Chrome didn't do this Windows does this. This is mostly because most GPUs (maybe all in 2018) can not multitask like a CPU can. That means if you give them 30 minutes of work to do they will do it without interruption for 30 minutes which would basically freeze your machine since your machine needs the GPU to draw application windows etc. In the past Windows got around this by, just like Chrome, resetting the GPU if something took too long. It used to be that Linux and Mac would just freeze for those 30 minutes or crash the OS since the OS would expect to be able to draw graphics and not be able to. Sometime in the last 8 years Mac and Linux got better at this. In any case, Chrome needs to try to be proactive so it uses its own timer and kills things if they are taking too long.

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Alternatives to D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS?

This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?

Is WebGL Shader Caching Possible?

My question is similar to Saving/Loading compiled WebGL shaders, but I don't want to pre-compile my shaders. Instead, I just want the browser to store the shaders it compiles longer than the default. Right now, every time I refresh the page the shaders have to be recompiled.
I understand the security and portability issues raised in answers like this one and this one. It seems that these are both non-issues assuming that the browser is caching shaders that it compiled for my web app.
Assuming the same OS + browser + GPU + driver combination, is there a way to make the browser cache compiled shaders in such a way that shader compilation will not be required after each time the page is refreshed?
There is nothing the user can do to force the browser to cache shaders. It's up to the browser to implement shader caching and to decide when to use it. Further, the browser relies on the OS to provide a way to cache shaders so if the OS doesn't support it then of course the browser can't either. As an example, currently on MacOS, WebGL runs on top of OpenGL and OpenGL on MacOS provides no way to cache shaders.
For example search for 'BINARY' in this official Apple OpenGL feature table and you'll see the number of formats for caching is 0. In other words you can't cache OpenGL shaders on MacOS
I don't know Metal that well, it possible that some future version of WebGL could be written on top of Metal and maybe Metal provides a way.
Chrome can cache shaders. Here's the code for caching them. But it can't if the OS doesn't support it.
Then there's the question of when to clear or not use the cache. Should the cache be cleared if the user presses 'refresh'. Note that 'refresh' is a signal from the user to NOT cache the page. There are many ways to revisit. One, click a link to the page again, pick it from a bookmark, enter it in the URL bar. All of these don't clear the cache. Clicking the 'Refresh' button AFAIK ignores the cache for at least the specific request (ie, the page itself) but not the things the page references.
Should the cache be cleared if the user picks to empty the browser's normal cache of web resources? Clearly the cache should be cleared anytime the driver changes version numbers. There may be other reasons to clear the cache as the browser needs to make sure it never delivers a bad or out of date shader.
As for Windows I believe DirectX allows caching shaders and Chrome, via ANGLE caches them. A quick test on Windows seems to bare this out. Going to shadertoy.com the first time I load the page it takes a while. The next time it doesn't. Another test. Pick a complex shader on shadertoy. Edit some constant in the shader, for example change 1.0 to 1.01 and press the compile button. Look at the compile time. Now change it back to 1.0 and press compile. In my tests the second compile takes much less time suggesting the shader was cached.
I have no idea if Firefox caches shaders. Safari doesn't since it only runs on platforms that don't support caching.

OpenGL ES apps appear to run MUCH FASTER when profiling in Instruments

I'm scared to ask this question because it doesn't include specifics and doesn't have any code samples, but that's because I've encountered it on three entirely different apps that I've worked on in the past few weeks, and I'm thinking specific code might just cloud the issue.
Scoured the web and found no reference to the phenomenon I'm encountering, so I'm just going to throw this out there and hope someone else has seen the same thing:
The 'problem' is that all the iOS OpenGL apps I've built, to a man, run MUCH FASTER when I'm profiling them in Instruments than when they're running standalone. As in, a frame rate roughly twice as fast (jumping from, eg, 30fps to 60fps). This is both measured with a code-timing loop and from watching the apps run. Instruments appears to be doing something magical.
This is on a device, not the iOS simulator.
If I profile my OpenGL apps and upload to a device — specifically, iPad 3 running iOS 5.1 — via Instruments, the frame rate is just flat-out much, much faster than running standalone. There appears to be no frame skipping or shennanigans like that. It simply does the same computation at around twice the speed.
Although I'm not including any code samples, just assume I'm doing the normal stuff. OpenGL ES 2.0, with VBOs and VAOs. Multithreading some computationally intensive code areas with dispatch queues/blocks. Nothing exotic or crazy.
I'd just like to know if anyone has experienced anything vaguely similar. If not, I'll just head back to my burrow and continue stabbing myself in the leg with a fork.
Could be that when you profile, a release build is used (by default) instead of a debug build when you just hit run.

Super slow Image processing on Android tablet

I am trying to implement SLIC superpixel algorithm in Android tablet (SLIC)
I port the code which in C++ to work with android environment using stl-lib and all. What application doing is taking an image from camera and send data to process in native code.
I got the app running but the problem is that it took 20-30 second to process a single frame (640 x 400) while in my notebook running with visual studio application would be almost instantly finish!
I check the memory leak, their isn't any... is their anything that might cause computation time to be way more expensive than VS2010 in notebook?
I know this question might be very open and not really specific but I'm really in the dark too. Hope you guys can help.
Thanks
PS. I check running time for each process, I think that every line of code execution time just went up. I don't see any specific function that take way longer than usual.
PSS. Do you think follow may cause the slow?
Memory size : investigated, during native not much of paused time show from GC
STL-library : not investigate yet, is it possible that function like vector, max and min running in STL may cause significant slow?
Android environment it self?
Lower hardware specification of Android Tablet (Acer Iconia tab - 1GHz Nvidia Tegra 250 dual-core processor and has 1GB of RAM)
Would be better to run in Java?
PSSS. If you have time please check out the code
I've taken a look to your code and can make the following recommendations:
First of all, you need to add the line APP_ABI := armeabi-v7a into your Application.mk file. Otherwise your code is compiled for old armv5te architecture where you have no any FPU (all floating point arithmetic is emulated), have less registers available and so on.
Your SLIC implementation intensively uses double floating-point values for computation. You should replace them with float wherever possible because ARM still misses hardware support for double type.

Resources