WebGL vs PyOpenGL - webgl

I'm now assigned to try to integrate some 3d rendering that was done in WebGl to PyOpenGL. I have some samples of both but right from the start I've run into somewhat of a dilemma. For one of the objects that has a huge number of vertexes the WebGL version runs much better than the PyOpenGL one.I'm mostly curious if this is normal or is it some implementation issue.
regards,
Bogdan

Is your PyOpenGL implementation using VBOs for rendering the geometry?
The main performance issues we ran into when implementing WebGL were JS->C++ call overhead, type conversions and GC runs. Which is why WebGL is using Typed Arrays for data and VBOs for rendering: Typed Arrays reduce the need for type conversions and are potentially faster to GC than JS arrays, whereas VBOs minimize the amount of API calls and CPU->GPU-traffic.
On PyOpenGL I'd imagine the main issue to be type conversions. But you shouldn't run into that with VBOs, hence the question.

Related

WebGL Constructive Solid Geometry to Static Vertices

I'm super new to 3D graphics (I've been trying to learn actual WebGL instead of using a framework) and I'm now in the constructive solid geometry phase. I know sites like tinkercad.com use CSG with WebGL, but they have things set so that your design is calculated every time you load the page instead of doing the subtraction, addition and intersection of primitive objects once and then storing those end design vertices for later use. I'm curious if anybody knows why they're doing things that way (maybe just to conserve resources on the server?) and if there isn't some straightforward way of extracting those vertices right before the draw call? Maybe a built in function of WebGL? Haven't found anything so far, when I try logging the object data from gl.bufferData() I'm getting multiple Float32Arrays (one for each object that was unionized together) instead of one complete set of vertices.
By the way, the only github I've found with CSG for WebGL is this https://github.com/evanw/csg.js/ and it's pretty straightforward, however it uses a framework and was curious if you know of any other CSG WebGL code out there that doesn't rely on a framework. I'd like to write it myself either way, but just being able to see what others have done would be nice.

Understanding Core Videos CVPixelBufferPool and CVOpenGLESTextureCache semantics

I'm refactoring my iOS OpenGL-based rendering pipeline. My pipeline consist of many rendering steps, hence I need a lot of intermediate textures to render to and read from. Those textures are of various types (unsigned byte and half float) and may posses a different number of channels.
To save memory and allocation effort I recycled textures that were used by previous steps in the pipeline and are no longer needed. In my previous implementation I did that on my own.
In my new implementation I want to use the APIs provided by the Core Video framework for that; especially since they provide much faster access to the texture memory from the CPU. I understand that the CVOpenGLESTextureCache allows me to create OpenGL textures out of CVPixelBuffers that can be created directly or using a CVPixelBufferPool. However, I am unable to find any documentation describing how they really work and how they play together.
Here are the things I want to know:
For getting a texture from the CVOpenGLESTextureCache I always need to provide a pixel buffer. Why is it called "cache" if I need to provide the memory anyways and are not able to retrieve an old, unused texture?
The CVOpenGLESTextureCacheFlush function "flushes currently unused resources". How does the cache know if a resource is "unused"? Are textures returned to the cache when I release the corresponding CVOpenGLESTextureRef? The same question applies to the CVPixelBufferPool.
Am I able to maintain textures with different properties (type, # channels, ...) in one texture cache? Does it know if a textures can be re-used or needs to be created depending on my request?
CVPixelBufferPools seem only to be able to manage buffers of the same size and type. This means I need to create one dedicated pool for each texture configuration I'm using, correct?
I'd be really happy if at least some of those questions could be clarified.
Yes, well you will not actually be able to find anything. I looked and looked and the short answer is you just need to test things out to see how the implementation functions. You can find my blog post on the subject along with example code at opengl_write_texture_cache. Basically, it seems that the way it works is that the texture cache object "holds" on to the association between the buffer (in the pool) and the OpenGL texture that is bound when a triangle render is executed. The result is that the same buffer should not be returned by the pool until after OpenGL is done with it. In the weird case of some kind of race condition, the pool might get 1 buffer larger to account for a buffer that is held too long. What is really nice about the texture cache API is that one only needs to write to the data buffer once, as opposed to calling an API like glTexImage2D() which would "upload" the data to the graphics card.

OpenGL (ES 2.0) VBO Performances in a Shared Memory Architecture

I am a desktop GL developer, and I am starting to explore the mobile world.
To avoid misunderstandings, or welcome-but-trivial replies, I can humbly say that I am pretty well aware of the GL and GL|ES machinery.
The short question is: if we are using GL|ES 2.0 in a shared memory architecture, what's the point behind using VBOs against client-side arrays?
More in detail:
Vertex buffers are raw chunks of memory, the driver cannot in any way optimize anything because the access pattern depends on: 1) how the application configures vertex data layout, 2) how a vertex shader consumes buffer content, and 3) we can have lots of vertex shaders operating in different ways, and differently sourcing the same buffer.
Alignment: individual VBO storage could start at addresses that are optimal to the underlying GL system; what if I just force (e.g, respect alignment best practices) client-side arrays allocation to these boundaries?
Tile-Based Rendering vs. Immediate Mode architectures should not come into play: to my understanding, this is not related to my question (i.e., memory access).
I understand that using VBOs can have your code run better/faster in future platforms/hardware without modifying it, but this is not the focus of this question.
Alongside, I also realize that using VBOs in a shared memory architecture doubles memory usage (if you, for some reason, have to keep vertex data at your disposal), and it costs you a memcpy of the data.
As with interleaved vertex arrays, VBO usage has got a great "hype" in developers' forums/blogs/official_technotes without any data supporting those statements (i.e., benchmarks).
Is VBO usage worth it on shared memory architectures?
Do client-side arrays work well?
What do you think/know about this?
I can report that using of VBOs to store vertex data on Android devices gave me zero performance improvement. Tried it on Adreno, Mali400 and PowerVR GPUs. However, we use VBOs considering that it is the best practice for OpenGL ES.
You can find notes about this in our article (Vertex Buffer Objects paragraph).
According to this report, even holding SMA constant it depends on both the OpenGL implementation (some VBO work is secretly done on the CPU) and the size of the VBOs:
http://sarofax.wordpress.com/2011/07/10/vbo-vertex-buffer-object-limitations-on-ios-devices/
I will tell you, what i know about iOS platform.
VBO does really improve your performance.
VBO is perfect, if you have a static geometry - once copied, no additional overhead on every draw call. CA will do copy your data from client memory to "gpu memory" every drawcall. It may do data realigning, if you forgot about it.
VBO can be mapped to gpu vie glMapBuffer - it is an asynchronous operation, meaning, it almost has no overhead, but you should remember - when you are mapping\unmapping your buffer, it's better to use it 2 frames after unmap operation - to avoid synchronization
Apple engineers proclaim, that VBO will have better performance, than CA on SGX hardware, even if you'll reupload it every frame - i don't know the details.
VBO is a best practice. CA are deprecated. Better keep in pace with modern trends and stay as much cross-platform, as possible

Can you prewarm a shader on a background thread with its own context?

I am developing a large game that streams in level data (including shaders) as you move through the game world. I do not want to have hitches in my frame rate as shaders are compiled/linked or on the first time they are used.
I have my shader compilation and linking working on a separate thread with its own open-gl context. But I have not been able to get the prewarming of the shaders to work on the separate thread (so that there is no performance hit when the shader is first used).
Prewarming is really not mentioned anywhere in the iOS or OpenGL docs. It is however mentioned in the OpenGL ES Analyzer (one of the instruments available when profiling from xcode). In this tool I get a "Shader Compiled Outside of Prewarming Phase" warning each time something is rendered with a shader that has not been used to render something before. The "Extended detail" says this:
"OpenGL ES Analyzer detected a shader compilation that is not part of an initial prewarming phase. Shader compilation can be a time consuming operation. To avoid them, prewarm all shaders used for rendering. To do this, make a prewarming passwhen your application launches and execute a drawing call with each of the shader programs to be used, using any gl state settings the shader program will be used in conjunction with. States such as blending, color mask, logic ops, multisamping, texture formats, and point primitive state can all affect shader compilation."
The term "compilation" is a little confusing here. The vertex and fragment shaders have already been compiled and the program has been linked. But the first time something is rendered with a given OpenGL state it does some more work on the shader to optimize it for that state I guess.
I have code to pre-warm the shaders by rendering a zero sized triangle before it's first use.
If I compile, link and pre-warm the shaders on the main thread with the same Open GL context as the normal rendering then it works. However if I do it on the background thread with its separate Open GL context it does not work (it still gets the Analyzer warning on first use).
So... it could be that prewarming a shader on a separate context has no effect on other contexts. Or it could be that I don't have all the same state set up the separate context. There is a lot of potential Open GL state that might need to be set up. I'm using an offscreen render buffer on the background thread so that could be considered part of the state.
Has anyone succeeded in getting prewarming working on a background thread?
To be honest with you I was quite ignorant on this matter until yesterday though I have been working on my engine optimization for a while. So, first of all, thank you for the tip :).
I have studied since then the shader warming topic and I have not found much around.
I have found a mention the official AMD documentation in a document titled "ATI OpenGL Programming and Optimization Guide":
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&ved=0CEoQFjAF&url=http%3A%2F%2Fdeveloper.amd.com%2Fmedia%2Fgpu_assets%2FATI_OpenGL_Programming_and_Optimization_Guide.pdf&ei=3HIeT_-jKYbf8AOx3o3BDg&usg=AFQjCNFProzLiXf5Aqqs4jZ2jOb4x0pssg&sig2=6YV7SVA97EFglXv_SX5weg
This is an excerpt of which refers to the warming of the shaders:
Quote:
While the R500 natively supports flow control in the fragment shading unit, the R300 and R400
asics does not. Static flow control for the R300 and R400 is emulated by the driver compiling out
unused conditionals and unrolling loops based on the set constants. Even though the R500 asics family
natively support flow control, the driver will still attempt to compile out static flow conditions enabling
it to reorganize shader instructions for better instruction scheduling. The driver will also try to cache
away the compiled shader for a specific static flow condition set in anticipation for its reuse. So when
writing a fragment program that uses static flow control, it is recommended to “warm” the shader cache
by rendering a dummy triangle on the very first frame that uses the common static conditional
permutations relevant for the life of the shader.
The best explanation I have found around is the following:
http://fgiesen.wordpress.com/2011/07/01/a-trip-through-the-graphics-pipeline-2011-part-1/
Quote:
Incidentally, this is also the reason why you’ll often see a delay the first time you use a new shader or resource; a lot of the creation/compilation work is deferred by the driver and only executed when it’s actually necessary (you wouldn’t believe how much unused crap some apps create!). Graphics programmers know the other side of the story – if you want to make sure something is actually created (as opposed to just having memory reserved), you need to issue a dummy draw call that uses it to “warm it up”. Ugly and annoying, but this has been the case since I first started using 3D hardware in 1999 – meaning, it’s pretty much a fact of life by this point, so get used to it. :)
In this presentation, it is mentioned how the cryteck engined performed it on the far cry engine though it is mostly related to DirectX.
http://www.powershow.com/view/11f2b1-MzUxN/Far_Cry_and_DirectX_flash_ppt_presentation
I hope these links help in some way.

Newbie to GPU programming: what to learn?

I am rendering a certain scene to an off-screen frame buffer (FBO) and then I'm reading the rendered image using glReadPixels() for processing on the CPU. The processing involves some very simple scanning routines and extraction of data.
After profiling I realized that most of what my application does is spend time in glReadPixels() - more than 50% of the time. So the natural step is to move the processing to the GPU so that the data would not have to be copied.
So my question is - what would be the best way to program such a thing to the GPU?
GLSL?
CUDA?
Anything else I'm not currently aware of?
The main requirements is that it'll have access to The rendered off-screen frame bufferes (or texture data since it is possible to render to a texture) and to be able to output some information to the CPU, say in the order of 1-2Kb per frame.
You might find the answers in the "Intro to GPU programming" questions useful.
-Adam
There are a number of pointers to getting started with GPU programming in other questions, but if you have an application that is already built using OpenGL, then probably your question really is "which one will interoperate with OpenGL"?
After all, your whole point is to avoid the overhead of reading your FBO back from the GPU to the CPU with glReadPixels(). If, for example you had to read it back anyway, then copy the data into a CUDA buffer, then transfer it back to the gpu using CUDA APIs, there wouldn't be much point.
So you need a GPGPU package that will take your OpenGL FBO object as an input directly, without any extra copying.
That would probably rule out everything except GLSL.
I'm not 100% sure whether CUDA has any way of operating directly on an OpenGL buffer object, but I don't think it has that feature.
I am sure that ATI's Stream SDK doesn't do that. (Although it will interoperate with DirectX.)
I doubt that the DirectX 11 "technology preview" with compute shaders has that feature, either.
EDIT: Follow-up: it looks like CUDA, at least the most recent version, has some support for OpenGL interoperability. If so, that's probably your best bet.
I recently found this Modern GPU
You may find OpenAI Triton useful

Resources