vulkan pushConstant vs uniform buffer update - buffer

So I am reading the vulkan book now and got a problem about the push Constant and ubo update.
After I set up all the pipeline and descriptor stuff. Basically I just need the copy the buffer to the UBO buffer such as memcpy then I am done.
Basically I can understand the issue about the whole pipeline needs to wait for this "buffer" ready then change it's content. So it will be slow.
On the other hand, when I use push constant, there is no such a problem. Although it's small (say 256 bytes big).
So far so good.
However, on the second thought, I find that if I am updating the UBO, I don't need to change the command buffer, or re-record it, I can submit the old CB since it's still the same.
Then if I want to update by using Push Constant, I have to reset the CB and record it again then submit it.
So won't this be an issue? How to make sure which one is faster?
Thanks.

Lots of people get confused on this issue, because the Vulkan Tutorial pre-records commands and Vulkan Guide re-record commands every frame.
When people say to use push constants for per-frame changing data like transform matrices and time data, there's the implicit assumption that you are recording command buffer per frame. Push constants essentially hitch a ride with the rest of your commands when submitted, which is also how they avoid synchronization and cache flushing to operate.
Now, in a lot of scenarios, re-recording command buffers can be easier and not significantly more costly than re-use. And indeed, re-using command buffers when things change can be a real pain to manage. Command buffers are meant to be fast to record. Still, the Vulkan tutorial went with pre-recording everything, which is also a valid approach though potentially harder to maintain at scale.
At the time the tutorial was created, the Vulkan Tutorial was essentially one of the only resources available to learn vulkan in a structured manner. Even though command buffers are quick to record, pre-recording command buffers eliminates even more CPU overhead and exemplifies Vulkan's "Never be draw call limited again" mantra to eliminating CPU overhead in graphics applications.
As for the speed comparison, you'll have to benchmark, but I would not necessarily choose one or the other for "speed" reasons. If you pre-record, you don't want to re-orient your entire rendering architecture just to take advantage of push constants. If you don't pre-record, there's no reason not to use push constants, they are just straight up easier to deal with.
It seems like currently you are pre-recording. I would not bother with push constants at all for this kind of data. I would also not focus on these kinds of issues until you get more familiar with vulkan, as it is very easy to get caught in the weeds with optimization in vulkan, strategies for optimization are no where near as uniform as in the CPU space.

Related

Should I programly put computation-heavy tasks on a separate thread on IOS to utilize multi-core?

I am making a real-time image processing app on IOS with my team. I am handling the custom computation kernel (mostly on CPU rather than GPU) and my teammates deals with the GUI. When I tested my kernel on a toy app, the core (ignoring any IO overhead ) runs steadily at 100ms per image. However, when put into the full-functioning one, it is slowed down to 500ms per image.
I have checked that the data is pretty much the same and I am only measuring time consumed within the kernel, on the same iphone6. There are hardly any other computation in the full-functioning app so I am not sure what is pulling behind. Though GPU-processing is definitely an alternative and I am working on it, I would like to know if there is any tricks to use for now.
Currently, there is no explicit multi-threading in the computation part, so my simple guess is: should I programly put the computation part on a separate thread so the second core can be utilized?
[Update]
It turns out that I made some mistakes in packing my code as library, as the copying over the source code works out nicely. I have not figured out my problem yet and am going to post it on a separate question.
GPU Acceleration
This massively depends on the tasks you're performing, the GPU is good a specific subset of tasks and simply utilising it can sometimes even slow things down. Check this out
A lot of image based tasks that are part of the Quartz framework e.t.c are GPU accelerated (like blurring). Also if you use a library like OpenCV you get GPU acceleration on certain tasks out the box.
Unless you're a real pro I would avoid using the GPU specifically and let the frameworks and libraries you use do that for you.
Concurrency
It will certainly help to put intensive tasks on a background thread. Just be aware of what it entails (i.e. you can't make any UIKit calls from a background thread.
The answer heavily depends on how you do the processing. Some methods in the SDK perform their job in a background thread, while others require the caller to create and use one.
In general, in the case of drawing, most methods require you to create one explicitly. This is important especially for the ones that perform their work on the CPU (e.g. using CoreGraphics to draw within a drawRect method). If you're using methods that use GPU for the processing, then creating threads won't be much of use since CPU won't be the cause of the bottleneck.
If you want to determine why your app slows down, use Instruments. (Time Profiler for CPU and Core Animation for drawing)

opencl - use image object with local memory

i'm trying to program with opencl.
There are two types of memory object.
one is buffer and another one is image.
some blogs and web site,white papers say 'image object is little bit faster that buffer because of cache'.
i'm trying to use image object and the reason for that is 'clamp', it will make kernel code more simpler and faster(my opinion)
my question is 'is it possible to use image object and local memory and is it faster(than using buffer object with local memory)?"
Data-> image object-> copy to local memory -> operations -> write back to other image object.
As far as i understood, i cannot use async_work_group_copy instruction for local memory in this case.
so i have to copy and synchronize manually for local memory. it will make overhead a lot.
The only real answer to that is "it depends". Most implementations don't really have a value in doing async_work_group_copy. Image reads may be slightly higher latency than buffer reads when there is a cache hit, but you may get better cache behaviour from them on some architectures. Clamping, address calculation and filtering are effectively free operations performed by dedicated hardware, that you'd have to shift into shader code when using buffers, so that reduces your read latency and may increase throughput.
If you are going to get big caching benefits from images, local memory may just get in the way. The extra cost of writing to it, synchronizing, reading from it, calculating addresses and so on may cost you.
Sadly this is just one of those things you'll have to experiment with on your target architectures.

glDrawElements massive cpu usage on iOS

Hardware: iPad2
Sofware: OpenGL ES 2.0 C++
glDrawElements seems to take up about 25% of the cpu. Making the CPU 18ms and the GPU 10ms per frame.
When I don't use an index buffer and use glDrawArrays, it speeds up and glDrawArrays barley shows up on the profiler. Everything else is the same, glDrawArrays has more verts because I have to duplicate verts in the VBO without the index buffer.
so far:
virtually the same amount of state changes between the two methods
vertex structure is two floats(8 bytes).
indexbuffer is 16bit(tried 32bit as well)
GL_SATIC_DRAW for both buffers
buffers don't change after load
the same VBO and the indexbuffer render multiple times per frame, with different offsets and sizes
no opengl errors
So it looks like it's doing a software fallback of some sort. But I can't figure out what would cause OpenGL to fallback.
There are a few things that immediately jump to mind that might affect speed the way you describe.
For one, many commands are issued passively to reduce the number of bus transfers. They are queued up and wait for the next batch transfer. State changes, texture changes, and similar commands all accumulate. It is possible that the the draw commands are triggering a larger transfer in the one case but not in the other, or that you are triggering more frequent transfers in the one case or the other. For another, your specific models might be better organized for one or the other draw calls. You need to look at how big they are, if they reuse index values, and if they are optimized or reordered for rendering. glDrawArrays may require more data to be transferred, but if your models are small the overhead may not be much of a concern. Draw frequency becomes important since you want to queue off calls frequently to keep the card busy and let your CPU do other work, you don't want it to just accumulate in the command buffer waiting to be sent, but it needs to be balanced since there is a cost with those transfers. And to top it off, frequently indexed values can benefit from cache effects when they are frequently reused, but linearly accessed arrays can benefit from cache effects when they are accessed linearly, so you need to know your data since different types of data benefit from different methods.
Even Apple seems to be unsure which method to use.
Up until iOS7 the OpenGL ES Programming Guide for IOS for that version and earlier wrote:
For best performance, your models should be submitted as a single unindexed triangle strip using glDrawArrays with as few duplicated vertices as possible. If your models require many vertices to be duplicated (...), you may obtain better performance using a separate index buffer and calling glDrawElements instead. ... For best results, test your models using both indexed and unindexed triangle strips, and use the one that performs the fastest.
But their updated OpenGL ES Programming Guide for iOS that applies to iOS8 offers the opposite:
For best performance, your models should be submitted as a single indexed triangle strip. To avoid specifying data for the same vertex multiple times in the vertex buffer, use a separate index buffer and draw the triangle strip using the glDrawElements function
It looks like in your case you have just tried both, and found that one method is better suited for your data.

Determining available video memory

When developing an OpenGL program, is there a way to poll from the system to find out just how many megabytes are available to store textures, etc?
Or is the standard approach these days just allocate memory and forget about everything?
Although the official stance remains "you don't need to know, you don't want to know, and it would not help you anyway", luckily at least two IHVs have shown a little more insight lately and offer extensions to query that information:
NVX_gpu_memory_info
ATI_meminfo
One nice thing about these extensions is that they have a least common denominator which is just what most people need, and you don't need to query extension support or do anything special, as they both work via glGetIntegerv.
In the easiest case, you can just initialize an array of 4 integers to zero (or some minimum default value that you'll assume in case the extensions don't work), then you call glGetIntegerv twice (with GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX and TEXTURE_FREE_MEMORY_ATI, respectively), and finally call glGetError to clear the error state. glGetIntegerv does not modify the pointed-to memory if it fails, nor does it crash or any other bad thing -- it merely sets the error state to GL_INVALID_ENUM.
Both extensions return a value in the first array position, the ATI one returns some values in the other 3 too.
n.b.: glAreTexturesResident has not been supported for almost a decade on mainstream hardware, in the same manner as texture priorities. The common mantra is that the driver writer knows much better than you anyway.
OpenGL doesn't give you this information. And frankly: There's only little benefit, simply because today we have multitasking operating systems. The OpenGL driver is responsible for swapping in texture data to/from system memory, if there's demand for it.
What OpenGL can do for you, is tell, if the textures you've uploaded are still resident in fast memory. The function is called "glAreTexturesResident". You can use this to gradually upload stuff to the GPU until you've filled up the GPU's memory. But keep in mind that you're not the only user of the GPU.

What are possible causes of IDirect3DVertexBuffer9::Lock failing?

In error reports from some end users of our game I have quite often seen following behaviour: IDirect3DVertexBuffer9::Lock fails, returned error code is D3DERR_NOTAVAILABLE.
Once this happens, quite frequently (but not always) it is followed by the CreateTexture or CreateVertexBuffer call failing with error D3DERR_OUTOFVIDEOMEMORY.
What are possible reasons for a vertex buffer lock failure? Could the virtual memory address space be exhausted, or what?
Based on the DIRECTXDEV response by Chuck Walbourn from Microsoft, besides of "out of address space" another cause could be "out of page pool".
Alternatively, on Windows XP this could indicate you have hit the limits of paged pool kernel memory. Typically this happens when you create a lot of Direct3D resources (textures, etc.)
We DO create a lot of Direct3D resources.
This is what I posted to DirectXDev: ;)
Have you checked how much memory your
application is using? (Be sure to
select the Virtual Memory column in
Task Manager!). My guess would be
memory fragmentation based issues
causing you to, as you suggest, run
out of address space.
It could, however, be a driver bug ...
Does the debug runtime provide any useful information?
Edit: The only other thing I can think of is that the aperture memory has run out. I don't know how this works with PCIExpress but on AGP you can set the aperture size. I've no idea how to check if it is full however. I suspect the error you are seeing is reporting that its full. Are you doing lots of locks with the Discard flag? If so its possible that these are creating tonnes of new allocations in the aperture and is causing you to run out of memory there. This is pure guess work however.
I'd guess that if this is happening with only some of your users it is those on the lower end machines. If things run slowly then you can end up with a lot of data buffered in the command buffer. This will make control laggy and "could", at a guess, lead to the problem you are seeing. You may want to try making sure the command buffer never gets too long. If you make sure the first lock of every frame is done without the discard flag (ie flag set to 0) then this will cause the pipeline to stall until the vertex buffer has been rendered and bring the command buffer back in sync with you. This will cause a slow down as the command buffering will not be able to smooth out frame rate spikes as easily ...
Anyway ... thats just a guess!
The raised issue about out of memory is valid. We need some details on the Lock() call to be sure, but for example if it is in the DEFAULT pool and if it's dynamic (D3DLOCK_DISCARD flag passed), it's very well possible that your driver tries to find an unused piece of memory to return (because it double or triple buffers internally) and fails because, as you discover yourself soon after, video memory is exhausted.

Resources