Metal best practices - changing renderPipelineDescriptor during render - ios

During my rendering pipeline i would like to user a few shaders and in some cases modify parameters on the MTLRenderPipelineDescriptor object (for example, change blending functions).
As i see it, i have 2 options:
Create and precompile one MTLRenderPipelineState for each combination of parameters (vertex shader, fragment shader, blending, etc).I can have many such state objects because there could be many combinations.
Create and compile new MTLRenderPipelineState objects during the rendering process.
Which of the options would be better? Are there any other options i am missing.

For best-practice (and best performance), you should follow your Option 1.
In the Transient and Non-transient Objects in Metal section, The Metal Programming Guide is quite clear about which objects should be considered transient or non-transient, and that non-transient objects should be cached and reused.
For the MTLRenderPipelineState object in particular, here's what the guide has to say in the Creating a Render Pipeline State section:
A render pipeline state object is a long-lived persistent object that
can be created outside of a render command encoder, cached in advance,
and reused across several render command encoders. When describing the
same set of graphics state, reusing a previously created render
pipeline state object may avoid expensive operations that re-evaluate
and translate the specified state to GPU commands.

Option #1 is better.
With option #2 it isn't clear if you are thinking of discarding the object at the end of each rendering pass or if you would cache it and use it next time you require that permutation.
The former would be a very bad idea but the latter would be a good enough, pragmatic approach if the number of possible permutations your code has to support is very large, but the number you are actually going to use in any given run is relatively small and you have no easy way of determining it in advance. This sort of scenario isn't ideal, but can easily be imagined in the context of writing engine-level code which has to expose a lot of flexibility to project-level code.

Related

Metal in-place operation for `MTLComputeCommandEncoder ` on iOS

Is it possible to have a Metal compute function that processes a texture in-place on iOS? I have noticed that some MPS image filters support in-place processing, and was wondering if there is a way to accomplish this with custom kernels.
Specifically, I am looking to combine two textures into one using a blend function. I am easily able to do this by making the first texture a render target and using a shader to write the second one on top, but it feels like an overkill since both textures are the same size.
Yes, you can take a texture parameter with the access::read_write attribute, and read and write it within the same kernel function invocation. You'll need to ensure that the texture is created with both the .read and .write usage flags. Additionally, note that writes are not guaranteed to be seen by any subsequent reads by the same thread unless you call the flush() function after the write.
By the way, MetalPerformanceShaders kernels that are able to operate "in-place" don't necessarily use read_write textures; it's often the case that they use auxiliary textures and buffers and do their work across multiple passes. Per the documentation, any kernel can fail to operate in-place for any number of reasons, so you should always provide a fallback allocator to handle such cases.

Metal: Do I need multiple RenderPipelines to have multiple shaders?

I am very new to metal so bear with me as I am transitioning from the ugly state machine calls of OpenGL to modern graphics frameworks. I really want to make sure I understand how everything works and works together.
I have read most of Apples documentation but it does a better job describing the function of individual components than how they come together.
I am trying to understand essentially whether I should have multiple renderPipelines and renderEncoders are needed in my situation.
To describe my pipeline at a high level here is what goes on:
Retrieve the previous frame's contents from an offscreen texture that was rendered to and draw some new contents onto it.
Swith to rendering on the screen. Draw the texture from step 1 to the screen.
Do some post processing (in native resolution).
Draw the UI ontop as quads. (essentailly a repeat of 2)
So in essence there will be the following vertex/fragment shader pairs
Draw the entities (step 1)
Draw quads on a specefied area (step 2 and 4)
Post processing shader 1 (step 3) uses different inputs than D and cant be done in the same shader
Post processing shader 2 (step 3) uses different inputs than C and can't be done in the same shader
There will be the following texture groups
Texture for each UI element
Texture for the offscreen drawing done in step 1
Potentially more offscreen textures will be used in post processing depening on metals preformance
Ultimately my confusions are this:
Q1. Render Pipelines take only one vertex and one fragment function so does this mean I need have 4 render pipelines even though I only have 3 unique steps to my drawing procedure?
Q2. How am I supposed to use multiple pipelines in one encoder? Wouldn't each sucessive call on .setRenderPipelineState override the previous one?
Q3. Would you recommend keeping all of my .setFragmentTexture calls right after creating my encoder or do I need to set those only right before they are needed.
Q4. Is it valid to keep my depthState constant even as I switch between pipelineStates? How do I ensure that my entities on step 1 are rendered with depth but make sure depth information is lost between frames so entities are all on top of the previous contents?
Q5. What do I do with render step 3 where I have two post processing steps? Do those have to be seperate pipelines?
Q6. How can I efficiently build my pipeline knowing that steps 2 and 4 are essentially the same just with different inputs?
I guess it would help me if someone would walk me through what renderPipelineObjects I will need and for what. It would also be useful to understand what some of the renderCommandEncoder commands might look like at a psuedocode level.
Q1. Render Pipelines take only one vertex and one fragment function so does this mean I need have 4 render pipelines even though I only have 3 unique steps to my drawing procedure?
If there are 4 unique combinations of shader functions, then it's not correct that you "only have 3 unique steps to my drawing procedure". In any case, yes, you need a separate render pipeline state object for each unique combination of shader functions (as well as for any other attribute of the render pipeline state descriptor that you need to change).
Q2. How am I supposed to use multiple pipelines in one encoder? Wouldn't each sucessive call on .setRenderPipelineState override the previous one?
When you send a draw method to the render command encoder, that draw command is encoded with all of the relevant current state and written to the command buffer. If you later change the render pipeline state associated with the encoder that doesn't affect previously-encoded commands, it only affects subsequently-encoded commands.
Q3. Would you recommend keeping all of my .setFragmentTexture calls right after creating my encoder or do I need to set those only right before they are needed.
You only need to set them before the draw command that uses them is encoded. Beyond that, it doesn't much matter when you set them. I'd do whatever makes for the clearest, most readable code.
Q4. Is it valid to keep my depthState constant even as I switch between pipelineStates?
Yes, or there wouldn't be separate methods to set them independently. There would be a method to set both.
How do I ensure that my entities on step 1 are rendered with depth but make sure depth information is lost between frames so entities are all on top of the previous contents?
Configure the loadAction for the depth attachment in the render pass descriptor to clear with an appropriate value (e.g. 1.0). If you're using multiple render command encoders, only do this for the first one, of course. Likewise, the render pass descriptor of the last (or only) render command encoder can/should use a storeAction of .dontCare.
Q5. What do I do with render step 3 where I have two post processing steps? Do those have to be seperate pipelines?
Well, the description of your scenario is kind of vague. But, if you want to use a different shader function, then, yes, you need to use a different render pipeline state object.
Q6. How can I efficiently build my pipeline knowing that steps 2 and 4 are essentially the same just with different inputs?
Again, your description is entirely too vague to know how to answer this. In what ways are those steps the same? In what ways are they different? What do you mean about different inputs?
In any case, just do what seems like the simplest, most direct way even if it seems like it might be inefficient. Worry about optimizations later. When that time comes, open a new question and show your actual working code and ask specifically about that.

Metal Best Practice: Triple-buffering – Textures too?

In the Metal Best Practices Guide, it states that for best performance one should "implement a triple buffering model to update dynamic buffer data," and that "dynamic buffer data refers to frequently updated data stored in a buffer."
Does an MTLTexture qualify as "frequently updated data stored in a buffer" if it needs to be updated every frame? All the examples in the guide above focus on MTLBuffers.
I notice Apple's implementation in MetalKit has a concept of a nextDrawable, so perhaps that's what's happening here?
If a command could be in flight and it could access (read/sample/write) the texture while you're modifying that same texture on the CPU (e.g. using one of the -replaceRegion:... methods or by writing to a backing IOSurface), then you will need a multi-buffering technique, yes.
If you're only modifying the texture on the GPU (by rendering to it, writing to it from a shader function, or using blit command encoder methods to copy to it), then you don't need multi-buffering. You may need to use a texture fence within the shader function or you may need to call -textureBarrier on the render command encoder between draw calls, depending on exactly what you're doing.
Yes, nextDrawable provides a form of multi-buffering. In this case, it's not due to CPU access, though. You're going to render to one texture while the previously-rendered texture may still be on its way to the screen. You don't want to use the same texture for both because the new rendering could overdraw the texture just before it's put on screen, thus showing corrupt results.

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Resources