I am very new to metal so bear with me as I am transitioning from the ugly state machine calls of OpenGL to modern graphics frameworks. I really want to make sure I understand how everything works and works together.
I have read most of Apples documentation but it does a better job describing the function of individual components than how they come together.
I am trying to understand essentially whether I should have multiple renderPipelines and renderEncoders are needed in my situation.
To describe my pipeline at a high level here is what goes on:
Retrieve the previous frame's contents from an offscreen texture that was rendered to and draw some new contents onto it.
Swith to rendering on the screen. Draw the texture from step 1 to the screen.
Do some post processing (in native resolution).
Draw the UI ontop as quads. (essentailly a repeat of 2)
So in essence there will be the following vertex/fragment shader pairs
Draw the entities (step 1)
Draw quads on a specefied area (step 2 and 4)
Post processing shader 1 (step 3) uses different inputs than D and cant be done in the same shader
Post processing shader 2 (step 3) uses different inputs than C and can't be done in the same shader
There will be the following texture groups
Texture for each UI element
Texture for the offscreen drawing done in step 1
Potentially more offscreen textures will be used in post processing depening on metals preformance
Ultimately my confusions are this:
Q1. Render Pipelines take only one vertex and one fragment function so does this mean I need have 4 render pipelines even though I only have 3 unique steps to my drawing procedure?
Q2. How am I supposed to use multiple pipelines in one encoder? Wouldn't each sucessive call on .setRenderPipelineState override the previous one?
Q3. Would you recommend keeping all of my .setFragmentTexture calls right after creating my encoder or do I need to set those only right before they are needed.
Q4. Is it valid to keep my depthState constant even as I switch between pipelineStates? How do I ensure that my entities on step 1 are rendered with depth but make sure depth information is lost between frames so entities are all on top of the previous contents?
Q5. What do I do with render step 3 where I have two post processing steps? Do those have to be seperate pipelines?
Q6. How can I efficiently build my pipeline knowing that steps 2 and 4 are essentially the same just with different inputs?
I guess it would help me if someone would walk me through what renderPipelineObjects I will need and for what. It would also be useful to understand what some of the renderCommandEncoder commands might look like at a psuedocode level.
Q1. Render Pipelines take only one vertex and one fragment function so does this mean I need have 4 render pipelines even though I only have 3 unique steps to my drawing procedure?
If there are 4 unique combinations of shader functions, then it's not correct that you "only have 3 unique steps to my drawing procedure". In any case, yes, you need a separate render pipeline state object for each unique combination of shader functions (as well as for any other attribute of the render pipeline state descriptor that you need to change).
Q2. How am I supposed to use multiple pipelines in one encoder? Wouldn't each sucessive call on .setRenderPipelineState override the previous one?
When you send a draw method to the render command encoder, that draw command is encoded with all of the relevant current state and written to the command buffer. If you later change the render pipeline state associated with the encoder that doesn't affect previously-encoded commands, it only affects subsequently-encoded commands.
Q3. Would you recommend keeping all of my .setFragmentTexture calls right after creating my encoder or do I need to set those only right before they are needed.
You only need to set them before the draw command that uses them is encoded. Beyond that, it doesn't much matter when you set them. I'd do whatever makes for the clearest, most readable code.
Q4. Is it valid to keep my depthState constant even as I switch between pipelineStates?
Yes, or there wouldn't be separate methods to set them independently. There would be a method to set both.
How do I ensure that my entities on step 1 are rendered with depth but make sure depth information is lost between frames so entities are all on top of the previous contents?
Configure the loadAction for the depth attachment in the render pass descriptor to clear with an appropriate value (e.g. 1.0). If you're using multiple render command encoders, only do this for the first one, of course. Likewise, the render pass descriptor of the last (or only) render command encoder can/should use a storeAction of .dontCare.
Q5. What do I do with render step 3 where I have two post processing steps? Do those have to be seperate pipelines?
Well, the description of your scenario is kind of vague. But, if you want to use a different shader function, then, yes, you need to use a different render pipeline state object.
Q6. How can I efficiently build my pipeline knowing that steps 2 and 4 are essentially the same just with different inputs?
Again, your description is entirely too vague to know how to answer this. In what ways are those steps the same? In what ways are they different? What do you mean about different inputs?
In any case, just do what seems like the simplest, most direct way even if it seems like it might be inefficient. Worry about optimizations later. When that time comes, open a new question and show your actual working code and ask specifically about that.
Related
I'm writing a graphics engine in Metal and I'm using the stencil buffer to mask the volumes covered by the Spherical Harmonic lights in the scene. I use two shaders for that, and I need 3 draw calls per light: one for the back faces, another for the front faces, and a final draw call with a different shader to actually render the light.
But, if I understood well Metal documentation, you need to define all your passes "statically", that is, you need a different Render Command Encoder for every shader and render surfaces configurations that you use. Is this correct?
That means that I ended up creating this loop for my lights, which feels quite horrible because I'm creating lots of encoders,
for l in shLights {
let descStencil = createLightAccumulationRenderPass()
guard let encoderStencil = commandBuffer.makeRenderCommandEncoder(descriptor: descStencil) else {
continue
}
drawSHLightStencil(l, encoder: encoderStencil)
encoderStencil.endEncoding()
let descColor = createLightAccumulationRenderPass()
guard let encoderColor = commandBuffer.makeRenderCommandEncoder(descriptor: descColor) else {
continue
}
drawSHLight(l, encoder: encoderColor)
encoderColor.endEncoding()
}
The full code is here: https://github.com/endavid/VidEngine/blob/master/VidFramework/VidFramework/sdk/gfx/plugins/DeferredLightingPlugin.swift (drawSHLights function)
And if you need more context on how this is used, please check this blogpost: http://endavid.com/index.php?entry=85
I also tried reusing the encoders, but if you don't call endEncoding, Metal crashes on the next call of makeRenderCommandEncoder.
Is it possible to combine those encoders in any way?
Edit:
I've taken a GPU capture so it's easier to see the whole render pipeline. Here's a screenshot,
It's quite small, but I've put some labels on top. The white labels correspond to the stuff in the loop. There are 3 lights in the scene, and 3 spheres are being lit by them.
But, if I understood well Metal documentation, you need to define all your passes "statically", that is, you need a different Render Command Encoder for every shader and render surfaces configurations that you use. Is this correct?
No, that's not entirely correct. You'll notice that there are some attributes of a render command encoder that are specified via the MTLRenderPassDescriptor at the time you create the encoder, and there are other attributes that are set via accessors on the encoder after it has been created. The former are immutable for the lifetime of the encoder. The latter can be changed.
So, you do need a new command encoder if you change the render targets (attachments). But you do not need a new command encoder to change the shaders. The shaders are specified by the render pipeline state and can be changed on an existing command encoder using setRenderPipelineState(_:).
It is definitely true that you should create your render pipeline state objects once in the lifetime of the app, if at all possible. But you can reuse them as often as needed after that.
Finally, I would not worry too much about creating multiple render command encoders. They are designed to be relatively cheap to create. So, while expending a bit of effort to consolidate all work which can be done with a given encoder together is fine, don't bend over backwards to try to make something "simpler" when it runs against the grain of how things work.
What i'm doing is GPGPU on WebGL and I don't know the access pattern which I'd be talking about applies to general graphics and gaming programs. In our code, frequently, we come across data which needs to be summarized or reduced per output texel. A very simple example is matrix multiplication during which, for every output texel, your return a value which is a dot product of a row of one input and a column of the other input.
This has been the sore point of our performance because of not so much the computation but multiplied data access. So I've been trying to find a pattern of reads or data layouts which would expedite this operation and I have been completely unsuccessful.
I will be describing some assumptions and some schemes below. The sample code for all these are under https://github.com/jeffsaremi/webgl-experiments
Unfortunately due to size I wasn't able to use the 'snippet' feature of StackOverflow. NOTE: All examples write to console not the html page.
Base matmul implementation: Example: [2,3]x[3,4]->[2,4] . This produces in a simplistic form 2 textures of (w:3,h:2) and (w:4,h:3). For each output texel I will be reading along the X axis of the left texture but going along the Y axis of the right texture. (see webgl-matmul.html)
Assuming that GPU accesses data similar to CPU -- that is block by block -- if I read along the width of the texture I should be hitting the cache pretty often.
For this, I'd layout both textures in a way that I'd be doing dot products of corresponding rows (along texture width) only. Example: [2,3]x[4,3]->[2,4] . Note that the data for the right texture is now transposed so that for each output texel I'd be doing a dot product of one row from the left and one row from the right. (see webgl-matmul-shared-alongX.html)
To ensure that the above assumption is indeed working, I created a negative test also. In this test I'd be reading along the Y axis of both left and right textures which should have the worst performance ever. Data is pre-transposed so that the results make sense. Example: [3,2]x[3,4]->[2,4]. (see webgl-matmul-shared-alongY.html).
So I ran these -- and I hope you could do as well to see -- and I found no evidence to support existence or non-existence of such caching behavior. You need to run each example a few times to get consistent results for comparison.
Then I came along this paper http://fileadmin.cs.lth.se/cs/Personal/Michael_Doggett/pubs/doggett12-tc.pdf which in short claims that the GPU caches data in blocks (or tiles as I call them).
Based on this promising lead I created a version of matmul (or dot product) which uses blocks of 2x2 to do its calculation. Prior to using this of course I had to rearrange my inputs into such layout. The cost of that re-arrangement is not included in my comparison. Let's say I could do that once and run my matmul many times after. Even this scheme did not contribute anything to the performance if not taking something away. (see webgl-dotprod-tiled.html).
A this point I am completely out of ideas and any hints would be appreciated.
thanks
During my rendering pipeline i would like to user a few shaders and in some cases modify parameters on the MTLRenderPipelineDescriptor object (for example, change blending functions).
As i see it, i have 2 options:
Create and precompile one MTLRenderPipelineState for each combination of parameters (vertex shader, fragment shader, blending, etc).I can have many such state objects because there could be many combinations.
Create and compile new MTLRenderPipelineState objects during the rendering process.
Which of the options would be better? Are there any other options i am missing.
For best-practice (and best performance), you should follow your Option 1.
In the Transient and Non-transient Objects in Metal section, The Metal Programming Guide is quite clear about which objects should be considered transient or non-transient, and that non-transient objects should be cached and reused.
For the MTLRenderPipelineState object in particular, here's what the guide has to say in the Creating a Render Pipeline State section:
A render pipeline state object is a long-lived persistent object that
can be created outside of a render command encoder, cached in advance,
and reused across several render command encoders. When describing the
same set of graphics state, reusing a previously created render
pipeline state object may avoid expensive operations that re-evaluate
and translate the specified state to GPU commands.
Option #1 is better.
With option #2 it isn't clear if you are thinking of discarding the object at the end of each rendering pass or if you would cache it and use it next time you require that permutation.
The former would be a very bad idea but the latter would be a good enough, pragmatic approach if the number of possible permutations your code has to support is very large, but the number you are actually going to use in any given run is relatively small and you have no easy way of determining it in advance. This sort of scenario isn't ideal, but can easily be imagined in the context of writing engine-level code which has to expose a lot of flexibility to project-level code.
I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!
I want to get the properly rendered projection result from a Stage3D framework that presents something of a 'gray box' interface via its API. It is gray rather than black because I can see this critical snippet of source code:
matrix3D.copyFrom (renderable.getRenderSceneTransform (camera));
matrix3D.append (viewProjection);
The projection rendering technique that perfectly suits my needs comes from a helpful tutorial that works directly with AGAL rather than any particular framework. Its comparable rendering logic snippet looks like this:
cube.mat.copyToMatrix3D (drawMatrix);
drawMatrix.prepend (worldToClip);
So, I believe the correct, general summary of what is going on here is that both pieces of code are setting up the proper combined matrix to be sent to the Vertex Shader where that matrix will be a parameter to the m44 AGAL operation. The general description is that the combined matrix will take us from Object Local Space through Camera View Space to Screen or Clipping Space.
My problem can be summarized as arising from my ignorance of proper matrix operations. I believe my failed attempt to merge the two environments arises precisely because the semantics of prepending one matrix to another is not, and is never intended to be, equivalent to appending that matrix to the other. My request, then, can be summarized in this way. Because I have no control over the calling sequence that the framework will issue, e.g., I must live with an append operation, I can only try to fix things on the side where I prepare the matrix which is to be appended. That code is not black-boxed, but it is too complex for me to know how to change it so that it would meet the interface requirements posed by the framework.
Is there some sequence of inversions, transformations or other manuevers which would let me modify a viewProjection matrix that was designed to be prepended, so that it will turn out right when it is, instead, appended to the Object's World Space coordinates?
I am providing an answer more out of desperation than sure understanding, and still hope I will receive a better answer from those more knowledgeable. From Dunn and Parberry's "3D Math Primer" I learned that "transposing the product of two matrices is the same as taking the product of their transposes in reverse order."
Without being able to understand how to enter text involving superscripts, I am not sure if I can reduce my approach to a helpful mathematical formulation, so I will invent a syntax using functional notation. The equivalency noted by Dunn and Parberry would be something like:
AB = transpose (B) x transpose (A)
That comes close to solving my problem, which problem, to restate, is really just a problem arising out of the fact that I cannot control the behavior of the internal matrix operations in the framework package. I can, however, perform appropriate matrix operations on either side of the workflow from local object coordinates to those required by the GPU Vertex Shader.
I have not completed the test of my solution, which requires the final step to be taken in the AGAL shader, but I have been able to confirm in AS3 that the last 'un-transform' does yield exactly the same combined raw data as the example from the author of the camera with the desired lens properties whose implementation involves prepending rather than appending.
BA = transpose (transpose (A) x transpose (B))
I have also not yet tested to see if these extra calculations are so processing intensive as to reduce my application frame rate beyond what is acceptable, but am pleased at least to be able to confirm that the computations yield the same result.