I'm coming across deploying a Metal render loop as a chain of multiple
KCS (kernel/compute) shaders and VFS (vertex/fragment) shaders:
texture -> [KCS -> VFS -> KCS -> VFS] --\
--->[KCS -> KCS] --> presentable
texture -> [KCS -> VFS -> KCS -> VFS] --/
The output of one shader is the input to the next. The two sets of 4 alternating shaders are combined near the end, as shown.
If I'm thinking this through rightly, I'll need up to 10 disparate pipeline descriptors in order to make this happen, along with a lot of calls to completion handlers in which the next shader will be dispatched.
I also haven't indicated it, but the last call to presentable will also feed subregion of its output fed into a separate MTKView (via a vertex/fragment shader).
Any hints would be appreciated.
You only need different descriptors if they would have different values. That is, if any of the KCS steps you show use the same compute shader function, then they typically can share a descriptor. (There are other properties of MTLComputePipelineDescriptor, but they're less commonly used.)
For the VFS steps, the descriptor is more complex and so they'd have to be equal across all properties to be shared.
Of course, you should be creating your pipeline state objects once in the lifetime of your app, if you can. Avoid creating them for every render loop.
You definitely don't want to use completion handlers to dispatch the next step. That will stall the pipeline horribly (make the CPU and GPU wait for each other, repeatedly). Just encode the steps in order into a command buffer. Any given draw or dispatch won't proceed until any previous draw/dispatch that might write to its inputs has completed.
Related
Is it possible to have a Metal compute function that processes a texture in-place on iOS? I have noticed that some MPS image filters support in-place processing, and was wondering if there is a way to accomplish this with custom kernels.
Specifically, I am looking to combine two textures into one using a blend function. I am easily able to do this by making the first texture a render target and using a shader to write the second one on top, but it feels like an overkill since both textures are the same size.
Yes, you can take a texture parameter with the access::read_write attribute, and read and write it within the same kernel function invocation. You'll need to ensure that the texture is created with both the .read and .write usage flags. Additionally, note that writes are not guaranteed to be seen by any subsequent reads by the same thread unless you call the flush() function after the write.
By the way, MetalPerformanceShaders kernels that are able to operate "in-place" don't necessarily use read_write textures; it's often the case that they use auxiliary textures and buffers and do their work across multiple passes. Per the documentation, any kernel can fail to operate in-place for any number of reasons, so you should always provide a fallback allocator to handle such cases.
I am using IMFSourceReader with hardware acceleration enabled to decode videos and read them into my application. After the ReadSample call, I get hold of the IDirect3DSurface9 from the IMFSample. At this point, I use the LockRect() call to access the raw-bytes and copy them into my applications buffer.
I would like to perform additional operations on the GPU such as transpose and a possible conversion of the image data from row-major order to column-major order.
Is there a Blt operation I can setup to this?
I came across the ID3DXBaseEffect interface but I am not sure that is applicable in my case.
Would appreciate any inputs.
Dinesh
With IDirect3DSurface9, you can use shader (ID3DXBaseEffect).
To do it on GPU directly, before copy the raw-bytes to your application, i will try this :
Call IMFSourceReader::GetServiceForStream to query for MR_VIDEO_ACCELERATION_SERVICE and IDirect3DDeviceManager9.
use IDirect3DDeviceManager9 to query the IDirect3DDevice9 (IDirect3DDeviceManager9::LockDevice).
Use IDirect3DDevice9, IDirect3DSurface9, a new RenderTarget, shader, as usual with Directx.
copy the raw-bytes from the final RenderTarget (after shader apply).
EDIT
See here : mofo7777 github
Under MediaFoundationTransform > MFTDirectxAware > MFTVideoShaderEffect, i'll show the concept.
I am very new to metal so bear with me as I am transitioning from the ugly state machine calls of OpenGL to modern graphics frameworks. I really want to make sure I understand how everything works and works together.
I have read most of Apples documentation but it does a better job describing the function of individual components than how they come together.
I am trying to understand essentially whether I should have multiple renderPipelines and renderEncoders are needed in my situation.
To describe my pipeline at a high level here is what goes on:
Retrieve the previous frame's contents from an offscreen texture that was rendered to and draw some new contents onto it.
Swith to rendering on the screen. Draw the texture from step 1 to the screen.
Do some post processing (in native resolution).
Draw the UI ontop as quads. (essentailly a repeat of 2)
So in essence there will be the following vertex/fragment shader pairs
Draw the entities (step 1)
Draw quads on a specefied area (step 2 and 4)
Post processing shader 1 (step 3) uses different inputs than D and cant be done in the same shader
Post processing shader 2 (step 3) uses different inputs than C and can't be done in the same shader
There will be the following texture groups
Texture for each UI element
Texture for the offscreen drawing done in step 1
Potentially more offscreen textures will be used in post processing depening on metals preformance
Ultimately my confusions are this:
Q1. Render Pipelines take only one vertex and one fragment function so does this mean I need have 4 render pipelines even though I only have 3 unique steps to my drawing procedure?
Q2. How am I supposed to use multiple pipelines in one encoder? Wouldn't each sucessive call on .setRenderPipelineState override the previous one?
Q3. Would you recommend keeping all of my .setFragmentTexture calls right after creating my encoder or do I need to set those only right before they are needed.
Q4. Is it valid to keep my depthState constant even as I switch between pipelineStates? How do I ensure that my entities on step 1 are rendered with depth but make sure depth information is lost between frames so entities are all on top of the previous contents?
Q5. What do I do with render step 3 where I have two post processing steps? Do those have to be seperate pipelines?
Q6. How can I efficiently build my pipeline knowing that steps 2 and 4 are essentially the same just with different inputs?
I guess it would help me if someone would walk me through what renderPipelineObjects I will need and for what. It would also be useful to understand what some of the renderCommandEncoder commands might look like at a psuedocode level.
Q1. Render Pipelines take only one vertex and one fragment function so does this mean I need have 4 render pipelines even though I only have 3 unique steps to my drawing procedure?
If there are 4 unique combinations of shader functions, then it's not correct that you "only have 3 unique steps to my drawing procedure". In any case, yes, you need a separate render pipeline state object for each unique combination of shader functions (as well as for any other attribute of the render pipeline state descriptor that you need to change).
Q2. How am I supposed to use multiple pipelines in one encoder? Wouldn't each sucessive call on .setRenderPipelineState override the previous one?
When you send a draw method to the render command encoder, that draw command is encoded with all of the relevant current state and written to the command buffer. If you later change the render pipeline state associated with the encoder that doesn't affect previously-encoded commands, it only affects subsequently-encoded commands.
Q3. Would you recommend keeping all of my .setFragmentTexture calls right after creating my encoder or do I need to set those only right before they are needed.
You only need to set them before the draw command that uses them is encoded. Beyond that, it doesn't much matter when you set them. I'd do whatever makes for the clearest, most readable code.
Q4. Is it valid to keep my depthState constant even as I switch between pipelineStates?
Yes, or there wouldn't be separate methods to set them independently. There would be a method to set both.
How do I ensure that my entities on step 1 are rendered with depth but make sure depth information is lost between frames so entities are all on top of the previous contents?
Configure the loadAction for the depth attachment in the render pass descriptor to clear with an appropriate value (e.g. 1.0). If you're using multiple render command encoders, only do this for the first one, of course. Likewise, the render pass descriptor of the last (or only) render command encoder can/should use a storeAction of .dontCare.
Q5. What do I do with render step 3 where I have two post processing steps? Do those have to be seperate pipelines?
Well, the description of your scenario is kind of vague. But, if you want to use a different shader function, then, yes, you need to use a different render pipeline state object.
Q6. How can I efficiently build my pipeline knowing that steps 2 and 4 are essentially the same just with different inputs?
Again, your description is entirely too vague to know how to answer this. In what ways are those steps the same? In what ways are they different? What do you mean about different inputs?
In any case, just do what seems like the simplest, most direct way even if it seems like it might be inefficient. Worry about optimizations later. When that time comes, open a new question and show your actual working code and ask specifically about that.
I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!
I'm on iOS 6 (7 too if you will and makes any difference) and GL ES 2.0.
The idea is for a CAEAGLLayer to have a dynamic chain of shader-based filters that processes its contents property and displays the final result. Filters can be added / removed at any point in the chain.
So far I came up with an implementation, but I'm wondering if there's better ways to go about it. My implementation roughly goes about it this way:
A base filter class from which concrete filters inherit, creating a shader program (vertex / fragment combo) for whatever filter / imaging they implement.
A CAEAGLLayer subclass which implements the filter chain and to which filters are added. The high-level processing algorithm is:
// 1 - Assume whenever the layer's content property is changed to an image, a copy of the image gets stored in a sourceImage property.
// 2 - Assume changing the content property or adding / removing an image unit triggers this algorithm.
// 3 - Assume the whole filter chain basically processes a quad with position and texture coordinates thru a VBO.
// 4 - Assume all shader programs (by shader program I mean a vertex and fragment shader pair in a single program) have access to texture unit 0.
// 5 - Assume P shader programs.
load imageSource into a texture object bound to GL_TEXTURE2D and pointing to to GL_TEXTURE0
attach bound texture object to GL_FRAMEBUFFER GL_COLOR_ATTACHMENT0 (so we are doing render-to-texture, which will be accessible to fragment shaders)
for p = program identifier 0 up to P - 2:
glUseProgram(p)
glDrawArrays()
attach GL_RENDERBUFFER to GL_FRAMEBUFFER GL_COLOR_ATTACHMENT0 (GL_RENDERBUFFER in turn has its storage set to the layer itself);
p = program identifier P - 1 (last program in the chain)
glUseProgram(p)
glDrawArrays()
present GL_RENDERBUFFER onscreen
This approach seems to work so far, but there's a number of things I'm wondering about:
Best way to implement adding / removing of filters:
Adding and removing programs seems the most logical approach right now. However this means one program per plugin and switching between all of these at render time. I wonder how these other approaches would compare:
Attaching / detaching shader-pairs and re-linking a single composite program, instead of adding / removing programs. The OpenGL ES 2.0 Programming Guide says you cannot do it. However, since desktop GL allows for multiple shader objects in one program, I'm anyway curious if it would be a better approach if ES supported it.
Keeping the filters in text format (their code within a function other than main) and instead compile them all into a monolithic shader pair (with an added main of course) each time a filter is added / removed.
Best way to implement per-filter caching:
Right now, adding / removing any number of filters at any point in the chain requires running all programs again to render the final image. It'd be nice however if I could somehow cache the output of each filter. That way, removing, adding or bypassing a filter would only require running the filters past the point of insertion / deletion / bypassing in the chain. I can think of a naive approach: on each program pass, bind a different texture object to GL_TEXTURE0 and to the GL_COLOR_ATTACHMENT0of the frame buffer. In this way I can keep the output of every filter around. However, creating a new texture, binding and changing the framebuffer attachment once per filter seems inefficient.
I don't have much to say about the filter output caching problem, but as for filter switching... The EXT_separate_shader_objects extension is designed to solve this very problem, and it's supported on every device that runs iOS 5.0 or later. Here's a brief overview:
There's a new convenience API for compiling shader programs that also takes care of making them "separable":
_vertexProgram = glCreateShaderProgramvEXT(GL_VERTEX_SHADER, 1, &source);
Program Pipeline Objects manage your program state and let you mix and match already-compiled shaders:
GLuint _ppo;
glGenProgramPipelinesEXT(1, &_ppo);
glBindProgramPipelineEXT(_ppo);
glUseProgramStagesEXT(_ppo, GL_VERTEX_SHADER_BIT_EXT, _vertexProgram);
glUseProgramStagesEXT(_ppo, GL_FRAGMENT_SHADER_BIT_EXT, _fragmentProgram);
Mixing and matching shaders can make attribute binding a pain, so you can specify that in the shader (likewise for varyings):
#extension GL_EXT_separate_shader_objects : enable
layout(location = 0) attribute vec4 position;
layout(location = 1) attribute vec3 normal;
Uniforms are set for the shader program they belong to:
glProgramUniformMatrix3fvEXT(_vertexProgram, u_normalMatrix, 1, 0, _normalMatrix.m);