Performing a transpose operation on a DirectX Surface Buffer - directx

I am using IMFSourceReader with hardware acceleration enabled to decode videos and read them into my application. After the ReadSample call, I get hold of the IDirect3DSurface9 from the IMFSample. At this point, I use the LockRect() call to access the raw-bytes and copy them into my applications buffer.
I would like to perform additional operations on the GPU such as transpose and a possible conversion of the image data from row-major order to column-major order.
Is there a Blt operation I can setup to this?
I came across the ID3DXBaseEffect interface but I am not sure that is applicable in my case.
Would appreciate any inputs.
Dinesh

With IDirect3DSurface9, you can use shader (ID3DXBaseEffect).
To do it on GPU directly, before copy the raw-bytes to your application, i will try this :
Call IMFSourceReader::GetServiceForStream to query for MR_VIDEO_ACCELERATION_SERVICE and IDirect3DDeviceManager9.
use IDirect3DDeviceManager9 to query the IDirect3DDevice9 (IDirect3DDeviceManager9::LockDevice).
Use IDirect3DDevice9, IDirect3DSurface9, a new RenderTarget, shader, as usual with Directx.
copy the raw-bytes from the final RenderTarget (after shader apply).
EDIT
See here : mofo7777 github
Under MediaFoundationTransform > MFTDirectxAware > MFTVideoShaderEffect, i'll show the concept.

Related

Metal in-place operation for `MTLComputeCommandEncoder ` on iOS

Is it possible to have a Metal compute function that processes a texture in-place on iOS? I have noticed that some MPS image filters support in-place processing, and was wondering if there is a way to accomplish this with custom kernels.
Specifically, I am looking to combine two textures into one using a blend function. I am easily able to do this by making the first texture a render target and using a shader to write the second one on top, but it feels like an overkill since both textures are the same size.
Yes, you can take a texture parameter with the access::read_write attribute, and read and write it within the same kernel function invocation. You'll need to ensure that the texture is created with both the .read and .write usage flags. Additionally, note that writes are not guaranteed to be seen by any subsequent reads by the same thread unless you call the flush() function after the write.
By the way, MetalPerformanceShaders kernels that are able to operate "in-place" don't necessarily use read_write textures; it's often the case that they use auxiliary textures and buffers and do their work across multiple passes. Per the documentation, any kernel can fail to operate in-place for any number of reasons, so you should always provide a fallback allocator to handle such cases.

DirectCompute: How to read from a RWTexture2D<float4>?

I have the following buffer:
RWTexture2D<float4> Output : register(u0);
This buffer is used by a compute shader for rendering a computed image.
To write a pixel in that texture, I just use code similar to this:
Output[XY] = SomeFunctionReturningFloat4(SomeArgument);
This works very well and my computed image is correctly rendered on screen.
Now at some stage in the compute shader, I would like to read back an
already computed pixel and process it again.
Output[XY] = SomeOtherFunctionReturningFloat4(Output[XY]);
The compiler return an error:
error X3676: typed UAV loads are only allowed for single-component 32-bit element types
Any help appreciated.
In Compute Shaders, data access is limited on some data types, and not at all intuitive and straightforward. In your case, you use a
RWTexture2D<float4>
That is a UAV typed of DXGI_FORMAT_R32G32B32A32_FLOAT format.
This forma is only supported for UAV typed store, but it’s not supported by UAV typed load.
Basically, you can only write on it, but not read it. UAV typed load only supports 32 bit formats, in your case DXGI_FORMAT_R32_FLOAT, that can only contain a single component (32 bits and that’s all).
Your code should run if you use a RWTexture2D<float> but I suppose this is not enough for you.
Possible workarounds that spring to my minds are:
1. using 4 different RWTexture2D<float>, one for each component
2. using 2 different textures, RWTexture2D<float4> to write your values and Texture2D<float4> to read from
3. Use a RWStructuredBufferinstead of the texture.
I don’t know your code so I don’t know if solutions 1. and 2. could be viable. However, I strongly suggest going for 3. and using StructuredBuffer. A RWStructuredBuffer can hold any type of struct and can easily cover all your needs. To be honest, in compute shaders I almost only use them to pass data. If you need the final output to be a texture, you can do all your calculations on the buffer, then copy the results on the texture when you’re done. I would add that drivers often use CompletePath to access RWTexture2D data, and FastPath to access RWStructuredBuffer data, making the former awfully slower than the latter.
Reference for data type access is here. Scroll down to UAV typed load.

Metal Best Practice: Triple-buffering – Textures too?

In the Metal Best Practices Guide, it states that for best performance one should "implement a triple buffering model to update dynamic buffer data," and that "dynamic buffer data refers to frequently updated data stored in a buffer."
Does an MTLTexture qualify as "frequently updated data stored in a buffer" if it needs to be updated every frame? All the examples in the guide above focus on MTLBuffers.
I notice Apple's implementation in MetalKit has a concept of a nextDrawable, so perhaps that's what's happening here?
If a command could be in flight and it could access (read/sample/write) the texture while you're modifying that same texture on the CPU (e.g. using one of the -replaceRegion:... methods or by writing to a backing IOSurface), then you will need a multi-buffering technique, yes.
If you're only modifying the texture on the GPU (by rendering to it, writing to it from a shader function, or using blit command encoder methods to copy to it), then you don't need multi-buffering. You may need to use a texture fence within the shader function or you may need to call -textureBarrier on the render command encoder between draw calls, depending on exactly what you're doing.
Yes, nextDrawable provides a form of multi-buffering. In this case, it's not due to CPU access, though. You're going to render to one texture while the previously-rendered texture may still be on its way to the screen. You don't want to use the same texture for both because the new rendering could overdraw the texture just before it's put on screen, thus showing corrupt results.

Cascading Image filters in opencl without using Texture Memory

I am working on a custom device that supports OpenCL 1.2 Embedded Profile and does not have Image support or Texture Memory. I have to pass an image through a Sobel filter and then a Median filter. What could be the best (fastest) way of doing this? Can I avoid having to send the image back to the host after Sobel filter and then reading it back on the device for Median filter? Where to store the intermediate image, global memory, local memory or elsewhere?
You can keep the buffer in the global memory of the device between kernel calls to avoid the extra copies. When you create the buffer, make sure you use the flag 'CL_MEM_READ_WRITE', this will allow the Sobel kernel to write to it, and the Median kernel to read from it afterward. You can get away with two buffers, but I would use three if memory is not a restriction.
create 3 buffers. call them whatever you'd like. (originalBuff, middleBuff, finalBuff)
copy the image data to originalBuff
optionally set other buffers to an all-zero state (can be done on the device by the kernels which write to these buffers)
call the sobel filter kernel with params (originalBuff, middleBuff)
call median kernel with params (middleBuff, finalBuff)
read finalBuff back to host
I left out the other steps, such as creating context/program/queue/etc.. in order to focus on the answer to your question.
Read about clCreateBuffer here.
EDIT:
I have not tried the flag 'CL_MEM_HOST_NO_ACCESS' before, but I think it is worth a try. In my example, middleBuff might benefit from this flag. Like most opencl features, any possible benefit would be implementation-dependent.

DirectX11: Pass data from ComputeShader to VertexShader?

Is it possible to apply a filter to the geometry data that is to be rendered using Compute Shader and then use the result as an input buffer in the Vertex Shader? That would save me the trouble (&time) of reading back the data.
Any help is much appreciated.
Yes absolutely. First you create two identicals ID3D11Buffer of structures using BIND_VERTEX_BUFFER, BIND_SHADER_RESOURCE and BIND_UNORDERED_ACCESS usage flags, and the associated UAVs and SRVs.
First step is to apply your filter to input source buffer and write to the destination buffer during your compute pass.
Then during the draw pass, you just have to bind the destination buffer to the IA stage. You can do some ping-pong if you need to accumulate computations on the vertices (I assume that by filter you mean a functional map, for refering to the Functional Programming term).

Resources