SharpDX MapSubresource is vertically flipped C# - directx

How can I flipped the SharpDX.Databox without converting it to bitmap?
I'm making a screen recording using SharpDX and Media foundation. Below is the code on how I get the Databox.
mapSource = device.ImmediateContext.MapSubresource(screenTexture, 0,SharpDX.Direct3D11.MapMode.Read, SharpDX.Direct3D11.MapFlags.None);
But when I passed the mapSource in mediafoundation.net I produced a vertical video.
IMFSample sample = null;
IMFMediaBuffer buffer = null;
IntPtr data = new IntPtr();
int bufferMaxLength;
int bufferCurrentLength;
int hr = (int)MFExtern.MFCreateMemoryBuffer(frameSizeBytes, out buffer);
if (Succeeded(hr)) hr = (int)buffer.Lock(out data, out bufferMaxLength, out bufferCurrentLength);
if (Succeeded(hr))
{
hr = (int)MFExtern.MFCopyImage(data, videoWidth * BYTES_PER_PIXEL, mapSource.DataPointer, videoWidth * BYTES_PER_PIXEL, videoWidth * BYTES_PER_PIXEL, videoHeight);
}
if (Succeeded(hr)) hr = (int)buffer.Unlock();
if (Succeeded(hr)) hr = (int)buffer.SetCurrentLength(frameSizeBytes);
if (Succeeded(hr)) hr = (int)MFExtern.MFCreateSample(out sample);
if (Succeeded(hr)) hr = (int)sample.AddBuffer(buffer);
if (Succeeded(hr)) hr = (int)sample.SetSampleTime(frame.prevRecordingDuration.Ticks);//(TICKS_PER_SECOND * frames / VIDEO_FPS);
if (Succeeded(hr)) hr = (int)sample.SetSampleDuration((frame.recordDuration-frame.prevRecordingDuration).Ticks);
if (Succeeded(hr)) hr = (int)sinkWriter.WriteSample(streamIndex, sample);
if (Succeeded(hr)) frames++;
COMBase.SafeRelease(sample);
COMBase.SafeRelease(buffer);
enter image description here

In your code there is a mistake in code with MFCopyImage. According MFCopyImage function you must set
_In_ LONG lDestStride,
_In_ const BYTE *pSrc,
_In_ LONG lSrcStride, - lDestStride and lSrcStride - is width of memory for storing one line of pixels - your computing videoWidth * BYTES_PER_PIXEL is not correct, because for Windows RGB format stride can be widther than videoWidth * BYTES_PER_PIXEL. You must compute destination stride by function MFGetStrideForBitmapInfoHeader, source stride you can get from you image source code - I do not know you code, but for my project I used
D3D11_MAPPED_SUBRESOURCE resource;
UINT subresource = D3D11CalcSubresource(0, 0, 0);
ctx->Map(mDestImage, subresource, D3D11_MAP_READ_WRITE, 0, &resource);
LOG_INVOKE_MF_FUNCTION(MFCopyImage,
aPtrData,
mStride,
(BYTE*)resource.pData,
resource.RowPitch,
RowPitch.
Regards.
P.S. Destination stride mStride can be negative - it means that it needs write from last line to the first. It can done by the next changing of destination pointer - aPtrData += (mHeight - 1)*mStride;

Related

Color conversion from DXGI_FORMAT_B8G8R8A8_UNORM to NV12 in GPU using DirectX11 pixel shaders

I'm working on a code to capture the desktop using Desktop duplication and encode the same to h264 using Intel hardwareMFT. The encoder only accepts NV12 format as input. I have got a DXGI_FORMAT_B8G8R8A8_UNORM to NV12 converter(https://github.com/NVIDIA/video-sdk-samples/blob/master/nvEncDXGIOutputDuplicationSample/Preproc.cpp) that works fine, and is based on DirectX VideoProcessor.
The problem is that the VideoProcessor on certain intel graphics hardware supports conversions only from DXGI_FORMAT_B8G8R8A8_UNORM to YUY2 but not NV12, I have confirmed the same by enumerating the supported formats through GetVideoProcessorOutputFormats. Though the VideoProcessor Blt succeeded without any errors, and I could see that the frames in the output video are pixelated a bit, I could notice it if I look at it closely.
I guess, the VideoProcessor has simply failed over to the next supported output format (YUY2) and I'm unknowingly feeding it to the encoder that thinks that the input is in NV12 as configured. There is no failure or major corruption of frames due to the fact that there is little difference like byte order and subsampling between NV12 and YUY2. Also, I don't have pixelating problems on hardware that supports NV12 conversion.
So I decided to do the color conversion using pixel shaders which is based on this code(https://github.com/bavulapati/DXGICaptureDXColorSpaceConversionIntelEncode/blob/master/DXGICaptureDXColorSpaceConversionIntelEncode/DuplicationManager.cpp). I'm able make the pixel shaders work, I have also uploaded my code here(https://codeshare.io/5PJjxP) for reference (simplified it as much as possible).
Now, I'm left with two channels, chroma, and luma respectively
(ID3D11Texture2D textures). And I'm really confused about efficiently
packing the two separate channels into one ID3D11Texture2D texture so
that I may feed the same to the encoder. Is there a way to efficiently
pack the Y and UV channels into a single ID3D11Texture2D in GPU? I'm
really tired of CPU based approaches due to the fact that it's costly,
and doesn't offer the best possible frame rates. In fact, I'm
reluctant to even copy the textures to CPU. I'm thinking of a way to
do it in GPU without any back and forth copies between CPU and GPU.
I have been researching this for quite some time without any progress, any help would be appreciated.
/**
* This method is incomplete. It's just a template of what I want to achieve.
*/
HRESULT CreateNV12TextureFromLumaAndChromaSurface(ID3D11Texture2D** pOutputTexture)
{
HRESULT hr = S_OK;
try
{
//Copying from GPU to CPU. Bad :(
m_pD3D11DeviceContext->CopyResource(m_CPUAccessibleLuminanceSurf, m_LuminanceSurf);
D3D11_MAPPED_SUBRESOURCE resource;
UINT subresource = D3D11CalcSubresource(0, 0, 0);
HRESULT hr = m_pD3D11DeviceContext->Map(m_CPUAccessibleLuminanceSurf, subresource, D3D11_MAP_READ, 0, &resource);
BYTE* sptr = reinterpret_cast<BYTE*>(resource.pData);
BYTE* dptrY = nullptr; // point to the address of Y channel in output surface
//Store Image Pitch
int m_ImagePitch = resource.RowPitch;
int height = GetImageHeight();
int width = GetImageWidth();
for (int i = 0; i < height; i++)
{
memcpy_s(dptrY, m_ImagePitch, sptr, m_ImagePitch);
sptr += m_ImagePitch;
dptrY += m_ImagePitch;
}
m_pD3D11DeviceContext->Unmap(m_CPUAccessibleLuminanceSurf, subresource);
//Copying from GPU to CPU. Bad :(
m_pD3D11DeviceContext->CopyResource(m_CPUAccessibleChrominanceSurf, m_ChrominanceSurf);
hr = m_pD3D11DeviceContext->Map(m_CPUAccessibleChrominanceSurf, subresource, D3D11_MAP_READ, 0, &resource);
sptr = reinterpret_cast<BYTE*>(resource.pData);
BYTE* dptrUV = nullptr; // point to the address of UV channel in output surface
m_ImagePitch = resource.RowPitch;
height /= 2;
width /= 2;
for (int i = 0; i < height; i++)
{
memcpy_s(dptrUV, m_ImagePitch, sptr, m_ImagePitch);
sptr += m_ImagePitch;
dptrUV += m_ImagePitch;
}
m_pD3D11DeviceContext->Unmap(m_CPUAccessibleChrominanceSurf, subresource);
}
catch(HRESULT){}
return hr;
}
Draw NV12:
//
// Draw frame for NV12 texture
//
HRESULT DrawNV12Frame(ID3D11Texture2D* inputTexture)
{
HRESULT hr;
// If window was resized, resize swapchain
if (!m_bIntialized)
{
HRESULT Ret = InitializeNV12Surfaces(inputTexture);
if (!SUCCEEDED(Ret))
{
return Ret;
}
m_bIntialized = true;
}
m_pD3D11DeviceContext->CopyResource(m_ShaderResourceSurf, inputTexture);
D3D11_TEXTURE2D_DESC FrameDesc;
m_ShaderResourceSurf->GetDesc(&FrameDesc);
D3D11_SHADER_RESOURCE_VIEW_DESC ShaderDesc;
ShaderDesc.Format = FrameDesc.Format;
ShaderDesc.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE2D;
ShaderDesc.Texture2D.MostDetailedMip = FrameDesc.MipLevels - 1;
ShaderDesc.Texture2D.MipLevels = FrameDesc.MipLevels;
// Create new shader resource view
ID3D11ShaderResourceView* ShaderResource = nullptr;
hr = m_pD3D11Device->CreateShaderResourceView(m_ShaderResourceSurf, &ShaderDesc, &ShaderResource);
IF_FAILED_THROW(hr);
m_pD3D11DeviceContext->PSSetShaderResources(0, 1, &ShaderResource);
// Set resources
m_pD3D11DeviceContext->OMSetRenderTargets(1, &m_pLumaRT, nullptr);
m_pD3D11DeviceContext->PSSetShader(m_pPixelShaderLuma, nullptr, 0);
m_pD3D11DeviceContext->RSSetViewports(1, &m_VPLuminance);
// Draw textured quad onto render target
m_pD3D11DeviceContext->Draw(NUMVERTICES, 0);
m_pD3D11DeviceContext->OMSetRenderTargets(1, &m_pChromaRT, nullptr);
m_pD3D11DeviceContext->PSSetShader(m_pPixelShaderChroma, nullptr, 0);
m_pD3D11DeviceContext->RSSetViewports(1, &m_VPChrominance);
// Draw textured quad onto render target
m_pD3D11DeviceContext->Draw(NUMVERTICES, 0);
// Release shader resource
ShaderResource->Release();
ShaderResource = nullptr;
return S_OK;
}
Init shaders:
void SetViewPort(D3D11_VIEWPORT* VP, UINT Width, UINT Height)
{
VP->Width = static_cast<FLOAT>(Width);
VP->Height = static_cast<FLOAT>(Height);
VP->MinDepth = 0.0f;
VP->MaxDepth = 1.0f;
VP->TopLeftX = 0;
VP->TopLeftY = 0;
}
HRESULT MakeRTV(ID3D11RenderTargetView** pRTV, ID3D11Texture2D* pSurf)
{
if (*pRTV)
{
(*pRTV)->Release();
*pRTV = nullptr;
}
// Create a render target view
HRESULT hr = m_pD3D11Device->CreateRenderTargetView(pSurf, nullptr, pRTV);
IF_FAILED_THROW(hr);
return S_OK;
}
HRESULT InitializeNV12Surfaces(ID3D11Texture2D* inputTexture)
{
ReleaseSurfaces();
D3D11_TEXTURE2D_DESC lOutputDuplDesc;
inputTexture->GetDesc(&lOutputDuplDesc);
// Create shared texture for all duplication threads to draw into
D3D11_TEXTURE2D_DESC DeskTexD;
RtlZeroMemory(&DeskTexD, sizeof(D3D11_TEXTURE2D_DESC));
DeskTexD.Width = lOutputDuplDesc.Width;
DeskTexD.Height = lOutputDuplDesc.Height;
DeskTexD.MipLevels = 1;
DeskTexD.ArraySize = 1;
DeskTexD.Format = lOutputDuplDesc.Format;
DeskTexD.SampleDesc.Count = 1;
DeskTexD.Usage = D3D11_USAGE_DEFAULT;
DeskTexD.BindFlags = D3D11_BIND_SHADER_RESOURCE;
HRESULT hr = m_pD3D11Device->CreateTexture2D(&DeskTexD, nullptr, &m_ShaderResourceSurf);
IF_FAILED_THROW(hr);
DeskTexD.Format = DXGI_FORMAT_R8_UNORM;
DeskTexD.BindFlags = D3D11_BIND_RENDER_TARGET;
hr = m_pD3D11Device->CreateTexture2D(&DeskTexD, nullptr, &m_LuminanceSurf);
IF_FAILED_THROW(hr);
DeskTexD.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
DeskTexD.Usage = D3D11_USAGE_STAGING;
DeskTexD.BindFlags = 0;
hr = m_pD3D11Device->CreateTexture2D(&DeskTexD, NULL, &m_CPUAccessibleLuminanceSurf);
IF_FAILED_THROW(hr);
SetViewPort(&m_VPLuminance, DeskTexD.Width, DeskTexD.Height);
HRESULT Ret = MakeRTV(&m_pLumaRT, m_LuminanceSurf);
if (!SUCCEEDED(Ret))
return Ret;
DeskTexD.Width = lOutputDuplDesc.Width / 2;
DeskTexD.Height = lOutputDuplDesc.Height / 2;
DeskTexD.Format = DXGI_FORMAT_R8G8_UNORM;
DeskTexD.Usage = D3D11_USAGE_DEFAULT;
DeskTexD.CPUAccessFlags = 0;
DeskTexD.BindFlags = D3D11_BIND_RENDER_TARGET;
hr = m_pD3D11Device->CreateTexture2D(&DeskTexD, nullptr, &m_ChrominanceSurf);
IF_FAILED_THROW(hr);
DeskTexD.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
DeskTexD.Usage = D3D11_USAGE_STAGING;
DeskTexD.BindFlags = 0;
hr = m_pD3D11Device->CreateTexture2D(&DeskTexD, NULL, &m_CPUAccessibleChrominanceSurf);
IF_FAILED_THROW(hr);
SetViewPort(&m_VPChrominance, DeskTexD.Width, DeskTexD.Height);
return MakeRTV(&m_pChromaRT, m_ChrominanceSurf);
}
HRESULT InitVertexShader(ID3D11VertexShader** ppID3D11VertexShader)
{
HRESULT hr = S_OK;
UINT Size = ARRAYSIZE(g_VS);
try
{
IF_FAILED_THROW(m_pD3D11Device->CreateVertexShader(g_VS, Size, NULL, ppID3D11VertexShader));;
m_pD3D11DeviceContext->VSSetShader(m_pVertexShader, nullptr, 0);
// Vertices for drawing whole texture
VERTEX Vertices[NUMVERTICES] =
{
{ XMFLOAT3(-1.0f, -1.0f, 0), XMFLOAT2(0.0f, 1.0f) },
{ XMFLOAT3(-1.0f, 1.0f, 0), XMFLOAT2(0.0f, 0.0f) },
{ XMFLOAT3(1.0f, -1.0f, 0), XMFLOAT2(1.0f, 1.0f) },
{ XMFLOAT3(1.0f, -1.0f, 0), XMFLOAT2(1.0f, 1.0f) },
{ XMFLOAT3(-1.0f, 1.0f, 0), XMFLOAT2(0.0f, 0.0f) },
{ XMFLOAT3(1.0f, 1.0f, 0), XMFLOAT2(1.0f, 0.0f) },
};
UINT Stride = sizeof(VERTEX);
UINT Offset = 0;
D3D11_BUFFER_DESC BufferDesc;
RtlZeroMemory(&BufferDesc, sizeof(BufferDesc));
BufferDesc.Usage = D3D11_USAGE_DEFAULT;
BufferDesc.ByteWidth = sizeof(VERTEX) * NUMVERTICES;
BufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
BufferDesc.CPUAccessFlags = 0;
D3D11_SUBRESOURCE_DATA InitData;
RtlZeroMemory(&InitData, sizeof(InitData));
InitData.pSysMem = Vertices;
// Create vertex buffer
IF_FAILED_THROW(m_pD3D11Device->CreateBuffer(&BufferDesc, &InitData, &m_VertexBuffer));
m_pD3D11DeviceContext->IASetVertexBuffers(0, 1, &m_VertexBuffer, &Stride, &Offset);
m_pD3D11DeviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
D3D11_INPUT_ELEMENT_DESC Layout[] =
{
{ "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 0, 12, D3D11_INPUT_PER_VERTEX_DATA, 0 }
};
UINT NumElements = ARRAYSIZE(Layout);
hr = m_pD3D11Device->CreateInputLayout(Layout, NumElements, g_VS, Size, &m_pVertexLayout);
m_pD3D11DeviceContext->IASetInputLayout(m_pVertexLayout);
}
catch (HRESULT) {}
return hr;
}
HRESULT InitPixelShaders()
{
HRESULT hr = S_OK;
// Refer https://codeshare.io/5PJjxP for g_PS_Y & g_PS_UV blobs
try
{
UINT Size = ARRAYSIZE(g_PS_Y);
hr = m_pD3D11Device->CreatePixelShader(g_PS_Y, Size, nullptr, &m_pPixelShaderChroma);
IF_FAILED_THROW(hr);
Size = ARRAYSIZE(g_PS_UV);
hr = m_pD3D11Device->CreatePixelShader(g_PS_UV, Size, nullptr, &m_pPixelShaderLuma);
IF_FAILED_THROW(hr);
}
catch (HRESULT) {}
return hr;
}
I am experimenting this RGBA conversion to NV12 in the GPU only, using DirectX11.
This is a good challenge. I'm not familiar with Directx11, so this is my first experimentation.
Check this project for updates : D3D11ShaderNV12
In my current implementation (may not be the last), here is what I do:
Step 1: use a DXGI_FORMAT_B8G8R8A8_UNORM as input texture
Step 2: make a 1st pass shader to get 3 textures (Y:Luma, U:ChromaCb and V:ChromaCr): see YCbCrPS2.hlsl
Step 3: Y is DXGI_FORMAT_R8_UNORM, and is ready for final NV12 texture
Step 4: UV needs to be downsampled in a 2nd pass shader: see ScreenPS2.hlsl (using linear filtering)
Step 5: a third pass shader to sample Y texture
Step 6: a fourth pass shader to sample UV texture using a shift texture (I think other technique could be use)
My final texture is not DXGI_FORMAT_NV12, but a similar DXGI_FORMAT_R8_UNORM texture. My computer is Windows7, so DXGI_FORMAT_NV12 is not handled. I will try later on a another computer.
The process with pictures:

OpenCL :Access proper index by using globalid(.)

Hi,
I am coding in OpenCL.
I am converting a "C function" having 2D array starting from i=1 and j=1 .PFB .
cv::Mat input; //Input :having some data in it ..
//Image input size is :input.rows=288 ,input.cols =640
cv::Mat output(input.rows-2,input.cols-2,CV_32F); //Output buffer
//Image output size is :output.rows=286 ,output.cols =638
This is a code Which I want to modify in OpenCL:
for(int i=1;i<output.rows-1;i++)
{
for(int j=1;j<output.cols-1;j++)
{
float xVal = input.at<uchar>(i-1,j-1)-input.at<uchar>(i-1,j+1)+ 2*(input.at<uchar>(i,j-1)-input.at<uchar>(i,j+1))+input.at<uchar>(i+1,j-1) - input.at<uchar>(i+1,j+1);
float yVal = input.at<uchar>(i-1,j-1) - input.at<uchar>(i+1,j-1)+ 2*(input.at<uchar>(i-1,j) - input.at<uchar>(i+1,j))+input.at<uchar>(i-1,j+1)-input.at<uchar>(i+1,j+1);
output.at<float>(i-1,j-1) = xVal*xVal+yVal*yVal;
}
}
...
Host code :
//Input Image size is :input.rows=288 ,input.cols =640
//Output Image size is :output.rows=286 ,output.cols =638
OclStr->global_work_size[0] =(input.cols);
OclStr->global_work_size[1] =(input.rows);
size_t outBufSize = (output.rows) * (output.cols) * 4;//4 as I am copying all 4 uchar values into one float variable space
cl_mem cl_input_buffer = clCreateBuffer(
OclStr->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR ,
(input.rows) * (input.cols),
static_cast<void *>(input.data), &OclStr->returnstatus);
cl_mem cl_output_buffer = clCreateBuffer(
OclStr->context, CL_MEM_WRITE_ONLY| CL_MEM_USE_HOST_PTR ,
(output.rows) * (output.cols) * sizeof(float),
static_cast<void *>(output.data), &OclStr->returnstatus);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 0, sizeof(cl_mem), (void *)&cl_input_buffer);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 1, sizeof(cl_mem), (void *)&cl_output_buffer);
OclStr->returnstatus = clEnqueueNDRangeKernel(
OclStr->command_queue,
OclStr->objkernel,
2,
NULL,
OclStr->global_work_size,
NULL,
0,
NULL,
NULL
);
clEnqueueMapBuffer(OclStr->command_queue, cl_output_buffer, true, CL_MAP_READ, 0, outBufSize, 0, NULL, NULL, &OclStr->returnstatus);
kernel Code :
__kernel void Sobel_uchar (__global uchar *pSrc, __global float *pDstImage)
{
const uint cols = get_global_id(0)+1;
const uint rows = get_global_id(1)+1;
const uint width= get_global_size(0);
uchar Opsoble[8];
Opsoble[0] = pSrc[(cols-1)+((rows-1)*width)];
Opsoble[1] = pSrc[(cols+1)+((rows-1)*width)];
Opsoble[2] = pSrc[(cols-1)+((rows+0)*width)];
Opsoble[3] = pSrc[(cols+1)+((rows+0)*width)];
Opsoble[4] = pSrc[(cols-1)+((rows+1)*width)];
Opsoble[5] = pSrc[(cols+1)+((rows+1)*width)];
Opsoble[6] = pSrc[(cols+0)+((rows-1)*width)];
Opsoble[7] = pSrc[(cols+0)+((rows+1)*width)];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(cols-1)+(rows-1)*width] = gx*gx + gy*gy;
}
Here I am not able to get the output as expected.
I am having some questions that
My for loop is starting from i=1 instead of zero, then How can I get proper index by using the global_id() in x and y direction
What is going wrong in my above kernel code :(
I am suspecting there is a problem in buffer stride but not able to further break my head as already broke it throughout a day :(
I have observed that with below logic output is skipping one or two frames after some 7/8 frames sequence.
I have added the screen shot of my output which is compared with the reference output.
My above logic is doing partial sobelling on my input .I changed the width as -
const uint width = get_global_size(0)+1;
PFB
Your suggestions are most welcome !!!
It looks like you may be fetching values in (y,x) format in your opencl version. Also, you need to add 1 to the global id to replicate your for loops starting from 1 rather than 0.
I don't know why there is an unused iOffset variable. Maybe your bug is related to this? I removed it in my version.
Does this kernel work better for you?
__kernel void simple(__global uchar *pSrc, __global float *pDstImage)
{
const uint i = get_global_id(0) +1;
const uint j = get_global_id(1) +1;
const uint width = get_global_size(0) +2;
uchar Opsoble[8];
Opsoble[0] = pSrc[(i-1) + (j - 1)*width];
Opsoble[1] = pSrc[(i-1) + (j + 1)*width];
Opsoble[2] = pSrc[i + (j-1)*width];
Opsoble[3] = pSrc[i + (j+1)*width];
Opsoble[4] = pSrc[(i+1) + (j - 1)*width];
Opsoble[5] = pSrc[(i+1) + (j + 1)*width];
Opsoble[6] = pSrc[(i-1) + (j)*width];
Opsoble[7] = pSrc[(i+1) + (j)*width];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(i-1) + (j-1)*width] = gx*gx + gy*gy ;
}
I am a bit apprehensive about posting an answer suggesting optimizations to your kernel, seeing as the original output has not been reproduced exactly as of yet. There is a major improvement available to be made for problems related to image processing/filtering.
Using local memory will help you out by reducing the number of global reads by a factor of eight, as well as grouping the global writes together for potential gains with the single write-per-pixel output.
The kernel below reads a block of up to 34x34 from pSrc, and outputs a 32x32(max) area of the pDstImage. I hope the comments in the code are enough to guide you in using the kernel. I have not been able to give this a complete test, so there could be changes required. Any comments are appreciated as well.
__kernel void sobel_uchar_wlocal (__global uchar *pSrc, __global float *pDstImage, __global uint2 dimDstImage)
{
//call this kernel 1-dimensional work group size: 32x1
//calculates 32x32 region of output with 32 work items
const uint wid = get_local_id(0);
const uint wid_1 = wid+1; // corrected for the calculation step
const uint2 gid = (uint2)(get_group_id(0),get_group_id(1));
const uint localDim = get_local_size(0);
const uint2 globalTopLeft = (uint2)(localDim * gid.x, localDim * gid.y); //position in pSrc to copy from/to
//dimLocalBuff is used for the right and bottom edges of the image, where the work group may run over the border
const uint2 dimLocalBuff = (uint2)(localDim,localDim);
if(dimDstImage.x - globalTopLeft.x < dimLocalBuff.x){
dimLocalBuff.x = dimDstImage.x - globalTopLeft.x;
}
if(dimDstImage.y - globalTopLeft.y < dimLocalBuff.y){
dimLocalBuff.y = dimDstImage.y - globalTopLeft.y;
}
int i,j;
//save region of data into local memory
__local uchar srcBuff[34][34]; //34^2 uchar = 1156 bytes
for(j=-1;j<dimLocalBuff.y+1;j++){
for(i=x-1;i<dimLocalBuff.x+1;i+=localDim){
srcBuff[i+1][j+1] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//compute output and store locally
__local float dstBuff[32][32]; //32^2 float = 4096 bytes
if(wid_1 < dimLocalBuff.x){
for(i=0;i<dimLocalBuff.y;i++){
float gx = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1-1)+ (i + 1)]+2*(srcBuff[wid_1+ (i-1)]-srcBuff[wid_1+ (i+1)])+srcBuff[(wid_1+1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i + 1)];
float gy = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i - 1)]+2*(srcBuff[(wid_1-1)+ (i)]-srcBuff[(wid_1+1)+ (i)])+srcBuff[(wid_1-1)+ (i + 1)]-srcBuff[(wid_1+1)+ (i + 1)];
dstBuff[wid][i] = gx*gx + gy*gy;
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//copy results to output
for(j=0;j<dimLocalBuff.y;j++){
for(i=0;i<dimLocalBuff.x;i+=localDim){
srcBuff[i][j] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
}

How to capture the desktop in OpenCV (ie. turn a bitmap into a Mat)?

I want to use OpenCV to process my desktop as if it were a video stream.
I am familiar with OpenCV.
I am not familiar with the Windows API.
I realize there are other ways to capture the screen, but for the purposes of my question, I need it to be done using OpenCV.
Here is my (super naive) code:
HWND hDesktopWnd;
HDC hDesktopDC;
hDesktopWnd=GetDesktopWindow();
hDesktopDC=GetDC(hDesktopWnd);
// get the height and width of the screen
int height = GetSystemMetrics(SM_CYVIRTUALSCREEN);
int width = GetSystemMetrics(SM_CXVIRTUALSCREEN);
// create a bitmap
HBITMAP hbDesktop = CreateCompatibleBitmap( hDesktopDC, width, height);
Mat src(height,width,CV_8UC4);
src.data = (uchar*)hbDesktop;
imshow("output",src); //fails :(
There are similar questions on StackOverflow, but they are either for the old-style OpenCV, or for Android operating system.
I'm on windows 7 64x
Opencv 2.4.3
Thanks anyone who can answer this question.
After MUCH trial and error, I managed to write a function to do it. here it is for anyone else who might want it:
#include "stdafx.h"
#include "opencv2/core/core.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include <opencv2/highgui/highgui.hpp>
#include <Windows.h>
#include <iostream>
#include <string>
using namespace std;
using namespace cv;
Mat hwnd2mat(HWND hwnd){
HDC hwindowDC,hwindowCompatibleDC;
int height,width,srcheight,srcwidth;
HBITMAP hbwindow;
Mat src;
BITMAPINFOHEADER bi;
hwindowDC=GetDC(hwnd);
hwindowCompatibleDC=CreateCompatibleDC(hwindowDC);
SetStretchBltMode(hwindowCompatibleDC,COLORONCOLOR);
RECT windowsize; // get the height and width of the screen
GetClientRect(hwnd, &windowsize);
srcheight = windowsize.bottom;
srcwidth = windowsize.right;
height = windowsize.bottom/2; //change this to whatever size you want to resize to
width = windowsize.right/2;
src.create(height,width,CV_8UC4);
// create a bitmap
hbwindow = CreateCompatibleBitmap( hwindowDC, width, height);
bi.biSize = sizeof(BITMAPINFOHEADER); //http://msdn.microsoft.com/en-us/library/windows/window/dd183402%28v=vs.85%29.aspx
bi.biWidth = width;
bi.biHeight = -height; //this is the line that makes it draw upside down or not
bi.biPlanes = 1;
bi.biBitCount = 32;
bi.biCompression = BI_RGB;
bi.biSizeImage = 0;
bi.biXPelsPerMeter = 0;
bi.biYPelsPerMeter = 0;
bi.biClrUsed = 0;
bi.biClrImportant = 0;
// use the previously created device context with the bitmap
SelectObject(hwindowCompatibleDC, hbwindow);
// copy from the window device context to the bitmap device context
StretchBlt( hwindowCompatibleDC, 0,0, width, height, hwindowDC, 0, 0,srcwidth,srcheight, SRCCOPY); //change SRCCOPY to NOTSRCCOPY for wacky colors !
GetDIBits(hwindowCompatibleDC,hbwindow,0,height,src.data,(BITMAPINFO *)&bi,DIB_RGB_COLORS); //copy from hwindowCompatibleDC to hbwindow
// avoid memory leak
DeleteObject (hbwindow); DeleteDC(hwindowCompatibleDC); ReleaseDC(hwnd, hwindowDC);
return src;
}
A better way to do it is do it while allocating memory to the pixels only once. so the only copy done here is the one that made by BitBlt
int main()
{
int x_size = 800, y_size = 600; // <-- Your res for the image
HBITMAP hBitmap; // <-- The image represented by hBitmap
Mat matBitmap; // <-- The image represented by mat
// Initialize DCs
HDC hdcSys = GetDC(NULL); // Get DC of the target capture..
HDC hdcMem = CreateCompatibleDC(hdcSys); // Create compatible DC
void *ptrBitmapPixels; // <-- Pointer variable that will contain the potinter for the pixels
// Create hBitmap with Pointer to the pixels of the Bitmap
BITMAPINFO bi; HDC hdc;
ZeroMemory(&bi, sizeof(BITMAPINFO));
bi.bmiHeader.biSize = sizeof(BITMAPINFOHEADER);
bi.bmiHeader.biWidth = x_size;
bi.bmiHeader.biHeight = -y_size; //negative so (0,0) is at top left
bi.bmiHeader.biPlanes = 1;
bi.bmiHeader.biBitCount = 32;
hdc = GetDC(NULL);
hBitmap = CreateDIBSection(hdc, &bi, DIB_RGB_COLORS, &ptrBitmapPixels, NULL, 0);
// ^^ The output: hBitmap & ptrBitmapPixels
// Set hBitmap in the hdcMem
SelectObject(hdcMem, hBitmap);
// Set matBitmap to point to the pixels of the hBitmap
matBitmap = Mat(y_size, x_size, CV_8UC4, ptrBitmapPixels, 0);
// ^^ note: first it is y, then it is x. very confusing
// * SETUP DONE *
// Now update the pixels using BitBlt
BitBlt(hdcMem, 0, 0, x_size, y_size, hdcSys, 0, 0, SRCCOPY);
// Just to do some image processing on the pixels.. (Dont have to to this)
Mat matRef = matBitmap(Range(100, 200), Range(100, 200));
// y1 y2 x1 x2
bitwise_not(matRef, matRef); // Invert the colors in this x1,x2,y1,y2
// Display the results through Mat
imshow("Title", matBitmap);
// Wait until some key is pressed
waitKey(0);
return 0;
}
Note that no error handling done here to make it simple to understand but you have to do error handling in your code!
Hope this helps

Direct X Sprite Rendering Issue

I have been writing my own library using Direct X and have hit an odd issue. Whilst trying to render an animating sprite I am simply seeing a big black square:
I have stepped through the code obsessively and have concluded that it must be something about the loading of the actual sprites, because everything that I can see in my code is fine. Obviously, I cannot step into the functions such as BltFast, and so cannot tell if my sprite surfaces are being blitted onto the backbuffer successfully.
Here are my load and render functions for the sprite:
SPRITE::LOAD
/**
* loads a bitmap file and copies it to a directdraw surface
*
* #param pID wait
* #param pFileName name of the bitmap file to load into memory
*/
void Sprite::Load (const char *pID, const char *pFileName)
{
// initialises the member variables with the new image id and file name
mID = pID;
mFileName = pFileName;
// creates the necessary variables
HBITMAP tHBM;
BITMAP tBM;
DDSURFACEDESC2 tDDSD;
IDirectDrawSurface7 *tDDS;
// stores bitmap image into HBITMAP handler
tHBM = static_cast<HBITMAP> (LoadImage (NULL, pFileName, IMAGE_BITMAP, 0, 0, LR_LOADFROMFILE | LR_CREATEDIBSECTION));
GetObject (tHBM, sizeof (tBM), &tBM);
// create surface for the HBITMAP to be copied onto
ZeroMemory (&tDDSD, sizeof (tDDSD));
tDDSD.dwSize = sizeof (tDDSD);
tDDSD.dwFlags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH;
tDDSD.ddsCaps.dwCaps = DDSCAPS_OFFSCREENPLAIN;
tDDSD.dwWidth = tBM.bmWidth;
tDDSD.dwHeight = tBM.bmHeight;
DirectDraw::GetInstance ()->DirectDrawObject()->CreateSurface (&tDDSD, &tDDS, NULL);
// copying bitmap image onto surface
CopyBitmap(tDDS, tHBM, 0, 0, 0, 0);
// deletes bitmap image now that it has been used
DeleteObject(tHBM);
// stores the new width and height of the image
mSpriteWidth = tBM.bmWidth;
mSpriteHeight = tBM.bmHeight;
// sets the address of the bitmap surface to this temporary surface with the new bitmap image
mBitmapSurface = tDDS;
}
SPRITE::RENDER
/**
* renders the sprites surface to the back buffer
*
* #param pBackBuffer surface to render the sprite to
* #param pX x co-ordinate to render to (default is 0)
* #param pY y co-ordinate to render to (default is 0)
*/
void Sprite::Render (LPDIRECTDRAWSURFACE7 &pBackBuffer, float pX, float pY)
{
if (mSpriteWidth > 800) mSpriteWidth = 800;
RECT tFrom;
tFrom.left = tFrom.top = 0;
tFrom.right = mSpriteWidth;
tFrom.bottom = mSpriteHeight;
// bltfast parameters are (position x, position y, dd surface, draw rect, wait flag)
// pBackBuffer->BltFast (0 + DirectDraw::GetInstance()->ScreenWidth(), 0, mBitmapSurface, &tFrom, DDBLTFAST_WAIT);
pBackBuffer->BltFast (static_cast<DWORD>(pX + DirectDraw::GetInstance()->ScreenWidth()),
static_cast<DWORD>(pY), mBitmapSurface, &tFrom, DDBLTFAST_WAIT);
}
The surfaces were simply not a compatible format.
Here's the fixed copybitmap function which I now call in the load function:
extern "C" HRESULT
DDCopyBitmap(IDirectDrawSurface7 * pdds, HBITMAP hbm, int x, int y,
int dx, int dy)
{
HDC hdcImage;
HDC hdc;
BITMAP bm;
DDSURFACEDESC2 ddsd;
HRESULT hr;
if (hbm == NULL || pdds == NULL)
return E_FAIL;
//
// Make sure this surface is restored.
//
pdds->Restore();
//
// Select bitmap into a memoryDC so we can use it.
//
hdcImage = CreateCompatibleDC(NULL);
if (!hdcImage)
OutputDebugString("createcompatible dc failed\n");
SelectObject(hdcImage, hbm);
//
// Get size of the bitmap
//
GetObject(hbm, sizeof(bm), &bm);
dx = dx == 0 ? bm.bmWidth : dx; // Use the passed size, unless zero
dy = dy == 0 ? bm.bmHeight : dy;
//
// Get size of surface.
//
ddsd.dwSize = sizeof(ddsd);
ddsd.dwFlags = DDSD_HEIGHT | DDSD_WIDTH;
pdds->GetSurfaceDesc(&ddsd);
if ((hr = pdds->GetDC(&hdc)) == DD_OK)
{
StretchBlt(hdc, 0, 0, ddsd.dwWidth, ddsd.dwHeight, hdcImage, x, y,
dx, dy, SRCCOPY);
pdds->ReleaseDC(hdc);
}
DeleteDC(hdcImage);
return hr;
}

Copying memory allocated by cudaMallocPitch

Can cudaMemcpy be used for memory allocated with cudaMallocPitch? If not, can you tell, which function should be used. cudaMallocPitch returns linear memory, so I suppose that cudaMemcpy should be used.
You certainly could use cudaMemcpy to copy pitched device memory, but it would be more usual to use cudaMemcpy2D. An example of a pitched copy from host to device would look something like this:
#include "cuda.h"
#include <assert.h>
typedef float real;
int main(void)
{
cudaFree(0); // Establish context
// Host array dimensions
const size_t dx = 300, dy = 300;
// For the CUDA API width and pitch are specified in bytes
size_t width = dx * sizeof(real), height = dy;
// Host array allocation
real * host = new real[dx * dy];
size_t pitch1 = dx * sizeof(real);
// Device array allocation
// pitch is determined by the API call
real * device;
size_t pitch2;
assert( cudaMallocPitch((real **)&device, &pitch2, width, height) == cudaSuccess );
// Sample memory copy - note source and destination pitches can be different
assert( cudaMemcpy2D(device, pitch2, host, pitch1, width, height, cudaMemcpyHostToDevice) == cudaSuccess );
// Destroy context
assert( cudaDeviceReset() == cudaSuccess );
return 0;
}
(note: untested, cavaet emptor and all that.....)

Resources