Create 1D Texture in DirectX 11 - directx

I try to create the 1D Texture in DirectX 11 wih this code:
PARAMETER: ID3D11Device* pDevice
D3D11_TEXTURE1D_DESC text1_desc;
::ZeroMemory(&text1_desc, sizeof(D3D11_TEXTURE1D_DESC));
text1_desc.Width = 258
text1_desc.MipLevels = 2;
text1_desc.ArraySize = 2;
text1_desc.Usage = D3D11_USAGE_IMMUTABLE;
text1_desc.BindFlags = D3D11_BIND_SHADER_RESOURCE;
text1_desc.Format = R8G8B8A8_UNORM;
FLOAT* pData = new FLOAT[text1_desc.MipLevels * text1_desc.ArraySize * text1_desc.Width];
D3D11_SUBRESOURCE_DATA sr_data;
::ZeroMemory(&sr_data, sizeof(D3D11_SUBRESOURCE_DATA));
sr_data.pSysMem = pData;
ID3D11Texture1D* pTexture1D = nullptr;
auto hr = pDevice->CreateTexture1D(&text1_desc, &sr_data, &pTexture1D);
When text1_desc.MipLevels = 1 and text1_desc.ArraySize = 1 everything is good.
When text1_desc.MipLevels = 0 or text1_desc.MipLevels > 1 it raises Unhandled exception at 0x000007FEE6D14CC0 (nvwgf2umx.dll): 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.
Can anyone help me to solve this problem?

Mip-levels of '0' is a problem since it results in an allocation size of '0'. You need to figure out the number of mip-levels that will get generated for the given input width. So for 0, you need something like:
size_t mipLevels = 1;
size_t width = 258;
while ( width > 1 )
{
if ( width > 1 )
width >>= 1;
++mipLevels;
}
The second thing to note is that you have to pass an array of D3D11_SUBRESOURE_DATA instances, not just one, if you are creating a complex resource. There is one D3D11_SUBRESOURE_DATA per sub-resource, which needs to be mipLevels * text1_desc.ArraySize in length. You only ever allocate 1, which is why you get a runtime fault.
You should look at DirectXTex for code that works with a all kinds of Direct3D 11 textures.

Related

How to speed up YUV conversion for a fast SkiaSharp camera preview?

I'm writing some code to render camera preview using SkiaSharp. This is cross-platform but I came across a problem while writing the implementation for android.
I needed to convert YUV_420_888 to RGB8888 because that's what SkiaSharp supports and with the help of this thread, somehow managed to show decent quality images to my SkiaSharp canvas. The problem is the speed. At best I can get about 8 fps but usually it's just 4 or 5 fps. It turned out the biggest factor is the conversion. I now have about 3 versions of my ToRGB converter. I've even ended up trying "unsafe" code and parallel loops. I'll just show you my best one yet.
private unsafe byte[] ToRgb(byte[] yValuesArr, byte[] uValuesArr,
byte[] vValuesArr, int uvPixelStride, int uvRowStride)
{
var width = PixelSize.Width;
var height = PixelSize.Height;
var rgb = new byte[width * height * 4];
var partitions = Partitioner.Create(0, height);
Parallel.ForEach(partitions, range =>
{
var (item1, item2) = range;
Parallel.For(item1, item2, y =>
{
for (var x = 0; x < width; x++)
{
var yIndex = x + width * y;
var currentPosition = yIndex * 4;
var uvIndex = uvPixelStride * (x / 2) + uvRowStride * (y / 2);
fixed (byte* rgbFixed = rgb)
fixed (byte* yValuesFixed = yValuesArr)
fixed (byte* uValuesFixed = uValuesArr)
fixed (byte* vValuesFixed = vValuesArr)
{
var rgbPtr = rgbFixed;
var yValues = yValuesFixed;
var uValues = uValuesFixed;
var vValues = vValuesFixed;
var yy = *(yValues + yIndex);
var uu = *(uValues + uvIndex);
var vv = *(vValues + uvIndex);
var rTmp = yy + vv * 1436 / 1024 - 179;
var gTmp = yy - uu * 46549 / 131072 + 44 - vv * 93604 / 131072 + 91;
var bTmp = yy + uu * 1814 / 1024 - 227;
rgbPtr = rgbPtr + currentPosition;
*rgbPtr = (byte) (rTmp < 0 ? 0 : rTmp > 255 ? 255 : rTmp);
rgbPtr++;
*rgbPtr = (byte) (gTmp < 0 ? 0 : gTmp > 255 ? 255 : gTmp);
rgbPtr++;
*rgbPtr = (byte) (bTmp < 0 ? 0 : bTmp > 255 ? 255 : bTmp);
rgbPtr++;
*rgbPtr = 255;
}
}
});
});
return rgb;
}
You can also find it on my repo. You can also find on that same repo the part where I rendered the output to SkiaSharp
For a preview size of 1440x1080, running on my phone, this code takes about 120ms to finish. Even if all the other parts are optimized, the most I can get from that is 8fps. And no, it's not my hardware because the built-in camera app runs smoothly. By the way 1440x1080 is the output of my ChooseOptimalSize algorithm that I got from the mono-droid examples of android's Camera2 API. I don't know if it's the best way or if it lacks logic on detecting the fps and sizing down the preview to make it faster.
Does SkiaSharp support GPU drawing? If you connect the camera to a SurfaceTexture, you can use the preview frames as GL textures and render them efficiently into an OpenGL scene.
Even if not, you may still get faster results by sending the frames to the GPU and reading them back to the CPU with something like glReadPixels, as that'll do a RGB conversion within the GPU.

Changing values of MTLBuffer after created

I would like to be able to define a MTLBuffer and populate data directly to the buffer (or as efficiently as possible).
If I do the following, the values used in the shader are 1.0 and 2.0 (for X and Y respectively), not 3.0 and 4.0 which are set after the MTLBuffer is created.
int bufferLength = 128 * 128;
float pointBuffer[bufferLength * 2]; // 2 for X and Y
//Populate array with test values
for (int i = 0; i < (bufferLength * 2); i += 2) {
pointBuffer[i] = 1.0; //X
pointBuffer[i + 1] = 2.0; //Y
}
id<MTLBuffer> pointDataBuffer = [device newBufferWithBytes:&pointBuffer length:sizeof(pointBuffer) options:MTLResourceOptionCPUCacheModeDefault];
//Populate array with updated test values
for (int i = 0; i < (bufferLength * 2); i += 2) {
pointBuffer[i] = 3.0; //X
pointBuffer[i + 1] = 4.0; //Y
}
//In the (Swift) class with the pipeline:
commandEncoder!.setBuffer(pointDataBuffer, offset: 0, index: 4)
Based on the docs, it seems like I need to call didModifyRange: but pointDataBuffer does not seem to recognize the selector.
Is there a way to update the array without having to recreate the MTLBuffer?
-newBufferWithBytes:... makes a copy of the passed in bytes. It does not keep referencing them. So, subsequent changes to pointBuffer do not affect it.
However, buffers like this one (whose storage mode is not private) provide access to their storage through the -contents method. So, you could do something like this:
float *points = pointDataBuffer.contents;
for (int i = 0; i < (bufferLength * 2); i += 2) {
points[i] = 3.0; //X
points[i + 1] = 4.0; //Y
}
Be careful, though. The CPU and GPU operate asynchronously relative to each other. If there might be commands being processed by the GPU that reference the buffer, then modifying it from the CPU may interfere with the operation of those commands. So, you'll want to take steps to synchronize access to the buffer or otherwise avoid simultaneous CPU and GPU access.

realloc() in a 64bit iOS device

When I use the C function realloc(p,size) in my project, the code runs well in both the simulator and on an iPhone 5.
However, when the code is running on an iPhone 6 plus, some odd things happen. The contents a points to are changed! What's worse, the output is different every time the function runs!
Here is the test code:
#define LENGTH (16)
- (void)viewDidLoad {
[super viewDidLoad];
char* a = (char*)malloc(LENGTH+1);
a[0] = 33;
a[1] = -14;
a[2] = 76;
a[3] = -128;
a[4] = 25;
a[5] = 49;
a[6] = -45;
a[7] = 56;
a[8] = -36;
a[9] = 56;
a[10] = 125;
a[11] = 44;
a[12] = 26;
a[13] = 79;
a[14] = 29;
a[15] = 66;
//print binary contents pointed by a, print each char in int can also see the differene
[self printInByte:a];
realloc(a, LENGTH);
[self printInByte:a]
free(a);
}
-(void) printInByte: (char*)t{
for(int i = 0;i < LENGTH ;++i){
for(int j = -1;j < 16;(t[i] >> ++j)& 1?printf("1"):printf("0"));
printf(" ");
if (i % 4 == 3) {
printf("\n");
}
}
}
One more thing, when LENGTH is not 16, it runs well with assigning up to a[LENGTH-1]. However, if LENGTH is exactly 16, things go wrong.
Any ideas will be appreciated.
From the realloc man doc:
The realloc() function tries to change the size of the allocation
pointed
to by ptr to size, and returns ptr. If there is not enough room to
enlarge the memory allocation pointed to by ptr, realloc() creates a new
allocation, copies as much of the old data pointed to by ptr as will fit
to the new allocation, frees the old allocation, and returns a pointer to
the allocated memory.
So, you must write: a = realloc(a, LENGTH); because in the case where the already allocated memory block can't be made larger, the zone will be moved to another location, and so, the adress of the zone will change. That is why realloc return a memory pointer.
You need to do:
a = realloc(a, LENGTH);
The reason is that realloc() may move the block, so you need to update your pointer to the new address:
http://www.cplusplus.com/reference/cstdlib/realloc/

Parallel Image Processing in OpenMP - Splitting Image

I have a function defined by Intel IPP to operate on an Image / Region of Image.
The input to the image are the pointer to the image, parameters to define the size to process and parameters of the filter.
The IPP function is single threaded.
Now, I have an image of size M x N.
I want to apply the filter on it in parallel.
The main idea is simple, break the image into 4 sub images which are independent of each other.
Apply the filter to each sub image and write the result to a sub block of an empty image where each thread write to a distinct set of pixels.
It's really like processing 4 images each on it own core.
This is the program I'm doing it with:
void OpenMpTest()
{
const int width = 1920;
const int height = 1080;
Ipp32f input_image[width * height];
Ipp32f output_image[width * height];
IppiSize size = { width, height };
int step = width * sizeof(Ipp32f);
/* Splitting the image */
IppiSize section_size = { width / 2, height / 2};
Ipp32f* input_upper_left = input_image;
Ipp32f* input_upper_right = input_image + width / 2;
Ipp32f* input_lower_left = input_image + (height / 2) * width;
Ipp32f* input_lower_right = input_image + (height / 2) * width + width / 2;
Ipp32f* output_upper_left = input_image;
Ipp32f* output_upper_right = input_image + width / 2;
Ipp32f* output_lower_left = input_image + (height / 2) * width;
Ipp32f* output_lower_right = input_image + (height / 2) * width + width / 2;
Ipp32f* input_sections[4] = { input_upper_left, input_upper_right, input_lower_left, input_lower_right };
Ipp32f* output_sections[4] = { output_upper_left, output_upper_right, output_lower_left, output_lower_right };
/* Filter Params */
Ipp32f pKernel[7] = { 1, 2, 3, 4, 3, 2, 1 };
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < 4; i++)
ippiFilterRow_32f_C1R(
input_sections[i], step,
output_sections[i], step,
section_size, pKernel, 7, 3);
}
Now, the issues is I see no gain versus working Single Threaded mode on all image.
I tried to change the image size or filter size and nothing will the change the picture.
The most I could gain was nothing significant (10-20%).
I thought it might have something to do with that I can't "Promise" each thread the zone it received is "Read Only".
Moreover to let it know the memory location it writes to is also belongs only to himself.
I read about defining variables as private and share, yet I couldn't find a guide to deal with arrays and pointers.
What would be the proper way to deal with pointers and sub arrays in OpenMP?
How does the performance of threaded IPP compare?
Assuming no race conditions, performance problems with writing to shared arrays are most likely to occur in cache lines where part of the line is written by one thread and another part is read by another.
It's likely to require a data region larger than a 10 megabytes or so before full parallel speedup is seen.
You would need deeper analysis, e.g. by Intel VTune Amplifier, to see whether memory bandwidth or data overlaps are limiting performance.
Using Intel IPP Filter, the best solution was using:
int height = dstRoiSize.height;
int width = dstRoiSize.width;
Ipp32f *pSrc1, *pDst1;
int nThreads, cH, cT;
#pragma omp parallel shared( pSrc, pDst, nThreads, width, height, kernelSize,\
xAnchor, cH, cT ) private( pSrc1, pDst1 )
{
#pragma omp master
{
nThreads = omp_get_num_threads();
cH = height / nThreads;
cT = height % nThreads;
}
#pragma omp barrier
{
int curH;
int id = omp_get_thread_num();
pSrc1 = (Ipp32f*)( (Ipp8u*)pSrc + id * cH * srcStep );
pDst1 = (Ipp32f*)( (Ipp8u*)pDst + id * cH * dstStep );
if( id != ( nThreads - 1 )) curH = cH;
else curH = cH + cT;
ippiFilterRow_32f_C1R( pSrc1, srcStep, pDst1, dstStep,
width, curH, pKernel, kernelSize, xAnchor );
}
}
Thank You.

OpenCL :Access proper index by using globalid(.)

Hi,
I am coding in OpenCL.
I am converting a "C function" having 2D array starting from i=1 and j=1 .PFB .
cv::Mat input; //Input :having some data in it ..
//Image input size is :input.rows=288 ,input.cols =640
cv::Mat output(input.rows-2,input.cols-2,CV_32F); //Output buffer
//Image output size is :output.rows=286 ,output.cols =638
This is a code Which I want to modify in OpenCL:
for(int i=1;i<output.rows-1;i++)
{
for(int j=1;j<output.cols-1;j++)
{
float xVal = input.at<uchar>(i-1,j-1)-input.at<uchar>(i-1,j+1)+ 2*(input.at<uchar>(i,j-1)-input.at<uchar>(i,j+1))+input.at<uchar>(i+1,j-1) - input.at<uchar>(i+1,j+1);
float yVal = input.at<uchar>(i-1,j-1) - input.at<uchar>(i+1,j-1)+ 2*(input.at<uchar>(i-1,j) - input.at<uchar>(i+1,j))+input.at<uchar>(i-1,j+1)-input.at<uchar>(i+1,j+1);
output.at<float>(i-1,j-1) = xVal*xVal+yVal*yVal;
}
}
...
Host code :
//Input Image size is :input.rows=288 ,input.cols =640
//Output Image size is :output.rows=286 ,output.cols =638
OclStr->global_work_size[0] =(input.cols);
OclStr->global_work_size[1] =(input.rows);
size_t outBufSize = (output.rows) * (output.cols) * 4;//4 as I am copying all 4 uchar values into one float variable space
cl_mem cl_input_buffer = clCreateBuffer(
OclStr->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR ,
(input.rows) * (input.cols),
static_cast<void *>(input.data), &OclStr->returnstatus);
cl_mem cl_output_buffer = clCreateBuffer(
OclStr->context, CL_MEM_WRITE_ONLY| CL_MEM_USE_HOST_PTR ,
(output.rows) * (output.cols) * sizeof(float),
static_cast<void *>(output.data), &OclStr->returnstatus);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 0, sizeof(cl_mem), (void *)&cl_input_buffer);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 1, sizeof(cl_mem), (void *)&cl_output_buffer);
OclStr->returnstatus = clEnqueueNDRangeKernel(
OclStr->command_queue,
OclStr->objkernel,
2,
NULL,
OclStr->global_work_size,
NULL,
0,
NULL,
NULL
);
clEnqueueMapBuffer(OclStr->command_queue, cl_output_buffer, true, CL_MAP_READ, 0, outBufSize, 0, NULL, NULL, &OclStr->returnstatus);
kernel Code :
__kernel void Sobel_uchar (__global uchar *pSrc, __global float *pDstImage)
{
const uint cols = get_global_id(0)+1;
const uint rows = get_global_id(1)+1;
const uint width= get_global_size(0);
uchar Opsoble[8];
Opsoble[0] = pSrc[(cols-1)+((rows-1)*width)];
Opsoble[1] = pSrc[(cols+1)+((rows-1)*width)];
Opsoble[2] = pSrc[(cols-1)+((rows+0)*width)];
Opsoble[3] = pSrc[(cols+1)+((rows+0)*width)];
Opsoble[4] = pSrc[(cols-1)+((rows+1)*width)];
Opsoble[5] = pSrc[(cols+1)+((rows+1)*width)];
Opsoble[6] = pSrc[(cols+0)+((rows-1)*width)];
Opsoble[7] = pSrc[(cols+0)+((rows+1)*width)];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(cols-1)+(rows-1)*width] = gx*gx + gy*gy;
}
Here I am not able to get the output as expected.
I am having some questions that
My for loop is starting from i=1 instead of zero, then How can I get proper index by using the global_id() in x and y direction
What is going wrong in my above kernel code :(
I am suspecting there is a problem in buffer stride but not able to further break my head as already broke it throughout a day :(
I have observed that with below logic output is skipping one or two frames after some 7/8 frames sequence.
I have added the screen shot of my output which is compared with the reference output.
My above logic is doing partial sobelling on my input .I changed the width as -
const uint width = get_global_size(0)+1;
PFB
Your suggestions are most welcome !!!
It looks like you may be fetching values in (y,x) format in your opencl version. Also, you need to add 1 to the global id to replicate your for loops starting from 1 rather than 0.
I don't know why there is an unused iOffset variable. Maybe your bug is related to this? I removed it in my version.
Does this kernel work better for you?
__kernel void simple(__global uchar *pSrc, __global float *pDstImage)
{
const uint i = get_global_id(0) +1;
const uint j = get_global_id(1) +1;
const uint width = get_global_size(0) +2;
uchar Opsoble[8];
Opsoble[0] = pSrc[(i-1) + (j - 1)*width];
Opsoble[1] = pSrc[(i-1) + (j + 1)*width];
Opsoble[2] = pSrc[i + (j-1)*width];
Opsoble[3] = pSrc[i + (j+1)*width];
Opsoble[4] = pSrc[(i+1) + (j - 1)*width];
Opsoble[5] = pSrc[(i+1) + (j + 1)*width];
Opsoble[6] = pSrc[(i-1) + (j)*width];
Opsoble[7] = pSrc[(i+1) + (j)*width];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(i-1) + (j-1)*width] = gx*gx + gy*gy ;
}
I am a bit apprehensive about posting an answer suggesting optimizations to your kernel, seeing as the original output has not been reproduced exactly as of yet. There is a major improvement available to be made for problems related to image processing/filtering.
Using local memory will help you out by reducing the number of global reads by a factor of eight, as well as grouping the global writes together for potential gains with the single write-per-pixel output.
The kernel below reads a block of up to 34x34 from pSrc, and outputs a 32x32(max) area of the pDstImage. I hope the comments in the code are enough to guide you in using the kernel. I have not been able to give this a complete test, so there could be changes required. Any comments are appreciated as well.
__kernel void sobel_uchar_wlocal (__global uchar *pSrc, __global float *pDstImage, __global uint2 dimDstImage)
{
//call this kernel 1-dimensional work group size: 32x1
//calculates 32x32 region of output with 32 work items
const uint wid = get_local_id(0);
const uint wid_1 = wid+1; // corrected for the calculation step
const uint2 gid = (uint2)(get_group_id(0),get_group_id(1));
const uint localDim = get_local_size(0);
const uint2 globalTopLeft = (uint2)(localDim * gid.x, localDim * gid.y); //position in pSrc to copy from/to
//dimLocalBuff is used for the right and bottom edges of the image, where the work group may run over the border
const uint2 dimLocalBuff = (uint2)(localDim,localDim);
if(dimDstImage.x - globalTopLeft.x < dimLocalBuff.x){
dimLocalBuff.x = dimDstImage.x - globalTopLeft.x;
}
if(dimDstImage.y - globalTopLeft.y < dimLocalBuff.y){
dimLocalBuff.y = dimDstImage.y - globalTopLeft.y;
}
int i,j;
//save region of data into local memory
__local uchar srcBuff[34][34]; //34^2 uchar = 1156 bytes
for(j=-1;j<dimLocalBuff.y+1;j++){
for(i=x-1;i<dimLocalBuff.x+1;i+=localDim){
srcBuff[i+1][j+1] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//compute output and store locally
__local float dstBuff[32][32]; //32^2 float = 4096 bytes
if(wid_1 < dimLocalBuff.x){
for(i=0;i<dimLocalBuff.y;i++){
float gx = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1-1)+ (i + 1)]+2*(srcBuff[wid_1+ (i-1)]-srcBuff[wid_1+ (i+1)])+srcBuff[(wid_1+1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i + 1)];
float gy = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i - 1)]+2*(srcBuff[(wid_1-1)+ (i)]-srcBuff[(wid_1+1)+ (i)])+srcBuff[(wid_1-1)+ (i + 1)]-srcBuff[(wid_1+1)+ (i + 1)];
dstBuff[wid][i] = gx*gx + gy*gy;
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//copy results to output
for(j=0;j<dimLocalBuff.y;j++){
for(i=0;i<dimLocalBuff.x;i+=localDim){
srcBuff[i][j] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
}

Resources