I'm having a problem with the implementation of multiple kernel functions in Metal in combination with Swift.
My target is to implement a block-wise DCT transformation over an image. The DCT is implemented with two matrix multiplications.
J = H * I * H^-1
The following code shows the kernel functions itself and the used calls in the swift code. If I run each kernel function alone it works but i can't manage to hand over the write buffer from the first kernel function to the second function. The second function therefore always returns a buffer filled with just 0.
All the image input and output buffers are 400x400 big with RGB (16-bit Integer for each component). The matrices are 8x8 16-bit Integers.
Is there a special command needed to synchronize the buffer read and write accesses of the different kernel functions? Or am I doing something else wrong?
Thanks for your help
shaders.metal
struct Image3D16{
short data[400][400][3];
};
struct Matrix{
short data[8 * 8];
};
kernel void dct1(device Image3D16 *inputImage [[buffer(0)]],
device Image3D16 *outputImage [[buffer(1)]],
device Matrix *mult [[buffer(2)]],
uint2 gid [[thread_position_in_grid]],
uint2 tid [[thread_position_in_threadgroup]]){
int red = 0, green = 0, blue = 0;
for(int x=0;x<8;x++){
short r = inputImage->data[gid.x-tid.x + x][gid.y][0];
short g = inputImage->data[gid.x-tid.x + x][gid.y][1];
short b = inputImage->data[gid.x-tid.x + x][gid.y][2];
red += r * mult->data[tid.x*8 + x];
green += g * mult->data[tid.x*8 + x];
blue += b * mult->data[tid.x*8 + x];
}
outputImage->data[gid.x][gid.y][0] = red;
outputImage->data[gid.x][gid.y][1] = green;
outputImage->data[gid.x][gid.y][2] = blue;
}
kernel void dct2(device Image3D16 *inputImage [[buffer(0)]],
device Image3D16 *outputImage [[buffer(1)]],
device Matrix *mult [[buffer(2)]],
uint2 gid [[thread_position_in_grid]],
uint2 tid [[thread_position_in_threadgroup]]){
int red = 0, green = 0, blue = 0;
for(int y=0;y<8;y++){
short r = inputImage->data[gid.x][gid.y-tid.y + y][0];
short g = inputImage->data[gid.x][gid.y-tid.y + y][1];
short b = inputImage->data[gid.x][gid.y-tid.y + y][2];
red += r * mult->data[tid.y*8 + y];
green += g * mult->data[tid.y*8 + y];
blue += b * mult->data[tid.y*8 + y];
}
outputImage->data[gid.x][gid.y][0] = red;
outputImage->data[gid.x][gid.y][1] = green;
outputImage->data[gid.x][gid.y][2] = blue;
}
ViewController.swift
...
let commandBuffer = commandQueue.commandBuffer()
let computeEncoder1 = commandBuffer.computeCommandEncoder()
computeEncoder1.setComputePipelineState(computeDCT1)
computeEncoder1.setBuffer(input, offset: 0, atIndex: 0)
computeEncoder1.setBuffer(tmpBuffer3D1, offset: 0, atIndex: 1)
computeEncoder1.setBuffer(dctMatrix1, offset: 0, atIndex: 2)
computeEncoder1.dispatchThreadgroups(blocks, threadsPerThreadgroup: dctSize)
computeEncoder1.endEncoding()
let computeEncoder2 = commandBuffer.computeCommandEncoder()
computeEncoder2.setComputePipelineState(computeDCT2)
computeEncoder2.setBuffer(tmpBuffer3D1, offset: 0, atIndex: 0)
computeEncoder2.setBuffer(output, offset: 0, atIndex: 1)
computeEncoder2.setBuffer(dctMatrix2, offset: 0, atIndex: 2)
computeEncoder2.dispatchThreadgroups(blocks, threadsPerThreadgroup: dctSize)
computeEncoder2.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
I found the error. My kernel function tried to read outside of its allocated memory. The reaction of the metal interface is then to stop the execution of all following commands in the command buffer. Therefore was the output always zero because the computation was never done. The GPU usage of the application drops which can be used for detecting the error.
Related
I'm trying to do background segmentation of a live video using CoreML. I used DeepLabV3 as provided by Apple. The model works ok, even though it already takes 100ms to process a 513x513 image. I then want to display the output, which is a 513x513 array of int32. Converting it in an image as done in CoreMLHelpers takes 300ms and I'm looking for a much faster way to display the results. I was thinking that maybe it'd be faster to somehow dump this to a OpenGL or Metal texture.
What is the best way to handle MLMultiArray for live inputs?
My answer is based on processing the MLMultiArray in Metal
Create an MTLBuffer:
let device = MTLCreateSystemDefaultDevice()!
let segmentationMaskBuffer: MTLBuffer = self.device.makeBuffer(length: segmentationHeight * segmentationWidth * MemoryLayout<Int32>.stride)
Copy MLMultiArray to MTLBuffer:
memcpy(segmentationMaskBuffer.contents(), mlOutput.semanticPredictions.dataPointer, segmentationMaskBuffer.length)
Setup Metal related variables:
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "binaryMask")!
let computePipeline = try! device.makeComputePipelineState(function: function)
create a struct for segmentation size:
let segmentationWidth = 513
let segmentationHeight = 513
struct MixParams {
var width: Int32 = Int32(segmentationWidth)
var height: Int32 = Int32(segmentationHeight)
}
create a output texture:
let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .bgra8Unorm, width: width, height: height, mipmapped: false)
textureDescriptor.usage = [.shaderRead, .shaderWrite]
let outputTexture = device.makeTexture(descriptor: textureDescriptor)!
pass the mtlbuffer, outputtexture to the kernal function:
let buffer = commandQueue.makeCommandBuffer()!
let maskCommandEncoder = buffer.makeComputeCommandEncoder()!
maskCommandEncoder.setTexture(outputTexture, index: 1)
maskCommandEncoder.setBuffer(segmentationBuffer, offset: 0, index: 0)
maskCommandEncoder.setBytes(¶ms, length: MemoryLayout<MixParams>.size, index: 1)
let w = computePipeline.threadExecutionWidth
let h = computePipeline.maxTotalThreadsPerThreadgroup / w
let threadGroupSize = MTLSizeMake(w, h, 1)
let threadGroups = MTLSizeMake(
(depthWidth + threadGroupSize.width - 1) / threadGroupSize.width,
(depthHeight + threadGroupSize.height - 1) / threadGroupSize.height, 1)
maskCommandEncoder.setComputePipelineState(computePipeline)
maskCommandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)
maskCommandEncoder.endEncoding()
write your kernel function in Shaders.metal file:
#include <metal_stdlib>
using namespace metal;
#include <CoreImage/CoreImage.h>
struct MixParams {
int segmentationWidth;
int segmentationHeight;
};
static inline int get_class(float2 pos, int width, int height, device int* mask) {
const int x = int(pos.x * width);
const int y = int(pos.y * height);
return mask[y*width + x];
}
static float get_person_probability(float2 pos, int width, int height, device int* mask) {
return get_class(pos, width, height, mask) == 15;
}
kernel void binaryMask(
texture2d<float, access::write> outputTexture [[texture(1)]],
device int* segmentationMask [[buffer(0)]],
constant MixParams& params [[buffer(1)]],
uint2 gid [[thread_position_in_grid]])
{
float width = outputTexture.get_width();
float height = outputTexture.get_height();
if (gid.x >= width ||
gid.y >= height) return;
const float2 pos = float2(float(gid.x) / width,
float(gid.y) / height);
const float is_person = get_person_probability(pos, params.segmentationWidth,
params.segmentationHeight,
segmentationMask);
float4 outPixel;
if (is_person < 0.5f) {
outPixel = float4(0.0,0.0,0.0,0.0);
} else {
outPixel = float4(1.0,1.0,1.0,1.0);
}
outputTexture.write(outPixel, gid);
}
Finally get the ciimage from output texture after encoding is finished:
let kciOptions: [CIImageOption: Any] = [CIImageOption.colorSpace: CGColorSpaceCreateDeviceRGB()]
let maskIMage = CIImage(mtlTexture: outputTexture,options: kciOptions)!.oriented(.downMirrored)
Instead of outputting an MLMultiArray you can change the model to make it output an image of type CVPixelBuffer. Then you can use CVMetalTextureCacheCreateTextureFromImage to turn the pixel buffer into an MTLTexture. (I think this works but I don't recall if I ever tried it. Not all pixel buffer objects can be turned into textures and I'm not sure if Core ML outputs a CVPixelBuffer object with the "Metal compatibility flag" turned on.)
Alternatively, you can write a compute kernel that takes in the MLMultiArray and converts it to a texture, which then gets drawn into a Metal view. This has the advantage that you apply all kinds of effects to the segmentation map in the compute kernel at the same time.
how can I calculate mean and variance value of an image with 16 channels using Metal ?
I want to calculate mean and variance value of different channel sperately!
ex.:
kernel void meanandvariance(texture2d_array<float, access::read> in[[texture(0)]],
texture2d_array<float, access::write> out[[texture(1)]],
ushort3 gid[[thread_position_in_grid]],
ushort tid[[thread_index_in_threadgroup]],
ushort3 tg_size[[threads_per_threadgroup]]) {
}
There's probably a way to do this by creating a sequence of texture views on the input texture array and output texture array, encoding a MPSImageStatisticsMeanAndVariance kernel invocation for each slice.
But let's take a look at how to do it ourselves. There are many different possible approaches, so I chose one that was simple and used some interesting results from statistics.
Essentially, we'll do the following:
Write a kernel that can produce a subset mean and variance for a single row of the image.
Write a kernel that can produce an overall mean and variance from the partial results from step 1.
Here are the kernels:
kernel void compute_row_mean_variance_array(texture2d_array<float, access::read> inTexture [[texture(0)]],
texture2d_array<float, access::write> outTexture [[texture(1)]],
uint3 tpig [[thread_position_in_grid]])
{
uint row = tpig.x;
uint slice = tpig.y;
uint width = inTexture.get_width();
if (row >= inTexture.get_height() || slice >= inTexture.get_array_size()) { return; }
float4 mean(0.0f);
float4 var(0.0f);
for (uint col = 0; col < width; ++col) {
float4 rgba = inTexture.read(ushort2(col, row), slice);
// http://datagenetics.com/blog/november22017/index.html
float weight = 1.0f / (col + 1);
float4 oldMean = mean;
mean = mean + (rgba - mean) * weight;
var = var + (rgba - oldMean) * (rgba - mean);
}
var = var / width;
outTexture.write(mean, ushort2(row, 0), slice);
outTexture.write(var, ushort2(row, 1), slice);
}
kernel void reduce_mean_variance_array(texture2d_array<float, access::read> inTexture [[texture(0)]],
texture2d_array<float, access::write> outTexture [[texture(1)]],
uint3 tpig [[thread_position_in_grid]])
{
uint width = inTexture.get_width();
uint slice = tpig.x;
// https://arxiv.org/pdf/1007.1012.pdf
float4 mean(0.0f);
float4 meanOfVar(0.0f);
float4 varOfMean(0.0f);
for (uint col = 0; col < width; ++col) {
float weight = 1.0f / (col + 1);
float4 oldMean = mean;
float4 submean = inTexture.read(ushort2(col, 0), slice);
mean = mean + (submean - mean) * weight;
float4 subvar = inTexture.read(ushort2(col, 1), slice);
meanOfVar = meanOfVar + (subvar - meanOfVar) * weight;
varOfMean = varOfMean + (submean - oldMean) * (submean - mean);
}
float4 var = meanOfVar + varOfMean / width;
outTexture.write(mean, ushort2(0, 0), slice);
outTexture.write(var, ushort2(1, 0), slice);
}
In summary, to achieve step 1, we use an "online" (incremental) algorithm to calculate the partial mean/variance of the row in a way that's more numerically-stable than just adding all the pixel values and dividing by the width. My reference for writing this kernel was this post. Each thread in the grid writes its row's statistics to the appropriate column and slice of an intermediate texture array.
To achieve step 2, we need to find a statistically-sound way of computing the overall statistics from the partial results. This is quite simple in the case of finding the mean: the mean of the population is the mean of the means of the subsets (this holds when the sample size of each subset is the same; in the general case, the overall mean is a weighted sum of the subset means). The variance is trickier, but it turns out that the variance of the population is the sum of the mean of the variances of the subsets and the variance of the means of the subsets (the same caveat about equally-sized subsets applies here). This is a convenient fact that we can combine with our incremental approach above to produce the final mean and variance of each slice, which is written to the corresponding slice of the output texture.
For completeness, here's the Swift code I used to drive these kernels:
let library = device.makeDefaultLibrary()!
let meanVarKernelFunction = library.makeFunction(name: "compute_row_mean_variance_array")!
let meanVarComputePipelineState = try! device.makeComputePipelineState(function: meanVarKernelFunction)
let reduceKernelFunction = library.makeFunction(name: "reduce_mean_variance_array")!
let reduceComputePipelineState = try! device.makeComputePipelineState(function: reduceKernelFunction)
let width = sourceTexture.width
let height = sourceTexture.height
let arrayLength = sourceTexture.arrayLength
let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .rgba32Float, width: width, height: height, mipmapped: false)
textureDescriptor.textureType = .type2DArray
textureDescriptor.arrayLength = arrayLength
textureDescriptor.width = height
textureDescriptor.height = 2
textureDescriptor.usage = [.shaderRead, .shaderWrite]
let partialResultsTexture = device.makeTexture(descriptor: textureDescriptor)!
textureDescriptor.width = 2
textureDescriptor.height = 1
textureDescriptor.usage = .shaderWrite
let destTexture = device.makeTexture(descriptor: textureDescriptor)!
let commandBuffer = commandQueue.makeCommandBuffer()!
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()!
computeCommandEncoder.setComputePipelineState(meanVarComputePipelineState)
computeCommandEncoder.setTexture(sourceTexture, index: 0)
computeCommandEncoder.setTexture(partialResultsTexture, index: 1)
let meanVarGridSize = MTLSize(width: sourceTexture.height, height: sourceTexture.arrayLength, depth: 1)
let meanVarThreadgroupSize = MTLSizeMake(meanVarComputePipelineState.threadExecutionWidth, 1, 1)
let meanVarThreadgroupCount = MTLSizeMake((meanVarGridSize.width + meanVarThreadgroupSize.width - 1) / meanVarThreadgroupSize.width,
(meanVarGridSize.height + meanVarThreadgroupSize.height - 1) / meanVarThreadgroupSize.height,
1)
computeCommandEncoder.dispatchThreadgroups(meanVarThreadgroupCount, threadsPerThreadgroup: meanVarThreadgroupSize)
computeCommandEncoder.setComputePipelineState(reduceComputePipelineState)
computeCommandEncoder.setTexture(partialResultsTexture, index: 0)
computeCommandEncoder.setTexture(destTexture, index: 1)
let reduceThreadgroupSize = MTLSizeMake(1, 1, 1)
let reduceThreadgroupCount = MTLSizeMake(arrayLength, 1, 1)
computeCommandEncoder.dispatchThreadgroups(reduceThreadgroupCount, threadsPerThreadgroup: reduceThreadgroupSize)
computeCommandEncoder.endEncoding()
let destTexture2DDesc = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .rgba32Float, width: 2, height: 1, mipmapped: false)
destTexture2DDesc.usage = .shaderWrite
let destTexture2D = device.makeTexture(descriptor: destTexture2DDesc)!
meanVarKernel.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture2D, destinationTexture: destTexture2D)
#if os(macOS)
let blitCommandEncoder = commandBuffer.makeBlitCommandEncoder()!
blitCommandEncoder.synchronize(resource: destTexture)
blitCommandEncoder.synchronize(resource: destTexture2D)
blitCommandEncoder.endEncoding()
#endif
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
In my experiments, this program produced the same results as MPSImageStatisticsMeanAndVariance, give or take some differences on the order of 1e-7. It was also 2.5x slower than MPS on my Mac, probably due in part to failure to exploit latency hiding with granular parallelism.
#include <metal_stdlib>
using namespace metal;
kernel void instance_norm(constant float4* scale[[buffer(0)]],
constant float4* shift[[buffer(1)]],
texture2d_array<float, access::read> in[[texture(0)]],
texture2d_array<float, access::write> out[[texture(1)]],
ushort3 gid[[thread_position_in_grid]],
ushort tid[[thread_index_in_threadgroup]],
ushort3 tg_size[[threads_per_threadgroup]]) {
ushort width = in.get_width();
ushort height = in.get_height();
const ushort thread_count = tg_size.x * tg_size.y;
threadgroup float4 shared_mem [256];
float4 sum = 0;
for(ushort xIndex = gid.x; xIndex < width; xIndex += tg_size.x) {
for(ushort yIndex = gid.y; yIndex < height; yIndex += tg_size.y) {
sum += in.read(ushort2(xIndex, yIndex), gid.z);
}
}
shared_mem[tid] = sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Reduce to 32 values
sum = 0;
if (tid < 32) {
for (ushort i = tid + 32; i < thread_count; i += 32) {
sum += shared_mem[i];
}
}
shared_mem[tid] += sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Calculate mean
sum = 0;
if (tid == 0) {
ushort top = min(ushort(32), thread_count);
for (ushort i = 0; i < top; i += 1) {
sum += shared_mem[i];
}
shared_mem[0] = sum / (width * height);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
const float4 mean = shared_mem[0];
threadgroup_barrier(mem_flags::mem_threadgroup);
// Variance
sum = 0;
for(ushort xIndex = gid.x; xIndex < width; xIndex += tg_size.x) {
for(ushort yIndex = gid.y; yIndex < height; yIndex += tg_size.y) {
sum += pow(in.read(ushort2(xIndex, yIndex), gid.z) - mean, 2);
}
}
shared_mem[tid] = sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Reduce to 32 values
sum = 0;
if (tid < 32) {
for (ushort i = tid + 32; i < thread_count; i += 32) {
sum += shared_mem[i];
}
}
shared_mem[tid] += sum;
threadgroup_barrier(mem_flags::mem_threadgroup);
// Calculate variance
sum = 0;
if (tid == 0) {
ushort top = min(ushort(32), thread_count);
for (ushort i = 0; i < top; i += 1) {
sum += shared_mem[i];
}
shared_mem[0] = sum / (width * height);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
const float4 sigma = sqrt(shared_mem[0] + float4(1e-4));
float4 multiplier = scale[gid.z] / sigma;
for(ushort xIndex = gid.x; xIndex < width; xIndex += tg_size.x) {
for(ushort yIndex = gid.y; yIndex < height; yIndex += tg_size.y) {
float4 val = in.read(ushort2(xIndex, yIndex), gid.z);
out.write(clamp((val - mean) * multiplier + shift[gid.z], -10.0, 10.0), ushort2(xIndex, yIndex), gid.z);
}
}
}
this is how Blend implement, but I do not think it is true, can anybody solve it ?
https://github.com/xmartlabs/Bender/blob/master/Sources/Metal/instanceNorm.metal
I want to use template matching in OpenCV to get the similarity of two images. As we all know,template matching is usually used to find smaller image parts in a bigger one. Here is my question. I find when template image and source image are same-sized, the result matrix get from function matchTemplate() is always 0, even if the two images are exactly the same one.
Can template matching in OpenCV deal with two same-sized images?
Perhaps I should apologize first: the value of the matrix is indeed zero after normalization, as long as the two pictures are of the same size. I was wrong about that:)
Check out this page:
OpenCV - Normalize
Part of the OpenCV source code:
void cv::normalize( InputArray _src, OutputArray _dst, double a, double b,
int norm_type, int rtype, InputArray _mask )
{
Mat src = _src.getMat(), mask = _mask.getMat();
double scale = 1, shift = 0;
if( norm_type == CV_MINMAX )
{
double smin = 0, smax = 0; //Records the maximum and minimum value in the _src matrix
double dmin = MIN( a, b ), dmax = MAX( a, b );
minMaxLoc( _src, &smin, &smax, 0, 0, mask ); //Find the minimum and maximum value
scale = (dmax - dmin)*(smax - smin > DBL_EPSILON ? 1./(smax - smin) : 0);
shift = dmin - smin*scale;
}
//...
if( !mask.data )
src.convertTo( dst, rtype, scale, shift );
else
{
//...
}
}
Since there is only one element in the result array, smin = smax = result[0][0]
scale = (dmax - dmin)*(smax - smin > DBL_EPSILON ? 1./(smax - smin) : 0);
= (1 - 0 ) * (0) = 0
shift = dmin - smin*scale
= 0 - result[0][0] * 0
= 0
After that, void Mat::convertTo(OutputArray m, int rtype, double alpha, double beta) uses the following formula: (saturate_cast has nothing to do with your problem, so we can ignore it for now.)
When you call normalize( result, result, 0, 1, NORM_MINMAX, -1, Mat() ), whatever the element in the matrix is, it will execute src.convertTo( dst, rtype, scale, shift ); with scale = 0, shift = 0.
In this convertTo function,
alpha = 0, beta = 0
result[0][0] = result[0][0] * alpha + beta
= result[0][0] * 0 + 0
= 0
So, whatever the value in the result matrix is:
As long as the image and the template are of the same size, size of the result matrix will be 1x1, and after normalization, the result matrix will become a [0].
Hi,
I am coding in OpenCL.
I am converting a "C function" having 2D array starting from i=1 and j=1 .PFB .
cv::Mat input; //Input :having some data in it ..
//Image input size is :input.rows=288 ,input.cols =640
cv::Mat output(input.rows-2,input.cols-2,CV_32F); //Output buffer
//Image output size is :output.rows=286 ,output.cols =638
This is a code Which I want to modify in OpenCL:
for(int i=1;i<output.rows-1;i++)
{
for(int j=1;j<output.cols-1;j++)
{
float xVal = input.at<uchar>(i-1,j-1)-input.at<uchar>(i-1,j+1)+ 2*(input.at<uchar>(i,j-1)-input.at<uchar>(i,j+1))+input.at<uchar>(i+1,j-1) - input.at<uchar>(i+1,j+1);
float yVal = input.at<uchar>(i-1,j-1) - input.at<uchar>(i+1,j-1)+ 2*(input.at<uchar>(i-1,j) - input.at<uchar>(i+1,j))+input.at<uchar>(i-1,j+1)-input.at<uchar>(i+1,j+1);
output.at<float>(i-1,j-1) = xVal*xVal+yVal*yVal;
}
}
...
Host code :
//Input Image size is :input.rows=288 ,input.cols =640
//Output Image size is :output.rows=286 ,output.cols =638
OclStr->global_work_size[0] =(input.cols);
OclStr->global_work_size[1] =(input.rows);
size_t outBufSize = (output.rows) * (output.cols) * 4;//4 as I am copying all 4 uchar values into one float variable space
cl_mem cl_input_buffer = clCreateBuffer(
OclStr->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR ,
(input.rows) * (input.cols),
static_cast<void *>(input.data), &OclStr->returnstatus);
cl_mem cl_output_buffer = clCreateBuffer(
OclStr->context, CL_MEM_WRITE_ONLY| CL_MEM_USE_HOST_PTR ,
(output.rows) * (output.cols) * sizeof(float),
static_cast<void *>(output.data), &OclStr->returnstatus);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 0, sizeof(cl_mem), (void *)&cl_input_buffer);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 1, sizeof(cl_mem), (void *)&cl_output_buffer);
OclStr->returnstatus = clEnqueueNDRangeKernel(
OclStr->command_queue,
OclStr->objkernel,
2,
NULL,
OclStr->global_work_size,
NULL,
0,
NULL,
NULL
);
clEnqueueMapBuffer(OclStr->command_queue, cl_output_buffer, true, CL_MAP_READ, 0, outBufSize, 0, NULL, NULL, &OclStr->returnstatus);
kernel Code :
__kernel void Sobel_uchar (__global uchar *pSrc, __global float *pDstImage)
{
const uint cols = get_global_id(0)+1;
const uint rows = get_global_id(1)+1;
const uint width= get_global_size(0);
uchar Opsoble[8];
Opsoble[0] = pSrc[(cols-1)+((rows-1)*width)];
Opsoble[1] = pSrc[(cols+1)+((rows-1)*width)];
Opsoble[2] = pSrc[(cols-1)+((rows+0)*width)];
Opsoble[3] = pSrc[(cols+1)+((rows+0)*width)];
Opsoble[4] = pSrc[(cols-1)+((rows+1)*width)];
Opsoble[5] = pSrc[(cols+1)+((rows+1)*width)];
Opsoble[6] = pSrc[(cols+0)+((rows-1)*width)];
Opsoble[7] = pSrc[(cols+0)+((rows+1)*width)];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(cols-1)+(rows-1)*width] = gx*gx + gy*gy;
}
Here I am not able to get the output as expected.
I am having some questions that
My for loop is starting from i=1 instead of zero, then How can I get proper index by using the global_id() in x and y direction
What is going wrong in my above kernel code :(
I am suspecting there is a problem in buffer stride but not able to further break my head as already broke it throughout a day :(
I have observed that with below logic output is skipping one or two frames after some 7/8 frames sequence.
I have added the screen shot of my output which is compared with the reference output.
My above logic is doing partial sobelling on my input .I changed the width as -
const uint width = get_global_size(0)+1;
PFB
Your suggestions are most welcome !!!
It looks like you may be fetching values in (y,x) format in your opencl version. Also, you need to add 1 to the global id to replicate your for loops starting from 1 rather than 0.
I don't know why there is an unused iOffset variable. Maybe your bug is related to this? I removed it in my version.
Does this kernel work better for you?
__kernel void simple(__global uchar *pSrc, __global float *pDstImage)
{
const uint i = get_global_id(0) +1;
const uint j = get_global_id(1) +1;
const uint width = get_global_size(0) +2;
uchar Opsoble[8];
Opsoble[0] = pSrc[(i-1) + (j - 1)*width];
Opsoble[1] = pSrc[(i-1) + (j + 1)*width];
Opsoble[2] = pSrc[i + (j-1)*width];
Opsoble[3] = pSrc[i + (j+1)*width];
Opsoble[4] = pSrc[(i+1) + (j - 1)*width];
Opsoble[5] = pSrc[(i+1) + (j + 1)*width];
Opsoble[6] = pSrc[(i-1) + (j)*width];
Opsoble[7] = pSrc[(i+1) + (j)*width];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(i-1) + (j-1)*width] = gx*gx + gy*gy ;
}
I am a bit apprehensive about posting an answer suggesting optimizations to your kernel, seeing as the original output has not been reproduced exactly as of yet. There is a major improvement available to be made for problems related to image processing/filtering.
Using local memory will help you out by reducing the number of global reads by a factor of eight, as well as grouping the global writes together for potential gains with the single write-per-pixel output.
The kernel below reads a block of up to 34x34 from pSrc, and outputs a 32x32(max) area of the pDstImage. I hope the comments in the code are enough to guide you in using the kernel. I have not been able to give this a complete test, so there could be changes required. Any comments are appreciated as well.
__kernel void sobel_uchar_wlocal (__global uchar *pSrc, __global float *pDstImage, __global uint2 dimDstImage)
{
//call this kernel 1-dimensional work group size: 32x1
//calculates 32x32 region of output with 32 work items
const uint wid = get_local_id(0);
const uint wid_1 = wid+1; // corrected for the calculation step
const uint2 gid = (uint2)(get_group_id(0),get_group_id(1));
const uint localDim = get_local_size(0);
const uint2 globalTopLeft = (uint2)(localDim * gid.x, localDim * gid.y); //position in pSrc to copy from/to
//dimLocalBuff is used for the right and bottom edges of the image, where the work group may run over the border
const uint2 dimLocalBuff = (uint2)(localDim,localDim);
if(dimDstImage.x - globalTopLeft.x < dimLocalBuff.x){
dimLocalBuff.x = dimDstImage.x - globalTopLeft.x;
}
if(dimDstImage.y - globalTopLeft.y < dimLocalBuff.y){
dimLocalBuff.y = dimDstImage.y - globalTopLeft.y;
}
int i,j;
//save region of data into local memory
__local uchar srcBuff[34][34]; //34^2 uchar = 1156 bytes
for(j=-1;j<dimLocalBuff.y+1;j++){
for(i=x-1;i<dimLocalBuff.x+1;i+=localDim){
srcBuff[i+1][j+1] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//compute output and store locally
__local float dstBuff[32][32]; //32^2 float = 4096 bytes
if(wid_1 < dimLocalBuff.x){
for(i=0;i<dimLocalBuff.y;i++){
float gx = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1-1)+ (i + 1)]+2*(srcBuff[wid_1+ (i-1)]-srcBuff[wid_1+ (i+1)])+srcBuff[(wid_1+1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i + 1)];
float gy = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i - 1)]+2*(srcBuff[(wid_1-1)+ (i)]-srcBuff[(wid_1+1)+ (i)])+srcBuff[(wid_1-1)+ (i + 1)]-srcBuff[(wid_1+1)+ (i + 1)];
dstBuff[wid][i] = gx*gx + gy*gy;
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//copy results to output
for(j=0;j<dimLocalBuff.y;j++){
for(i=0;i<dimLocalBuff.x;i+=localDim){
srcBuff[i][j] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
}
}
}
I have a huge image ( about 63000 x 63000 pixels = 3969 Megapixels )
what i have done so far is i decided to make "tiles" of (1024 x 1024) and do my calculations based on these tiles, resulting in an 62 x 62 image tile grid!
(this works out very well and has the advantage of making the image viewable with zoom-in and zoom out, only viewn tiles are downsized for example)
But what i need now are the contours from the huge image!
i use the OpenCV function "findContours" to detect contours on each
one of the tiles.
i have added some overlap in the tiles so i get
overlapping contours ( 1 pixel overlap )
i used the offset parameter
of "findContours" to shift the contours to the right position
into the "virtual total image"
Here are some screenshot's i made from a demo application
What I want is this:
Now my questions:
is it possible to stitch the contours, my worst case is a contour which covers the total image... is there some library that can do this?
is there a library which works on a compressed version of the total image ( like rle for example )
is there a way to make opencv findcontours work on 1 bit binary images ?
Here's the code used by findcontours:
// Surf2DTiledData ...a gobject based class used for 2d tile management and viewing..
Surf2DTiledData* td = (Surf2DTiledData*)in_td;
int nr_hor_tiles = surf2_d_tiled_data_get_nr_hor_tiles(td);
int nr_ver_tiles = surf2_d_tiled_data_get_nr_ver_tiles(td);
int tile_size_x = surf2_d_tiled_data_get_tile_width(td);
int tile_size_y = surf2_d_tiled_data_get_tile_height(td);
contouring_data_obj = surf2_d_tiled_data_get_ContouringData(td);
p_contours = contouring_data_obj->p_contours;
p_border_contours = contouring_data_obj->p_border_contours;
g_return_if_fail(p_border_contours != NULL);
g_return_if_fail(p_contours != NULL);
for (y = 0; y < nr_ver_tiles; y++){
int x;
for (x = 0; x < nr_hor_tiles; x++){
int idx = x + y*nr_hor_tiles;
CvMemStorage *mem = contouring_data_obj->contour_storage[idx];
CvMat _src;
CvSeq *contours = NULL;
uchar* dataBuffer = (uchar*)p_data[x][y];
// the idea is to have some extra space available for the overlap
// detection of contours!
// the extra space is needed for the algorithm to check for
// overlaps of contours later on!
#define VIRT_BORDER_EXTEND 2
int virtual_x = x * tile_size_x - VIRT_BORDER_EXTEND;
int virtual_y = y * tile_size_y - VIRT_BORDER_EXTEND;
int virtual_width = tile_size_x + VIRT_BORDER_EXTEND * 2;
int virtual_height = tile_size_y + VIRT_BORDER_EXTEND * 2;
int x_off = -VIRT_BORDER_EXTEND;
int y_off = -VIRT_BORDER_EXTEND;
if (virtual_x < 0) {
virtual_width += virtual_x;
virtual_x = 0;
x_off = 0;
}
if (virtual_y < 0) {
virtual_height += virtual_y;
virtual_y = 0;
y_off = 0;
}
if ((virtual_x + virtual_width) > (nr_hor_tiles*tile_size_x)) {
virtual_width = nr_hor_tiles*tile_size_x - virtual_x;
}
if ((virtual_y + virtual_height) > (nr_ver_tiles*tile_size_y)) {
virtual_height = nr_ver_tiles*tile_size_y - virtual_y;
}
CvMat* _roi_mat = get_roi_mat(td,
virtual_x, virtual_y,
virtual_width, virtual_height);
// Use either this:
//mem = cvCreateMemStorage(0);
if (_roi_mat){
// CV_LINK_RUNS => different algorithm!!!!
int tile_off_x = tile_size_x * x;
int tile_off_y = tile_size_y * y;
CvPoint contour_shift = cvPoint(x_off + tile_off_x, y_off + tile_off_y);
int n = cvFindContours(_roi_mat, mem, &contours, sizeof(CvContour), CV_RETR_LIST, CV_CHAIN_APPROX_SIMPLE, contour_shift);
cvReleaseMat(&_roi_mat);
p_contours[x][y] = contours;
}
//cvReleaseMemStorage(&mem);
}
}
later i used opengl to make textures out of the tiles and for every tile there is a quad !
the opencv contours are not drawn as this could be too slow for now, but i draw their bounding boxes... which are drawn in opengl too..