Increase speed of OpenCL image processing - image-processing

I think the execution time of my kernel is too high. It's job is it to just blend two images together using either addition, subtraction, division or multiplication.
#define SETUP_KERNEL(name, operator)\
__kernel void name(__read_only image2d_t image1,\
__read_only image2d_t image2,\
__write_only image2d_t output,\
const sampler_t sampler1,\
const sampler_t sampler2,\
float opacity)\
{\
int2 xy = (int2) (get_global_id(0), get_global_id(1));\
float2 normalizedCoords = convert_float2(xy) / (float2) (get_image_width(output), get_image_height(output));\
float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);\
float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);\
write_imagef(output, xy, (pixel1 * opacity) operator pixel2);\
}
SETUP_KERNEL(div, /)
SETUP_KERNEL(add, +)
SETUP_KERNEL(mult, *)
SETUP_KERNEL(sub, -)
As you can see I use macros to quickly define the different kernels. (Should I better use functions for that?)
Somehow the kernel managed to take 3ms on a GTX 970.
What can I do to increase the performance of this particular kernel?
Should I split it into different programs?

Bilinear interpolation is 2x-3x slower than nearest neighbor way. Are you sure you are not using nearest neighbor in opengl?
What it does in background(by the sampler) is something like:
R1 = ((x2 – x)/(x2 – x1))*Q11 + ((x – x1)/(x2 – x1))*Q21
R2 = ((x2 – x)/(x2 – x1))*Q12 + ((x – x1)/(x2 – x1))*Q22
After the two R values are calculated, the value of P can finally be calculated by a weighted average of R1 and R2.
P = ((y2 – y)/(y2 – y1))*R1 + ((y – y1)/(y2 – y1))*R2
The calculation will have to be repeated for the red, green, blue, and optionally the alpha component of.
http://supercomputingblog.com/graphics/coding-bilinear-interpolation/
Or it is simply Nvidia implemented fastpath for opengl and complete path for opencl image access. For example, for amd, image writes are complete path, smaller than 32bit data accesses are complete path, image reads are fastpath.
Another option: Z-order is better suited to compute-divergence of those image data and opencl's non Z-order(suspicious, maybe not) is worse.

Division is often expenisve so
I suggest moving calculation of normalizedCoords to host side.
On host side:
float normalized_x[output_width]; // initialize with [0..output_width-1]/output_width
float normalized_y[output_height]; // initialize with [0..output_height-1]/output_height
Change kernel to:
#define SETUP_KERNEL(name, operator)\
__kernel void name(__read_only image2d_t image1,\
__read_only image2d_t image2,\
__write_only image2d_t output,\
global float *normalized_x, \
global float *normalized_y, \
const sampler_t sampler1,\
const sampler_t sampler2,\
float opacity)\
{\
int2 xy = (int2) (get_global_id(0), get_global_id(1));\
float2 normalizedCoords = (float2) (normalized_x[xy.x],normalized_y[xy.y] );\
float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);\
float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);\
write_imagef(output, xy, (pixel1 * opacity) operator pixel2);\
}
Also you can try not using normalized cooardinates by using the same technique.
This would be more beneficial if the sizes of the input images dont change often.

Related

How can I find the brightest point in a CIImage (in Metal maybe)?

I created a custom CIKernel in Metal. This is useful because it is close to real-time. I am avoiding any cgcontext or cicontext that might lag in real time. My kernel essentially does a Hough transform, but I can't seem to figure out how to read the white points from the image buffer.
Here is kernel.metal:
#include <CoreImage/CoreImage.h>
extern "C" {
namespace coreimage {
float4 hough(sampler src) {
// Math
// More Math
// eventually:
if (luminance > 0.8) {
uint2 position = src.coord()
// Somehow add this to an array because I need to know the x,y pair
}
return float4(luminance, luminance, luminance, 1.0);
}
}
}
I am fine if this part can be extracted to a different kernel or function. The caveat to CIKernel, is its return type is a float4 representing the new color of a pixel. Ideally, instead of a image -> image filter, I would like an image -> array sort of deal. E.g. reduce instead of map. I have a bad hunch this will require me to render it and deal with it on the CPU.
Ultimately I want to retrieve the qualifying coordinates (which there can be multiple per image) back in my swift function.
FINAL SOLUTION EDIT:
As per suggestions of the answer, I am doing large per-pixel calculations on the GPU, and some math on the CPU. I designed 2 additional kernels that work like the builtin reduction kernels. One kernel returns a 1 pixel high image of the highest values in each column, and the other kernel returns a 1 pixel high image of the normalized y-coordinate of the highest value:
/// Returns the maximum value in each column.
///
/// - Parameter src: a sampler for the input texture
/// - Returns: maximum value in for column
float4 maxValueForColumn(sampler src) {
const float2 size = float2(src.extent().z, src.extent().w);
/// Destination pixel coordinate, normalized
const float2 pos = src.coord();
float maxV = 0;
for (float y = 0; y < size.y; y++) {
float v = src.sample(float2(pos.x, y / size.y)).x;
if (v > maxV) {
maxV = v;
}
}
return float4(maxV, maxV, maxV, 1.0);
}
/// Returns the normalized coordinate of the maximum value in each column.
///
/// - Parameter src: a sampler for the input texture
/// - Returns: normalized y-coordinate of the maximum value in for column
float4 maxCoordForColumn(sampler src) {
const float2 size = float2(src.extent().z, src.extent().w);
/// Destination pixel coordinate, normalized
const float2 pos = src.coord();
float maxV = 0;
float maxY = 0;
for (float y = 0; y < size.y; y++) {
float v = src.sample(float2(pos.x, y / size.y)).x;
if (v > maxV) {
maxY = y / size.y;
maxV = v;
}
}
return float4(maxY, maxY, maxY, 1.0);
}
This won't give every pixel where luminance is greater than 0.8, but for my purposes, it returns enough: the highest value in each column, and its location.
Pro: copying only (2 * image width) bytes over to the CPU instead of every pixel saves TONS of time (a few ms).
Con: If you have two major white points in the same column, you will never know. You might have to alter this and do calculations by row instead of column if that fits your use-case.
FOLLOW UP:
There seems to be a problem in rendering the outputs. The Float values returned in metal are not correlated to the UInt8 values I am getting in swift.
This unanswered question describes the problem.
Edit: This answered question provides a very convenient metal function. When you call it on a metal value (e.g. 0.5) and return it, you will get the correct value (e.g. 128) on the CPU.
Check out the filters in the CICategoryReduction (like CIAreaAverage). They return images that are just a few pixels tall, containing the reduction result. But you still have to render them to be able to read the values in your Swift function.
The problem for using this approach for your problem is that you don't know the number of coordinates you are returning beforehand. Core Image needs to know the extend of the output when it calls your kernel, though. You could just assume a static maximum number of coordinates, but that all sounds tedious.
I think you are better off using Accelerate APIs for iterating the pixels of your image (parallelized, super efficiently) on the CPU to find the corresponding coordinates.
You could do a hybrid approach where you do the per-pixel heavy math on the GPU with Core Image and then do the analysis on the CPU using Accelerate. You can even integrate the CPU part into your Core Image pipeline using a CIImageProcessorKernel.

Convert YUV4:4:4 to YUV4:2:2 images

There is a lot of information on the internet about the differences between YUV4:4:4 to YUV4:2:2 formats, however, I can not find anything that tells how to convert the YUV4:4:4 to YUV4:2:2. Since such conversion is performed using software, I was hoping that there should be some developers that have done it and could direct me to the sources that describe the conversion algorithm. Of course, the software code would be nice to have, but having the access to the theory would be sufficient enough to write my own software. Specifically, I would like to know pixel structure and how the bytes are managed during conversion.
I found several similar questions like this and this, however, could not get my question answered. Also, I posted this question on the Photography forum, and they considered it as a software question.
The reason why you can't find specific description, is that there are many ways to do it.
Lets start from Wikipedia: https://en.wikipedia.org/wiki/Chroma_subsampling#4:2:2
4:4:4:
Each of the three Y'CbCr components have the same sample rate, thus there is no chroma subsampling. This scheme is sometimes used in high-end film scanners and cinematic post production.
and
4:2:2:
The two chroma components are sampled at half the sample rate of luma: the horizontal chroma resolution is halved. This reduces the bandwidth of an uncompressed video signal by one-third with little to no visual difference.
Note: Terms YCbCr and YUV are used interchangeably.
https://en.wikipedia.org/wiki/YCbCr
Y′CbCr is often confused with the YUV color space, and typically the terms YCbCr and YUV are used interchangeably, leading to some confusion; when referring to signals in video or digital form, the term "YUV" mostly means "Y′CbCr".
Data memory ordering:
Again there is more than one format.
Intel IPP documentation defines two main categories: "Pixel-Order Image Formats" and "Planar Image Formats".
There is a nice documentation here: https://software.intel.com/en-us/node/503876
Refer here: http://www.fourcc.org/yuv.php#NV12 for YUV pixel arrangement formats.
Refer here: http://scc.ustc.edu.cn/zlsc/sugon/intel/ipp/ipp_manual/IPPI/ippi_ch6/ch6_image_downsampling.htm#ch6_image_downsampling for downsampling description.
Let's assume "Pixel-Order" format:
YUV 4:4:4 data order: Y0 U0 V0 Y1 U1 V1 Y2 U2 V2 Y3 U3 V3
YUV 4:2:2 data order: Y0 U0 Y1 V0 Y2 U1 Y3 V1
Each element is a single byte, and Y0 is the lower byte in memory.
The 4:2:2 data order described above is named UYVY or YUY2 pixel format.
Conversion algorithms:
"Naive sub-sampling":
"Throw" every second U/V component:
Take U0, and throw U1, take V0 and throw V1...
Source: Y0 U0 V0 Y1 U1 V1 Y2 U2 V2
Destination: Y0 U0 Y1 V0 Y2 U2 Y3 V2
I can't recommend it, since it causes aliasing artifacts.
Average each U/V pair:
Take Destination U0 equals source (U0+U1)/2, same for V0...
Source: Y0 U0 V0 Y1 U1 V1 Y2 U2 V2
Destination: Y0 (U0+U1)/2 Y1 (V0+V1)/2 Y2 (U2+U3)/2 Y3 (V2+V3)/2
Use other interpolation method for down-sampling U and V (cubic interpolation for example).
Usually you will not be able to see any differences compared to simple average.
C implementation:
The question is not tagged as C, but I think the following C implementation may be helpful.
The following code converts pixel-ordered YUV 4:4:4 to pixel-ordered YUV 4:2:2 by averaging each U/V pair:
//Convert single row I0 from pixel-ordered YUV 4:4:4 to pixel-ordered YUV 4:2:2.
//Save the result in J0.
//I0 size in bytes is image_width*3
//J0 size in bytes is image_width*2
static void ConvertRowYUV444ToYUV422(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x;
//Process two Y,U,V triples per iteration:
for (x = 0; x < image_width; x += 2)
{
//Load source elements
unsigned char y0 = I0[x*3]; //Load source Y element
unsigned int u0 = (unsigned int)I0[x*3+1]; //Load source U element (and convert from uint8 to uint32).
unsigned int v0 = (unsigned int)I0[x*3+2]; //Load source V element (and convert from uint8 to uint32).
//Load next source elements
unsigned char y1 = I0[x*3+3]; //Load source Y element
unsigned int u1 = (unsigned int)I0[x*3+4]; //Load source U element (and convert from uint8 to uint32).
unsigned int v1 = (unsigned int)I0[x*3+5]; //Load source V element (and convert from uint8 to uint32).
//Calculate destination U, and V elements.
//Use shift right by 1 for dividing by 2.
//Use plus 1 before shifting - round operation instead of floor operation.
unsigned int u01 = (u0 + u1 + 1) >> 1; //Destination U element equals average of two source U elements.
unsigned int v01 = (v0 + v1 + 1) >> 1; //Destination U element equals average of two source U elements.
J0[x*2] = y0; //Store Y element (unmodified).
J0[x*2+1] = (unsigned char)u01; //Store destination U element (and cast uint32 to uint8).
J0[x*2+2] = y1; //Store Y element (unmodified).
J0[x*2+3] = (unsigned char)v01; //Store destination V element (and cast uint32 to uint8).
}
}
//Convert image I from pixel-ordered YUV 4:4:4 to pixel-ordered YUV 4:2:2.
//I - Input image in pixel-order data YUV 4:4:4 format.
//image_width - Number of columns of image I.
//image_height - Number of rows of image I.
//J - Destination "image" in pixel-order data YUV 4:2:2 format.
//Note: The term "YUV" referees to "Y'CbCr".
//I is pixel ordered YUV 4:4:4 format (size in bytes is image_width*image_height*3):
//YUVYUVYUVYUV
//YUVYUVYUVYUV
//YUVYUVYUVYUV
//YUVYUVYUVYUV
//
//J is pixel ordered YUV 4:2:2 format (size in bytes is image_width*image_height*2):
//YUYVYUYV
//YUYVYUYV
//YUYVYUYV
//YUYVYUYV
//
//Conversion algorithm:
//Each element of destination U is average of 2 original U horizontal elements
//Each element of destination V is average of 2 original V horizontal elements
//
//Limitations:
//1. image_width must be a multiple of 2.
//2. I and J must be two separate arrays (in place computation is not supported).
static void ConvertYUV444ToYUV422(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source row.
const unsigned char *I0; //I0 -> YUYVYUYV...
//J0 and points destination row.
unsigned char *J0; //J0 -> YUYVYUYV
int y; //Row index
//In each iteration process single row.
for (y = 0; y < image_height; y++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is Y,U,V).
J0 = &J[y*image_width*2]; //Output row width is image_width*2 bytes (each two pixels are Y,U,Y,V).
//Process single source row into single destination row
ConvertRowYUV444ToYUV422(I0, image_width, J0);
}
}
Planar representation of YUV 4:2:2
Planar representation may be more intuitive than "Pixel-Order" format.
In planar representation each color channel is represented as a separate matrix, which can be displayed as an image.
Example:
Original image in RGB format (before converting to YUV):
Image channels in YUV 4:4:4 format:
(Left YUV triple is represented in gray levels, and right YUV triple is represented using false colors).
Image channels in YUV 4:2:2 format (after horizontal Chroma subsampling):
(Left YUV triple is represented in gray levels, and right YUV triple is represented using "false colors").
As you can see, in 4:2:2 format, the U an V channels are down-sampled (shrunk) in the horizontal axis.
Remark:
The "false colors" representation of U and V channels is used for emphasizing that Y is the Luma channel and U and V are the Chrominance channels.
Higher order interpolation and Anti-Aliasing filter:
Following MATLAB code sample shows how to perform down-sampling with higher order interpolation and Anti-Aliasing filter.
The sample also shows the down-sampling method used by FFMPEG.
Note: you don't need to know MATLAB programming in order to understand the samples.
You do need some knowledge of image filtering by convolution between a Kernel and an image.
%Prepare the input:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
load('mandrill.mat', 'X', 'map'); %Load input image
RGB = im2uint8(ind2rgb(X, map)); %Convert to RGB (the mandrill sample image is an indexed image)
YUV = rgb2ycbcr(RGB); %Convert from RGB to YUV (MATLAB function rgb2ycbcr uses BT.601 conversion formula)
%Separate YUV to 3 planes (Y plane, U plane and V plane)
Y = YUV(:, :, 1);
U = YUV(:, :, 2);
V = YUV(:, :, 3);
U = double(U); %Work in double precision instead of uint8.
[M, N] = size(Y); %Image size is N columns by M rows.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Linear interpolation without Anti-Aliasing filter:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Horizontal down-sampling U plane using Linear interpolation (without Anti-Aliasing filter).
%Simple averaging is equivalent to linear interpolation.
U2 = (U(:, 1:2:end) + U(:, 2:2:end))/2;
refU2 = imresize(U, [M, N/2], 'bilinear', 'Antialiasing', false); %Use MATLAB imresize function as reference
disp(['Linear interpolation max diff = ' num2str(max(abs(double(U2(:)) - double(refU2(:)))))]); %Print maximum difference.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Cubic interpolation without Anti-Aliasing filter:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Horizontal down-sampling U plane using Cubic interpolation (without Anti-Aliasing filter).
%Following operations are equivalent to cubic interpolation:
%1. Convolution with filter kernel [-0.125, 1.25, -0.125]
%2. Averaging pair elements
fU = imfilter(U, [-0.125, 1.25, -0.125], 'symmetric');
U2 = (fU(:, 1:2:end) + fU(:, 2:2:end))/2;
U2 = max(min(U2, 240), 16); %Limit to valid range of U elements (valid range of U elements in uint8 format is [16, 240])
refU2 = imresize(U, [M, N/2], 'cubic', 'Antialiasing', false); %Use MATLAB imresize function as reference
refU2 = max(min(refU2, 240), 16); %Limit to valid range of U elements
disp(['Cubic interpolation max diff = ' num2str(max(abs(double(U2(:)) - double(refU2(:)))))]); %Print maximum difference.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Linear interpolation with Anti-Aliasing filter:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Horizontal down-sampling U plane using Linear interpolation with Anti-Aliasing filter.
%Remark: The Anti-Aliasing filter is the filter used by MATLAB specific implementation of 'bilinear' imresize.
%Following operations are equivalent to Linear interpolation with Anti-Aliasing filter:
%1. Convolution with filter kernel [0.25, 0.5, 0.25]
%2. Averaging pair elements
fU = imfilter(U, [0.25, 0.5, 0.25], 'symmetric');
U2 = (fU(:, 1:2:end) + fU(:, 2:2:end))/2;
refU2 = imresize(U, [M, N/2], 'bilinear', 'Antialiasing', true); %Use MATLAB imresize function as reference
disp(['Linear interpolation with Anti-Aliasing max diff = ' num2str(max(abs(double(U2(:)) - double(refU2(:)))))]); %Print maximum difference.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Cubic interpolation with Anti-Aliasing filter:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Horizontal down-sampling U plane using Cubic interpolation with Anti-Aliasing filter.
%Remark: The Anti-Aliasing filter is the filter used by MATLAB specific implementation of 'cubic' imresize.
%Following operations are equivalent to Linear interpolation with Anti-Aliasing filter:
%1. Convolution with filter kernel [-0.0234375, -0.046875, 0.2734375, 0.59375, 0.2734375, -0.046875, -0.0234375]
%2. Averaging pair elements
h = [-0.0234375, -0.046875, 0.2734375, 0.59375, 0.2734375, -0.046875, -0.0234375];
fU = imfilter(U, h, 'symmetric');
U2 = (fU(:, 1:2:end) + fU(:, 2:2:end))/2;
U2 = max(min(U2, 240), 16); %Limit to valid range of U elements
refU2 = imresize(U, [M, N/2], 'cubic', 'Antialiasing', true); %Use MATLAB imresize function as reference
refU2 = max(min(refU2, 240), 16); %Limit to valid range of U elements
disp(['Cubic interpolation with Anti-Aliasing max diff = ' num2str(max(abs(double(U2(:)) - double(refU2(:)))))]); %Print maximum difference.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%FFMPEG implementation of horizontal down-sampling U plane.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%FFMPEG uses cubic interpolation with Anti-Aliasing filter (different filter kernel):
%Remark: I didn't check the source code of FFMPEG to verify the values of the filter kernel.
%I can't tell how FFMPEG actually implements the conversion.
%Following operations are equivalent to FFMPEG implementation (with minor differences):
%1. Convolution with filter kernel [-115, -231, 1217, 2354, 1217, -231, -115]/4096
%2. Averaging pair elements
h = [-115, -231, 1217, 2354, 1217, -231, -115]/4096;
fU = imfilter(U, h, 'symmetric');
U2 = (fU(:, 1:2:end) + fU(:, 2:2:end))/2;
U2 = max(min(U2, 240), 16); %Limit to valid range of U elements (FFMPEG actually doesn't limit the result)
%Save Y,U,V planes to file in format supported by FFMPEG
f = fopen('yuv444.yuv', 'w');
fwrite(f, Y', 'uint8');
fwrite(f, U', 'uint8');
fwrite(f, V', 'uint8');
fclose(f);
%For executing FFMPEG within MATLAB, download FFMPEG and place the executable in working directory (ffmpeg.exe for Windows)
%FFMPEG converts source file in YUV444 format to destination file in YUV422 format.
if isunix
[status, cmdout] = system(['./ffmpeg -y -s ', num2str(N), 'x', num2str(M), ' -pix_fmt yuv444p -i yuv444.yuv -pix_fmt yuv422p yuv422.yuv']);
else
[status, cmdout] = system(['ffmpeg.exe -y -s ', num2str(N), 'x', num2str(M), ' -pix_fmt yuv444p -i yuv444.yuv -pix_fmt yuv422p yuv422.yuv']);
end
f = fopen('yuv422.yuv', 'r');
refY = (fread(f, [N, M], '*uint8'))';
refU2 = (fread(f, [N/2, M], '*uint8'))'; %Read down-sampled U plane (FFMPEG result from file).
refV2 = (fread(f, [N/2, M], '*uint8'))';
fclose(f);
%Limit to valid range of U elements.
%In FFMPEG down-sampled U and V may exceed valid range (there is probably a way to tell FFMPEG to limit the result).
refU2 = max(min(refU2, 240), 16);
%Difference exclude first column and last column (FFMPEG treats the margins different than MATLAB)
%Remark: There are minor differences due to rounding (I guess).
disp(['FFMPEG Cubic interpolation with Anti-Aliasing max diff = ' num2str(max(max(abs(double(U2(:, 2:end-1)) - double(refU2(:, 2:end-1))))))]);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Examples for different kind of down-sampling methods.
Linear interpolation versus Cubic interpolation with Anti-Aliasing filter:
In the first example (mandrill) there are no visible differences.
In the second example (circle and rectangle) there are minor visible differences.
The third example (lines) demonstrates aliasing artifacts.
Remark: displayed images where up-sampled from YUV422 to YUV444 using Cubic interpolation and converted from YUV444 to RGB.
Linear interpolation versus Cubic with Anti-Aliasing (mandrill):
Linear interpolation versus Cubic with Anti-Aliasing (circle and rectangle):
Linear interpolation versus Cubic with Anti-Aliasing (demonstrates Aliasing artifacts):

Adding a scalar to a Mat object

So I'm trying to add a scalar value to all elements of a Mat object in openCV, however for raw_t_ubit8 and raw_t_ubit16 types I get wrong results. Here's the code.
Mat A;
//Initialize Mat A;
A = A + 0.1;
The Matrix is initially
The result of the addition is exactly the same matrix. This problem does not occur when I try to add scalars to raw_t_real types of matrices. By raw_t_ubit8 I mean the depth is CV_8UC1
If, as you mentioned in the comments, the contained values are scaled in the matrix to fit the integer domain 0..255, then you should also scale the scalar value you sum. Namely:
A = A + cv::Scalar(round(0.1 * 255) );
Or even better:
A += cv::Scalar(round(0.1 * 255) );
Note that cv::Scalar, as pointed out in the comments by Miki, is in any case made from a double (it's a cv::Scalar_<double>).
The rounding could be omitted, but then you leave the choice on how to convert your double into integer to the function implementation.
You should also check what happens when the values saturate.
Documentation for Opencv matrix expressions.
As stated in the comments and in #Antonio's answer, you can't add 0.1 to an integer.
If you are using CV_8UC1 matrices, but you want to work with floating points values, you should multiply by 255.
Mat1b A; // <-- type CV_8UC1
...
A += 0.1 * 255;
If the result of the operation need to be casted, as in this case, then ultimately saturated_cast is called.
This is equivalent to #Antonio's answer, but it results in cleaner code (at least for me).
The same code will be used, either if you sum a double or a Scalar. A Scalar object will be created in both ways using:
template<typename _Tp> inline
Scalar_<_Tp>::Scalar_(_Tp v0)
{
this->val[0] = v0;
this->val[1] = this->val[2] = this->val[3] = 0;
}
However if you need to sum exactly 0.1 to your matrix (and not to scale it by 255), you need to convert your matrix to CV_32FC1:
#include <opencv2/opencv.hpp>
using namespace cv;
int main(int, char** argv)
{
Mat1b A = (Mat1b(3,3) << 1,2,3,4,5,6,7,8,9);
Mat1f F;
A.convertTo(F, CV_32FC1);
F += 0.1;
return 0;
}

Drawing marching ants using directx

I have to draw a selection feedback like Photoshop in my directx application. I came across an algorithm on wikipedia to do this. But, I am not sure if its the right way to do this especially if my selection area could be any arbitrary geometry. Has someone implemented it using Directx? Any hints are much appreciated.
Based on my comment here is a simple pixel shader to achieve the wanted result:
float4 PS( float4 pos : SV_POSITION) : SV_Target
{
float w = ((int)(pos.x + pos.y + t) % 8);
return (w < 4 ? float4(0,0,0,1) : float4(1,1,1,1));
}
x and y are added to produce the diagonal stripe pattern. You can imagine it as follows: If y is constant and x increases by 1, w also increases by 1. The same applies for y. So for w to stay constant, you have to go (x+1, y-1) or (x-1, y+1) (or other step sizes). We use the % operator to produce a periodicity of 8 pixels. The first half period is filled black and the second half white.
This is an equivalent, but more performant shader. It uses bit operations instead of modulo and comparisons.
float4 PS( float4 pos : SV_POSITION) : SV_Target
{
int w = ((int)(pos.x + pos.y + t) & 4);
return float4(w,w,w,1);
}

Laplacian of gaussian filter use

This is a formula for LoG filtering:
(source: ed.ac.uk)
Also in applications with LoG filtering I see that function is called with only one parameter:
sigma(σ).
I want to try LoG filtering using that formula (previous attempt was by gaussian filter and then laplacian filter with some filter-window size )
But looking at that formula I can't understand how the size of filter is connected with this formula, does it mean that the filter size is fixed?
Can you explain how to use it?
As you've probably figured out by now from the other answers and links, LoG filter detects edges and lines in the image. What is still missing is an explanation of what σ is.
σ is the scale of the filter. Is a one-pixel-wide line a line or noise? Is a line 6 pixels wide a line or an object with two distinct parallel edges? Is a gradient that changes from black to white across 6 or 8 pixels an edge or just a gradient? It's something you have to decide, and the value of σ reflects your decision — the larger σ is the wider are the lines, the smoother the edges, and more noise is ignored.
Do not get confused between the scale of the filter (σ) and the size of the discrete approximation (usually called stencil). In Paul's link σ=1.4 and the stencil size is 9. While it is usually reasonable to use stencil size of 4σ to 6σ, these two quantities are quite independent. A larger stencil provides better approximation of the filter, but in most cases you don't need a very good approximation.
This was something that confused me too, and it wasn't until I had to do the same as you for a uni project that I understood what you were supposed to do with the formula!
You can use this formula to generate a discrete LoG filter. If you write a bit of code to implement that formula, you can then to generate a filter for use in image convolution. To generate, say a 5x5 template, simply call the code with x and y ranging from -2 to +2.
This will generate the values to use in a LoG template. If you graph the values this produces you should see the "mexican hat" shape typical of this filter, like so:
(source: ed.ac.uk)
You can fine tune the template by changing how wide it is (the size) and the sigma value (how broad the peak is). The wider and broader the template the less affected by noise the result will be because it will operate over a wider area.
Once you have the filter, you can apply it to the image by convolving the template with the image. If you've not done this before, check out these few tutorials.
java applet tutorials more mathsy.
Essentially, at each pixel location, you "place" your convolution template, centred at that pixel. You then multiply the surrounding pixel values by the corresponding "pixel" in the template and add up the result. This is then the new pixel value at that location (typically you also have to normalise (scale) the output to bring it back into the correct value range).
The code below gives a rough idea of how you might implement this. Please forgive any mistakes / typos etc. as it hasn't been tested.
I hope this helps.
private float LoG(float x, float y, float sigma)
{
// implement formula here
return (1 / (Math.PI * sigma*sigma*sigma*sigma)) * //etc etc - also, can't remember the code for "to the power of" off hand
}
private void GenerateTemplate(int templateSize, float sigma)
{
// Make sure it's an odd number for convenience
if(templateSize % 2 == 1)
{
// Create the data array
float[][] template = new float[templateSize][templatesize];
// Work out the "min and max" values. Log is centered around 0, 0
// so, for a size 5 template (say) we want to get the values from
// -2 to +2, ie: -2, -1, 0, +1, +2 and feed those into the formula.
int min = Math.Ceil(-templateSize / 2) - 1;
int max = Math.Floor(templateSize / 2) + 1;
// We also need a count to index into the data array...
int xCount = 0;
int yCount = 0;
for(int x = min; x <= max; ++x)
{
for(int y = min; y <= max; ++y)
{
// Get the LoG value for this (x,y) pair
template[xCount][yCount] = LoG(x, y, sigma);
++yCount;
}
++xCount;
}
}
}
Just for visualization purposes, here is a simple Matlab 3D colored plot of the Laplacian of Gaussian (Mexican Hat) wavelet. You can change the sigma(σ) parameter and see its effect on the shape of the graph:
sigmaSq = 0.5 % Square of σ parameter
[x y] = meshgrid(linspace(-3,3), linspace(-3,3));
z = (-1/(pi*(sigmaSq^2))) .* (1-((x.^2+y.^2)/(2*sigmaSq))) .*exp(-(x.^2+y.^2)/(2*sigmaSq));
surf(x,y,z)
You could also compare the effects of the sigma parameter on the Mexican Hat doing the following:
t = -5:0.01:5;
sigma = 0.5;
mexhat05 = exp(-t.*t/(2*sigma*sigma)) * 2 .*(t.*t/(sigma*sigma) - 1) / (pi^(1/4)*sqrt(3*sigma));
sigma = 1;
mexhat1 = exp(-t.*t/(2*sigma*sigma)) * 2 .*(t.*t/(sigma*sigma) - 1) / (pi^(1/4)*sqrt(3*sigma));
sigma = 2;
mexhat2 = exp(-t.*t/(2*sigma*sigma)) * 2 .*(t.*t/(sigma*sigma) - 1) / (pi^(1/4)*sqrt(3*sigma));
plot(t, mexhat05, 'r', ...
t, mexhat1, 'b', ...
t, mexhat2, 'g');
Or simply use the Wavelet toolbox provided by Matlab as follows:
lb = -5; ub = 5; n = 1000;
[psi,x] = mexihat(lb,ub,n);
plot(x,psi), title('Mexican hat wavelet')
I found this useful when implementing this for edge detection in computer vision. Although not the exact answer, hope this helps.
It appears to be a continuous circular filter whose radius is sqrt(2) * sigma. If you want to implement this for image processing you'll need to approximate it.
There's an example for sigma = 1.4 here: http://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm

Resources