I'm using OpenCV with cv::Mat objects, and I need to know the number of bytes that my matrix occupies in order to pass it to a low-level C API. It seems that OpenCV's API doesn't have a method that returns the number of bytes a matrix uses, and I only have a raw uchar *data public member with no member that contains its actual size.
How can one find a cv::Mat size in bytes?
The common answer is to calculate the total number of elements in the matrix and multiply it by the size of each element, like this:
// Given cv::Mat named mat.
size_t sizeInBytes = mat.total() * mat.elemSize();
This will work in conventional scenarios, where the matrix was allocated as a contiguous chunk in memory.
But consider the case where the system has an alignment constraint on the number of bytes per row in the matrix. In that case, if mat.cols * mat.elemSize() is not properly aligned, mat.isContinuous() is false, and the previous size calculation is wrong, since mat.elemSize() will have the same number of elements, although the buffer is larger!
The correct answer, then, is to find the size of each matrix row in bytes, and multiply it with the number of rows:
size_t sizeInBytes = mat.step[0] * mat.rows;
Read more about step here.
Related
In OpenCV, the type of the elements in a cv::Mat object could be for instance CV_32FC1, CV_32FC3 which represent 32-bit floating point with one channel and 32-bit floating point with three channels, respectively.
The CV_32FC3 type can be used to represent color images which have blue, green and red channels plus an alpha channel used to represent transparency, with each channel getting 8 bits.
I'm wondering how the bits being allocated in CV_32FC1 type, when there's only one channel?
32F means float. The number after the C is the number of channels. So CV_32FC3 means 3 floats per pixel, while CV_32FC1 is one float only. What these floats mean is up to you and not explicitly stored in the Mat.
The float is stored in memory as it would in your regular C program (typically in little endian).
A classical BGR image (default channel ordering in OpenCV) would be a CV_8UC3: 8 bit unsigned integer per channel in three channels.
I am going through the Metal iOS Swift example trying to understand the triple buffering practice they suggest. This is shown inside of the demo for the uniform animations.
As I understand it aligned memory simply starts at a specefic increment that is a multiple of some byte amount that the device really likes. My confusion is this line of code
// The 256 byte aligned size of our uniform structure
let alignedUniformsSize = (MemoryLayout<Uniforms>.size & ~0xFF) + 0x100
they use it to find the size and byte of the Uniforms struct. I am confused about why there are binary operations here I am really not sure what they do.
If it helps this aligned size is used to create a buffer like this. I am fairly sure that buffer allocates byte aligned memory automatically and is henceforth used as the memory storage location for the uniforms.
let buffer = self.device.makeBuffer(length:alignedUniformsSize * 3, options:[MTLResourceOptions.storageModeShared])
So essentially rather than going through the trouble of allocating byte aligned memory by yourself they let metal do it for them.
Is there any reason that the strategy they used when they did let allignedUniformsSize = would not work for other types such as Int or Float etc?
Let's talk first about why you'd want aligned buffers, then we can talk about the bitwise arithmetic.
Our goal is to allocate a Metal buffer that can store three (triple-buffered) copies of our uniforms (so that we can write to one part of the buffer while the GPU reads from another). In order to read from each of these three copies, we supply an offset when binding the buffer, something like currentBufferIndex * uniformsSize. Certain Metal devices require these offsets to be multiples of 256, so we instead need to use something like currentBufferIndex * alignedUniformsSize as our offset.
How do we "round up" an integer to the next highest multiple of 256? We can do it by dropping the lowest 8 bits of the "unaligned" size, effectively rounding down, then adding 256, which gets us the next highest multiple. The rounding down part is achieved by bitwise ANDing with the 1's complement (~) of 255, which (in 32-bit) is 0xFFFFFF00. The rounding up is done by just adding 0x100, which is 256.
Interestingly, if the base size is already aligned, this technique spuriously rounds up anyway (e.g., from 256 to 512). For the cost of an integer divide, you can avoid this waste:
let alignedUniformsSize = ((MemoryLayout<Uniforms>.size + 255) / 256) * 256
I have a bitmap in RGB format, i.e. 24bit per pixel. How can I create a Mat object so that I can minimize data copying while making sure that the order of the channel is treated correctly, given that default order in OpenCV is BGR
In general you can use RGB order as usual, just remember to convert to correct color space when needed.
You can create a Mat header with no copies using:
int rows = ...
int cols = ...
uchar* rgb_buffer = ...
cv::Mat3b rgb_image(rows, cols, bgr_buffer);
None (or just a few) of the OpenCV functions assume that matrix data should be in BGR or RGB order. You can also operate on your data with your custom processing accounting for RGB order.
The fact that OpenCV images are in BGR order is mostly a matter of input / output (basically imshow, imread, imwrite, and the like).
You can always convert your image with cvtColor(..., RGB2<whatever>) in case you need to switch color space. This won't be a performance issue, since the data would be copied anyway.
RGB buffer:
int rows, int cols, uchar* input_RGB;
Mat buffer declaration:
Mat image(rows, cols, CV_8UC3, input_RGB);
https://docs.opencv.org/master/d6/d6d/tutorial_mat_the_basic_image_container.html
I'm working on image processing with CUDA and i've a doubt about pixel processing.
What is often done with the boundary pixels of an image when applying a m x m convolution filter?
In a 3 x 3 convolution kernel, ignoring the 1 pixel boundary of the image is easier to deal with, especially when the code is improved with shared memory. Indeed, in this case, one does not need to check if a given pixel has all the neigbourhood available (i.e. pixel at coord (0, 0) has not left, left-upper, upper neighbours). However, removing the 1 pixel boundary of the original image could generate partial results.
Opposite to that, I'd like to process all the pixels within the image, also when using shared memory improvements, i.e., for example, loading 16 x 16 pixels, but computing the inner 14 x 14. Also in this case, ignoring the boundary pixels generates a clearer code.
What is usually done in this case?
Does anyone usually use my approach ignoring the boundary pixels?
Of course, I'm aware the answer depends on the type of problem, i.e. adding two images pixel-wise has not this problem.
Thanks in advance.
A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:
A constant (e.g. zero)
Replicate the first and last row / column as many times as needed
Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])
tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.
(Apologies in advance for my reckless abuse of mathematics)
To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.
Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)).
Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:
K*w + K*w + K*w = K*3*w
So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).
Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.
So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.
Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.
Your question is somewhat broad and I believe it mixes two problems:
dealing with boundary conditions;
dealing with halo regions.
The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.
The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.
For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.
Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.
Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float signal_center = tex1D(signal_texture, tid - 0);
float signal_before = tex1D(signal_texture, tid - 1);
float signal_after = tex1D(signal_texture, tid + 1);
printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);
d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
// --- Input host array declaration and initialization
float *h_arr = (float *)malloc(N * sizeof(float));
for (int i = 0; i < N; i++) h_arr[i] = (float)i;
// --- Output host and device array vectors
float *h_vec = (float *)malloc(N * sizeof(float));
float *d_vec; gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));
// --- CUDA array declaration and texture memory binding; CUDA array initialization
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *d_arr; gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));
cudaBindTextureToArray(signal_texture, d_arr);
signal_texture.normalized = false;
signal_texture.addressMode[0] = cudaAddressModeWrap;
// --- Kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
printf("Test finished\n");
return 0;
}
I'm currently trying to compute the fft of an image via fftw_plan_dft_2d.
To use this function, I'm linearizing the image data into an in array and calling the function mentioned above (and detailed below)
ftw_plan fftw_plan_dft_2d(int n0, int n1,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
The func modifies a complex array, out, with a size equal to the number of pixels in the original image.
Do you know if this is the proper way of computing the 2D FFT of an image? If so, what does the data within out represent? IE Where are the high and low frequency values in the array?
Thanks,
djs22
A 2D FFT is equivalent to applying a 1D FFT to each row of the image in one pass, followed by 1D FFTs on all the columns of the output from the first pass.
The output of a 2D FFT is just like the output of a 1D FFT, except that you have complex magnitudes in x, y dimensions rather just a single dimension. Spatial frequency increases with the x and y index as expected.
There's a section in the FFTW manual (here) which covers the organisation of the real-to-complex 2D FFT output data, assuming that's what you're using.
It is.
Try to compute 2 plans:
plan1 = fftw_plan_dft_2d(image->rows, image->cols, in, fft, FFTW_FORWARD, FFTW_ESTIMATE);
plan2 = fftw_plan_dft_2d(image->rows, image->cols, fft, ifft, FFTW_BACKWARD, FFTW_ESTIMATE);
You'll obtain the original data in ifft.
Hope it helps :)