I am trying to implement computer vision algorithm on my NVidia GPU with openCV. I am using openCV 2.4 and I am currently writing very simple programs to get accustomed to openCV. I wrote a simple code of transposing a matrix and also to implementing canny edge detection on GPU. The program is running perfectly but I need to deallocate the memory in both the CPU and GPU. So I am posting my code below :
int main(int argc,char *argv[])
{
int k;
cv::Mat src;
cv::Mat dest;
cv::Mat dest_1;
cv::gpu::GpuMat im_source;
cv::gpu::GpuMat im_dest;
cv::gpu::GpuMat im_dest_1;
cv::gpu::Stream::Null;
k = cv::gpu :: getCudaEnabledDeviceCount();
printf("%d\n",k);
src = cv::imread("lena.jpg",0);
cv::imshow("lena_org",src);
im_source.upload(src);
cv::gpu::transpose(im_source,im_dest);
im_dest.download(dest);
cv::imshow("lena_trans",dest);
cv::gpu::Canny(im_source,im_dest_1,100,100,3,false);
im_dest_1.download(dest_1);
cv::imshow("lena_edge",dest_1);
cv::waitKey();
}
So from the code above I believe the memory is not getting freed in both the CPU and GPU. I was searching the internet a bit and I came across with cv::Mat::Release for cpu and cv::gpu::GpuMat::Release for the GPU side. But I am not getting how to use them or how I should use this functions in my code so that I could free bot my CPU and GPU memories. It would be very much helpful if someone could guide me through correct usage of the Release apis so that I could free the memory successfully. Thanks for all your support.
The destructor for cv::Mat objects automatically frees the memory, making calls to the release function you describe. At the level of your code, you shouldn't have to worry about that. Once the matrix leaves scope, it gets destroyed.
If you want to manually destroy your reference to the data, you can call, for example, src.release(). There is a good tutorial on memory management in the OpenCV documentation, available here
Related
I've got a MPSImageGaussianBlur object doing work on each frame of a compute pass (Blurring the contents of an intermediate texture).
While the app is still running at 60fps no problem, I see an increase of ~15% in CPU usage when enabling the blur pass. I'm wondering if this is normal?
I'm just curious as to what could be going on under the hood of MPSImageGaussianBlur's encodeToCommandBuffer: operation that would see so much CPU utilization. In my (albeit naive) understanding, I'd imagine there would just be some simple encoding along the lines of:
MPSImageGaussianBlur.encodeToCommandBuffer: pseudo-method :
func encodeToCommandBuffer(commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture) {
let encoder = commandBuffer.computeCommandEncoder()
encoder.setComputePipelineState(...)
encoder.setTexture(sourceTexture, atIndex: 0)
encoder.setTexture(destinationTexture, atIndex: 1)
// kernel weights would be built at initialization and
// present here as a `kernelWeights` property
encoder.setTexture(self.kernelWeights, atIndex: 2)
let threadgroupsPerGrid = ...
let threadsPerThreadgroup = ...
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
}
Most of the 'performance magic' would be implemented on the algorithms running in the compute kernel function. I can appreciate that bit because performance (on the GPU) is pretty fantastic independent of the blurRadius I initialize the MPSImageGaussianBlur with.
Some probably irrelevant details about my specific setup:
MPSImageGaussianBlur initialized with blur radius 8 pixels.
The texture I'm blurring is 128 by 128 pixels.
Performing all rendering in an MTKViewDelegate's drawInMTKView: method.
I hope this question is somewhat clear in it's intent.
MPSGaussianBlur is internally a complex multipass algorithm. It is spending some time allocating textures out of its internal texture cache to hold the intermediate data. There is the overhead of multiple kernel launches to be managed. Also some resources like Gaussian blur kernel weights need to be set up. When you commit the command buffer, all these textures need to be wired down (iOS) and some other work needs to be done. So, it is not quite as simple as you imagine.
The texture you are using is small enough that the relatively fixed CPU overhead can start to become an appreciable part of the time.
Filing a radar on the CPU cost of MPSGassianBlur would cause Apple to spend an hour or two looking if something can be improved, and will be worth your time.
I honestly would not be surprised if under the hood the gpu was being less accessed than you would think for the kernel. In my first experiences with metal compute I found performance underwhelming and fell back again on neon. It was counter intuitive. I really wouldn't be surprised if the cpu hit was neon. I saw the same using mps Gaussian. It would be nice to get this confirmed. Neon has a lot of memory and instruction features that are friendlier to this use case.
Also, an indicator that this might be the case is that these filters don't run on OS X Metal. If it were just compute shaders I'm sure they could run. But Neon code can't run on the simulator.
Is there a way to directly copy previously allocated CUDA device data into an OpenCV GPU Mat? I would like to copy my data, previously initialized and filled by CUDA, into the OpenCV GPU mat. I would like to do so because I want solve a linear system of equations Ax = B by computing the inverse of the matrix A using OpenCV.
What I want to do is something like this:
float *dPtr;
gpuErrchk( cudaMalloc( (void**) &dPtr, sizeof(float) * height * width));
gpuErrchk( cudaMemset(dPtr, 0, sizeof(float) * height * width));
// modify dPtr in some way on the GPU
modify_dPtr();
// copy previously allocated and modified dPtr into OpenCV GPU mat?
// process GPU mat later - e.x. do a matrix inversion operation.
// extract raw pointer from GPU mat
EDIT:
The OpenCV documentation provides a GPU upload function.
Can the device pointer just be passed into that function as a parameter? If not, is there no other way to do such a data transfer? I don't want to copy data back and forth between the host and device memory, do my computation on a normal OpenCV Mat container, and copy back the results; my application is real-time. I am assuming that since there is no .at() function for a GPU Mat, as in the normal OpenCV Mat, there is no way to access the element at a particular location in the matrix? Also, does an explicit matrix inversion operation exist for the GPU Mat? The documentation does not provide a GPU Mat inv() function.
Just as talonmies posted in the comments, there is a constructor in the header of the GPU mat structure that allows the creation of a GPUMat header pointing to my previously allocated CUDA device data. This is what I had used:
cv::gpu::GpuMat dst(height, width, CV_32F, d_Ptr);
There is no need to figure out the step size because the constructor automatically evaluates it, given the width and height of the image.
Hopefully, when the support for OpenCV GPU functions becomes better, this post may be useful to someone.
EDIT
Another (probably) useful way is to utilize unified memory in CUDA. Pass the data into an OpenCV GPU and CPU mat, and continue operations from there.
Does Tegra K1 support RenderScript on GPU ? I used Mipad and wrote a sample RS kernel and ran it, but the cpu usage can reach 95% on average. Kernel like this:
#pragma version(1)
#pragma rs java_package_name(com.example.android.rs.hellocomputendk)
#pragma rs_fp_relaxed
void root(const uchar4 *v_in, uchar4 *v_out) {
v_out->xyzw = v_in->xyzw;
}
The allocation's flag like this:
RS_ALLOCATION_USAGE_SHARED | RS_ALLOCATION_USAGE_SCRIPT,
Official pdf said Tegra K1 GPU support RS, i don't know where i am wrong.
Thanks
Did you check the GPU utilization? You could try nVidia nSight Tegra.
The high CPU utilization is per core or per processor? If per processor this might indicate that RS has parallelized the task among cores.
Are you using Tegra Android Development Pack?
It may be that nVidia supports RenderScript just for the CPU side. Since K1 has a CUDA based GPU, the logic for putting any type of code on the GPU may not be implemented.
GPU may be used in kernels that do image processing stuff like here.
I have a program that uses opencv and oclMat.
When I tried to run this my PC gets slow and sometimes get freezed.
I guess there are two much memory in GPU. So, my question is how can I release the memory allocated by opencv ocl mat. I execute 4 kernels. Something like this:
I create the oclmat and call the kernel and pass the matrices to the kernel.The result is a ocl mat which is used in the following kernel. M3,M4,M8,M9,M5,M10 are matrices that hold data inside the kernel. I am not using local memory(as the target device does not support local memory). SO, I am using the above mentioned ocl mat as data holder. All the temporary calculated data inside the kernel is stored in those matrix. They work the way a local memory would have worked here.I do not need them in the next kernel. SO I want to free them. What is the process to do that?
oclmat M1,M2,M3,M4
kernel1 (M1,M2,M3,M4)
oclmat M5,M6
kernel2(M4,M5,M6)
oclmat M7,M8,M9
kernel3(M6,M7,M8,M9)
oclmat M10,M11
kernel2(M9,M10,M11)
I noticed that a new data structure cv::Matx was added to the new OpenCV version, intended for small matrices of known size at compilation time, for example
cv::Matx31f // matrix 3x1 of float type
Checking the documentation I saw that most of matrix operations are available, but still I don't see the advantages of using this new type instead of the old cv::Mat.
When should I use Matx instead of Mat?
Short answer: cv::Mat uses the heap to store its data, while cv::Matx uses the stack.
A cv::Mat uses dynamic memory allocation (on the heap). This is appropriate for big matrices (like images) and lets you do things like shallow copies of a matrix, which is the default behavior of cv::Mat.
However, for the small matrices that cv::Matx is designed for, heap allocation would be very expensive compared to doing the same thing on the stack. I have seen a block of math reduce processing time by over 75% by switching to using stack-allocated types (e.g. cv::Point and cv::Matx) instead of cv::Mat.
It's about memory management and not wasting (in some cases important) memory or just reservation of memory for an object you'll use later.
That's how I understand it – may be someone else can give a better explanation.
This is a late late answer, but it is still an interesting question!
dom's answer is quite accurate, and the heap/stack reference in user1460044's is also interesting.
From a practical point of view, I wouldn't use Matx (or Vec), except if it were completely necessary. The major advantages of Matx are
Using the stack (efficient![1])
Initialization.
The problem is, at the end you will have to move your Matx data to a Mat to do most of stuff, and so, you will be back at the heap again.
On the other hand, the "cool initialization" of a Matx can be done in a normal Mat:
// Matx initialization:
Matx31f A(1.f,2.f,3.f);
// Mat initialization:
Mat B = (Mat_<float>(3,1) << 1.f, 2.f, 3.f);
Also, there is a difference in initialization (beyond the heap/stack) stuff. If you try to put 5 values into the Matx31, it will crash (runtime exception), while calling the Mat_::operator<< with 5 values will only store the first three.
[1] Efficient if your program has to create lots of matrices of less than ~10 elements. In that case use Matx matrices.
There are 2 other reasons why I prefer Matx to Mat:
Readability: people reading the code can immediately see the size of the matrices, for example:
cv::Matx34d transform = ...;
It's clear that this is a 3x4 matrix, so it contains a 3D transformation of type (R,t), where R is a rotation matrix (as opposed to say, axis-angle).
Similarly, accessing an element is more natural with transform(i,j) vs transform.at<double>(i,j).
Easy debugging. Since the elements for Matx are allocated on the stack in an array of known length, IDEs or debuggers can display the entire contents nicely when stepping through the code.