Leaking memory in detectMultiScale when detecting faces - ios

I'm using the following function (successfully) to detect faces using OpenCV in iOS, but it seems to leak 4-5Mb of memory every second according to Instruments.
The function is called from processFrame() regularly.
By a process of elimination it's the line that called detectMultiScale on the face_cascade that's causing the problem.
I've tried surrounding sections with an autoreleasepool (as I've had that issue before releasing memory on non-UI threads when doing video processing) but that didn't make a difference.
I've also tried forcing the faces Vector to release its memory, but again to no avail.
Does anyone have any ideas?
- (bool)detectAndDisplay :(Mat)frame
{
BOOL bFaceFound = false;
vector<cv::Rect> faces;
Mat frame_gray;
cvtColor(frame, frame_gray, CV_BGRA2GRAY);
equalizeHist(frame_gray, frame_gray);
// the following line leaks 5Mb of memory per second
face_cascade.detectMultiScale(frame_gray, faces, 1.1, 2, 0 | CV_HAAR_SCALE_IMAGE, cv::Size(100, 100));
for(unsigned int i = 0; i < faces.size(); ++i)
{
rectangle(frame, cv::Point(faces[i].x, faces[i].y),
cv::Point(faces[i].x + faces[i].width, faces[i].y + faces[i].height),
cv::Scalar(0,255,255));
bFaceFound = true;
}
return bFaceFound;
}

I am using the same source code as you, suffering from exactly same problem - memory leakage. The only differences are: I use Qt5 for Windows and I am loading separate .jpg images (thousands of them actually). I have tried same techniques to prevent the crashes, but in vain. I wonder whether you already solved the problem?
The similar issue is described here (bold paragraph, at the real bottom of the page), however this is write-up for the previous version of OpenCV interface. Author says:
The above code (function detect and draw) has a serious memory leak when run in an infinite for loop for real time face detection.
My humble guess is, the leakage is caused by badly handled resources inside detectMultiScale() method. I did not check it out yet, but cvHaarDetectObjects() method explained here, might be a better solution (however using an old version of OpenCV is probably not the best idea).
Combined with a suggestion from previous link (add this line at the end of operations: cvReleaseMemStorage(&storage)), the leakage should be plugged.
Writing this post made me want to try this out, so I will let you know as soon as I know whether this works or not.
EDIT : My guess was probably wrong. Try to simply clear 'faces' vector after every detection instead of releasing it's memory. I am running the script now for quite a piece of time, few hundreds of faces detected, still no sign of problems.
EDIT 2 : Yep, this is it. Just add faces.clean() after every detection. Everything will work just fine. Cheers.

Related

OpenCV CUDA API very slow at the first call

I am using cuda::resize to resize a vector of images (in GpuMat)
It shows the first call takes ~15ms, and the rests only take ~0.3ms. So I want to ask if there are ways to shorten the time of the first call.
Here is my code(simplified):
for (int i = 0; i < num_images; ++i)
{
full_img = v_GpuMat[i].clone(); // vGpuMat is vector of images in cuda::GpuMat
seam_scale = 0.4377;
cuda::resize(full_img, img, Size(), seam_scale, seam_scale, INTER_LINEAR);
}
Thank you very much.
CUDA device memory allocations and copying data from device to host and vice versa are very slow. Please try allocale memory and load data outside the main loop. Cloning matrix allocates new device memory each time, try to use copying data instead of cloning it should speed up your code.
After checking the result in Nvidia-Visual-Profile, I found it is the cudaLaunchKernel that takes ~20ms and will be called only the first time.
If you have to make a continuous process like me, maybe one of the solution can be a dry run before you process your own tasks. As for this sample, make a cuda::resize out of the loop is much faster.

How can I debug inscrutable "internal error" in Metal compute shader? (Arbitrary code changes trigger)

I have a Metal compute shader for iOS that has started generating: "Error Domain=AGXMetal Code=1 "Compiler encountered an internal error" errors during newComputePipelineStateWithFunction().
The errors are consistent from run to run but seem to be triggered by almost arbitrary modifications to unrelated code. What I mean is - in naturally trying to debug why my adding the most recent chunk of code was ostensibly causing this problem I found that removing seemingly arbitrary and unrelated lines of code or constructs would remove the error.
I'm wondering if I may have hit some size or complexity limit of the compiler.
My shader function is less than 200 lines of code structured as a few C functions total and doesn't allocate much memory, but it does have some loops and hands off some buffer pointers. Up to a certain point it was all working perfectly and the more recently added code is just more of the same.
My questions are:
1) First - what exactly is the compiler for the compute pipeline doing here (that wasn't done when my default.metallib was generated) and is there any hope of gathering more debug info from it?
2) If this is some kind of size or complexity issue with the code does anyone have ideas on how I might restructure to mitigate this? Does any of this make sense?
It's going to be tough to post example code for this but I will try to update with it if a solution doesn't appear first.
EDIT:
So what I've done is painstakingly reduced and simplified my code until I have a relatively compact example that illustrates the problem. This was not as straightforward as it sounds because so many seemingly minor changes cause the problem to disappear, and yet it always returns when complexity is added back.
Please keep in mind that this is not about what the code below does - the failure happens when setting up the compute pipeline long before it runs. If you see something obviously wrong please let me know but other than that the code is only meant to be representative.
The shader below fails on an A9 processor (iPhone 6s or 6s plus) but runs on an A7 (iPad Air first gen).
void myFunc( device int *ibuff0, thread int *ibuff1)
{
int counter = 0;
float fbuff0[8];
for( int i = 0; i<8; i++) {
if ( ibuff0[0] == 42 ) {
fbuff0[counter++] = 0.0;
}
}
float val = fbuff0[0];
if ( distance( float2(0.0f,0.0f), float2(val,val) ) < 42.0f) {
ibuff1[0] = 0;
}
}
kernel void myKernelFunc( device int *ibuff0 [[ buffer(0) ]] )
{
int ibuff1[8];
myFunc( ibuff0, ibuff1 );
}
What's interesting is how many ways you can fix the above. To name a few: 1) inline myFunc (either manually or using the inline keyword). 2) Comment out either buffer assignment. 3) Substitute a local buffer for the device buffer. 4) Comment out the for-loop, leaving the body. Also the distance function call is not magic here and you can substitute any non-inlineable function that uses 'val' there.
By the way, here is a completely bogus one-liner version that fails on both A9 and A7 processors:
kernel void myKernelFunc() {
while ( true ) { }
}
Some more thoughts -
I'm assuming what I'm running into here is some bug or limitation involved in the metal compiler trying to map these conditional and loop structures to the dynamically uniform type of code that can run on the GPU. But I don't know why I got as far as I did in my code before hitting this wall since the above does not seem more complicated than the kind of stuff I was doing successfully.
Now that I have a sample I may file a bug with Apple (as some suggested). But I wanted to share here in case someone has an idea.
UPDATE:
The easiest way that I've found to work around this problem is to manually inline some of my functions.

WebGL performance problems with a >65k vertex mesh on a MacBook Pro

The following model has good performance on several low-end machines:
http://examples.x3dom.org/example/x3dom_sofaGirl.html
However on a MacBook Pro with Nvidia GT 650m the framerate is very low. I thought it was because the MacBook does not have the OES_element_index_uint extension, but the extension shows up if I do:
document.createElement("canvas").getContext("experimental-webgl").getSupportedExtensions();
Restructuring the mesh below 65K solves the problem. Is there any way to have good performance without restructuring?
I installed an application (gfxCardStatus), which disabled the GT 650m and forced using integrated graphics only. Now, everything works fine. Is this a driver bug?
I found another 3d scene that works faster on the dedicated GPU than on the integrated:
http://examples.x3dom.org/binaryGeo/oilrig_demo/index.html
I think this is because it consists of many small meshes. Also when I run this scene I can hear the GPU fan spin up. It did not with the sofaGirl scene.
First off WebGL is not limited to 65k vertices per draw call. gl.drawElements has a 64k limit though there is an extensions that removes that limit. gl.drawArrays has no such limit though.
I don't know why it's slow but looking at a frame in the WebGL Inspector X3DOM is using gl.drawArrays
I dug a little more. I tried using the Web Tracing Framework as well as Chrome's profiler. It showed a lot of time spent in gl.readPixels.
To see if that was the issue I opened the JavaScript console and replaced gl.readPixels with a no-op like this
In JavaScript Console:
// find the canvas
c = document.getElementsByTagName("canvas")[0]
// get the webgl context for that canvas
gl = c.getContext("webgl")
// replace readPixels with a no-op
gl.readPixels = function(x, y, w, h, f, t, b) {
var s = w * h * 4;
for (var ii = 0; ii < s; ++ii) {
b[ii] = 0;
}
};
That removed readPixels from showing up in the profiler
but the sample didn't run any faster.
Next I tried hacking drawArrays to draw less.
In the JavaScript Console:
// save off the original drawArrays so we can call it
window.origDrawArrays = gl.drawArrays.bind(gl)
// patch in our own that draws less
gl.drawArrays = function(t, o, c) { window.origDrawArrays(t, o, 50000); }
What do you know, now it runs super fast. Hmm. Maybe it is a driver bug. It was being asked to draw 1070706 vertices but that hardly seems like a large number for an NVidia GT 650m
So, I don't know why but I felt like looking into this issue. I wrote a native app to display the same data. It runs at 60fps easily. I checked integrated graphics in WebGL like the OP said. Also 60fps easily. The NVidia 650GT is at around ~1fps.
I also checked Safari and Firefox. They run it slow too. The common thing there is ANGLE. They all use ANGLE to re-write shaders. Maybe there's an issue there since the same shader ran fine on my native test. Of course the Native test isn't doing the exact same things as WebGL but still, it's not just that it's drawing 1M polys.
So I filed a bug:
https://code.google.com/p/chromium/issues/detail?id=437150

Golang async face detection

I'm using an OpenCV binding library for Go and trying to asynchronously detect objects in 10 images but keep getting this panic. Detecting only 4 images never fails.
var wg sync.WaitGroup
for j := 0; j < 10; j++ {
wg.Add(1)
go func(i int) {
image := opencv.LoadImage(strconv.Itoa(i) + ".jpg")
defer image.Release()
faces := cascade.DetectObjects(image)
fmt.Println((len(faces) > 0))
wg.Done()
}(j)
}
wg.Wait()
I'm fairly new to OpenCV and Go and trying to figure where the problem lies. I'm guessing some resource is being exhausted but which one.
Each time you call DetectObjects the underlying implementation of OpenCV builds a tree of classifiers and stores them inside of cascade. You can see part of the handling of these chunks of memory at https://github.com/Itseez/opencv/blob/master/modules/objdetect/src/haar.cpp line 2002
Your original code only had one cascade as a global. Each new go routine call DetectObjects used the same root cascade. Each new image would have free'd the old memory and rebuilt a new tree and eventually they would stomp on each other's memory use and cause a dereference through 0, causing the panic.
Moving the allocation of the cascade inside the goroutine allocates a new one for each DetectObject call and they do not share any memory.
The fact that it never happened on 4 images, but failed on 5 images is the nature of computing. You got lucky with 4 images and never saw the problem. You always saw the problem on 5 images because exactly the same thing happened each time (regardless of concurrency).
Repeating the same image multiple times doesn't cause the cascade tree to be rebuilt. If the image didn't change why do the work ... an optimization in OpenCV to handle multiple images frames.
The problem seemed to be having the cascade as a global variable.
Once I moved
cascade := opencv.LoadHaarClassifierCascade("haarcascade_frontalface_alt.xml")
into the goroutine all was fine.
You are not handling for a nil image
image := opencv.LoadImage(strconv.Itoa(i) + ".jpg")
if image == nil {
// handle error
}

how to get page size

I was asked this question in an interview Plz tell me the answer :-
You have no documentation of the kernel. You only knows that you kernel supports paging.
How will you find that page size ? There is no flag or macro you have that can tell you about page size.
I was given the hint as you can use Time to get the answer. I still have no clue for it.
Run code like the following:
for (int stride = 1; stride < maxpossiblepagesize; stride += searchgranularity) {
char* somemem = (char*)malloc(veryverybigsize*stride);
starttime = getcurrentveryaccuratetime();
for (pos = somemem; pos < somemem+veryverybigsize*stride; pos += stride) {
// iterate over "veryverybigsize" chunks of size "stride"
*pos = 'Q'; // Just write something to force the page back into physical memory
}
endtime = getcurrentveryaccuratetime();
printf("stride %u, runtime %u", stride, endtime-starttime);
}
Graph the results with stride on the X axis and runtime on the Y axis. There should be a point at stride=pagesize, where the performance no longer drops.
This works by incurring a number of page faults. Once stride surpasses pagesize, the number of faults ceases to increase, so the program's performance no longer degrades noticeably.
If you want to be cleverer, you could exploit the fact that the mprotect system call must work on whole pages. Try it with something smaller, and you'll get an error. I'm sure there are other "holes" like that, too - but the code above will work on any system which supports paging and where disk access is much more expensive than RAM access. That would be every seminormal modern system.
It looks to me like a question about 'how does paging actually work'
They want you to explain the impact that changing the page size will have on the execution of the system.
I am a bit rusty on this stuff, but when a page is full, the system starts page swapping, which slows everything down. So you want to run something that will fill up the memory to different sizes, and measure the time it takes to do a task. At some point there will be a jump, where the time taken to do the task will suddenly jump.
Like I said I am a bit rusty on the implementation of doing this. But i'm pretty sure that is the shape of the answer they were after.
Whatever answer they were expecting it would almost certainly be a brittle solution. For one thing you can have multiple pages sizes so any answer you may have gotten for one small allocation may be irrelevant for the next multi-megabyte allocation (see things like Linux's Large Page support).
I suspect the question was more aimed at seeing how you approached the problem rather than the final solution you came up with.
By the way this question isn't about linux because you do have documentation for that as well as POSIX compliance, for which you just call sysconf(_SC_PAGE_SIZE).

Resources