C++-cli mixed types are not supported - opencv

I am trying to use cli/c++ system/multithreading for a opencv detection software I have. I have got several haarcascades which I want to use them in several threads in order to run simultaneously. I am trying to following the instructions from here:. I have noticed that when I create ref class I cant defined as a class member opencv objects. For example when I am trying to define private variables for my ref class Detection:
private:
Mat *image;
CascadeClassifier *cascade;
double scale;
int neighbors;
public:
Detection(cv::Mat &img, cv::CascadeClassifier &cas, double sc, int neigh)
{
image = new cv::Mat(img);
cascade = new cv::CascadeClassifier(cas);
scale = sc;
neighbors = neigh;
}
void detect_faces(){
Mat gray_image;
cv::cvtColor((*image), gray_image, CV_BGR2GRAY);
cv::equalizeHist(gray_image, gray_image);
std::vector<cv::Rect> faces1;
(*cascade).detectMultiScale( gray_image, faces1, scale, neighbors, 0| CASCADE_SCALE_IMAGE, Size(3, 3), Size(190,190));
faces.insert(faces.end(), faces1.begin(), faces1.end());
}
Main function:
int main()
{
Mat image = imread(...);
cv::CascadeClassifier face_cascade1;
face_cascade1.load("cascades/lbpcascade_profileface.xml");
Detection^ obj1 = gcnew Detection(image, face_cascade1, 1.01,5);
ThreadStart^ myThreadDelegate1 = gcnew ThreadStart(obj1, &Detection::detect_faces );
Thread^ Thread1 = gcnew Thread( myThreadDelegate1 );
Thread1->start();
...//the rest threads
}
This code it seems to work. However as Berak mention I shouldnt make copy of cascadeClassifier. Is there something else I could do? Is there an issue of loosing time with that implementation? Moreover is there a chance to move detectMultiScale in main function?

C++/CLI ref classes can only hold .Net objects or primitive types as members.
If you want to have a C++ member, you'll have to hold a pointer to it. (Which can be compiled to an integer type big enough to hold the pointer).
Now if you want to check if the threads are running concurrently, you can do 2 things:
Pause the program, and look at the threads window. (Debug -> Windows -> Threads)
Here you can see all the working threads, but you might be unlucky and "miss" the timing that shows both threads working.
Add a long sleep() in each thread. just before starting its work. Then Pause the program like in (1) and if you see the threads waiting on the sleep() then it means they're working concurrently.

Related

c++ vecotr with iterator can't be destroyed?

Recently I have faced a problem when the program running the memory keep increasing, and when program is closed the memory would restore normal level. Obviously, it's a memory leak. After some work, I have located the code responsible, but I don't know why? The program's work flow is simple:
first use lidar api to get point cloud and image data;
then transport to next tbb flow graph to process these data;
finally use open3d api to visualzie them.
In the first step, the lidar itself's api use asio to asynchronously invoke some callback function to transport data, so I create some tbb concurrent_queue to store these data, and a align function to match cloud and image with timestamp. The problem is in the align function. In the function, I create a vector<shared_ptr<open3d::..::PointCloud>> and use iterator to store point cloud elements. However, I found when the function complete, the shared_ptr use count don't reduce . Similar but simpler example code like this:
std::pair<std::shared_ptr<int>, int> helper() {
auto a = std::make_shared<int>(90);
auto c = 100;
std::vector<std::pair<std::shared_ptr<int>, int>> container;
container.reserve(5);
auto iter = container.begin();
for (int i = 0; i < 3; i++) {
*iter = std::make_pair(a, c);
iter++;
}
return *(iter-1);
}
int main() {
auto b = helper();
std::cout << "shared_ptr use count: " << std::get<0>(b).use_count() << std::endl;
return 0;
}
Ubuntu 20.04 + gcc 9.4, the print result is shared_ptr use count: 4.
Why the vector can't be auto destroyed when function is completed? Hope someone kindly explain this problem.
Thanks #Retired Ninja! The root of the problem is vector.reserve just reserve capacity not physical space. So the vector space after reserve is 0. The following iterator operation is assumed point to some undifined memory. While the result can be transport to main function with no value error, the shared_ptr use count can't reduce to 1 after function call.
To solve the problem, One can just modify the reserve to resize, which can change physical space of the vector and iterator point to defined memory space. Or avoid use iterator, just use push_back and return back().

passing file descriptors from the main process to its threads

I have a simple question regarding file descriptors passage from processes into their threads. I'm almost sure but need to a confirmation, if the files descriptors are treated as normal integers and thus can be passed through an array of integers for example to the process thread through the pthread_create() thread argument. Thanks
The rough definition of the term "process" could be "a memory space with at least one thread". In other words, all threads within the same process share a memory space.
Now, file descriptors are basically indices that reference objects within a table that belongs to the process. Since the objects belong to the process, and the threads operate inside the process, the threads can refer to these objects via their index ("file descriptor").
Yes, file descriptors are just integers and so can be passed as function arguments like any other variable. They will still refer to the same files, because the open files are shared by all the threads in a process.
#include <pthread.h>
struct files {
int count;
int* descriptors;
};
void* worker(void* p)
{
struct files *f = (struct files*)p;
// ...
}
int main(void)
{
struct files f;
f.count = 4;
f.descriptors = (int*)malloc(sizeof(int) * f.count);
f.descriptors[0] = open("...", O_RDONLY);
// ...
pthread_t t;
pthread_create(&t, NULL, worker, &f);
// ...
pthread_join(t);
}

Reading and Writing Structs to and from Arduino's EEPROM

I'm trying to write data structures defines in C to my Arduino Uno board's non-volatile memory, so the values of the struct will be retained after the power goes off or it is reset.
To my understanding, the only way to do this (while the sketch is running) would be to write to arduino's EEPROM. Although I can write individual bytes (sets a byte with value 1 at address 0):
eeprom_write_byte(0,1);
I am stuck trying to write a whole struct:
typedef struct NewProject_Sequence {
NewProject_SequenceId sequenceId;
NewProject_SequenceLength maxRange;
NewProject_SequenceLength minRange;
NewProject_SequenceLength seqLength;
NewProject_SceneId sceneList[5];
} NewProject_Sequence;
Because of the EEPROM's limit of 100,000 writes, I don't want to write to the Arduino in a loop going through each byte, for this will probably use it up pretty fast. Does anyone know a more efficient way of doing this, either with EEPROM or if there's a way to write to PROGMEM while the sketch is running? (without using the Arduino Library, just C).
RESOLVED
I ended up writing two custom functions -- eepromWrite and eepromRead. They are listed below:
void eepromRead(uint16_t addr, void* output, uint16_t length) {
uint8_t* src;
uint8_t* dst;
src = (uint8_t*)addr;
dst = (uint8_t*)output;
for (uint16_t i = 0; i < length; i++) {
*dst++ = eeprom_read_byte(src++);
}
}
void eepromWrite(uint16_t addr, void* input, uint16_t length) {
uint8_t* src;
uint8_t* dst;
src = (uint8_t*)input;
dst = (uint8_t*)addr;
for (uint16_t i = 0; i < length; i++) {
eeprom_write_byte(dst++, *src++);
}
}
The would be implemented like this:
uint16_t currentAddress;
struct {
uint16_t x;
uint16_t y;
} data;
struct {
} output;
uint16_t input
eepromWrite(currentAddress, data, sizeof(data);
eepromRead(currentAddress, output, sizeof(data));
Several solutions and or combinations.
setup a timer event to store the values periodically, rather then
back to back.
use a checksum, then increment the initial offset,
when writing. Where when reading you attempt each increment until
you have a valid checksum. this spreads your data across the entire
range increasing your life. modern flash drives do this.
Catch the unit turning off, by using an external Brown Out Detector to trigger an INT to then quickly write the EEPROM. Where you can then also use the internal BOD to prevent corruption, before it falls below safe writing voltages. By having the external significantly higher than the internal thresholds. The time to write before complete shutdown can be increased by increasing the VCC capacitance. Where the external BOD is compared before the VCC and not directly the VCC itself.
Here is a video explaining how to enable the internal BOD, for a ATtiny, where it is nearly identical for the other ATmega's. Video
The Arduino EEPROM library provides get/put functions that are able to read and write structs...
Link to EEPROM.put(...)
The write is made only when a byte has changed.
So, using put/get is the solution to your problem.
I'm using these in a wide (25k) project without any problem.
And as already said I've used a timer to write not each time but some time to times.
Turning off detection is also a very good way to do this.

OpenCV parallel_for not using multiple processors

I just saw in the new OpenCV 2.4.3 that they added a universal parallel_for. So following this example, I tried to implement it myself. I got it all functioning with my code, but when I timed its processing vs a similar loop done in a typical serial fashion with a regular "for" command, the results were insignificantly faster, or often a tiny bit slower!
I thought maybe this had something to do with my pushing into vectors or something (I'm a pretty big noob to parallel processing), so I set up a test loop of just running through a big number and it still doesn't work.
Code:
class Parallel_Test : public cv::ParallelLoopBody
{
private:
double* const mypointer;
public:
Parallel_Test(double* pointer)
: mypointer(pointer){
}
void operator() (const Range& range) const
{
//This constructor needs to be here otherwise it is considered an abstract class.
// qDebug()<<"This should never be called";
}
void operator ()(const cv::BlockedRange& range) const
{
for (int x = range.begin(); x < range.end(); ++x){
mypointer[x]=x;
}
}
};
//TODO Loop pixels in parallel
double t = (double)getTickCount();
//TEST PARALELL LOOPING AT ALL
double data1[1000000];
cv::parallel_for(BlockedRange(0, 1000000), Parallel_Test(data1));
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "Parallel TEST time " << t << endl;
t = (double)getTickCount();
for(int i =0; i<1000000; i++){
data1[i]=i;
}
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "SERIAL Scan time " << t << endl;
output:
Parallel TEST time 0.00415479
SERIAL Scan time 0.00204597
Wow! I found the answer! "parallel_for" and "parallel_for_" (with a trailing underscore!) are totally different. You need the trailing underscore to make it work! Otherwise it will just run your loop in serial and you will have to use a BLOCKEDRANGE instead of a range! AHH!
Thanks to #Daniil Osokin and especially #Vladislav Vinogradov for pointing this out!
So again you code will need to look something like this:
cv::parallel_for_(Range(0, 1000000), Parallel_Test(data1));
More updated details at: http://answers.opencv.org/question/3730/how-to-use-parallel_for/
The problem is most likely that your loop body is too small.
It appears all you are doing is assigning a pointer in one vector to another.
You really need to think of a parallel for as an inefficient for loop, that is the work inside each iteration needs to be large enough so that you wouldn't dream of getting speedups by unrolling the loop because in addition to the usual decrement, compare and jump that can go on you also have a few interlocked instructions and perhaps a virtual function call or two and some allocations.
So instead of copying a pointer try doing a good amount of real math or work on a large array of data.

Timeout in CUDA? / fermi / gtx465

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:
__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
int holo_x = blockIdx.x*20 + threadIdx.x;
int holo_y = blockIdx.y*20 + threadIdx.y;
float k=2.0f*3.14f/0.000000054f;
if (firstTime[0]==1.0f)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f;
}
for (int i=0; i<pointsNumber[0]; i++)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
}
__syncthreads();
}
and this is function which calls kernel function:
extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
dim3 threadBlockRows(20, 20);
CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber);
CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
}
I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.
The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.
When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...
Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?
It will be better to call
CUT_CHECK_ERROR after
cudaThreadSynchronize(). Because
kernel run asynchronous and you must
wait for kernel ending to know about
errors... Maybe in second iteration you receive an error
from first kernel usage.
Be sure
that you have some valid number in the most interesting variable
pointsNumber[0] (it might cause a
long internal loop).
You could also
improve speed of your kernel
function:
Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
You don't use shared mem, so __syncthreads() is not required.
PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.
PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function.
You could also use bool firstTime - it will be MUCH better then float.
Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page
In kernel's cycle you write in the same array, from which you read - for global memory usage it is the worst, because warps from different blocks wait for each other.

Resources