OpenCV parallel_for not using multiple processors - opencv

I just saw in the new OpenCV 2.4.3 that they added a universal parallel_for. So following this example, I tried to implement it myself. I got it all functioning with my code, but when I timed its processing vs a similar loop done in a typical serial fashion with a regular "for" command, the results were insignificantly faster, or often a tiny bit slower!
I thought maybe this had something to do with my pushing into vectors or something (I'm a pretty big noob to parallel processing), so I set up a test loop of just running through a big number and it still doesn't work.
Code:
class Parallel_Test : public cv::ParallelLoopBody
{
private:
double* const mypointer;
public:
Parallel_Test(double* pointer)
: mypointer(pointer){
}
void operator() (const Range& range) const
{
//This constructor needs to be here otherwise it is considered an abstract class.
// qDebug()<<"This should never be called";
}
void operator ()(const cv::BlockedRange& range) const
{
for (int x = range.begin(); x < range.end(); ++x){
mypointer[x]=x;
}
}
};
//TODO Loop pixels in parallel
double t = (double)getTickCount();
//TEST PARALELL LOOPING AT ALL
double data1[1000000];
cv::parallel_for(BlockedRange(0, 1000000), Parallel_Test(data1));
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "Parallel TEST time " << t << endl;
t = (double)getTickCount();
for(int i =0; i<1000000; i++){
data1[i]=i;
}
t = ((double)getTickCount() - t)/getTickFrequency();
qDebug() << "SERIAL Scan time " << t << endl;
output:
Parallel TEST time 0.00415479
SERIAL Scan time 0.00204597

Wow! I found the answer! "parallel_for" and "parallel_for_" (with a trailing underscore!) are totally different. You need the trailing underscore to make it work! Otherwise it will just run your loop in serial and you will have to use a BLOCKEDRANGE instead of a range! AHH!
Thanks to #Daniil Osokin and especially #Vladislav Vinogradov for pointing this out!
So again you code will need to look something like this:
cv::parallel_for_(Range(0, 1000000), Parallel_Test(data1));
More updated details at: http://answers.opencv.org/question/3730/how-to-use-parallel_for/

The problem is most likely that your loop body is too small.
It appears all you are doing is assigning a pointer in one vector to another.
You really need to think of a parallel for as an inefficient for loop, that is the work inside each iteration needs to be large enough so that you wouldn't dream of getting speedups by unrolling the loop because in addition to the usual decrement, compare and jump that can go on you also have a few interlocked instructions and perhaps a virtual function call or two and some allocations.
So instead of copying a pointer try doing a good amount of real math or work on a large array of data.

Related

c++ vecotr with iterator can't be destroyed?

Recently I have faced a problem when the program running the memory keep increasing, and when program is closed the memory would restore normal level. Obviously, it's a memory leak. After some work, I have located the code responsible, but I don't know why? The program's work flow is simple:
first use lidar api to get point cloud and image data;
then transport to next tbb flow graph to process these data;
finally use open3d api to visualzie them.
In the first step, the lidar itself's api use asio to asynchronously invoke some callback function to transport data, so I create some tbb concurrent_queue to store these data, and a align function to match cloud and image with timestamp. The problem is in the align function. In the function, I create a vector<shared_ptr<open3d::..::PointCloud>> and use iterator to store point cloud elements. However, I found when the function complete, the shared_ptr use count don't reduce . Similar but simpler example code like this:
std::pair<std::shared_ptr<int>, int> helper() {
auto a = std::make_shared<int>(90);
auto c = 100;
std::vector<std::pair<std::shared_ptr<int>, int>> container;
container.reserve(5);
auto iter = container.begin();
for (int i = 0; i < 3; i++) {
*iter = std::make_pair(a, c);
iter++;
}
return *(iter-1);
}
int main() {
auto b = helper();
std::cout << "shared_ptr use count: " << std::get<0>(b).use_count() << std::endl;
return 0;
}
Ubuntu 20.04 + gcc 9.4, the print result is shared_ptr use count: 4.
Why the vector can't be auto destroyed when function is completed? Hope someone kindly explain this problem.
Thanks #Retired Ninja! The root of the problem is vector.reserve just reserve capacity not physical space. So the vector space after reserve is 0. The following iterator operation is assumed point to some undifined memory. While the result can be transport to main function with no value error, the shared_ptr use count can't reduce to 1 after function call.
To solve the problem, One can just modify the reserve to resize, which can change physical space of the vector and iterator point to defined memory space. Or avoid use iterator, just use push_back and return back().

Loading/Storing to XMFLOAT4 faster than using XMVECTOR?

I'm going through the DirectX Math/XNA Math library, and I got curious when I read about the alignment requirements for XMVECTOR (Now DirectX::XMVECTOR), and how it is expected of you to use XMFLOAT* for members instead, using XMLoad* and XMStore* when performing mathematical operations. I was specifically curious about the tradeoffs, so I did an experiment, as I'm sure many others have, and tested to see exactly how much you lose having to load and store the vectors for each operation. This is the resulting code:
#include <Windows.h>
#include <chrono>
#include <cstdint>
#include <DirectXMath.h>
#include <iostream>
using std::chrono::high_resolution_clock;
#define TEST_COUNT 1000000000l
int main(void)
{
DirectX::XMVECTOR v1 = DirectX::XMVectorSet(1, 2, 3, 4);
DirectX::XMVECTOR v2 = DirectX::XMVectorSet(2, 3, 4, 5);
DirectX::XMFLOAT4 x{ 1, 2, 3, 4 };
DirectX::XMFLOAT4 y{ 2, 3, 4, 5 };
std::chrono::system_clock::time_point start, end;
std::chrono::milliseconds duration;
// Test with just the XMVECTOR
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMVectorAdd(v1, v2);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
DirectX::XMFLOAT4 z;
DirectX::XMStoreFloat4(&z, v1);
std::cout << std::endl << "z = " << z.x << ", " << z.y << ", " << z.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
// Now try with load/store
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMLoadFloat4(&x);
v2 = DirectX::XMLoadFloat4(&y);
v1 = DirectX::XMVectorAdd(v1, v2);
DirectX::XMStoreFloat4(&x, v1);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << std::endl << "x = " << x.x << ", " << x.y << ", " << x.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
}
Running a debug build yields the output:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
25817 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
84344 milliseconds
Okay, so about thrice as slow, but does anyone really take perf tests on debug builds seriously? Here are the results when I do a release build:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
1980 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
670 milliseconds
Like magic, XMFLOAT4 runs almost three times faster! Somehow the tables have turned. Looking at the code, this makes no sense to me; the second part runs a superset of the commands that the first part runs! There must be something going wrong, or something I am not taking into account. It is difficult to believe that the compiler was able to optimize the second part nine-fold over the much simpler, and theoretically more efficient first part. The only reasonable explanations I have involve either (1) cache behavior, (2) some crazy out of order execution that XMVECTOR can't take advantage of, (3) the compiler is making some insane optimizations, or (4) using XMVECTOR directly has some implicit inefficiency that was able to be optimized away when using XMFLOAT4. That is, the default way the compiler loads and stores XMVECTORs from memory is less efficient than XMLoad* and XMStore*. I attempted to inspect the disassembly, but I'm not all that familiar with X86 and/or SSE2 and Visual Studio does some crazy optimizations making it difficult to follow along with the source code. I also tried the Visual Studio performance analysis tool, but that didn't help as I can't figure out how to make it show the disassembly instead of the code. The only useful information I get out of that is that the first call to XMVectorAdd accounts for ~48.6% of all cycles while the second call to XMVectorAdd accounts for ~4.4% of all cycles.
EDIT:
After some more debugging, here is the assembly for the code that gets run inside of the loop. For the first part:
002912E0 movups xmm1,xmmword ptr [esp+18h] <-- HERE
002912E5 add ecx,1
002912E8 movaps xmm0,xmm2 <-- HERE
002912EB adc esi,0
002912EE addps xmm0,xmm1
002912F1 movups xmmword ptr [esp+18h],xmm0 <-- HERE
002912F6 jne main+60h (0291300h)
002912F8 cmp ecx,3B9ACA00h
002912FE jb main+40h (02912E0h)
And for the second part:
00291400 add ecx,1
00291403 addps xmm0,xmm1
00291406 adc esi,0
00291409 jne main+173h (0291413h)
0029140B cmp ecx,3B9ACA00h
00291411 jb main+160h (0291400h)
Note that these two loops are indeed nearly identical. The only difference is that the first for loop appears to be the one doing the loading and storing! It would appear as though Visual Studio made a ton of optimizations since x and y were on the stack. Changing them both to be on the heap (thus the writes must happen), and the machine code is now identical. Is this generally the case? Is there really no negative side effect to using the storage classes? Other than the fully optimized versions I suppose.
If you define
DirectX::XMVECTOR v3 = DirectX::XMVectorSet(2, 3, 4, 5);
and use v3 instead v1 as a result:
...
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v3 = DirectX::XMVectorAdd(v1, v2);
}
you got code faster then 2-nd part code using XMLoadFloat4 and XMStoreFloat4
Firstly, don't use Visual Studio's "high-resolution clock" for perf timing. You should use QueryPerformanceCounter instead. See Connect.
SIMD performance is difficult to measure in these micro tests because the overhead of loading up vector data can often dominate with such trivial ALU usage. You really need to do something substantial with the data to see the benefits. Also keep in mind that depending on your compiler settings, the compiler itself may be using the 'scalar' SIMD functionality or even auto-vectoring such trivial code loops.
You are also seeing some issues with the way you are generating your test data. You should create something larger than a single vector on the heap and use that as your source/dest.
PS: The best way to create 'static' XMVECTOR data is to use the XMVECTORF32 type.
static const DirectX::XMVECTORF32 v1 = { 1, 2, 3, 4 };
Note that if you want to have the load/save conversions between XMVECTOR and XMFLOATx to be "automatic", take a look at SimpleMath in the DirectX Tool Kit. You just use types like SimpleMath::Vector4 in your data structures, and the implicit conversion operators take care of calling XMLoadFloat4 / XMStoreFloat4 for you.

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

char buffers comparison

i have two char buffers which i am trying to compare parts of them. i am having a weird problem. i have the following code:
char buffer1[50], buffer2[60];
// Get buffer1 and buffer2 from the network by reading sockets
for(int i = 0; i < 20; i++)
{
if(buffer1[15+i] != buffer2[25+i])
{
printf("%c", buffer1[15+i]);
printf("%c", buffer2[25+i]);
printf("%02x", (unsigned char)buffer1[15+i]);
printf("%02x", (unsigned char)buffer2[25+i]);
break;
}
}
The above code is a simplified version of my actual code which I didnt copy-paste here because its too long. Just in case this might help, I got those two buffer over the network by reading sockets.
The problem is the loop breaks even when both the buffers are the same. To check what is in the buffers, I added the two print statements inside the if statement. And the weird thing is is, the printf statements both print the same value for %c and %02x, but the comparison fails and the loop breaks.
(Disclaimer: I'm not a C/++ expert)
It seems to me like the data is changing while you're looking at it. Two quick questions come to mind:
If you run this in the debugger, and go over the loop step-by-step, does it still happen? If it doesn't, then I strongly suspect my second question will lead you to the answer.
Is the read operation asynchronous? It seems like data is still being read while you're inside the for loop, meaning you didn't wait for the read to finish.
The only thing I see is a timing issue. If they are not the same on the if statement and they are the same on the print statement someone changed them in between.

Ordering Output in MPI

in a simple MPI program I have used a column wise division of a large matrix.
How can I order the output so that each matrix appears next to the other ordered ?
I have tried this simple code the effect is quite different from the wanted:
for(int i=0;i<10;i++)
{
for(int k=0;k<numprocs;k++)
{
if (my_id==k){
for(int j=1;j<10;j++)
printf("%d",data[i][j]);
}
MPI_Barrier(com);
}
if(my_id==0)
printf("\n");
}
Seems that each process has his own stdout and so is impossible to have ordered lines output without sending all the data to one master which will print out. Is my guess true ? Or what I'm doing wrong ?
You guessed right. The MPI standard does not specify how stdout from different nodes should be collected for printing at the originating process. It is often the case that when multiple processes are doing prints the output will get merged in an unspecified way. fflush doesn't help.
If you want the output ordered in a certain way, the most portable method would be to send the data to the master process for printing.
For example, in pseudocode:
if (rank == 0) {
print_col(0);
for (i = 1; i < comm_size; i++) {
MPI_Recv(buffer, .... i, ...);
print_col(i);
}
} else {
MPI_Send(data, ..., 0, ...);
}
Another method which can sometimes work would be to use barries to lock step processes so that each process prints in turn. This of course depends on the MPI Implementation and how it handles stdout.
for(i = 0; i < comm_size; i++) {
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) {
printf(...);
}
}
Of course, in production code where the data is too large to print sensibly anyway, data is eventually combine by having each process writing to a separate file and merged separately, or using MPI I/O (defined in the MPI2 standards) to coordinate parallel writes.
I produced ordered output to a file before using the exact same method. You could try printing to a temporary file, printing the contents of said file and then deleting it.
Have the root processor do all of the printing. Use MPI_Send/MPI_Recv or MPI_Gather (or whatever) to send the data in turn from each processor to the root.
To solve this problem you can use short sleep. I use and then it works in 99%
printf("text nr 1\n");
MPI_Barrier(MPI_COMM_WORLD);
usleep(100);
printf("text nr 2\n");
It's not very elegant but works.

Resources