CUDA global memory access speed - memory

here is simple cuda code.
I am testing the time of accessing global memory. read and right.
below is kernel function(test1()).
enter code here
__global__ void test1(int *direct_map)
{
int index = 10;
int index2;
for(int j=0; j<1024; j++)
{
index2 = direct_map[index];
direct_map[index] = -1;
index = index2;
}
}
direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.
this kernel function needs about 600 micro second.
But, if i delete the code,
direct_map[index] = -1;
just takes 27 micro second.
I think the code already read the value of direct_map[index] from global memory from
index2 = direct_map[index];
then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.
And, I tested random writing to global memory(test2()).
It takes about 120 micro seconds.
enter code here
__global__ void test2(int *direct_map)
{
int index = 10;
for(int j=0; j<1024; j++)
{
direct_map[index] = -1;
index = j*683 + j/3 - 1;
}
}
So, I don't know why test1() takes over than 600 micro seconds.
thank you.

When you delete the code line:
direct_map[index] = -1;
your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.
You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable
Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.

Related

Why does allocating a float in Metal's threadgroup address space give different results depending on the hardware?

I have recently been working on a soft-body physics simulation based on the following paper. The implementation uses points and springs and involves calculating the volume of the shape which is then used to calculate the pressure that is to be applied to each point.
On my MacBook Pro (2018, 13") I used the following code to calculate the volume for each soft-body in the simulation since all of the physics for the springs and mass points were being handled by a separate threadgroup:
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
threadgroup float volume = 0;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
// mem_none is probably all that is necessary here.
threadgroup_barrier(mem_flags::mem_none);
// Do calculations that depend on volume.
With shared_memory being passed to the kernel function as a threadgroup buffer:
threadgroup float* shared_memory [[ threadgroup(0) ]]
This worked well until much later on I ran the code on an iPhone and an M1 MacBook and the simulation broke down completely resulting in the soft bodies disappearing fairly quickly after starting the application.
The solution to this problem was to store the result of the volume sum in a threadgroup buffer, threadgroup float* volume [[ threadgroup(2) ]], and do the volume calculation as follows:
// -*- Volume calculation -*-
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
auto sum = shared_memory[0];
for (uint i = 1; i < threadsPerThreadgroup; ++i) {
sum += shared_memory[i];
}
*volume = sum;
}
threadgroup_barrier(mem_flags::mem_none);
float epsilon = 0.000001;
float pressurev = rAB * pressure * divide(1.0, *volume + epsilon);
My question is why would the initial method work on my MacBook but not on other hardware and is this now the correct way of doing this? If it is wrong to allocate a float in the threadgroup address space like this then what is the point of being able to do so?
As a side note, I am using mem_flags::mem_none since it seems unnecessary to ensure the correct ordering of memory operations to threadgroup memory in this case. I just want to make sure each thread has written to shared_memory at this point but the order in which they do so shouldn't matter. Is this assumption correct?
you should use mem_flags::mem_threadgroup, but I think the main problem is metal cant initialize thread group memory to zero like that, the spec is unclear about this
try
threadgroup float volume;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
volume = 0;
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
If you don't want to use a threadgroup buffer, the correct way to do this is the following:
// -*- Volume calculation -*-
threadgroup float volume = 0;
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
volume = shared_memory[0];
// Index memory use signed int types over unsigned.
for (int i = 1; i < int(threadsPerThreadgroup); ++i) {
volume += shared_memory[i];
}
}
threadgroup_barrier(mem_flags::mem_none);
You can use either threadgroup_barrier(mem_flags::mem_none) and threadgroup_barrier(mem_flags::mem_threadgroup), it appears to make no difference.

Compute shader hangs when setting two pixels while looping RWTexture2D (DirectX11, SM5)

I'm trying to perform some basic cellular automata on compute shader (DirectCompute) but without double buffering, so I'm using unordered access view to a RWTexture2D<uint> for the data, however I'm having some really strange hang/crash here, I could make a very small snippet that produces the issue:
int w = 256;
for (int x = 0; x < w; ++x)
{
for (int y = 1; y < w; ++y)
{
if (map[int2(x, y - 1)])
{
map[int2(x, y)] = 10;
map[int2(x, y-1)] = 30;
}
}
}
where map is RWTexture2D<uint>.
If I remove the if or one of the assignments, it works, I thought it could be some kind of limit so I tried looping just 1/4 of the texture but the problem persists. That code is dispatched with (1,1,1) and kernel numthreads is (1,1,1) too, in my real-world scenario I want to loop from bottom to top and fill the voids (0) with the pixel I'm currently looping (think of a "falling sand" kind of effect), so it can't be parallel except in columns since it depends on the bottom pixel.
I don't understand what is causing the shader to hang though, there's no error or anything, it simply hangs and never not even times out.
EDIT:
After some further investigation, I came across something really intriguing; when I pass that w value in a constant buffer it all works fine. I have no idea what would cause that, maybe it's some compiling optimization that went wrong, maybe it tries to unroll the loop what causes some issue, and passing the value in a constant buffer disables that, however I'm compiling the shaders in debug with no optimization so I don't know.
I've had issues declaring variables in global scope like this before. I believe it's because it's not static const (so declare as a static const and it should work). Most likely, it's treating it as a constant buffer (with some default naming) and the contents are undefined since you're not binding a buffer, which causes undefined results. So the following code should work:
static const int w = 256;
for (int x = 0; x < w; ++x)
{
for (int y = 1; y < w; ++y)
{
if (map[int2(x, y - 1)])
{
map[int2(x, y)] = 10;
map[int2(x, y-1)] = 30;
}
}
}

How to use gpu::pyrdown in opencv?

I found that using pyrDown and pyrUp makes my DownUp full of zeros for some odd reason. However, when I do this normally on the cpu, the results are perfectly fine.
NOTE : I'm using opencv4tegra on the jetson tk1 if that matters at all.
for (int i = 0; i < Pyramid_Size; i++) {
cv::gpu::pyrDown(DownUp, DownUp);
}
for (int i = 0; i < Pyramid_Size; i++){
cv::gpu::pyrUp(DownUp, DownUp);
}
Anyone know why this may be?
edit:
DownUp.upload(Input);
GpuMat buffer;
DownUp.copyTo(buffer);
for (int i = 0; i < Pyramid_Size; i++, DownUp.copyTo(buffer)) {
cv::gpu::pyrDown(buffer, DownUp);
}
for (int i = 0; i < Pyramid_Size; i++, DownUp.copyTo(buffer)){
cv::gpu::pyrUp(buffer, DownUp);
GpuMat a = GpuMat(DownUp.size(), CV_32F);
a.setTo(20.0f);
cv::gpu::add(DownUp, a, DownUp);
}
this is now working in my code but it is SIGNIFICANTLY slower than the cpu version. This gpu version takes around 1.6-2 seconds total to run and the cpu takes 0.1 seconds.
I also noticed the amount of time it takes to send data from host to device takes a lot longer than it does to simply process on the cpu. Is there anyway in opencv to speed this up? I'm definitely doing something wrong, even large 5mp images are faster to down / up sample on the cpu.
Both gpu::pyrDown and gpu::pyrUp in OpenCV 2.4 can't operate in-place. Please use separete GpuMat objects for input and output.

SIGSEGV Error occurs on the actual device but not on ios simulator

I am the developer of a cydia tweak named CountdownCenter. I was trying to create a digital clock appearance for my widget, so I created a sample app that does this perfectly. After doing this, I transferred this data into the code of my tweak but, this causes a SIGSEGV crash on my iPhone. After some testing, I found the part that is responsible for the crash, but I just can't find what's wrong as this part would work on a normal app. Can you help me please?
Here is the code:
int digitarray[10];
int c = 0;
digitarray[0] = 0;
digitarray[1] = 0;
while (secon>0) {
int digitt = secon % 10;
digitarray[c] = digitt;
secon /= 10;
c++;
}
lbl.text = [NSString stringWithFormat:#"%d", digitarray[0]];
[self selectimage:digitarray[0] img:numview10];
SIGSEGV usually means, that you're trying to access memory, that you are not allowed to access. (btw are you testing this with a release build?) Maybe f.e. here (there are some other similar places too):
c = 0;
while (secon>0) {
int digitt = secon % 10;
digitarray[c] = digitt;
secon /= 10;
c++;
}
There are some possibilities:
secon (horrible variable names btw...) is a float or double, in that case the while loop is either never entered or never left
c becomes so large that it's beyond digitarray's bounds
To solve this issue I would recommend to but a breakpoint at the beginning of your code and check it step by step.

Column sum of Opencv Matrix elements

I need to compute sum of elements in all columns separately.
Now I'm using:
Matrix cross_corr should be summed.
Mat cross_corr_summed;
for (int i=0;i<cross_corr.cols;i++)
{
double column_sum=0;
for (int k=0;k<cross_corr.rows;k++)
{
column_sum +=cross_corr.at<float>(k,i);
}
cross_corr_summed.push_back(column_sum);
}
The problem is that my program takes quite a long time to run. This is one of parts that is suspicious to cause this.
Can you advise any possible faster implementation???
Thanks!!!
You need a cv::reduce:
cv::reduce(cross_corr, cross_corr_summed, 0, CV_REDUCE_SUM, CV_32S);
If you know that your data is continuous and single-channeled, you can access the matrix data directly:
int width = cross_corr.cols;
float* data = (float*)cross_corr.data;
Mat cross_corr_summed;
for (int i=0;i<cross_corr.cols;i++)
{
double column_sum=0;
for (int k=0;k<cross_corr.rows;k++)
{
column_sum += data[i + k*width];
}
cross_corr_summed.push_back(column_sum);
}
which will be faster than your use of .at_<float>(). In general I avoid the use of .at() whenever possible because it is slower than direct access.
Also, although cv::reduce() (suggested by Andrey) is much more readable, I have found it is slower than even your implementation in some cases.
Mat originalMatrix;
Mat columnSum;
for (int i = 0; i<originalMatrix.cols; i++)
columnSum.push_back(cv::sum(originalMatrix.col(i))[0]);

Resources