prefetching pd (4 double) into __m256d register - avx

I want to prefetch some data using AVX. I was checking the Intel IntrisicsGuide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) but there exists only the _mm_prefetch(...) for SSE. Does anyone know a workaround for AVX?
Update 19.02.15:
Maybe i am misunderstanding the purpose of prefetching. So i wanted to describe the problem a bit more in detail:
#include <x86intrin.h>
...
__m128 x0 = ...;
...
// doing some vector operations ...
for (int i=0; i<ndiv4; ++i) {
_mm_prefetch((char*) y+4*i+8, _MM_HINT_NTA ); //prefetch data fro two iteratrions later
__m128 x1 = _mm_load_ps(x+4*i); // aligned load
__m128 x2 = _mm_mul_ps(x0,x1); // x0 defined earlier
_mm_store_ps(x+4*i,x2); // store aligned back
}
(i know that the prefetch might not necessarily help in this case).
My question is, if or how i could do it using __m256d registers and pd respectively?

I think the literal answer to "how i could do it using __m256d registers and pd respectively?" would be this:
for (int i=0; i<ndiv8; ++i) {
_mm_prefetch((char*) y+8*i+16, _MM_HINT_NTA ); //prefetch data fro two iteratrions later
__m256 x1 = _mm_load_pd(x+8*i); // aligned load
__m256 x2 = _mm_mul_pd(x0,x1); // x0 defined earlier
_mm_store_pd(x+8*i,x2); // store aligned back
}
Changing "_ps" to "_pd", "128" to "256", and "4" to "8" as appropriate. Given that you're consuming data twice as fast, though, the prefetch stride might need to be adjusted a bit, but that's a bit of a black art that's best accomplished with benchmarking...

Related

Why does allocating a float in Metal's threadgroup address space give different results depending on the hardware?

I have recently been working on a soft-body physics simulation based on the following paper. The implementation uses points and springs and involves calculating the volume of the shape which is then used to calculate the pressure that is to be applied to each point.
On my MacBook Pro (2018, 13") I used the following code to calculate the volume for each soft-body in the simulation since all of the physics for the springs and mass points were being handled by a separate threadgroup:
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
threadgroup float volume = 0;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
// mem_none is probably all that is necessary here.
threadgroup_barrier(mem_flags::mem_none);
// Do calculations that depend on volume.
With shared_memory being passed to the kernel function as a threadgroup buffer:
threadgroup float* shared_memory [[ threadgroup(0) ]]
This worked well until much later on I ran the code on an iPhone and an M1 MacBook and the simulation broke down completely resulting in the soft bodies disappearing fairly quickly after starting the application.
The solution to this problem was to store the result of the volume sum in a threadgroup buffer, threadgroup float* volume [[ threadgroup(2) ]], and do the volume calculation as follows:
// -*- Volume calculation -*-
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
auto sum = shared_memory[0];
for (uint i = 1; i < threadsPerThreadgroup; ++i) {
sum += shared_memory[i];
}
*volume = sum;
}
threadgroup_barrier(mem_flags::mem_none);
float epsilon = 0.000001;
float pressurev = rAB * pressure * divide(1.0, *volume + epsilon);
My question is why would the initial method work on my MacBook but not on other hardware and is this now the correct way of doing this? If it is wrong to allocate a float in the threadgroup address space like this then what is the point of being able to do so?
As a side note, I am using mem_flags::mem_none since it seems unnecessary to ensure the correct ordering of memory operations to threadgroup memory in this case. I just want to make sure each thread has written to shared_memory at this point but the order in which they do so shouldn't matter. Is this assumption correct?
you should use mem_flags::mem_threadgroup, but I think the main problem is metal cant initialize thread group memory to zero like that, the spec is unclear about this
try
threadgroup float volume;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
volume = 0;
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
If you don't want to use a threadgroup buffer, the correct way to do this is the following:
// -*- Volume calculation -*-
threadgroup float volume = 0;
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
volume = shared_memory[0];
// Index memory use signed int types over unsigned.
for (int i = 1; i < int(threadsPerThreadgroup); ++i) {
volume += shared_memory[i];
}
}
threadgroup_barrier(mem_flags::mem_none);
You can use either threadgroup_barrier(mem_flags::mem_none) and threadgroup_barrier(mem_flags::mem_threadgroup), it appears to make no difference.

Compute shader hangs when setting two pixels while looping RWTexture2D (DirectX11, SM5)

I'm trying to perform some basic cellular automata on compute shader (DirectCompute) but without double buffering, so I'm using unordered access view to a RWTexture2D<uint> for the data, however I'm having some really strange hang/crash here, I could make a very small snippet that produces the issue:
int w = 256;
for (int x = 0; x < w; ++x)
{
for (int y = 1; y < w; ++y)
{
if (map[int2(x, y - 1)])
{
map[int2(x, y)] = 10;
map[int2(x, y-1)] = 30;
}
}
}
where map is RWTexture2D<uint>.
If I remove the if or one of the assignments, it works, I thought it could be some kind of limit so I tried looping just 1/4 of the texture but the problem persists. That code is dispatched with (1,1,1) and kernel numthreads is (1,1,1) too, in my real-world scenario I want to loop from bottom to top and fill the voids (0) with the pixel I'm currently looping (think of a "falling sand" kind of effect), so it can't be parallel except in columns since it depends on the bottom pixel.
I don't understand what is causing the shader to hang though, there's no error or anything, it simply hangs and never not even times out.
EDIT:
After some further investigation, I came across something really intriguing; when I pass that w value in a constant buffer it all works fine. I have no idea what would cause that, maybe it's some compiling optimization that went wrong, maybe it tries to unroll the loop what causes some issue, and passing the value in a constant buffer disables that, however I'm compiling the shaders in debug with no optimization so I don't know.
I've had issues declaring variables in global scope like this before. I believe it's because it's not static const (so declare as a static const and it should work). Most likely, it's treating it as a constant buffer (with some default naming) and the contents are undefined since you're not binding a buffer, which causes undefined results. So the following code should work:
static const int w = 256;
for (int x = 0; x < w; ++x)
{
for (int y = 1; y < w; ++y)
{
if (map[int2(x, y - 1)])
{
map[int2(x, y)] = 10;
map[int2(x, y-1)] = 30;
}
}
}

Is there a way I could use cv::Mat data while destroying the cv::Mat holding it?

It's certainly a strange question but I would like to use data from a cv::Mat object OpenCV computed in a custom object that only needs a pointer from it.
It's like:
void func(const cv::Mat& a, void* pa) {
cv::Mat b = a;
b /= 256;
pa = b.data;
// b is destoyed but I still want to use the data
}
OpenCV handles smart pointers, but I don't know the library too well, is there a way to add a reference to the data pointer somehow?
memcpy is the option used right now.

Column sum of Opencv Matrix elements

I need to compute sum of elements in all columns separately.
Now I'm using:
Matrix cross_corr should be summed.
Mat cross_corr_summed;
for (int i=0;i<cross_corr.cols;i++)
{
double column_sum=0;
for (int k=0;k<cross_corr.rows;k++)
{
column_sum +=cross_corr.at<float>(k,i);
}
cross_corr_summed.push_back(column_sum);
}
The problem is that my program takes quite a long time to run. This is one of parts that is suspicious to cause this.
Can you advise any possible faster implementation???
Thanks!!!
You need a cv::reduce:
cv::reduce(cross_corr, cross_corr_summed, 0, CV_REDUCE_SUM, CV_32S);
If you know that your data is continuous and single-channeled, you can access the matrix data directly:
int width = cross_corr.cols;
float* data = (float*)cross_corr.data;
Mat cross_corr_summed;
for (int i=0;i<cross_corr.cols;i++)
{
double column_sum=0;
for (int k=0;k<cross_corr.rows;k++)
{
column_sum += data[i + k*width];
}
cross_corr_summed.push_back(column_sum);
}
which will be faster than your use of .at_<float>(). In general I avoid the use of .at() whenever possible because it is slower than direct access.
Also, although cv::reduce() (suggested by Andrey) is much more readable, I have found it is slower than even your implementation in some cases.
Mat originalMatrix;
Mat columnSum;
for (int i = 0; i<originalMatrix.cols; i++)
columnSum.push_back(cv::sum(originalMatrix.col(i))[0]);

Problem assigning values to Mat array in OpenCV 2.3 - seems simple

Using the new API for OpenCV 2.3, I am having trouble assigning values to a Mat array (or say image) inside a loop. Here is the code snippet which I am using;
int paddedHeight = 256 + 2*padSize;
int paddedWidth = 256 + 2*padSize;
int n = 266; // padded height or width
cv::Mat fx = cv::Mat(paddedHeight,paddedWidth,CV_64FC1);
cv::Mat fy = cv::Mat(paddedHeight,paddedWidth,CV_64FC1);
float value = -n/2.0f;
for(int i=0;i<n;i++)
{
for(int j=0;j<n;j++)
fx.at<cv::Vec2d>(i,j) = value++;
value = -n/2.0f;
}
meshElement = -n/2.0f;
for(int i=0;i<n;i++)
{
for(int j=0;j<n;j++)
fy.at<cv::Vec2d>(i,j) = value;
value++;
}
Now in the first loop as soon as j = 133, I get an exception which seems to be related to depth of the image, I cant figure out what I am doing wrong here.
Please Advise! Thanks!
You are accessing the data as 2-component double vector (using .at<cv::Vec2d>()), but you created the matrices to contain only 1 component doubles (using CV_64FC1). Either create the matrices to contain two components per element (with CV_64FC2) or, what seems more appropriate to your code, access the values as simple doubles, using .at<double>(). This explodes exactly at j=133 because that is half the size of your image and when treated as containing 2-component vectors when it only contains 1, it is only half as wide.
Or maybe you can merge these two matrices into one, containing two components per element, but this depends on the way you are going to use these matrices in the future. In this case you can also merge the two loops together and really set a 2-component vector:
cv::Mat f = cv::Mat(paddedHeight,paddedWidth,CV_64FC2);
float yValue = -n/2.0f;
for(int i=0;i<n;i++)
{
float xValue = -n/2.0f;
for(int j=0;j<n;j++)
{
f.at<cv::Vec2d>(i,j)[0] = xValue++;
f.at<cv::Vec2d>(i,j)[1] = yValue;
}
++yValue;
}
This might produce a better memory accessing scheme if you always need both values, the one from fx and the one from fy, for the same element.

Resources