Compute shader hangs when setting two pixels while looping RWTexture2D (DirectX11, SM5) - directx

I'm trying to perform some basic cellular automata on compute shader (DirectCompute) but without double buffering, so I'm using unordered access view to a RWTexture2D<uint> for the data, however I'm having some really strange hang/crash here, I could make a very small snippet that produces the issue:
int w = 256;
for (int x = 0; x < w; ++x)
{
for (int y = 1; y < w; ++y)
{
if (map[int2(x, y - 1)])
{
map[int2(x, y)] = 10;
map[int2(x, y-1)] = 30;
}
}
}
where map is RWTexture2D<uint>.
If I remove the if or one of the assignments, it works, I thought it could be some kind of limit so I tried looping just 1/4 of the texture but the problem persists. That code is dispatched with (1,1,1) and kernel numthreads is (1,1,1) too, in my real-world scenario I want to loop from bottom to top and fill the voids (0) with the pixel I'm currently looping (think of a "falling sand" kind of effect), so it can't be parallel except in columns since it depends on the bottom pixel.
I don't understand what is causing the shader to hang though, there's no error or anything, it simply hangs and never not even times out.
EDIT:
After some further investigation, I came across something really intriguing; when I pass that w value in a constant buffer it all works fine. I have no idea what would cause that, maybe it's some compiling optimization that went wrong, maybe it tries to unroll the loop what causes some issue, and passing the value in a constant buffer disables that, however I'm compiling the shaders in debug with no optimization so I don't know.

I've had issues declaring variables in global scope like this before. I believe it's because it's not static const (so declare as a static const and it should work). Most likely, it's treating it as a constant buffer (with some default naming) and the contents are undefined since you're not binding a buffer, which causes undefined results. So the following code should work:
static const int w = 256;
for (int x = 0; x < w; ++x)
{
for (int y = 1; y < w; ++y)
{
if (map[int2(x, y - 1)])
{
map[int2(x, y)] = 10;
map[int2(x, y-1)] = 30;
}
}
}

Related

Have issue with pretty simple C code

I am developing an app in XCode and have to write a bit of C for an algorithm. Here is a part of the C code:
double dataTag[M][N];
// dataTag initialized to values.....
double w[N]; // This is outside for loop at the top level of the method
for (int i = 0; i < N; i++) {
w[i] = pow(10.0, dataTag[2][i] / 10.0 / b);
}
//This is inside for loop.....
double disErr[N];
// disErr set and values confirmed with printArray...
double transedEstSetDrv[N][M];
// transedEstSetDrv set and values confirmed with printArray...
double stepGrad[M] = {0, 0, 0};
for (int j = 0; j < M; j++) {
double dotProductResult[M];
dotProductOfArrays(w, disErr, dotProductResult, N);
stepGrad[j] = sumOfArrayMultiplication(transedEstSetDrv[j], dotProductResult, M);
}
// Print array to console to confirm values
NSLog(#"%f %f %f", stepGrad[0], stepGrad[1], stepGrad[2]); <-- if this is present algorithm gives different results.
//Continue calculations......
So this is a part of algorithm in C which is inside for loop. The weird part is the NSLog that prints stepGrad array. Depending if i comment the call to the NSLog or not - the algorithm as a whole gives different results.
It would be great if someone gave some debugging suggestions.
Thanks!
UPDATE 1:
Simplified example which has the same issue and gave more code to support the issue.
UPDATE 2:
Removed the length_of_array function and just replaced it with a known number for simplicity.
So i will answer my own question.
Thanks to the comment from #Klas Lindbäck, i fixed the issue which was related to not initializing a C static array in for loop. So i went over all arrays before and after the code that had issue and did a
memset(a_c_array, 0, sizeof(a_c_array));
after declaration of each array. That is now working fine. Thank you for all your help!

CUDA global memory access speed

here is simple cuda code.
I am testing the time of accessing global memory. read and right.
below is kernel function(test1()).
enter code here
__global__ void test1(int *direct_map)
{
int index = 10;
int index2;
for(int j=0; j<1024; j++)
{
index2 = direct_map[index];
direct_map[index] = -1;
index = index2;
}
}
direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.
this kernel function needs about 600 micro second.
But, if i delete the code,
direct_map[index] = -1;
just takes 27 micro second.
I think the code already read the value of direct_map[index] from global memory from
index2 = direct_map[index];
then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.
And, I tested random writing to global memory(test2()).
It takes about 120 micro seconds.
enter code here
__global__ void test2(int *direct_map)
{
int index = 10;
for(int j=0; j<1024; j++)
{
direct_map[index] = -1;
index = j*683 + j/3 - 1;
}
}
So, I don't know why test1() takes over than 600 micro seconds.
thank you.
When you delete the code line:
direct_map[index] = -1;
your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.
You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable
Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.

Column sum of Opencv Matrix elements

I need to compute sum of elements in all columns separately.
Now I'm using:
Matrix cross_corr should be summed.
Mat cross_corr_summed;
for (int i=0;i<cross_corr.cols;i++)
{
double column_sum=0;
for (int k=0;k<cross_corr.rows;k++)
{
column_sum +=cross_corr.at<float>(k,i);
}
cross_corr_summed.push_back(column_sum);
}
The problem is that my program takes quite a long time to run. This is one of parts that is suspicious to cause this.
Can you advise any possible faster implementation???
Thanks!!!
You need a cv::reduce:
cv::reduce(cross_corr, cross_corr_summed, 0, CV_REDUCE_SUM, CV_32S);
If you know that your data is continuous and single-channeled, you can access the matrix data directly:
int width = cross_corr.cols;
float* data = (float*)cross_corr.data;
Mat cross_corr_summed;
for (int i=0;i<cross_corr.cols;i++)
{
double column_sum=0;
for (int k=0;k<cross_corr.rows;k++)
{
column_sum += data[i + k*width];
}
cross_corr_summed.push_back(column_sum);
}
which will be faster than your use of .at_<float>(). In general I avoid the use of .at() whenever possible because it is slower than direct access.
Also, although cv::reduce() (suggested by Andrey) is much more readable, I have found it is slower than even your implementation in some cases.
Mat originalMatrix;
Mat columnSum;
for (int i = 0; i<originalMatrix.cols; i++)
columnSum.push_back(cv::sum(originalMatrix.col(i))[0]);

Problem assigning values to Mat array in OpenCV 2.3 - seems simple

Using the new API for OpenCV 2.3, I am having trouble assigning values to a Mat array (or say image) inside a loop. Here is the code snippet which I am using;
int paddedHeight = 256 + 2*padSize;
int paddedWidth = 256 + 2*padSize;
int n = 266; // padded height or width
cv::Mat fx = cv::Mat(paddedHeight,paddedWidth,CV_64FC1);
cv::Mat fy = cv::Mat(paddedHeight,paddedWidth,CV_64FC1);
float value = -n/2.0f;
for(int i=0;i<n;i++)
{
for(int j=0;j<n;j++)
fx.at<cv::Vec2d>(i,j) = value++;
value = -n/2.0f;
}
meshElement = -n/2.0f;
for(int i=0;i<n;i++)
{
for(int j=0;j<n;j++)
fy.at<cv::Vec2d>(i,j) = value;
value++;
}
Now in the first loop as soon as j = 133, I get an exception which seems to be related to depth of the image, I cant figure out what I am doing wrong here.
Please Advise! Thanks!
You are accessing the data as 2-component double vector (using .at<cv::Vec2d>()), but you created the matrices to contain only 1 component doubles (using CV_64FC1). Either create the matrices to contain two components per element (with CV_64FC2) or, what seems more appropriate to your code, access the values as simple doubles, using .at<double>(). This explodes exactly at j=133 because that is half the size of your image and when treated as containing 2-component vectors when it only contains 1, it is only half as wide.
Or maybe you can merge these two matrices into one, containing two components per element, but this depends on the way you are going to use these matrices in the future. In this case you can also merge the two loops together and really set a 2-component vector:
cv::Mat f = cv::Mat(paddedHeight,paddedWidth,CV_64FC2);
float yValue = -n/2.0f;
for(int i=0;i<n;i++)
{
float xValue = -n/2.0f;
for(int j=0;j<n;j++)
{
f.at<cv::Vec2d>(i,j)[0] = xValue++;
f.at<cv::Vec2d>(i,j)[1] = yValue;
}
++yValue;
}
This might produce a better memory accessing scheme if you always need both values, the one from fx and the one from fy, for the same element.

randomness from spritemap in spritebatch

Im sure this is easily fixed, and I do have searched both high and low, traversed the net both east, west, north and south but to no prevail...
My problem is this. Im in the middle of trying to make a bejeweled clone, just to get me started in xna. However im stuck on the random plotting of gems/icons/pictures.
This is what i have.
First a generated list of positions, a random and a rectangle:
List<Vector2> platser = new List<Vector2>();
Random slump = new Random();
Rectangle bildsourcen;
protected override void Initialize()
{
for (int j = 0; j < 5; j++)
{
for (int i = 0; i < 5; i++)
{
platser.Add(new Vector2((i*100),(j*100)));
}
}
base.Initialize();
}
Pretty straight-forward.
I also have loaded a texture, with 5 icons/gems/pictures -> 5*100px = width of 500px.
allImage = Content.Load<Texture2D>("icons/all");
Then comes the "error".
protected override void Draw(GameTime gameTime)
{
GraphicsDevice.Clear(Color.CornflowerBlue);
spriteBatch.Begin();
int x = slump.Next(2);
bildsourcen = new Rectangle((x * 100), 0, 100, 100);
for (int i = 0; i < platser.Count; i++)
{
spriteBatch.Draw(allImage, new Rectangle((int)platser[i].X, (int)platser[i].Y, 100, 100), bildsourcen, Color.White);
}
So, there is my code. And this is what happens:
I want it to randomly pick a part of my image and plot it at the given coords taken from the vector2-list. However, it puts the same image at all coords and keeps randomly replacing them, not with random images but with the same. So the whole board keeps flickering the same icons. Ie, instead of generating 15231 and keeping it frozen, it one second puts 11111 and the next second it puts 33333.
Does anybody understand what im trying to describe ? I'm almost at the point of pulling my own hair out. The cat's hair has already been pulled...
Thx in advance
The Draw function is called once each frame. This line:
int x = slump.Next(2);
Is generating a random number (either a 0 or a 1 in this case) each frame, hence the flicker.
The line after that selects a sprite from your sprite atlas based on that number (specifically it specifies the rectangle containing that sprite). And in the loop that follows you're drawing multiple copies of that sprite (always the same image).
You should be doing all of your game logic in your Update function. That function will give you a time and you will probably want to implement a method of waiting for a certain amount of time to pass before you generate a random block (so keep accumulating the time that passes between each Update, until it reaches some threshold). The exact mechanics of when you want to generate your random block is up to you.
Of course, that is not to mention that there are other flaws in the structure of your code. Bejewelled is played on a fixed-sized board with different coloured blocks (each block you could represent with a number from 1 to X). The location of the blocks should be be implicit in your data structure (so you don't need to generate your platser list).
So your Game class should have something like:
const int BoardWidth = 10;
const int BoardHeight = 10;
int[,] board = new int[BoardWidth, BoardHeight];
Then in your Initialize function you should fill board and perhaps use 0 as an empty space and 1 to X to represent your colours, like so:
for(int x = 0; x < BoardWidth; x++) for(int y = 0; y < BoardHeight; y++)
{
board[x,y] = slump.Next(1, 6); // gives 5 different sprites
}
Then in Update wait for user input or a time-out before modifying the board (depending on your gameplay).
Then in your Draw function do something like this:
for(int x = 0; x < BoardWidth; x++) for(int y = 0; y < BoardHeight; y++)
{
if(board[x,y] == 0) continue; // don't render an empty space
Vector2 position = new Vector2(100*x, 100*y);
Rectangle bildsourcen = new Rectangle(100*(board[x,y]-1), 0, 100, 100);
sb.Draw(allImage, position, bildsourcen, Color.White);
}

Resources