How to CUDA-ize code when all cores require global memory access - memory

Without CUDA, my code is just two for loops that calculate the distance between all pairs of coordinates in a system and sort those distances into bins.
The problem with my CUDA version is that apparently threads can't write to the same global memory locations at the same time (race conditions?). The values I end up getting for each bin are incorrect because only one of the threads ended up writing to each bin.
__global__ void computePcf(
double const * const atoms,
double * bins,
int numParticles,
double dr) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles - 1) {
for (int j = i + 1; j < numParticles; j++) {
double r = distance(&atoms[3*i + 0], &atoms[3*j + 0]);
int binNumber = floor(r/dr);
// Problem line right here.
// This memory address is modified by multiple threads
bins[binNumber] += 2.0;
}
}
}
So... I have no clue what to do. I've been Googling and reading about shared memory, but the problem is that I don't know what memory area I'm going to be accessing until I do my distance computation!
I know this is possible, because a program called VMD uses the GPU to speed up this computation. Any help (or even ideas) would be greatly appreciated. I don't need this optimized, just functional.

How many bins[] are there?
Is there some reason that bins[] need to be of type double? It's not obvious from your code. What you have is essentially a histogram operation, and you may want to look at fast parallel histogram techniques. Thrust may be of interest.
There are several possible avenues to consider with your code:
See if there is a way to restructure your algorithm to arrange computations in such a way that a given group of threads (or bin computations) are not stepping on each other. This might be accomplished based on sorting distances, perhaps.
Use atomics This should solve your problem, but will likely be costly in terms of execution time (but since it's so simple you might want to give it a try.) In place of this:
bins[binNumber] += 2.0;
Something like this:
int * bins,
...
atomicAdd(bins+binNumber, 2);
You can still do this if bins are of type double, it's just a bit more complicated. Refer to the documentation for the example of how to do atomicAdd on a double.
If the number of bins is small (maybe a few thousand, or less) then you could create a few sets of bins that are updated by multiple threadblocks, and then use a reduction operation (adding the sets of bins together, element by element) at the end of the processing sequence. In this case, you might want to consider using a smaller number of threads or threadblocks, each of which processes multiple elements, by putting an additional loop in your kernel code, so that after each particle processing is complete, the loop jumps to the next particle by adding gridDim.x*blockDim.x to the i variable, and repeating the process. Since each thread or threadblock has it's own local copy of the bins, it can do this without stepping on other threads accesses.
For example, suppose I only needed 1000 bins of type int. I could create 1000 sets of bins, which would only take up about 4 megabytes. I could then give each of 1000 threads it's own bin set, and then each of the 1000 threads would have it's own bin set to update, and would not require atomics, since it could not interfere with any other thread. By having each thread loop through multiple particles, I can still effectively keep the machine busy this way. When all the particle-binning is done, I then have to add my 1000 bin-sets together, perhaps with a separate kernel call.

Related

How to efficiently create a large vector of items initialized to the same value?

I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.
Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?
vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.

Is it safe for an OpenCL kernel to randomly write to a __global buffer?

I want to run an instrumented OpenCL kernel to get some execution metrics. More specifically, I have added a hidden global buffer which will be initialized from the host code with N zeros. Each of the N values are integers and they represent a different metric, which each kernel instance will increment in a different manner, depending on its execution path.
A simplistic example:
__kernel void test(__global int *a, __global int *hiddenCounter) {
if (get_global_id(0) == 0) {
// do stuff and then increment the appropriate counter (random numbers here)
hiddenCounter[0] += 3;
}
else {
// do stuff...
hiddenCounter[1] += 5;
}
}
After the kernel execution is complete, I need the host code to aggregate (a simple element-wise vector addition) all the hiddenCounter buffers and print the appropriate results.
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer (which will definitely happen in my project). Do I need to enforce some kind of synchronization? Or is this impossible with __global arguments and I need to change it to __private? Will I be able to aggregate __private buffers from the host code afterwards?
My question is whether there are race conditions when multiple kernel instances try to write to the same index of the hiddenCounter buffer
The answer to this is emphatically yes, your code will be vulnerable to race conditions as currently written.
Do I need to enforce some kind of synchronization?
Yes, you can use global atomics for this purpose. All but the most ancient GPUs will support this. (anything supporting OpenCL 1.2, or cl_khr_global_int32_base_atomics and similar extensions)
Note that this will have a non-trivial performance overhead. Depending on your access patterns and frequency, collecting intermediate results in private or local memory and writing them out to global memory at the end of the kernel may be faster. (In the local case, the whole work group would share just one global atomic call for each updated cell - you'll need to use local atomics or a reduction algorithm to accumulate the values from individual work items across the group though.)
Another option is to use a much larger global memory buffer, with counters for each work item or group. In that case, you will not need atomics to write to them, but you will subsequently need to combine the values on the host. This uses much more memory, obviously, and likely more memory bandwidth too - modern GPUs should cache accesses to your small hiddenCounter buffer. So you'll need to work out/try which is the lesser evil in your case.

Iterative programming using PCollectionViews

I wish to create a PCollection of say one hundred thousand objects (maybe even a million) such that I apply an operation on it a million times in a for-loop on the same data, but with DIFFERENT values for the PCollectionView calculated on each iteration of the loop. Is this a use-case that df can handle reasonably well? Is there a better way to achieve this? My concerns is that PCollectionView has too much overhead, but it could be that that used to be a problem a year ago but now this a use-case that DF can support well. In my case, I can hardcode the number of iterations of the for-loop (as I believe that DF can't handle the situation in which the number of iterations is dynamically determined at run-time.) Here's some pseudocode:
PCollection<KV<Integer,RowVector>> rowVectors = ...
PCollectionView<Map<Integer, Float>> vectorX;
for (int i=0; i < 1000000; i++) {
PCollection<KV<Integer,Float>> dotProducts =
rowVectors.apply(ParDo.of(new DoDotProduct().withSideInputs(vectorX));
vectorX = dotProducts.apply(View.asMap());
}
Unfortunately we only support up to 1000 transformations / stages. This would require 1000000 (or whatever your forloop iterates over) stages.
Also you are correct in that we don't allow changes to the graph after the pipeline begins running.
If you want to do less than 1000 iterations, then using a map side input can work but you have to limit the number of map lookups you do per RowVector. You can do this by ensuring that each lookup has the whole column instead of walking the map for each RowVector. In this case you'd represent your matrix as a PCollectionView of a Map<ColumnIndex, Iterable<RowIndex, RowValue>>

CUDA: are access times for texture memory similar to coalesced global memory?

My kernel threads access a linear character array in a coalesced fashion. If I map
the array to texture I don't see any speedup. The running times are
almost the same. I'm working on a Tesla C2050 with compute capability 2.0 and read
somewhere that global accesses are cached. Is that true? Perhaps that is why I
am not seeing a difference in the running time.
The array in the main program is
char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
and I bind it to texture texture<char> texdatabase with
cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id
is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.
I'm binding only once and dev_database is a large array. Actually I found that
if the size is too large the bind fails. Is there a limit on the size of the array
to bind? Thanks very much.
There are several possibilities for why you don't see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.
Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.
You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time -- this can help to saturate the global memory bus.
There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

Running time and memory

If you cannot see the code of a function, but know that it takes arguments. Is it possible to find the running time speed and memory. If so how would you do it. Is there a way to use Big O in this case?
No, it's not possible to find either the memory or performance of a function by just looking at its parameters. For example, the same function
void DoSomething(int x, int y, int z);
Can be implemented as O(1) time and memory:
void DoSomething(int x, int y, int z) { }
or as a very, very expensive function taking O(x*y*z):
void DoSomething(int x, int y, int z)
{
int a = 0;
for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
for (int k = 0; k < z; k++) {
a++;
}
}
}
Console.WriteLine(a);
}
And many other possibilities. So, it's not possible to find how expensive the function is.
Am I allowed to run the function at all? Multiple times?
I would execute the function with a range of parameter values and measure the running time and (if possible) the memory consumption for each run. Then, assuming the function takes n argument, I would plot each data point on an n+1-dimensional plot and look for trends from there.
First of all, it is an interview question, so you'd better never say no.
If I were in the interview, here is my approach.
I may ask the interviewer a few questions, as an interview is meant to be interactive.
Because I cannot see the code, I suppose I can at least run it, hopefully, multiple times. This would be my first question: can I run it? (If I cannot run it, then I can do literally nothing with it, and I give up.)
What is the function used for? This may give a hint of the complexity, if the function is written sanely.
What are the type of argument? Are some they primitive types? Try some combinations of them. Are some of them "complex" (e.g. containers)? Try some different size combinations. Are some of them related (e.g. one for a container, and one for the size of the container)? Some test runs can be saved. Besides, I hope the legal ranges of the arguments are given, so I won't waste time on illegal guesses. Last, to test some marginal cases may help.
Can you run the function with a code? something like this:
start = clock();
//call the function;
end = clock();
time = end-start;
Being an interview question, you should never answer like "no it cannot be done".
What you need is the ability to run the code. Once you can run the code, call the same function with different parameters and measure the memory and time required. You can then plot these data and get a good estimate.
For big-O type notations also, you can follow the same approach and plot the results WRT the data set size. Then try to fit this curve with the known complexity curves like n, n^2, n^3, n*log(n), (n^2)*log(n) etc using a least square fit.
Lastly, remember that all these methods are approximations only.
no you cannot, this would have solved the Halting Problem , since code might run endlessly O(infinity). thus, solving this problem also solves HP, which is of course proven to be impossible.

Resources