Analyse code for spatial and temporal locality - spatial

Hi have some question regarding spatial and temporal locality. I have read in the course theory that
spatial locality
If one item is referenced, the likelihood of other address close by will be referenced soon
temporal locality
One item that is referenced at one point in time it tend to be referenced soon again.
Ok, but how do I see that in the code? I think I understood the concept for temporal locality but I don't understand spatial locality yet. For instance in this loop
for(i = 0; i < 20; i++)
for(j = 0; j < 10; j++)
a[i] = a[i]*j;
The inner loop will call same memory address when accessing a[i] ten times so that's an example for temporal locality I guess. But is there spatial locality also in the above loop?

Of course. For instance, after referencing a[5] you are about to reference a[6].

Related

Iterative programming using PCollectionViews

I wish to create a PCollection of say one hundred thousand objects (maybe even a million) such that I apply an operation on it a million times in a for-loop on the same data, but with DIFFERENT values for the PCollectionView calculated on each iteration of the loop. Is this a use-case that df can handle reasonably well? Is there a better way to achieve this? My concerns is that PCollectionView has too much overhead, but it could be that that used to be a problem a year ago but now this a use-case that DF can support well. In my case, I can hardcode the number of iterations of the for-loop (as I believe that DF can't handle the situation in which the number of iterations is dynamically determined at run-time.) Here's some pseudocode:
PCollection<KV<Integer,RowVector>> rowVectors = ...
PCollectionView<Map<Integer, Float>> vectorX;
for (int i=0; i < 1000000; i++) {
PCollection<KV<Integer,Float>> dotProducts =
rowVectors.apply(ParDo.of(new DoDotProduct().withSideInputs(vectorX));
vectorX = dotProducts.apply(View.asMap());
}
Unfortunately we only support up to 1000 transformations / stages. This would require 1000000 (or whatever your forloop iterates over) stages.
Also you are correct in that we don't allow changes to the graph after the pipeline begins running.
If you want to do less than 1000 iterations, then using a map side input can work but you have to limit the number of map lookups you do per RowVector. You can do this by ensuring that each lookup has the whole column instead of walking the map for each RowVector. In this case you'd represent your matrix as a PCollectionView of a Map<ColumnIndex, Iterable<RowIndex, RowValue>>

SIMD load in loop with descending index

I am just wondering how SIMD extensions implement the vector load in loop with descending index.
For example we have a loop of
for(i = N; i ==0; i--)
But the consecutive memory is loaded from low address. In such situation, is the vector load followed by a vector shuffle to place each element to the correct lane?
Thanks in advance,
T

How to create a confusion matrix without any packages

How do I create a confusion matrix without any packages? I want to be able to see the logic of creating one. This can be any language or pseudo-code.
Technically, a confusion matrix is just a regular matrix.
Just compute the intersection sizes and then label the rows and columns as desired.
Well an option(maybe not the best in performance but great to understand the concept is this one which was the first one i implemented:
true_positives = 0;
true_negatives = 0;
false_negatives = 0;
false_positives = 0;
for i in range(0,np.size(predictions)):
if predictions[i]==1 and real_values[i]==1:
true_positives+=1;
if predictions[i]==0 and real_values[i]==0:
true_negatives+=1;
if predictions[i]==0 and real_values[i]==1:
false_negatives+=1;
if predictions[i]==1 and real_values[i]==0:
false_positives+=1;

How to CUDA-ize code when all cores require global memory access

Without CUDA, my code is just two for loops that calculate the distance between all pairs of coordinates in a system and sort those distances into bins.
The problem with my CUDA version is that apparently threads can't write to the same global memory locations at the same time (race conditions?). The values I end up getting for each bin are incorrect because only one of the threads ended up writing to each bin.
__global__ void computePcf(
double const * const atoms,
double * bins,
int numParticles,
double dr) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles - 1) {
for (int j = i + 1; j < numParticles; j++) {
double r = distance(&atoms[3*i + 0], &atoms[3*j + 0]);
int binNumber = floor(r/dr);
// Problem line right here.
// This memory address is modified by multiple threads
bins[binNumber] += 2.0;
}
}
}
So... I have no clue what to do. I've been Googling and reading about shared memory, but the problem is that I don't know what memory area I'm going to be accessing until I do my distance computation!
I know this is possible, because a program called VMD uses the GPU to speed up this computation. Any help (or even ideas) would be greatly appreciated. I don't need this optimized, just functional.
How many bins[] are there?
Is there some reason that bins[] need to be of type double? It's not obvious from your code. What you have is essentially a histogram operation, and you may want to look at fast parallel histogram techniques. Thrust may be of interest.
There are several possible avenues to consider with your code:
See if there is a way to restructure your algorithm to arrange computations in such a way that a given group of threads (or bin computations) are not stepping on each other. This might be accomplished based on sorting distances, perhaps.
Use atomics This should solve your problem, but will likely be costly in terms of execution time (but since it's so simple you might want to give it a try.) In place of this:
bins[binNumber] += 2.0;
Something like this:
int * bins,
...
atomicAdd(bins+binNumber, 2);
You can still do this if bins are of type double, it's just a bit more complicated. Refer to the documentation for the example of how to do atomicAdd on a double.
If the number of bins is small (maybe a few thousand, or less) then you could create a few sets of bins that are updated by multiple threadblocks, and then use a reduction operation (adding the sets of bins together, element by element) at the end of the processing sequence. In this case, you might want to consider using a smaller number of threads or threadblocks, each of which processes multiple elements, by putting an additional loop in your kernel code, so that after each particle processing is complete, the loop jumps to the next particle by adding gridDim.x*blockDim.x to the i variable, and repeating the process. Since each thread or threadblock has it's own local copy of the bins, it can do this without stepping on other threads accesses.
For example, suppose I only needed 1000 bins of type int. I could create 1000 sets of bins, which would only take up about 4 megabytes. I could then give each of 1000 threads it's own bin set, and then each of the 1000 threads would have it's own bin set to update, and would not require atomics, since it could not interfere with any other thread. By having each thread loop through multiple particles, I can still effectively keep the machine busy this way. When all the particle-binning is done, I then have to add my 1000 bin-sets together, perhaps with a separate kernel call.

Running time and memory

If you cannot see the code of a function, but know that it takes arguments. Is it possible to find the running time speed and memory. If so how would you do it. Is there a way to use Big O in this case?
No, it's not possible to find either the memory or performance of a function by just looking at its parameters. For example, the same function
void DoSomething(int x, int y, int z);
Can be implemented as O(1) time and memory:
void DoSomething(int x, int y, int z) { }
or as a very, very expensive function taking O(x*y*z):
void DoSomething(int x, int y, int z)
{
int a = 0;
for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
for (int k = 0; k < z; k++) {
a++;
}
}
}
Console.WriteLine(a);
}
And many other possibilities. So, it's not possible to find how expensive the function is.
Am I allowed to run the function at all? Multiple times?
I would execute the function with a range of parameter values and measure the running time and (if possible) the memory consumption for each run. Then, assuming the function takes n argument, I would plot each data point on an n+1-dimensional plot and look for trends from there.
First of all, it is an interview question, so you'd better never say no.
If I were in the interview, here is my approach.
I may ask the interviewer a few questions, as an interview is meant to be interactive.
Because I cannot see the code, I suppose I can at least run it, hopefully, multiple times. This would be my first question: can I run it? (If I cannot run it, then I can do literally nothing with it, and I give up.)
What is the function used for? This may give a hint of the complexity, if the function is written sanely.
What are the type of argument? Are some they primitive types? Try some combinations of them. Are some of them "complex" (e.g. containers)? Try some different size combinations. Are some of them related (e.g. one for a container, and one for the size of the container)? Some test runs can be saved. Besides, I hope the legal ranges of the arguments are given, so I won't waste time on illegal guesses. Last, to test some marginal cases may help.
Can you run the function with a code? something like this:
start = clock();
//call the function;
end = clock();
time = end-start;
Being an interview question, you should never answer like "no it cannot be done".
What you need is the ability to run the code. Once you can run the code, call the same function with different parameters and measure the memory and time required. You can then plot these data and get a good estimate.
For big-O type notations also, you can follow the same approach and plot the results WRT the data set size. Then try to fit this curve with the known complexity curves like n, n^2, n^3, n*log(n), (n^2)*log(n) etc using a least square fit.
Lastly, remember that all these methods are approximations only.
no you cannot, this would have solved the Halting Problem , since code might run endlessly O(infinity). thus, solving this problem also solves HP, which is of course proven to be impossible.

Resources