How could I vectorize this for loop?

How could I vectorize this for loop? - ios

I have this loop
void f1(unsigned char *data, unsigned int size) {
unsigned int A[256] = {0u};
for (register unsigned int i = 0u; i < size; i++) {
++A[data[i]];
}
...
Is there any way to vectorize it manually?

Since multiple entries in data[i] might contain the same value, I don't see how this could be vectorized simply since there can be race conditions. The point of vectorization is that each element is independent of the other elements, and so can be computed in parallel. But your algorithm doesn't allow that. "Vectorize" is not the same thing as "make go faster."
What you seem to be building here is a histogram, and iOS has built-in, optimized support for that. You can create a single-channel, single-row image and use vImageHistogramCalculation_Planar8 like this:
void f1(unsigned char *data, unsigned int size) {
unsigned long A[256] = {0u};
vImage_Buffer src = { data, 1, size, size };
vImage_Error err = vImageHistogramCalculation_Planar8(&src, A, kvImageDoNotTile);
if (err != kvImageNoError) {
// error
}
...
}
Be careful about assuming this is always a win, though. It depends on the size of your data. Making a function call is very expensive, so it can take several million bytes of data to make it worth it. If you're computing this on smaller sets than that, then a simple, compiler-optimized loop is often the best approach. You need to profile this on real devices to see which is faster for your purposes.
Just make sure to allow the compiler to apply all vectorizing optimizations by turning on -Ofast (Fastest, Aggressive). That won't matter in this case because your loop can't be simply vectorized. But in general, -Ofast allows the compiler to apply vectorizing optimizations in cases that it might slightly grow code size (which isn't allowed under the default -Os). -Ofast also allows a little sloppiness in how floating point math is performed, so should not be used in cases where strict IEEE floating point conformance is required (but this is almost never the case for iOS apps, so -Ofast is almost always the correct setting).

The optimisation the compiler would attempt to do here is to parallelize ++A[data[i]]
It cannot do so because the contents of A depend on the previous iteration of the loop.
You could break this dependancy by using one frequency array (A) per way of parallelism, and then computing the sum of these at the end. I assume here you've got two ways of parallelism and that the size is even.
void f1(const unsigned char * const data, unsigned int size) {
unsigned int A0[256] = {0u};
unsigned int A1[256] = {0u};
for (unsigned int i = 0u; i < size /2u; i++) {
++A0[data[2*i]];
++A1[data[2*i+1]];
}
for (unsigned i=0u; i < 256; ++i){
A0[i] = A0[i] + A1[i];
}
}
Does this win you much? There only one way to find out - try it and measure the results. I suspect that the Accelerate framework will do much better than this, even for relatively small values on size. It's also optimised at run-time for the target architecture.
Compilers are pretty smart, but there are things you can do in C or C++ to help the compiler:
Apply const wherever possible: It's then obvious which data is invariant.
Identify pointers to non-overlapping memory regions with the restrict (__restrict in C++) qualifier. Without knowing this, the compiler must assume a write through one pointer potentially alters data that could be read with another. clang will in fact generate run-time checks and code-paths for both the overlapping and non-overlapping case, but there will be limits to this, and you can probably reduce code-size by being explicit.
I doubt the register qualifier for i makes any difference.

Related

CUDA "out of memory" with plenty of memory in the VRAM [duplicate]

Seems like there are a lot of questions on here about moving double (or int, or float, etc) 2d arrays from host to device. This is NOT my question.
I have already moved all of the data onto the GPU and, the __global__ kernel calls several __device__ functions.
In these device kernels, I have tried the following:
To allocate:
__device__ double** matrixCreate(int rows, int cols, double initialValue)
{
double** temp; temp=(double**)malloc(rows*sizeof(double*));
for(int j=0;j<rows;j++) {temp[j]=(double*)malloc(cols*sizeof(double));}
//Set initial values
for(int i=0;i<rows;i++)
{
for(int j=0;j<cols;j++)
{
temp[i][j]=initialValue;
}
}
return temp;
}
To deallocate:
__device__ void matrixDestroy(double** temp,int rows)
{
for(int j=0;j<rows;j++) { free( temp[j] ); }
free(temp);
}
For single dimension arrays the __device__ mallocs work great, can't seem to keep it stable in the multidimensional case. By the way, the variables are sometime used like this:
double** z=matrixCreate(2,2,0);
double* x=z[0];
However, care is always taken to ensure no calls to free are done with active data. The code is actually an adaption of cpu only code, so I know nothing funny is going on with the pointers or memory. Basically I'm just re-defining the allocators and throwing a __device__ on the serial portions. Just want to run the whole serial bit 10000 times and the GPU seems like a good way to do it.
++++++++++++++ UPDATE +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Problem solved by Vyas. As per cuda specifications, heap size is initially set to 8Mb, if your mallocs exceed this, NSIGHT will not launch and the kernel crashes. Use the following under host code.
float increaseHeap=10;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size[0]*increaseHeap);
Worked for me!

The GPU side malloc() is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size). For more info see : CUDA programming guide

Outputting values from CAMPARY

I'm trying to use the CAMPARY library (CudA Multiple Precision ARithmetic librarY). I've downloaded the code and included it in my project. Since it supports both cpu and gpu, I'm starting with cpu to understand how it works and make sure it does what I need. But the intent is to use this with CUDA.
I'm able to instantiate an instance and assign a value, but I can't figure out how to get things back out. Consider:
#include <time.h>
#include "c:\\vss\\CAMPARY\\Doubles\\src_cpu\\multi_prec.h"
int main()
{
const char *value = "123456789012345678901234567";
multi_prec<2> a(value);
a.prettyPrint();
a.prettyPrintBin();
a.prettyPrintBin_UnevalSum();
char *cc = a.prettyPrintBF();
printf("\n%s\n", cc);
free(cc);
}
Compiles, links, runs (VS 2017). But the output is pretty unhelpful:
Prec = 2
Data[0] = 1.234568e+26
Data[1] = 7.486371e+08
Prec = 2
Data[0] = 0x1.987bf7c563caap+86;
Data[1] = 0x1.64fa5c3800000p+29;
0x1.987bf7c563caap+86 + 0x1.64fa5c3800000p+29;
1.234568e+26 7.486371e+08
Printing each of the doubles like this might be easy to do, but it doesn't tell you much about the value of the 128 number being stored. Performing highly accurate computations is of limited value if there's no way to output the results.
In addition to just printing out the value, eventually I also need to convert these numbers to ints (I'm willing to try it all in floats if there's a way to print, but I fear that both accuracy and speed will suffer). Unlike MPIR (which doesn't support CUDA), CAMPARY doesn't have any associated multi-precision int type, just floats. I can probably cobble together what I need (mostly just add/subtract/compare), but only if I can get the integer portion of CAMPARY's values back out, which I don't see a way to do.
CAMPARY doesn't seem to have any docs, so it's conceivable these capabilities are there, and I've simply overlooked them. And I'd rather ask on the CAMPARY discussion forum/mail list, but there doesn't seem to be one. That's why I'm asking here.
To sum up:
Is there any way to output the 128bit ( multi_prec<2> ) values from CAMPARY?
Is there any way to extract the integer portion from a CAMPARY multi_prec? Perhaps one of the (many) math functions in the library that I don't understand computes this?

There are really only 2 possible answers to this question:
There's another (better) multi-precision library that works on CUDA that does what you need.
Here's how to modify this library to do what you need.
The only people who could give the first answer are CUDA programmers. Unfortunately, if there were such a library, I feel confident talonmies would have known about it and mentioned it.
As for #2, why would anyone update this library if they weren't a CUDA programmer? There are other, much better multi-precision libraries out there. The ONLY benefit CAMPARY offers is that it supports CUDA. Which means the only people with any real motivation to work with or modify the library are CUDA programmers.
And, as the CUDA programmer with the most vested interest in solving this, I did figure out a solution (albeit an ugly one). I'm posting it here in the hopes that the information will be of value to future CAMPARY programmers. There's not much information out there for this library, so this is a start.
The first thing you need to understand is how CAMPARY stores its data. And, while not complex, it isn't what I expected. Coming from MPIR, I assumed that CAMPARY stored its data pretty much the same way: a fixed size exponent followed by an arbitrary number of bits for the mantissa.
But nope, CAMPARY went a different way. Looking at the code, we see:
private:
double data[prec];
Now, I assumed that this was just an arbitrary way of reserving the number of bits they needed. But no, they really do use prec doubles. Like so:
multi_prec<8> a("2633716138033644471646729489243748530829179225072491799768019505671233074369063908765111461703117249");
// Looking at a in the VS debugger:
[0] 2.6337161380336443e+99 const double
[1] 1.8496577979210756e+83 const double
[2] 1.2618399223120249e+67 const double
[3] -3.5978270144026257e+48 const double
[4] -1.1764513205926450e+32 const double
[5] -2479038053160511.0 const double
[6] 0.00000000000000000 const double
[7] 0.00000000000000000 const double
So, what they are doing is storing the max amount of precision possible in the first double, then the remainder is used to compute the next double and so on until they encompass the entire value, or run out of precision (dropping the least significant bits). Note that some of these are negative, which means the sum of the preceding values is a bit bigger than the actual value and they are correcting it downward.
With this in mind, we return to the question of how to print it.
In theory, you could just add all these together to get the right answer. But kinda by definition, we already know that C doesn't have a datatype to hold a value this size. But other libraries do (say MPIR). Now, MPIR doesn't work on CUDA, but it doesn't need to. You don't want to have your CUDA code printing out data. That's something you should be doing from the host anyway. So do the computations with the full power of CUDA, cudaMemcpy the results back, then use MPIR to print them out:
#define MPREC 8
void ShowP(const multi_prec<MPREC> value)
{
multi_prec<MPREC> temp(value), temp2;
// from mpir at mpir.org
mpf_t mp, mp2;
mpf_init2(mp, value.getPrec() * 64); // Make sure we reserve enough room
mpf_init(mp2); // Only needs to hold one double.
const double *ptr = value.getData();
mpf_set_d(mp, ptr[0]);
for (int x = 1; x < value.getPrec(); x++)
{
// MPIR doesn't have a mpf_add_d, so we need to load the value into
// an mpf_t.
mpf_set_d(mp2, ptr[x]);
mpf_add(mp, mp, mp2);
}
// Using base 10, write the full precision (0) of mp, to stdout.
mpf_out_str(stdout, 10, 0, mp);
mpf_clears(mp, mp2, NULL);
}
Used with the number stored in the multi_prec above, this outputs the exact same value. Yay.
It's not a particularly elegant solution. Having to add a second library just to print a value from the first is clearly sub-optimal. And this conversion can't be all that speedy either. But printing is typically done (much) less frequently than computing. If you do an hour's worth of computing and a handful of prints, the performance doesn't much matter. And it beats the heck out of not being able to print at all.
CAMPARY has a lot of shortcomings (undoced, unsupported, unmaintained). But for people who need mp numbers on CUDA (especially if you need sqrt), it's the best option I've found.

How to CUDA-ize code when all cores require global memory access

Without CUDA, my code is just two for loops that calculate the distance between all pairs of coordinates in a system and sort those distances into bins.
The problem with my CUDA version is that apparently threads can't write to the same global memory locations at the same time (race conditions?). The values I end up getting for each bin are incorrect because only one of the threads ended up writing to each bin.
__global__ void computePcf(
double const * const atoms,
double * bins,
int numParticles,
double dr) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles - 1) {
for (int j = i + 1; j < numParticles; j++) {
double r = distance(&atoms[3*i + 0], &atoms[3*j + 0]);
int binNumber = floor(r/dr);
// Problem line right here.
// This memory address is modified by multiple threads
bins[binNumber] += 2.0;
}
}
}
So... I have no clue what to do. I've been Googling and reading about shared memory, but the problem is that I don't know what memory area I'm going to be accessing until I do my distance computation!
I know this is possible, because a program called VMD uses the GPU to speed up this computation. Any help (or even ideas) would be greatly appreciated. I don't need this optimized, just functional.

How many bins[] are there?
Is there some reason that bins[] need to be of type double? It's not obvious from your code. What you have is essentially a histogram operation, and you may want to look at fast parallel histogram techniques. Thrust may be of interest.
There are several possible avenues to consider with your code:
See if there is a way to restructure your algorithm to arrange computations in such a way that a given group of threads (or bin computations) are not stepping on each other. This might be accomplished based on sorting distances, perhaps.
Use atomics This should solve your problem, but will likely be costly in terms of execution time (but since it's so simple you might want to give it a try.) In place of this:
bins[binNumber] += 2.0;
Something like this:
int * bins,
...
atomicAdd(bins+binNumber, 2);
You can still do this if bins are of type double, it's just a bit more complicated. Refer to the documentation for the example of how to do atomicAdd on a double.
If the number of bins is small (maybe a few thousand, or less) then you could create a few sets of bins that are updated by multiple threadblocks, and then use a reduction operation (adding the sets of bins together, element by element) at the end of the processing sequence. In this case, you might want to consider using a smaller number of threads or threadblocks, each of which processes multiple elements, by putting an additional loop in your kernel code, so that after each particle processing is complete, the loop jumps to the next particle by adding gridDim.x*blockDim.x to the i variable, and repeating the process. Since each thread or threadblock has it's own local copy of the bins, it can do this without stepping on other threads accesses.
For example, suppose I only needed 1000 bins of type int. I could create 1000 sets of bins, which would only take up about 4 megabytes. I could then give each of 1000 threads it's own bin set, and then each of the 1000 threads would have it's own bin set to update, and would not require atomics, since it could not interfere with any other thread. By having each thread loop through multiple particles, I can still effectively keep the machine busy this way. When all the particle-binning is done, I then have to add my 1000 bin-sets together, perhaps with a separate kernel call.

Crash at any optimization level other than -o0 in iOS

The following two pieces of code works fine when optimization level is -o0.
But, when the optimization level is anything other than -o0, the first code crashes at some point, but the seconds does not crash. could you please explain why?
1.
unsigned char* _pos = ...;
double result;
*((int*)&result) = *((int*)_pos;
2.
unsigned char* _pos = ...;
double result;
int* curPos = (int*)_pos;
int* resultPos = (int*)&result;
*resultPos = *curPos;
EDIT:
By the way, this code is in an inlined function. When the function is not inlined, there in no crash even with optimizations.

The code here actually yields several problems at once.
First, as it was said before, the code violates the aliasing rules and thus the result is undefined per standard. So, strictly speaking, compiler can do bunch of stuff while optimizing (this is actually your case when the code mentioned above is inlined).
Second (and I believe this is the actual problem here) - casting char* to int* will increase the assumed alignment of the pointer. According to your platform ABI, char can be 1 byte aligned, but int - at least 4 (double is 8 byte aligned, btw). The system can tolerate the unaligned loads, but not always, e.g. on arm/darwin it can tolerate 4 byte unaligned loads, but not 8. The latter case can happen when compiler would decide to merge two consecutive loads / stored into 1. Since you bumped the actual alignment of the pointer compiler might deduce that everything is suitable aligned and generate such 8 byte loads.
So, in short - fix your code :) In this particular case memcpy / memmove will help you.

Running time and memory

If you cannot see the code of a function, but know that it takes arguments. Is it possible to find the running time speed and memory. If so how would you do it. Is there a way to use Big O in this case?

No, it's not possible to find either the memory or performance of a function by just looking at its parameters. For example, the same function
void DoSomething(int x, int y, int z);
Can be implemented as O(1) time and memory:
void DoSomething(int x, int y, int z) { }
or as a very, very expensive function taking O(x*y*z):
void DoSomething(int x, int y, int z)
{
int a = 0;
for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
for (int k = 0; k < z; k++) {
a++;
}
}
}
Console.WriteLine(a);
}
And many other possibilities. So, it's not possible to find how expensive the function is.

Am I allowed to run the function at all? Multiple times?
I would execute the function with a range of parameter values and measure the running time and (if possible) the memory consumption for each run. Then, assuming the function takes n argument, I would plot each data point on an n+1-dimensional plot and look for trends from there.

First of all, it is an interview question, so you'd better never say no.
If I were in the interview, here is my approach.
I may ask the interviewer a few questions, as an interview is meant to be interactive.
Because I cannot see the code, I suppose I can at least run it, hopefully, multiple times. This would be my first question: can I run it? (If I cannot run it, then I can do literally nothing with it, and I give up.)
What is the function used for? This may give a hint of the complexity, if the function is written sanely.
What are the type of argument? Are some they primitive types? Try some combinations of them. Are some of them "complex" (e.g. containers)? Try some different size combinations. Are some of them related (e.g. one for a container, and one for the size of the container)? Some test runs can be saved. Besides, I hope the legal ranges of the arguments are given, so I won't waste time on illegal guesses. Last, to test some marginal cases may help.

Can you run the function with a code? something like this:
start = clock();
//call the function;
end = clock();
time = end-start;

Being an interview question, you should never answer like "no it cannot be done".
What you need is the ability to run the code. Once you can run the code, call the same function with different parameters and measure the memory and time required. You can then plot these data and get a good estimate.
For big-O type notations also, you can follow the same approach and plot the results WRT the data set size. Then try to fit this curve with the known complexity curves like n, n^2, n^3, n*log(n), (n^2)*log(n) etc using a least square fit.
Lastly, remember that all these methods are approximations only.

no you cannot, this would have solved the Halting Problem , since code might run endlessly O(infinity). thus, solving this problem also solves HP, which is of course proven to be impossible.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart