Running time and memory

Running time and memory - memory

If you cannot see the code of a function, but know that it takes arguments. Is it possible to find the running time speed and memory. If so how would you do it. Is there a way to use Big O in this case?

No, it's not possible to find either the memory or performance of a function by just looking at its parameters. For example, the same function
void DoSomething(int x, int y, int z);
Can be implemented as O(1) time and memory:
void DoSomething(int x, int y, int z) { }
or as a very, very expensive function taking O(x*y*z):
void DoSomething(int x, int y, int z)
{
int a = 0;
for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
for (int k = 0; k < z; k++) {
a++;
}
}
}
Console.WriteLine(a);
}
And many other possibilities. So, it's not possible to find how expensive the function is.

Am I allowed to run the function at all? Multiple times?
I would execute the function with a range of parameter values and measure the running time and (if possible) the memory consumption for each run. Then, assuming the function takes n argument, I would plot each data point on an n+1-dimensional plot and look for trends from there.

First of all, it is an interview question, so you'd better never say no.
If I were in the interview, here is my approach.
I may ask the interviewer a few questions, as an interview is meant to be interactive.
Because I cannot see the code, I suppose I can at least run it, hopefully, multiple times. This would be my first question: can I run it? (If I cannot run it, then I can do literally nothing with it, and I give up.)
What is the function used for? This may give a hint of the complexity, if the function is written sanely.
What are the type of argument? Are some they primitive types? Try some combinations of them. Are some of them "complex" (e.g. containers)? Try some different size combinations. Are some of them related (e.g. one for a container, and one for the size of the container)? Some test runs can be saved. Besides, I hope the legal ranges of the arguments are given, so I won't waste time on illegal guesses. Last, to test some marginal cases may help.

Can you run the function with a code? something like this:
start = clock();
//call the function;
end = clock();
time = end-start;

Being an interview question, you should never answer like "no it cannot be done".
What you need is the ability to run the code. Once you can run the code, call the same function with different parameters and measure the memory and time required. You can then plot these data and get a good estimate.
For big-O type notations also, you can follow the same approach and plot the results WRT the data set size. Then try to fit this curve with the known complexity curves like n, n^2, n^3, n*log(n), (n^2)*log(n) etc using a least square fit.
Lastly, remember that all these methods are approximations only.

no you cannot, this would have solved the Halting Problem , since code might run endlessly O(infinity). thus, solving this problem also solves HP, which is of course proven to be impossible.

Related

Outputting values from CAMPARY

I'm trying to use the CAMPARY library (CudA Multiple Precision ARithmetic librarY). I've downloaded the code and included it in my project. Since it supports both cpu and gpu, I'm starting with cpu to understand how it works and make sure it does what I need. But the intent is to use this with CUDA.
I'm able to instantiate an instance and assign a value, but I can't figure out how to get things back out. Consider:
#include <time.h>
#include "c:\\vss\\CAMPARY\\Doubles\\src_cpu\\multi_prec.h"
int main()
{
const char *value = "123456789012345678901234567";
multi_prec<2> a(value);
a.prettyPrint();
a.prettyPrintBin();
a.prettyPrintBin_UnevalSum();
char *cc = a.prettyPrintBF();
printf("\n%s\n", cc);
free(cc);
}
Compiles, links, runs (VS 2017). But the output is pretty unhelpful:
Prec = 2
Data[0] = 1.234568e+26
Data[1] = 7.486371e+08
Prec = 2
Data[0] = 0x1.987bf7c563caap+86;
Data[1] = 0x1.64fa5c3800000p+29;
0x1.987bf7c563caap+86 + 0x1.64fa5c3800000p+29;
1.234568e+26 7.486371e+08
Printing each of the doubles like this might be easy to do, but it doesn't tell you much about the value of the 128 number being stored. Performing highly accurate computations is of limited value if there's no way to output the results.
In addition to just printing out the value, eventually I also need to convert these numbers to ints (I'm willing to try it all in floats if there's a way to print, but I fear that both accuracy and speed will suffer). Unlike MPIR (which doesn't support CUDA), CAMPARY doesn't have any associated multi-precision int type, just floats. I can probably cobble together what I need (mostly just add/subtract/compare), but only if I can get the integer portion of CAMPARY's values back out, which I don't see a way to do.
CAMPARY doesn't seem to have any docs, so it's conceivable these capabilities are there, and I've simply overlooked them. And I'd rather ask on the CAMPARY discussion forum/mail list, but there doesn't seem to be one. That's why I'm asking here.
To sum up:
Is there any way to output the 128bit ( multi_prec<2> ) values from CAMPARY?
Is there any way to extract the integer portion from a CAMPARY multi_prec? Perhaps one of the (many) math functions in the library that I don't understand computes this?

There are really only 2 possible answers to this question:
There's another (better) multi-precision library that works on CUDA that does what you need.
Here's how to modify this library to do what you need.
The only people who could give the first answer are CUDA programmers. Unfortunately, if there were such a library, I feel confident talonmies would have known about it and mentioned it.
As for #2, why would anyone update this library if they weren't a CUDA programmer? There are other, much better multi-precision libraries out there. The ONLY benefit CAMPARY offers is that it supports CUDA. Which means the only people with any real motivation to work with or modify the library are CUDA programmers.
And, as the CUDA programmer with the most vested interest in solving this, I did figure out a solution (albeit an ugly one). I'm posting it here in the hopes that the information will be of value to future CAMPARY programmers. There's not much information out there for this library, so this is a start.
The first thing you need to understand is how CAMPARY stores its data. And, while not complex, it isn't what I expected. Coming from MPIR, I assumed that CAMPARY stored its data pretty much the same way: a fixed size exponent followed by an arbitrary number of bits for the mantissa.
But nope, CAMPARY went a different way. Looking at the code, we see:
private:
double data[prec];
Now, I assumed that this was just an arbitrary way of reserving the number of bits they needed. But no, they really do use prec doubles. Like so:
multi_prec<8> a("2633716138033644471646729489243748530829179225072491799768019505671233074369063908765111461703117249");
// Looking at a in the VS debugger:
[0] 2.6337161380336443e+99 const double
[1] 1.8496577979210756e+83 const double
[2] 1.2618399223120249e+67 const double
[3] -3.5978270144026257e+48 const double
[4] -1.1764513205926450e+32 const double
[5] -2479038053160511.0 const double
[6] 0.00000000000000000 const double
[7] 0.00000000000000000 const double
So, what they are doing is storing the max amount of precision possible in the first double, then the remainder is used to compute the next double and so on until they encompass the entire value, or run out of precision (dropping the least significant bits). Note that some of these are negative, which means the sum of the preceding values is a bit bigger than the actual value and they are correcting it downward.
With this in mind, we return to the question of how to print it.
In theory, you could just add all these together to get the right answer. But kinda by definition, we already know that C doesn't have a datatype to hold a value this size. But other libraries do (say MPIR). Now, MPIR doesn't work on CUDA, but it doesn't need to. You don't want to have your CUDA code printing out data. That's something you should be doing from the host anyway. So do the computations with the full power of CUDA, cudaMemcpy the results back, then use MPIR to print them out:
#define MPREC 8
void ShowP(const multi_prec<MPREC> value)
{
multi_prec<MPREC> temp(value), temp2;
// from mpir at mpir.org
mpf_t mp, mp2;
mpf_init2(mp, value.getPrec() * 64); // Make sure we reserve enough room
mpf_init(mp2); // Only needs to hold one double.
const double *ptr = value.getData();
mpf_set_d(mp, ptr[0]);
for (int x = 1; x < value.getPrec(); x++)
{
// MPIR doesn't have a mpf_add_d, so we need to load the value into
// an mpf_t.
mpf_set_d(mp2, ptr[x]);
mpf_add(mp, mp, mp2);
}
// Using base 10, write the full precision (0) of mp, to stdout.
mpf_out_str(stdout, 10, 0, mp);
mpf_clears(mp, mp2, NULL);
}
Used with the number stored in the multi_prec above, this outputs the exact same value. Yay.
It's not a particularly elegant solution. Having to add a second library just to print a value from the first is clearly sub-optimal. And this conversion can't be all that speedy either. But printing is typically done (much) less frequently than computing. If you do an hour's worth of computing and a handful of prints, the performance doesn't much matter. And it beats the heck out of not being able to print at all.
CAMPARY has a lot of shortcomings (undoced, unsupported, unmaintained). But for people who need mp numbers on CUDA (especially if you need sqrt), it's the best option I've found.

Best way to access/store persistent data in CUDA along multiple kernel calls [duplicate]

I have 2 very similar kernel functions, in the sense that the code is nearly the same, but with a slight difference. Currently I have 2 options:
Write 2 different methods (but very similar ones)
Write a single kernel and put the code blocks that differ in an if/else statement
How much will an if statement affect my algorithm performance?
I know that there is no branching, since all threads in all blocks will enter either the if, or the else.
So will a single if statement decrease my performance if the kernel function is called a lot of times?

You have a third alternative, which is to use C++ templating and make the variable which is used in the if/switch statement a template parameter. Instantiate each version of the kernel you need, and then you have multiple kernels doing different things with no branch divergence or conditional evaluation to worry about, because the compiler will optimize away the dead code and the branching with it.
Perhaps something like this:
template<int action>
__global__ void kernel()
{
switch(action) {
case 1:
// First code
break;
case 2:
// Second code
break;
}
}
template void kernel<1>();
template void kernel<2>();

It will slightly decrease your performance, especially if it's in an inner loop, since you're wasting an instruction issue slot every so often, but it's not nearly as much as if a warp were divergent.
If it's a big deal, it may be worth moving the condition outside the loop, however. If the warp is truly divergent, though, think about how to remove the branching: e.g., instead of
if (i>0) {
x = 3;
} else {
x = y;
}
try
x = ((i>0)*3) | ((i<3)*y);

trying to rewrite this so it doesnt violate "prinicples discussed in Code Complete 2nd edition

function profit(){
int totalSales=0;
for (int i=0; i<12;i++) // computer yearly sales
totalSales+=montlysales[i];
return get_profit_from_sales(totalsales);
}
So i've already determined that the 12 in the for loop should be a constant instead of just using an integer and that the montlysales should be passed as a parameter into the function so then a check can be run to see if the length of sales is equal to the integer value of months which is also twelve.
I'm not sure if those are all the violations of the princples cause. I feel the last line
return get_profit_from_sales(totalsales)
is wrong and its really bothering me cause I can't seem to figure out why it is in fact bothering me and I think I might have skipped something else.
can anyone help me verify?

Summary - you should refactor out the call to another function and make this function so that it is pure and does only one thing, reducing complexity, and improving your ability to reason abstractly about your program and its correctness.
Your spidey sense is tingling and you should trust it - you are correct, but what is wrong is subtle.
Routines are best when they do one thing, and one thing only. So purity of vision is important in the prime imperative, management of complexity -- it allows our brains to be able to juggle more things because they are simpler. That is, you can just look at the function and know what it does, and you don't have to say, "it totals the sales, but it also calls another function at the end", which sort of clouds its "mission".
This is also part of functional programming and where I feel that languages have to adopt to try to implement the prime imperative spoken of in Code Complete. Functional programming has as one of its tenets, "no side effects", which is similar to "one mission" or "one purpose". What I've done to your code can also be seen as making it more functional, just inputs and outputs and nothing in or out via any other path.
Note also that function get_profit() reads like pseudocode, making it somewhat self-documenting, and you don't have to delve into any of the functions to understand what the function does or how it does it (ideally).
So, here is my idea of what I explained above (loosely coded, and not checked!).
function get_total_sales(int montlysales[]){
int totalSales=0;
for (int i=0; i<12;i++) // computer yearly sales
totalSales+=montlysales[i];
return totalSales;
}
function calc_profit(int all_sales, int all_commissions, int all_losses)(){
// your calculations here
int profit = all_sales - all_commissions - all_losses; // ... etc.
return profit;
}
function get_profit(){
int montlysales[] = read_from_disk();
int total_sales = get_total_sales(montlysales);
int tot_commissions = get_tot_commissions();
int tot_losses = get_tot_losses();
int profit_made = calc_profit(total_sales, tot_commissions, tot_losses);
return profit_made;
}
I read Code Complete about once a year, as coding is truly subtle at times, because it is so multidimensional. This has been very helpful to me. Regards - Stephen

How to optimize math with 2D vectors?

I’m working on iOS app that performs some calculations with an array of thousand objects. The objects have properties with x and y coordinates, velocity for x and y axises and couple other x-y properties. There is some math to calculate an interaction between physical objects represented by objects in the array. The math is pretty straight forward, basically it is calculation of the forces applied to the objects, speed and change in position (x,y) for each objects. I wrote a code using regular scalar math in Objective C, it worked fine on iPhone 5s, however too slow on other devices like iPhone 4, 4s, 5 and iPad mini. I found that the most time consuming operations were the calculations of the distance between 2 points and calculations of the length of a vector like shown below which involves taking a square root:
float speed = sqrtf(self.dynamic.speed_x,2)+pow(self.dynamic.speed_y,2));
So, I had to do something to make the calculations quicker. I re-wrote the code to make the properties with the coordinates of the objects and such properties as velocity which were presented by X and Y components to be vectors of GLKVector2 type. I was hoping that it would make the calculations of the variables like the distance between 2 vectors (or points, as per my understanding), addition and subtraction of vectors significantly faster due to using special vector functions like GLKVector2Distance, GLKVector2Normalize,GLKVector2Add etc. However, it didn’t help too much in terms of performance, because, as I believe, to put the object with properties of GLKVector2 type to the array I had to use NSValue, as well as to decode the GLKVector2 values back from the object in the array to perform vector calculations. Below is the code from calculation method in object’s implementation:
GLKVector2 currentPosition;
[self.currentPosition getValue:&currentPosition];
GLKVector2 newPosition;
// calculations with vectors. Result is in newPosition.
self.currentPosition = [NSValue value:&newPosition withObjCType:#encode(GLKVector2)];
Moreover, when I rewrote the code to use GLKVector2, I got memory warnings and after some time of running the applications sometimes crashes.
I spend several days trying to find a better way to do the calculations faster, I looked at vecLib, openGL, but have not found a solution that would be understandable for me. I have a feeling that I might have to look at writing code in C and integrate it somehow into objective C, but I don’t understand how to integrate it with the array of objects without using NSValue thousands times.
I would greatly appreciate it if anyone could help with advise on what direction should I look at? Maybe there is some library available that can be easily used in Objective C with group of objects stored in arrays?

Learn how to use Instruments. Any performance optimisation is totally pointless unless you measure speed before and after your changes. An array of 1000 objects is nothing, so you might be worrying about nothing. Or slowdowns are in a totally different place. Use Instruments.
x * x is a multiplication. powf (x, 2.0) is an expensive function call that probably takes anywhere between 20 and 100 times longer.
GLKVector2 is a primitive (it is a union). Trying to stash it into an NSValue is totally pointless and wastes probably 100 times more time than just storing it directly.

Here an answer to you question about integrating physics calculations in C with your current Objective-C class. If you have a fragment of C code in your .m file which may look like
static CGPoint point[MAX_OBJ];
static int n_points = 0;
with a corresponding function in plain C as you suggested for simulating physical interactions that acts on point to update object positions, as in
void tick_world() {
for (int k = 0; k < n_points; k ++) {
float speed = sqrtf((point[k].x*point[k].x) + (point[k].y*point[k].y));
...
}
}
then, your Objective-C class Moving for the moving object could contain a pointer to a particular CGPoint in point that you would define in the interface (probably the corresponding *.h file):
#interface Moving : NSObject {
...
CGPoint *pos;
}
When handling the init message, you can then grab and initialize the next available element in point. If your objects persist throughout run time, this could be done very simply simply by
#implementation
...
-(id)initAtX:(float)x0 Y:(float)y0 {
self = [super init];
if (self) {
if (n_points == MAX_OBJ) {
[self release];
return nil;
}
pos = point + n_points ++;
pos->x = x0;
pos->y = y0;
}
return self;
}
If your Moving objects do not persist, you might want to think of a smart way to recycle slots after destruction. For example, you could initialize all x of point with NAN, and use this as a way to locate a free slot. In your dealloc, you would then pos->x = NAN.

It sounds like you're up against a couple common problems: 1) fast math in a high-level language, and 2) a meta-problem: whether to get the benefit of others' work (OpenGL, and several ideas listed here) in exchange for a steep learning curve as the developer.
Especially for this subject matter, I think the trade is pretty good in favor of using a library. For many (e.g. Eigen), most of the learning curve is about integration of Objective C and C++, which you can quickly put behind you.
(As an aside, often times, computing the square of the distance between objects is sufficient for making comparisons. If that works in your app, you can save cycles by replacing distance(a,b) with distnaceSquared(a,b)

How to CUDA-ize code when all cores require global memory access

Without CUDA, my code is just two for loops that calculate the distance between all pairs of coordinates in a system and sort those distances into bins.
The problem with my CUDA version is that apparently threads can't write to the same global memory locations at the same time (race conditions?). The values I end up getting for each bin are incorrect because only one of the threads ended up writing to each bin.
__global__ void computePcf(
double const * const atoms,
double * bins,
int numParticles,
double dr) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles - 1) {
for (int j = i + 1; j < numParticles; j++) {
double r = distance(&atoms[3*i + 0], &atoms[3*j + 0]);
int binNumber = floor(r/dr);
// Problem line right here.
// This memory address is modified by multiple threads
bins[binNumber] += 2.0;
}
}
}
So... I have no clue what to do. I've been Googling and reading about shared memory, but the problem is that I don't know what memory area I'm going to be accessing until I do my distance computation!
I know this is possible, because a program called VMD uses the GPU to speed up this computation. Any help (or even ideas) would be greatly appreciated. I don't need this optimized, just functional.

How many bins[] are there?
Is there some reason that bins[] need to be of type double? It's not obvious from your code. What you have is essentially a histogram operation, and you may want to look at fast parallel histogram techniques. Thrust may be of interest.
There are several possible avenues to consider with your code:
See if there is a way to restructure your algorithm to arrange computations in such a way that a given group of threads (or bin computations) are not stepping on each other. This might be accomplished based on sorting distances, perhaps.
Use atomics This should solve your problem, but will likely be costly in terms of execution time (but since it's so simple you might want to give it a try.) In place of this:
bins[binNumber] += 2.0;
Something like this:
int * bins,
...
atomicAdd(bins+binNumber, 2);
You can still do this if bins are of type double, it's just a bit more complicated. Refer to the documentation for the example of how to do atomicAdd on a double.
If the number of bins is small (maybe a few thousand, or less) then you could create a few sets of bins that are updated by multiple threadblocks, and then use a reduction operation (adding the sets of bins together, element by element) at the end of the processing sequence. In this case, you might want to consider using a smaller number of threads or threadblocks, each of which processes multiple elements, by putting an additional loop in your kernel code, so that after each particle processing is complete, the loop jumps to the next particle by adding gridDim.x*blockDim.x to the i variable, and repeating the process. Since each thread or threadblock has it's own local copy of the bins, it can do this without stepping on other threads accesses.
For example, suppose I only needed 1000 bins of type int. I could create 1000 sets of bins, which would only take up about 4 megabytes. I could then give each of 1000 threads it's own bin set, and then each of the 1000 threads would have it's own bin set to update, and would not require atomics, since it could not interfere with any other thread. By having each thread loop through multiple particles, I can still effectively keep the machine busy this way. When all the particle-binning is done, I then have to add my 1000 bin-sets together, perhaps with a separate kernel call.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart