I have some places in my code which looks like this:
var i = 0
for c in vertexStates[0] {
//this operation is costy (encapsulates 4 linear interpolation inside)
currentVertexes.append(vertexStates[1][i++].interpolateTo(c, alpha: factor))
}
And I know that there is more than 1000 vertexes in vertexStates[index] array for sure (maybe up to 3000). What are the best practices for optimizing (vectorizing) such operations? Should figure out how to do it in some threads? Will profits from using multi-threading outweight over head? Maybe there are other ways of doing such operations faster?
I need general approach on how to optimize such operations (in my case which produces array from two other arrays and order is important for me), no matter if 3000 counts as long or not. My iPhone 6 Plus CPU is loaded by 65% during this operations, so I can predict 4s will show very poor results, even though I haven't tested it yet.
100 isn't very long. 300 isn't very long. 100,000 is where we can start arguing whether something is very long.
Did you measure how long things take? What is the slowest device where your code could run? If you run on iOS 7, how well does it run on an iPhone 4? If you run on iOS 8 or 9 only, how well does it run on 4s or iPad 2?
The first step is measuring. Post with results.
Related
I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.
I've implemented a minimal test compute shader in order to get a feel for the performance of Metal atomic functions, specifically atomic_fetch_add_explicit and atomic_compare_exchange_weak_explicit using atomic_uint in the device address space.
Running a compute kernel which runs atomic_fetch_add_explicit once, 307200 times, takes around 50 µs on an iPhone 12 Pro. Running atomic_fetch_add_explicit 10 times per thread instead takes about 650 µs, which makes sense. However, I'm confused by the Xcode shader profiler's line-by-line performance metrics:
Blue, red and yellow mean arithmetic, synchronisation and control flow, respectively. According to Apple's documentation, this result means the atomic function calls each take zero percent of the function's total elapsed time:
The statistics for lines in the function body indicate the time as a percent of the function's total elapsed time.
However clearly that's not the case, as my time expenditure is multiplied roughly ten-fold as I go from one to ten calls per thread.
Is this an Xcode bug, or am I missing something here? Answers appreciated, answers with sources doubly appreciated. :)
I'm using Z3 as a black box to find all possible combinations of some real-world objects with C# code like this:
while (solver.Check() == Status.SATISFIABLE)
{
SATModel = solver.Model;
....
//invert the Model
....
solver.Assert(InvertedModel)
}
For most of my problems the program is working fine, but now I have a bigger problem, where there would be 8.5E+64 possible combinations without constraints.
I'm starting with some 6000 constraints.
What I observe is that the check action takes less than .02 seconds at the beginning and builds up slowly. After 100000 found solutions it takes already 1 second per turn and after 130000 turns I measure 2 seconds.
Is there an easy way to improve the performance?
It's not unreasonable that the solver is taking longer and longer with each constraint. But to make sure it's not some sort of a memory-leak on the C# part, you should check that the time taken in your while loop is really in the Check part and not in the invert/assert part. If you determine z3 is the responsible party, perhaps filing it at https://github.com/Z3Prover/z3/issues might solicit a better answer from the developers.
I'am non-programmer, trying to assess the time spent in (opencv-)functions. We have an AD-converter which comes with a counter that is able to count external signals (e.g. from a function generator) with a frequency of 1 MHz = 1 µs resolution. The actual counter status can be queried with a function cbIn32(..., unsigned long *pointertovalue).
So my idea was to query the counter status before and after calling the function of interest and to calculate then the difference. However, doubts came up when I calcultated the difference without a function call in between, which revealed rel. high fluctuations (values between 80 and 400 µs or so). I wondered, if calculating the average time for calling cbIn32() (approx. 180 µs) and substract this from the putative time spent in the function of interest is a valid solution.
So my first two questions:
Is that approach generally feasible or useless?
Where do the fluctuations come from?
Alternatively, we tried using getTickCount(), which seemed to deliver reasonable values. But checking forums revealed that it has a low resolution of about 10 ms, which would be insatisfactory (100 µs resolution would be appreciated). However, the values we got were in the sub-ms range.
This brings me to the next questions:
How can the time assessed for a function with getTickCount() be in the microseconds range, when the resolution is around 10 ms?
Should I trust the obtained values or not?
I also tried it with gprof, but it gave me "no time accumulated", although I am sure that the time spent in a function containing opencv-related calls is at least a few milliseconds. I even tried rebuilding opencv with ENABLE_PROFILING=ON, but same result. I read somewhere that you need to build static opencv libraries to enable profiling, but I am not sure if this would improve the situation. So the question here is:
What do I have to do so that gprof also "sees" opencv functions?
Next alternative would be the QueryPerformanceCounter() function of the WINAPI. I don't how to use it, but I would fight my way through, if you recommend it. Question to that approach:
Will it be problematic because of multiple cores?
If yes, is there an "easy" way to handle that problem?
I also tried it with verysleepy, but it exits somehow to early (worked fine with other .exe).
Newbie-friendly answers would be very, very appreciated. My goal is to find the easiest approach with highest precision. I'm working on Win7 64bit, Eclipse with MinGW.
Thx for your help...
I am runing a program on a server at my university that has 4 Dual-Core AMD Opteron(tm) Processor 2210 HE and the O.S. is Linux version 2.6.27.25-78.2.56.fc9.x86_64. My program implements Conways Game of Life and it runs using pthreads and openmp. I timed the parrallel part of the program using the getimeofday() function using 1-8 threads. But the timings don't seem right. I get the biggest time using 1 thread(as expected), then the time gets smaller. But the smallest time I get is when I use 4 threads.
Here is an example when I use an array 1000x1000.
Using 1 thread~9,62 sec, Using 2 Threads~4,73 sec, Using 3 ~ 3.64 sec, Using 4~2.99 sec, Using 5 ~4,19 sec, Using 6~3.84, Using 7~3.34, Using 8~3.12.
The above timings are when I use pthreads. When I use openmp the timing are smaller but follow the same pattern.
I expected that the time would decrease from 1-8 because of the 4 Dual core cpus? I thought that because there are 4 cpus with 2 cores each, 8 threads could run at the same time. Does it have to do with the operating system that the server runs?
Also I tested the same programs on another server that has 7 Dual-Core AMD Opteron(tm) Processor 8214 and runs Linux version 2.6.18-194.3.1.el5. There the timings i get are what I expected. The timings get smaller starting from 1(the biggest) to 8(smallest execution time).
The program implements the Game of Life correct, both using pthreads and openmp, I just can't figure out why the timings are like the example I posted. So in conclusion, my questions are:
1) The number of threads that can run at the same time on a system depends by the cores of the cpus? it depends only by the cpus although each cpu has more than one cores? It depends by all the previous and the Operating System?
2) Does it have to do with the way I divide the 1000x1000 array to the number of threads? But if I did then the openmp code wouldn't give the same pattern of timings?
3) What is the reason I might get such timings?
This is the code I use with openmp:
#define Row 1000+2
#define Col 1000+2 int num; int (*temp)[Col]; int (*a1)[Col]; int (*a2)[Col];
int main() {
int i,j,l,sum;
int array1[Row][Col],array2[Row][Col];
struct timeval tim;
struct tm *tm;
double start,end;
int st,en;
for (i=0; i<Row; i++)
for (j=0; j<Col; j++)
{
array1[i][j]=0;
array2[i][j]=0;
}
array1[3][16]=1;
array1[4][16]=1;
array1[5][15]=1;
array1[6][15]=1;
array1[6][16]=1;
array1[7][16]=1;
array1[5][14]=1;
array1[4][15]=1;
a1=array1;
a2=array2;
printf ("\nGive number of threads:");
scanf("%d",&num);
gettimeofday(&tim,NULL);
start=tim.tv_sec+(tim.tv_usec/1000000.0); omp_set_num_threads(num);
#pragma omp parallel private(l,i,j,sum)
{
printf("Number of Threads:%d\n",omp_get_num_threads());
for (l=0; l<100; l++)
{
#pragma omp for
for (i=1; i<(Row-1); i++)
{
for (j=1; j<(Col-1); j++)
{
sum=a1[i-1][j-1]+a1[i-1][j]+a1[i-1][j+1]+a1[i][j-1]+a1[i][j+1]+a1[i+1][j-1]+a1[i+1][j]+a1[i+1][j+1];
if ((a1[i][j]==1) && (sum==2||sum==3))
a2[i][j]=1;
else if ((a1[i][j]==1) && (sum<2))
a2[i][j]=0;
else if ((a1[i][j]==1) && (sum>3))
a2[i][j]=0;
else if ((a1[i][j]==0 )&& (sum==3))
a2[i][j]=1;
else if (a1[i][j]==0)
a2[i][j]=0;
}//end of iteration J
}//end of iteration I
#pragma omp barrier
#pragma omp single
{
temp=a1;
a1=a2;
a2=temp;
}
#pragma omp barrier
}//end of iteration L
}//end of paraller region
gettimeofday(&tim,NULL);
end=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("\nTime Elapsed:%.6lf\n",end-start);
printf("all ok\n");
return 0; }
TIMINGS with openmp code
a)System with 7 Dual Core Cpus
Using 1 thread~7,72 sec, Using 2 threads~4,53 sec, Using 3 Threads~3,64 sec, Using 4 threads~ 2,24 sec, Using 5~2,02 sec, Using 6~ 1,78 sec, Using 7 ~1,59 sec,Using 8 ~ 1,44 sec
b)System with 4 Dual Core Cpus
Using 1 thread~9,06 sec, Using 2 threads~4,86 sec, Using 3 Threads~3,49 sec, Using 4 threads~ 2,61 sec, Using 5~3,98 sec, Using 6~ 3,53 sec, Using 7 ~3,48 sec,Using 8 ~ 3,32 sec
Above are the timings I get.
One thing you have to remember is that you're doing this on a shared memory architecture. The more loads/stores you are trying to do in parallel, the more chance you're going to have to hit contention with regards to memory access, which is a relatively slow operation. So in typical applications in my experience, don't benefit from more than 6 cores. (This is anecdotal, I could go into a lot of detail, but I don't feel like typing. Suffice to say, take these numbers with a grain of salt).
Try instead to minimize access to shared resources if possible, see what that does to your performance. Otherwise, optimize for what you got, and remember this:
Throwing more cores at a problem does not mean it will go quicker. Like with taxation, there's a curve as to when the number of cores, starts becoming a detriment to collecting the most performance out of your program. Find that "sweet spot", and use it.
You write
The above timings are when i use
pthreads. When i use openmp the timing
are smaller but follow the same
pattern.
Congratulations, you have discovered the pattern which all parallel programs follow ! If you plot execution time against number of processors the curve eventually flattens out and starts to rise; you reach a point where adding more processors slows things down.
The interesting question is how many processors you can profitably use and the answer to this is dependent on many factors. #jer has pointed out some of the factors which affect the scalability of programs on shared-memory computers. Other factors, principally the ratio of communication to computation, ensure that the shape of the performance curve will be the same on distributed-memory computers too.
The other factor which is important when measuring the parallel scalability of your program is the problem size(s) you use. How does your performance curve change when you try a grid of 1414 x 1414 cells ? I would expect that the curve will be below the curve for the problem on 1000 x 1000 cells and will flatten out later.
For further reading Google for Amdahl's Law and Gustafson's Law.
Could be your sysadmin is controlling how many threads you can execute simultaneously or how many cores you run on. I don't know if it is possible at the sysadmin level, but it sure is possible to tell a process that.
Or, your algorithm could be using L2 cache poorly. Hyper-threading or whatever they call it now works best when one thread is doing something that takes a long time and the other thread is not. Accessing memory not in L2 cache is SLOW and the thread doing so will stall while it waits. This is just one example of where the time to run multiple threads on a single core comes from. A Quad core memory bus might allow each core to access some of the ram at the same time, but not each thread in each core. If both threads go for RAM then they basically are running sequentially. So that could be where your 4 comes from.
You might look to see if you can change your loops so they operate on contiguous RAM. If you break the problem into small blocks of data that fit in your L2 cache and iterate through those blocks, you might get 8x. If you search for the intel machine language programmers guides for their latest processors...they talk about these issues.