I am trying to compute 10 Billion Digits of Pi with a program called Y-Cruncher. When I try to compute Pi it will come up with this Issue and not move for a long while. Any help would be useful.
Related
I want to process the value from InfluxDB on Grafana.
The final demand is to show how many miles the current vehicle has traveled in a certain time frame.
You can use the formula: average velocity * time.
Do the seniors have any good methods?
So what I'm thinking is: I've got the mean function for the average speed over a fixed period of time and the corresponding mileage, and then I want to add all the mileage together. How do I do that?
What if you only use SQL?
1.) InfluxDB uses InfluxQL, not a SQL
2.) Your approach average velocity * time is innacurate
3.) Use suitable InfluxDB functions, I would say INTEGRAL() is the best function for this case + some basic arithmetic. Don't expect the 100% accuracy. Accuracy depends heavily on the metric sampling, e.g. 1 minute sampling - but what if vehicle is driving 59 seconds and it is not moving for that second when sampling is happening. So don't be supprised, when even 10 sec sampling will be inacurrate.
I am using Jmeter plugin to show cpu and memory performance related graphs. It's showing me the matrices based on threadgroup setting and rest client call to server using server agent. But I am not able to understand the matrices for the x and Y axis for cpu and memory. Please suggest how it's working.
X axis: X axis value is cumulative value of current test duration , means in below graphs, test started 2 minutes and 5 minutes ago so the values is 00:02:05.
Example 1: If test started 30 minutes ago then x axis value is 00:30:00
Example 2: If test started 20 minutes and 20 seconds ago then x axis values is 00:20:20.
Y axis: Bit tricky , because in below graph "localhost Memory(*100)" utilization is 9000 means actual utilization is 90 % only this value is multiple by 100 (90*100=9000).
"Local host CPU (*1000)" means CPU is actual value (4) multiple by 1000.
The reason for multiple by 100 by Memory and 1000 by CPU is , network utilization is 30000.
In above graph Memory utilization is ~9000(~90*100) in above graph , actual utilization is ~90 only check below graph
Same thing for CPU utilization, check below
I intend to use the beaglebone to sample a shaped signal of the order of 1 microsec. I need to fit the signal after and therefore i would like to have a sampling rate of let's 10 MHZ. Something that seems feasible with PRU and libpruio. The point is, looking to the adc specifications it seems there is a limit at 200KHz. Is my reasoning correct?
thanks
You'll need additional hardware for a sampling rate of 10 MHz! libpruio isn't designed to work at that speed, as well as the BBB hardware.
The ADC subsystem in the AM335x CPU is clocked at 24 MHz and needs 15 cycles for a sample (14 in continous mode). This leads to a maximum sample rate of 1.6 (1.74) MSamples/s. See SRM, chapter 12 for details.
The problem is to get the samples in to the host memory. I couldn't get this working faster than ~250 kSamples/s (by CPU access - I didn't try DMA).
As long as you don't need more values than the FIFO can hold, you can sample a single line at maximum 1.7 MHz.
BR
I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
METAL IMPLEMENTATION DETAILS:
Host code gist:
https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d
Kernel (shader) code:
https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126
GPU frame capture profiling results:
The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.
I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.
My CPU is Intel Core2 Duo T5550, GPU is GeForce 8400M G. CUDA version 5.5.22, OpenCV version 2.4.8.
The test code is as follows:
double t = (double)getTickCount();
gpu::threshold(src, dst, thres, binMax, THRESH_BINARY);
t = ((double)getTickCount() - t)/getTickFrequency();
cout << "Times passed in seconds: " << t << endl;
For a 3648*2736 image, the result is
CPU: Times passed in seconds: 0.0136336
GPU: Times passed in seconds: 0.0217714
Thanks!
Perhaps this is not suprising.
You GeForce 8400M G is a old mobile card having only 8 cores, see the GeForce 8M series specifications, so you cannot extract much parallelism out of it.
Brutally speaking, GPUs are advantageous over multicore CPUs when you are capable of massively extracting parallelism by a large number of cores. In other words, to fastly build up an Egyptian pyramid by slow slaves (GPU cores) you need a large number of slaves. If you have only very few slow slaves (8 in your case), then perhaps it is better to have even fewer (2 CPU cores, for example), but much faster, slaves.
EDIT
I remembered just now to have bumped into this post
Finding minimum in GPU slower than CPU
which may help convince you that bad implementations (as underlined by Abid Rahman and Mailerdaimon) may lead to GPU codes that are slower than CPU ones. The situation is even worse if, as pointed out in the answer to the post above, you are hosting also the X display on your already limited GeForce 8400M G card.
Additionally to what #JackOLantern said:
Every Copy operation involving the GPU takes Time! A lot of time compared to just computing with the CPU. This is why #Abid Rahman K comment is a good Idea, he suggested to test again with more complex Code. The advantage of the GPU is in fast parallel processing, on off it disadvantages is the relatively slow transfer rate while copying data to and from the GPU.