Why does memkind and numactl improve program performance a lot?

Why does memkind and numactl improve program performance a lot? - memory

According to the course https://www.coursera.org/learn/parallelism-ia/home/welcome, there is one example which tried to illustrate the improvement from memkind API by using hbw_posix_memalign((void**)&pixel, 64, sizeof(P)*width*height); I think this API only provides us aligned memory allocation. I do not know why this can help improve the GPflops so much as the following shows.
The part of coding is as the following. Here the memory which stores img_in is allocated by memkind API.
template<typename P>
void ApplyStencil(ImageClass<P> & img_in, ImageClass<P> & img_out) {
const int width = img_in.width;
const int height = img_in.height;
P * in = img_in.pixel;
P * out = img_out.pixel;
#pragma omp parallel for
for (int i = 1; i < height-1; i++)
#pragma omp simd
for (int j = 1; j < width-1; j++) {
P val = -in[(i-1)*width + j-1] - in[(i-1)*width + j] - in[(i-1)*width + j+1]
-in[(i )*width + j-1] + 8*in[(i )*width + j] - in[(i )*width + j+1]
-in[(i+1)*width + j-1] - in[(i+1)*width + j] - in[(i+1)*width + j+1];
val = (val < 0 ? 0 : val);
val = (val > 255 ? 255 : val);
out[i*width + j] = val;
}
}
My questions are as the follwing:
Is it only because we can use less memory operation to get our data and then we can improve the performance almost 5 times?
In terms of numaclt, based on the linux documentation, it allows us to bind the processes with specific nodes or cpus. When we use the command numactl -m 1, we can get the improvement 5 times. I not sure if the improvement comes from NUMA communication delay.

Related

What is the significance of 1.000000015047466e+30?

I was attempting to do JIT compilation on a pytorch-based module from an NLP library and I saw that one of the generated fused CUDA kernel code implementations mentions the number 1.000000015047466e+30:
#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
template<typename T>
__device__ T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}
template<typename T>
__device__ T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}
extern "C" __global__
void fused_add_mul_mul_sub(float* tattn_mask1_1, float* tac_2, float* tbd_2, float* output_1, float* aten_mul_1) {
{
if (blockIdx.x<1ll ? 1 : 0) {
if ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)<169ll ? 1 : 0) {
if (blockIdx.x<1ll ? 1 : 0) {
float v = __ldg(tattn_mask1_1 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
aten_mul_1[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v * 1.000000015047466e+30f;
} } }if ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)<2028ll ? 1 : 0) {
float v_1 = __ldg(tac_2 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
float v_2 = __ldg(tbd_2 + 12ll * (((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) % 169ll) + ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) / 169ll);
float v_3 = __ldg(tattn_mask1_1 + ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) % 169ll);
output_1[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = (v_1 + v_2) * 0.125f - v_3 * 1.000000015047466e+30f;
}}
}
It feels like this should be some sort of FLOAT_MAX constant, but I don't know any numerical type that would have this as a limit.
Google searching for this constant just yields a handful of results that seem to suggest it may be a physical constant of some kind:
This MATLAB forum link suggests it may be a default value for some physical phenomena
This Github repo on whole cell electrophysiology seems to use it as a max limit of some kind.
This Opensea NFT link shows that the number is somehow used as a parameter for some kind of fractal art?
I'm absolutely baffled because I'm doing natural language processing using CUDA and have absolutely no clue why this number is appearing in my CUDA kernel code and why it's also being used in hard science research code. Does this number have some special floating point properties or something? Is it serving as a numerical stability factor somehow? Any leads would be greatly appreciated.

1.000000015047466e+30f is 1030 rounded to the nearest value representable in float (IEEE-754 binary32, also called “single precision”), then rounded to 16 decimal digits, then formatted as a float literal.
Thus, it likely originated as 1e30 or another representation of 1030 that was subsequently converted to float and then used to produce new source code.
(The float nearest 1030 is exactly 1000000015047466219876688855040.)

Why doesn't this Fibonacci Number function work in O(log N)?

So the Fibonacci number for log (N) — without matrices.
Ni // i-th Fibonacci number
= Ni-1 + Ni-2 // by definition
= (Ni-2 + Ni-3) + Ni-2 // unwrap Ni-1
= 2*Ni-2 + Ni-3 // reduce the equation
= 2*(Ni-3 + Ni-4) + Ni-3 //unwrap Ni-2
// And so on
= 3*Ni-3 + 2*Ni-4
= 5*Ni-4 + 3*Ni-5
= 8*Ni-5 + 5*Ni-6
= Nk*Ni-k + Nk-1*Ni-k-1
Now we write a recursive function, where at each step we take k~=I/2.
static long N(long i)
{
if (i < 2) return 1;
long k=i/2;
return N(k) * N(i - k) + N(k - 1) * N(i - k - 1);
}
Where is the fault?

You get a recursion formula for the effort: T(n) = 4T(n/2) + O(1). (disregarding the fact that the numbers get bigger, so the O(1) does not even hold). It's clear from this that T(n) is not in O(log(n)). Instead one gets by the master theorem T(n) is in O(n^2).
Btw, this is even slower than the trivial algorithm to calculate all Fibonacci numbers up to n.

The four N calls inside the function each have an argument of around i/2. So the length of the stack of N calls in total is roughly equal to log2N, but because each call generates four more, the bottom 'layer' of calls has 4^log2N = O(n2) Thus, the fault is that N calls itself four times. With only two calls, as in the conventional iterative method, it would be O(n). I don't know of any way to do this with only one call, which could be O(log n).
An O(n) version based on this formula would be:
static long N(long i) {
if (i<2) {
return 1;
}
long k = i/2;
long val1;
long val2;
val1 = N(k-1);
val2 = N(k);
if (i%2==0) {
return val2*val2+val1*val1;
}
return val2*(val2+val1)+val1*val2;
}
which makes 2 N calls per function, making it O(n).

public class fibonacci {
public static int count=0;
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
int i = scan.nextInt();
System.out.println("value of i ="+ i);
int result = fun(i);
System.out.println("final result is " +result);
}
public static int fun(int i) {
count++;
System.out.println("fun is called and count is "+count);
if(i < 2) {
System.out.println("function returned");
return 1;
}
int k = i/2;
int part1 = fun(k);
int part2 = fun(i-k);
int part3 = fun(k-1);
int part4 = fun(i-k-1);
return ((part1*part2) + (part3*part4)); /*RESULT WILL BE SAME FOR BOTH METHODS*/
//return ((fun(k)*fun(i-k))+(fun(k-1)*fun(i-k-1)));
}
}
I tried to code to problem defined by you in java. What i observed is that complexity of above code is not completely O(N^2) but less than that.But as per conventions and standards the worst case complexity is O(N^2) including some other factors like computation(division,multiplication) and comparison time analysis.
The output of above code gives me information about how many times the function
fun(int i) computes and is being called.
OUTPUT
So including the time taken for comparison and division, multiplication operations, the worst case time complexity is O(N^2) not O(LogN).
Ok if we use Analysis of the recursive Fibonacci program technique.Then we end up getting a simple equation
T(N) = 4* T(N/2) + O(1)
where O(1) is some constant time.
So let's apply Master's method on this equation.
According to Master's method
T(n) = aT(n/b) + f(n) where a >= 1 and b > 1
There are following three cases:
If f(n) = Θ(nc) where c < Logba then T(n) = Θ(nLogba)
If f(n) = Θ(nc) where c = Logba then T(n) = Θ(ncLog n)
If f(n) = Θ(nc) where c > Logba then T(n) = Θ(f(n))
And in our equation a=4 , b=2 & c=0.
As case 1 c < logba => 0 < 2 (which is log base 2 and equals to 2) is satisfied
hence T(n) = O(n^2).
For more information about how master's algorithm works please visit: Analysis of Algorithms

Your idea is correct, and it will perform in O(log n) provided you don't compute the same formula
over and over again. The whole point of having N(k) * N(i-k) is to have (k = i - k) so you only have to compute one instead of two. But if you only call recursively, you are performing the computation twice.
What you need is called memoization. That is, store every value that you already have computed, and
if it comes up again, then you get it in O(1).
Here's an example
const int MAX = 10000;
// memoization array
int f[MAX] = {0};
// Return nth fibonacci number using memoization
int fib(int n) {
// Base case
if (n == 0)
return 0;
if (n == 1 || n == 2)
return (f[n] = 1);
// If fib(n) is already computed
if (f[n]) return f[n];
// (n & 1) is 1 iff n is odd
int k = n/2;
// Applying your formula
f[n] = fib(k) * fib(n - k) + fib(k - 1) * fib(n - k - 1);
return f[n];
}

Re-implementing Muller and Mueller clock recovery with control_loop

I'm currently implementing symbol time recovery blocks. The idea is to be able to choose different TEDs (Gardner, Zero-crossing, Early-Late, Maximum-likelihood etc). In blocks like M&M recovery, the gain parameters of the loop are expressed explicitly (gain_omega and gain_mu) which can be difficult to get right. The contro_loop class is, however, more convenient (loop characteristics can be specified by "loop bandwidth" and "damping factor"(zeta)). So my first test started with the re-implementation of the MM Clock Recovery with a control loop. The work function of this block is shown below (Comments are mine)
clock_recovery_mm_ff_impl::general_work(int noutput_items,
gr_vector_int &ninput_items,
gr_vector_const_void_star &input_items,
gr_vector_void_star &output_items)
{
const float *in = (const float *)input_items[0];
float *out = (float *)output_items[0];
int ii = 0; // input index
int oo = 0; // output index
int ni = ninput_items[0] - d_interp->ntaps(); // don't use more input than this
float mm_val;
while(oo < noutput_items && ii < ni ) {
// produce output sample
out[oo] = d_interp->interpolate(&in[ii], d_mu); //Interpolation
mm_val = slice(d_last_sample) * out[oo] - slice(out[oo]) * d_last_sample; // Error calculation
d_last_sample = out[oo];
//Loop filtering
d_omega = d_omega + d_gain_omega * mm_val; //Frequency
d_omega = d_omega_mid + gr::branchless_clip(d_omega-d_omega_mid, d_omega_lim); //Bound the frequency
d_mu = d_mu + d_omega + d_gain_mu * mm_val; //Phase
ii += (int)floor(d_mu); // Basepoint index
d_mu = d_mu - floor(d_mu); // Fractional interval
oo++;
}
consume_each(ii);
return oo;
}
Here is my code. First, the control loop is initialized the constructor
loop(new gr::blocks::control_loop(0.02,(1 + d_omega_relative_limit)*omega,
(1 - d_omega_relative_limit)*omega))
First of all I would like to eliminate a couple of doubts that I have regarding pll (the control_loop above) in symbol timing recovery particularly phase and frequency ranges (that are in turn used for wrapping). Taking an analogy from Costas loop : carrier phase is wrapped between -2pi and +2pi and the frequency offset is tracked between -1 and +1. It is quite straightforward to see why. Unfortunately I can't get my head around phase and frequency tracking in symbol recovery. From the m&m block, frequency is tracked between (1+omega_relative_limit) and (1 - omega_relative_limit)*omega where omega is simply the number of samples per symbol. Phase is tracked between 0 and omega. I dont understand why this is so and why the m&m block doesn't wrap it. Any ideas here will be appreciated.
And here is my work function
debug_time_recovery_pam_test_1_impl::general_work (int noutput_items,
gr_vector_int &ninput_items,
gr_vector_const_void_star &input_items,
gr_vector_void_star &output_items)
{
// Tell runtime system how many output items we produced.
const float *in = (const float *)input_items[0];
float *out = (float *)output_items[0];
int ii = 0; // input index
int oo = 0; // output index
int ni = ninput_items[0] - d_interp->ntaps(); // don't use more input than this
float mm_val;
while(oo < noutput_items && ii < ni ) {
// produce output sample
out[oo] = d_interp->interpolate(&in[ii], d_mu);
//Calculating error
mm_val = slice(d_last_sample) * out[oo] - slice(out[oo]) * d_last_sample;
d_last_sample = out[oo];
//Loop filtering
loop->advance_loop(mm_val); // Filter the error
loop->frequency_limit(); //Stop frequency from wandering too far
//Loop phase and frequency
d_omega = loop->get_frequency();
d_mu = loop->get_phase();
//d_omega = d_omega + d_gain_omega * mm_val;
//d_omega = d_omega_mid + gr::branchless_clip(d_omega-d_omega_mid, d_omega_lim);
//d_mu = d_mu + d_omega + d_gain_mu * mm_val;
ii += (int)floor(d_mu); // Basepoint index
d_mu = d_mu - floor(d_mu);//Fractional interval
oo++;
}
consume_each(ii);
return oo;
}
I have tried to use the block in a GFSK demodulator and I got this error
python: /build/gnuradio-bJXzXK/gnuradio-3.7.9.1/gnuradio-runtime/include/gnuradio/buffer.h:177: unsigned int gr::buffer::index_add(unsigned int, unsigned int): Assertion `s < d_bufsize' failed.
The first google search regarding this error suggests that im somehow "abusing" the scheduler since this error comes somewhere below the API. I think my calculation of d_omega and d_mu from the control loop is a bit naive but unfortunately I don't know any other way of doing so. Another alternative will be to use a modulo-1 counter (incrementing or decrementing) but I want to explore this option first.

Bad acess, multi-threading, GCD, swift

I am trying to translate some sample code from objective-c into swift!
I got it all working except for the multithreading part which is cruical to this simulation.
For some reason when I start using multiple threads it has access errors. Specefically when getting or setting things from the array.
This class is instanced inside of a static class.
var screenWidthi:Int = 0
var screenHeighti:Int = 0
var poolWidthi:Int = 0
var poolHeighti:Int = 0
var rippleSource:[GLfloat] = []
var rippleDest:[GLfloat] = []
func update()
{
let queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0)
dispatch_apply(Int(poolHeighti), queue, {(y: size_t) -> Void in
//for y in 0..<poolHeighti
//{
let pw = self.poolWidthi
for x in 1..<(pw - 1)
{
let ai:Int = (y ) * (pw + 2) + x + 1
let bi:Int = (y + 2) * (pw + 2) + x + 1
let ci:Int = (y + 1) * (pw + 2) + x
let di:Int = (y + 1) * (pw + 2) + x + 2
let me:Int = (y + 1) * (pw + 2) + x + 1
let a = self.rippleSource[ai]
let b = self.rippleSource[bi]
let c = self.rippleSource[ci]
let d = self.rippleSource[di]
var result = (a + b + c + d) / 2.0 - self.rippleDest[me]
result -= result / 32.0
self.rippleDest[me] = result
}
}
)
}
It is important to note that there is also another loop that should run on a different thread right after this one, it acesses the same arrays. That being said it will still bad acess without having the 2nd in another thread so I feel that it is irrelivant to show.
If you could please tell me what is going on that causes this crash to happen at randomish times rather then the first time.
If you want reference here is what it was like in objective c
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(poolHeight, queue, ^(size_t y) {
for (int x=0; x<poolWidth; x++)
{
float a = rippleSource[(y)*(poolWidth+2) + x+1];
float b = rippleSource[(y+2)*(poolWidth+2) + x+1];
float c = rippleSource[(y+1)*(poolWidth+2) + x];
float d = rippleSource[(y+1)*(poolWidth+2) + x+2];
float result = (a + b + c + d)/2.f - rippleDest[(y+1)*(poolWidth+2) + x+1];
result -= result/32.f;
rippleDest[(y+1)*(poolWidth+2) + x+1] = result;
}
});
How do you ensure that variables are able to be accessed from different threads? How about static members?
I only no how to print out the call stack before the app crashes, however after, the only way I know to get to the call stack is to look at the threads. Let me know if there is a different way I should do this.
NOTE: I noticed something wierd. I put a print statement in each loop so I could see what x and y coordinate it was processing to see if the crash was consistant. Obiously that brought the fps down to well under 1 fps, however I did notice it has yet to crash. The program is running perfect so far without any bad acess just at under 1 fps.

The Apple code is using a C-style array, these are "thread safe" when used appropriately - as the Apple code does.
Swift, and Objective-C, arrays are not thread-safe and this is the cause of your issues. You need to implement some form of access control to the array.
A simple method is to associate a GCD sequential queue with each array, then to write to the array dispatch async to this queue, and to read dispatch sync. This is simple but reduces concurrency, to make it better read Mike Ash. For Swift code
Mike Ash is good if you need to understand the issues, and for Swift code you and look at this question - read all the answers and comments.
HTH

Math in Dart - working with numbers other than base 10

I was trying to do some math in something other than base 10, and wanted to get some input from the Dart community on possible directions I can take.
So, for example, suppose p = 2. Then, looking at 100 in base 2:
100 = (1 · 26
) + (1 · 25
) + (0 · 24
) + (0 · 23
) + (1 · 22
) + (0 · 21
) + (0 · 20
)
so
100 base 2 = 1100100
Add 1 to each digit and multiply the answers together:
(1 + 1)(1 + 1)(0 + 1)(0 + 1)(1 + 1)(0 + 1)(0 + 1) = 8
Try it for 3:
100 = (1 · 34
) + (0 · 33
) + (2 · 32
) + (0 · 31
) + (1 · 30
)
so
100 base 3 = 10201
(1 + 1)(0 + 1)(2 + 1)(0 + 1)(1 + 1) = 12
In Dart, I came up with the following;
https://gist.github.com/anonymous/11345381
int rows = 1000000000;
int total = 0;
for(int i = 0 ; i < rows ; i++ ){
total += i.toRadixString(7)
.split('')
.map((s) => int.parse(s) + 1)
.reduce((prev,next) => prev*next);
}
print('${total} results');
This does work, and seems to be ok for small number of rows. But for larger numbers, it is really quite slow.
As shown, I am converting an int to a string (resulting in a string representation of a number), splitting it, mapping the split characters back to ints, adding 1 to each int, and then multiplying them all together.
Am I missing something when it comes to working with numbers in Dart, other than base 10?

Going through the string seems inefficient. How about doing it directly?
int digitProduct(int n, int base) { // non-negative n and base only
int product = 1;
while (n > 0) {
int digit = n % base;
n = n ~/ base;
product *= (digit + 1);
}
return product;
}
main() {
int rows = 1000000000;
int total = 0;
for (int i = 0; i < rows; i++) {
total += digitProduct(i, 7);
}
print("${total} results");
}
Also 1000000000 is a pretty big number. If it takes a microsecond per row, it will still take ~16 minutes to complete.
Division by arbitrary bases is slow. If you hardcode the base to be 7, it'll probably be significantly faster (I expect ~90% of the time spent in digitProduct to be used on ~/ and %).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Why does memkind and numactl improve program performance a lot? - memory

Related

What is the significance of 1.000000015047466e+30?

Why doesn't this Fibonacci Number function work in O(log N)?

Re-implementing Muller and Mueller clock recovery with control_loop

Bad acess, multi-threading, GCD, swift

Math in Dart - working with numbers other than base 10

Categories

Resources