What is the significance of 1.000000015047466e+30? - machine-learning

I was attempting to do JIT compilation on a pytorch-based module from an NLP library and I saw that one of the generated fused CUDA kernel code implementations mentions the number 1.000000015047466e+30:
#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
template<typename T>
__device__ T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}
template<typename T>
__device__ T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}
extern "C" __global__
void fused_add_mul_mul_sub(float* tattn_mask1_1, float* tac_2, float* tbd_2, float* output_1, float* aten_mul_1) {
{
if (blockIdx.x<1ll ? 1 : 0) {
if ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)<169ll ? 1 : 0) {
if (blockIdx.x<1ll ? 1 : 0) {
float v = __ldg(tattn_mask1_1 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
aten_mul_1[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = v * 1.000000015047466e+30f;
} } }if ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)<2028ll ? 1 : 0) {
float v_1 = __ldg(tac_2 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
float v_2 = __ldg(tbd_2 + 12ll * (((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) % 169ll) + ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) / 169ll);
float v_3 = __ldg(tattn_mask1_1 + ((long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)) % 169ll);
output_1[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = (v_1 + v_2) * 0.125f - v_3 * 1.000000015047466e+30f;
}}
}
It feels like this should be some sort of FLOAT_MAX constant, but I don't know any numerical type that would have this as a limit.
Google searching for this constant just yields a handful of results that seem to suggest it may be a physical constant of some kind:
This MATLAB forum link suggests it may be a default value for some physical phenomena
This Github repo on whole cell electrophysiology seems to use it as a max limit of some kind.
This Opensea NFT link shows that the number is somehow used as a parameter for some kind of fractal art?
I'm absolutely baffled because I'm doing natural language processing using CUDA and have absolutely no clue why this number is appearing in my CUDA kernel code and why it's also being used in hard science research code. Does this number have some special floating point properties or something? Is it serving as a numerical stability factor somehow? Any leads would be greatly appreciated.

1.000000015047466e+30f is 1030 rounded to the nearest value representable in float (IEEE-754 binary32, also called “single precision”), then rounded to 16 decimal digits, then formatted as a float literal.
Thus, it likely originated as 1e30 or another representation of 1030 that was subsequently converted to float and then used to produce new source code.
(The float nearest 1030 is exactly 1000000015047466219876688855040.)

Related

Why does memkind and numactl improve program performance a lot?

According to the course https://www.coursera.org/learn/parallelism-ia/home/welcome, there is one example which tried to illustrate the improvement from memkind API by using hbw_posix_memalign((void**)&pixel, 64, sizeof(P)*width*height); I think this API only provides us aligned memory allocation. I do not know why this can help improve the GPflops so much as the following shows.
The part of coding is as the following. Here the memory which stores img_in is allocated by memkind API.
template<typename P>
void ApplyStencil(ImageClass<P> & img_in, ImageClass<P> & img_out) {
const int width = img_in.width;
const int height = img_in.height;
P * in = img_in.pixel;
P * out = img_out.pixel;
#pragma omp parallel for
for (int i = 1; i < height-1; i++)
#pragma omp simd
for (int j = 1; j < width-1; j++) {
P val = -in[(i-1)*width + j-1] - in[(i-1)*width + j] - in[(i-1)*width + j+1]
-in[(i )*width + j-1] + 8*in[(i )*width + j] - in[(i )*width + j+1]
-in[(i+1)*width + j-1] - in[(i+1)*width + j] - in[(i+1)*width + j+1];
val = (val < 0 ? 0 : val);
val = (val > 255 ? 255 : val);
out[i*width + j] = val;
}
}
My questions are as the follwing:
Is it only because we can use less memory operation to get our data and then we can improve the performance almost 5 times?
In terms of numaclt, based on the linux documentation, it allows us to bind the processes with specific nodes or cpus. When we use the command numactl -m 1, we can get the improvement 5 times. I not sure if the improvement comes from NUMA communication delay.

Unstable long calculations

Trying to use Dafny as a CAS to check that some algebraic calculations are correct.
Dafny does a good job, except that it gets unstable for longer ones, failing to verify some very easy steps even when spoon-fed.
calc == {
// a dozen lines...
k * (a + b) * (a + b);
k * (a * a + b * b + 2 * a * b); // fails
}
When Dafny fails there, I try to help it, in vain.
calc == {
// a dozen lines...
k * (a + b) * (a + b);
calc == {
(a + b) * (a + b);
a * a + b * b + 2 * a * b;
}
k * (a * a + b * b + 2 * a * b); // still fails
}
I've also tried to put the substitued expression in a variable or even a lemma, with the same result.
Is there some way to tell Dafny which part of an expression we want to substitute?

Why doesn't this Fibonacci Number function work in O(log N)?

So the Fibonacci number for log (N) — without matrices.
Ni // i-th Fibonacci number
= Ni-1 + Ni-2 // by definition
= (Ni-2 + Ni-3) + Ni-2 // unwrap Ni-1
= 2*Ni-2 + Ni-3 // reduce the equation
= 2*(Ni-3 + Ni-4) + Ni-3 //unwrap Ni-2
// And so on
= 3*Ni-3 + 2*Ni-4
= 5*Ni-4 + 3*Ni-5
= 8*Ni-5 + 5*Ni-6
= Nk*Ni-k + Nk-1*Ni-k-1
Now we write a recursive function, where at each step we take k~=I/2.
static long N(long i)
{
if (i < 2) return 1;
long k=i/2;
return N(k) * N(i - k) + N(k - 1) * N(i - k - 1);
}
Where is the fault?
You get a recursion formula for the effort: T(n) = 4T(n/2) + O(1). (disregarding the fact that the numbers get bigger, so the O(1) does not even hold). It's clear from this that T(n) is not in O(log(n)). Instead one gets by the master theorem T(n) is in O(n^2).
Btw, this is even slower than the trivial algorithm to calculate all Fibonacci numbers up to n.
The four N calls inside the function each have an argument of around i/2. So the length of the stack of N calls in total is roughly equal to log2N, but because each call generates four more, the bottom 'layer' of calls has 4^log2N = O(n2) Thus, the fault is that N calls itself four times. With only two calls, as in the conventional iterative method, it would be O(n). I don't know of any way to do this with only one call, which could be O(log n).
An O(n) version based on this formula would be:
static long N(long i) {
if (i<2) {
return 1;
}
long k = i/2;
long val1;
long val2;
val1 = N(k-1);
val2 = N(k);
if (i%2==0) {
return val2*val2+val1*val1;
}
return val2*(val2+val1)+val1*val2;
}
which makes 2 N calls per function, making it O(n).
public class fibonacci {
public static int count=0;
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
int i = scan.nextInt();
System.out.println("value of i ="+ i);
int result = fun(i);
System.out.println("final result is " +result);
}
public static int fun(int i) {
count++;
System.out.println("fun is called and count is "+count);
if(i < 2) {
System.out.println("function returned");
return 1;
}
int k = i/2;
int part1 = fun(k);
int part2 = fun(i-k);
int part3 = fun(k-1);
int part4 = fun(i-k-1);
return ((part1*part2) + (part3*part4)); /*RESULT WILL BE SAME FOR BOTH METHODS*/
//return ((fun(k)*fun(i-k))+(fun(k-1)*fun(i-k-1)));
}
}
I tried to code to problem defined by you in java. What i observed is that complexity of above code is not completely O(N^2) but less than that.But as per conventions and standards the worst case complexity is O(N^2) including some other factors like computation(division,multiplication) and comparison time analysis.
The output of above code gives me information about how many times the function
fun(int i) computes and is being called.
OUTPUT
So including the time taken for comparison and division, multiplication operations, the worst case time complexity is O(N^2) not O(LogN).
Ok if we use Analysis of the recursive Fibonacci program technique.Then we end up getting a simple equation
T(N) = 4* T(N/2) + O(1)
where O(1) is some constant time.
So let's apply Master's method on this equation.
According to Master's method
T(n) = aT(n/b) + f(n) where a >= 1 and b > 1
There are following three cases:
If f(n) = Θ(nc) where c < Logba then T(n) = Θ(nLogba)
If f(n) = Θ(nc) where c = Logba then T(n) = Θ(ncLog n)
If f(n) = Θ(nc) where c > Logba then T(n) = Θ(f(n))
And in our equation a=4 , b=2 & c=0.
As case 1 c < logba => 0 < 2 (which is log base 2 and equals to 2) is satisfied
hence T(n) = O(n^2).
For more information about how master's algorithm works please visit: Analysis of Algorithms
Your idea is correct, and it will perform in O(log n) provided you don't compute the same formula
over and over again. The whole point of having N(k) * N(i-k) is to have (k = i - k) so you only have to compute one instead of two. But if you only call recursively, you are performing the computation twice.
What you need is called memoization. That is, store every value that you already have computed, and
if it comes up again, then you get it in O(1).
Here's an example
const int MAX = 10000;
// memoization array
int f[MAX] = {0};
// Return nth fibonacci number using memoization
int fib(int n) {
// Base case
if (n == 0)
return 0;
if (n == 1 || n == 2)
return (f[n] = 1);
// If fib(n) is already computed
if (f[n]) return f[n];
// (n & 1) is 1 iff n is odd
int k = n/2;
// Applying your formula
f[n] = fib(k) * fib(n - k) + fib(k - 1) * fib(n - k - 1);
return f[n];
}

Re-implementing Muller and Mueller clock recovery with control_loop

I'm currently implementing symbol time recovery blocks. The idea is to be able to choose different TEDs (Gardner, Zero-crossing, Early-Late, Maximum-likelihood etc). In blocks like M&M recovery, the gain parameters of the loop are expressed explicitly (gain_omega and gain_mu) which can be difficult to get right. The contro_loop class is, however, more convenient (loop characteristics can be specified by "loop bandwidth" and "damping factor"(zeta)). So my first test started with the re-implementation of the MM Clock Recovery with a control loop. The work function of this block is shown below (Comments are mine)
clock_recovery_mm_ff_impl::general_work(int noutput_items,
gr_vector_int &ninput_items,
gr_vector_const_void_star &input_items,
gr_vector_void_star &output_items)
{
const float *in = (const float *)input_items[0];
float *out = (float *)output_items[0];
int ii = 0; // input index
int oo = 0; // output index
int ni = ninput_items[0] - d_interp->ntaps(); // don't use more input than this
float mm_val;
while(oo < noutput_items && ii < ni ) {
// produce output sample
out[oo] = d_interp->interpolate(&in[ii], d_mu); //Interpolation
mm_val = slice(d_last_sample) * out[oo] - slice(out[oo]) * d_last_sample; // Error calculation
d_last_sample = out[oo];
//Loop filtering
d_omega = d_omega + d_gain_omega * mm_val; //Frequency
d_omega = d_omega_mid + gr::branchless_clip(d_omega-d_omega_mid, d_omega_lim); //Bound the frequency
d_mu = d_mu + d_omega + d_gain_mu * mm_val; //Phase
ii += (int)floor(d_mu); // Basepoint index
d_mu = d_mu - floor(d_mu); // Fractional interval
oo++;
}
consume_each(ii);
return oo;
}
Here is my code. First, the control loop is initialized the constructor
loop(new gr::blocks::control_loop(0.02,(1 + d_omega_relative_limit)*omega,
(1 - d_omega_relative_limit)*omega))
First of all I would like to eliminate a couple of doubts that I have regarding pll (the control_loop above) in symbol timing recovery particularly phase and frequency ranges (that are in turn used for wrapping). Taking an analogy from Costas loop : carrier phase is wrapped between -2pi and +2pi and the frequency offset is tracked between -1 and +1. It is quite straightforward to see why. Unfortunately I can't get my head around phase and frequency tracking in symbol recovery. From the m&m block, frequency is tracked between (1+omega_relative_limit) and (1 - omega_relative_limit)*omega where omega is simply the number of samples per symbol. Phase is tracked between 0 and omega. I dont understand why this is so and why the m&m block doesn't wrap it. Any ideas here will be appreciated.
And here is my work function
debug_time_recovery_pam_test_1_impl::general_work (int noutput_items,
gr_vector_int &ninput_items,
gr_vector_const_void_star &input_items,
gr_vector_void_star &output_items)
{
// Tell runtime system how many output items we produced.
const float *in = (const float *)input_items[0];
float *out = (float *)output_items[0];
int ii = 0; // input index
int oo = 0; // output index
int ni = ninput_items[0] - d_interp->ntaps(); // don't use more input than this
float mm_val;
while(oo < noutput_items && ii < ni ) {
// produce output sample
out[oo] = d_interp->interpolate(&in[ii], d_mu);
//Calculating error
mm_val = slice(d_last_sample) * out[oo] - slice(out[oo]) * d_last_sample;
d_last_sample = out[oo];
//Loop filtering
loop->advance_loop(mm_val); // Filter the error
loop->frequency_limit(); //Stop frequency from wandering too far
//Loop phase and frequency
d_omega = loop->get_frequency();
d_mu = loop->get_phase();
//d_omega = d_omega + d_gain_omega * mm_val;
//d_omega = d_omega_mid + gr::branchless_clip(d_omega-d_omega_mid, d_omega_lim);
//d_mu = d_mu + d_omega + d_gain_mu * mm_val;
ii += (int)floor(d_mu); // Basepoint index
d_mu = d_mu - floor(d_mu);//Fractional interval
oo++;
}
consume_each(ii);
return oo;
}
I have tried to use the block in a GFSK demodulator and I got this error
python: /build/gnuradio-bJXzXK/gnuradio-3.7.9.1/gnuradio-runtime/include/gnuradio/buffer.h:177: unsigned int gr::buffer::index_add(unsigned int, unsigned int): Assertion `s < d_bufsize' failed.
The first google search regarding this error suggests that im somehow "abusing" the scheduler since this error comes somewhere below the API. I think my calculation of d_omega and d_mu from the control loop is a bit naive but unfortunately I don't know any other way of doing so. Another alternative will be to use a modulo-1 counter (incrementing or decrementing) but I want to explore this option first.

Integer overflow in Fibonacci number

I was solving this codechef problem on Fibonacci numbers. It says number is of 1000 digits then why it is not causing integer overflow in tester's solution when it is scanning the array and storing it in unsigned long long int. I can't understand how solution is working. Below is the problem and tester's solution.
The Head Chef has been playing with Fibonacci numbers for long . He has learnt several tricks related to Fibonacci numbers . Now he wants to test his chefs in the skills .
A fibonacci number is defined by the recurrence :
f(n) = f(n-1) + f(n-2) for n > 2
and f(1) = 0
and f(2) = 1 .
Given a number A , determine if it is a fibonacci number.
Input
The first line of the input contains an integer T denoting the number of test cases. The description of T test cases follows.
The only line of each test case contains a single integer A denoting the number to be checked .
Output
For each test case, output a single line containing "YES" if the given number is a fibonacci number , otherwise output a single line containing "NO" .
Constraints
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.
Example
Input:
3
3
4
5
Output:
YES
NO
YES
**Tester's solution:**
#include <iostream>
#include <cstdio>
#include <algorithm>
#include <set>
#include <cstring>
using namespace std;
int const mx = 6666;
set <unsigned long long> f;
unsigned long long fib[mx + 10];
char s[mx + 1];
int main(){
// freopen("input.txt", "r", stdin);
// freopen("output.txt", "w", stdout);
fib[0] = 0;
fib[1] = 1;
f.insert(1);
f.insert(0);
int i;
for (i = 2; i <= mx; i++){
fib[i] = fib[i - 1] + fib[i - 2];
f.insert(fib[i]);
}
int tc;
cin>>tc;
while (tc--){
unsigned long long n = 0, ten = 10;
cin>>s;
int len = strlen(s);
for (i = 0; i < len; i++){
char q = s[i];
unsigned long long a = q - '0';
n = n * ten + a;
}
if (f.find(n) == f.end()) printf("NO\n");
else printf("YES\n");
}
return 0;
}
From cplusplus you will see that,
ULLONG_MAX Maximum value for an object of type unsigned long long int is 18446744073709551615 (264-1) or greater.
The actual value depends on the particular system and library implementation, but shall reflect the limits of these types in
the target platform.
Above information is just to let you know its a BIG number. Moreover the cause of not getting overflow is not the limit i mentioned.
Most probably, the input file of judge does not contain any input that can cause an overflow.
And its still possible to set such input even after fulfilling the conditions,
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.

Resources