Outputting values from CAMPARY - arbitrary-precision

I'm trying to use the CAMPARY library (CudA Multiple Precision ARithmetic librarY). I've downloaded the code and included it in my project. Since it supports both cpu and gpu, I'm starting with cpu to understand how it works and make sure it does what I need. But the intent is to use this with CUDA.
I'm able to instantiate an instance and assign a value, but I can't figure out how to get things back out. Consider:
#include <time.h>
#include "c:\\vss\\CAMPARY\\Doubles\\src_cpu\\multi_prec.h"
int main()
{
const char *value = "123456789012345678901234567";
multi_prec<2> a(value);
a.prettyPrint();
a.prettyPrintBin();
a.prettyPrintBin_UnevalSum();
char *cc = a.prettyPrintBF();
printf("\n%s\n", cc);
free(cc);
}
Compiles, links, runs (VS 2017). But the output is pretty unhelpful:
Prec = 2
Data[0] = 1.234568e+26
Data[1] = 7.486371e+08
Prec = 2
Data[0] = 0x1.987bf7c563caap+86;
Data[1] = 0x1.64fa5c3800000p+29;
0x1.987bf7c563caap+86 + 0x1.64fa5c3800000p+29;
1.234568e+26 7.486371e+08
Printing each of the doubles like this might be easy to do, but it doesn't tell you much about the value of the 128 number being stored. Performing highly accurate computations is of limited value if there's no way to output the results.
In addition to just printing out the value, eventually I also need to convert these numbers to ints (I'm willing to try it all in floats if there's a way to print, but I fear that both accuracy and speed will suffer). Unlike MPIR (which doesn't support CUDA), CAMPARY doesn't have any associated multi-precision int type, just floats. I can probably cobble together what I need (mostly just add/subtract/compare), but only if I can get the integer portion of CAMPARY's values back out, which I don't see a way to do.
CAMPARY doesn't seem to have any docs, so it's conceivable these capabilities are there, and I've simply overlooked them. And I'd rather ask on the CAMPARY discussion forum/mail list, but there doesn't seem to be one. That's why I'm asking here.
To sum up:
Is there any way to output the 128bit ( multi_prec<2> ) values from CAMPARY?
Is there any way to extract the integer portion from a CAMPARY multi_prec? Perhaps one of the (many) math functions in the library that I don't understand computes this?

There are really only 2 possible answers to this question:
There's another (better) multi-precision library that works on CUDA that does what you need.
Here's how to modify this library to do what you need.
The only people who could give the first answer are CUDA programmers. Unfortunately, if there were such a library, I feel confident talonmies would have known about it and mentioned it.
As for #2, why would anyone update this library if they weren't a CUDA programmer? There are other, much better multi-precision libraries out there. The ONLY benefit CAMPARY offers is that it supports CUDA. Which means the only people with any real motivation to work with or modify the library are CUDA programmers.
And, as the CUDA programmer with the most vested interest in solving this, I did figure out a solution (albeit an ugly one). I'm posting it here in the hopes that the information will be of value to future CAMPARY programmers. There's not much information out there for this library, so this is a start.
The first thing you need to understand is how CAMPARY stores its data. And, while not complex, it isn't what I expected. Coming from MPIR, I assumed that CAMPARY stored its data pretty much the same way: a fixed size exponent followed by an arbitrary number of bits for the mantissa.
But nope, CAMPARY went a different way. Looking at the code, we see:
private:
double data[prec];
Now, I assumed that this was just an arbitrary way of reserving the number of bits they needed. But no, they really do use prec doubles. Like so:
multi_prec<8> a("2633716138033644471646729489243748530829179225072491799768019505671233074369063908765111461703117249");
// Looking at a in the VS debugger:
[0] 2.6337161380336443e+99 const double
[1] 1.8496577979210756e+83 const double
[2] 1.2618399223120249e+67 const double
[3] -3.5978270144026257e+48 const double
[4] -1.1764513205926450e+32 const double
[5] -2479038053160511.0 const double
[6] 0.00000000000000000 const double
[7] 0.00000000000000000 const double
So, what they are doing is storing the max amount of precision possible in the first double, then the remainder is used to compute the next double and so on until they encompass the entire value, or run out of precision (dropping the least significant bits). Note that some of these are negative, which means the sum of the preceding values is a bit bigger than the actual value and they are correcting it downward.
With this in mind, we return to the question of how to print it.
In theory, you could just add all these together to get the right answer. But kinda by definition, we already know that C doesn't have a datatype to hold a value this size. But other libraries do (say MPIR). Now, MPIR doesn't work on CUDA, but it doesn't need to. You don't want to have your CUDA code printing out data. That's something you should be doing from the host anyway. So do the computations with the full power of CUDA, cudaMemcpy the results back, then use MPIR to print them out:
#define MPREC 8
void ShowP(const multi_prec<MPREC> value)
{
multi_prec<MPREC> temp(value), temp2;
// from mpir at mpir.org
mpf_t mp, mp2;
mpf_init2(mp, value.getPrec() * 64); // Make sure we reserve enough room
mpf_init(mp2); // Only needs to hold one double.
const double *ptr = value.getData();
mpf_set_d(mp, ptr[0]);
for (int x = 1; x < value.getPrec(); x++)
{
// MPIR doesn't have a mpf_add_d, so we need to load the value into
// an mpf_t.
mpf_set_d(mp2, ptr[x]);
mpf_add(mp, mp, mp2);
}
// Using base 10, write the full precision (0) of mp, to stdout.
mpf_out_str(stdout, 10, 0, mp);
mpf_clears(mp, mp2, NULL);
}
Used with the number stored in the multi_prec above, this outputs the exact same value. Yay.
It's not a particularly elegant solution. Having to add a second library just to print a value from the first is clearly sub-optimal. And this conversion can't be all that speedy either. But printing is typically done (much) less frequently than computing. If you do an hour's worth of computing and a handful of prints, the performance doesn't much matter. And it beats the heck out of not being able to print at all.
CAMPARY has a lot of shortcomings (undoced, unsupported, unmaintained). But for people who need mp numbers on CUDA (especially if you need sqrt), it's the best option I've found.

Related

How to efficiently create a large vector of items initialized to the same value?

I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.
Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?
vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.

How to keep track of the seed

So in Lua it's common knowledge that you can use math.randomseed but it's also obvious that math.random sets the seed as well (calling it twice does not return the same result), what does it set it to, and how can I keep track of it, and if it's impossible, please explain why that is so.
This is not a Lua question, but general question on how some RNG algorithm works.
First, Lua don't have their own RNG - they just output you (slightly mangled) value from RNG of underlying C library. Most RNG implementations do not reveal you their inner state, but sometimes you can caclulate it yourself.
For example when you use Lua on Windows, you'll be using LCG-based RNG from MS C library. The numbers you get is a slice of seed, not full value. There are two ways you can deal with that:
If you know how many times you called random, you can just take initial seed value, feed it to your copy of the same algorithm with same constants that are hardcoded in MS library and get exact value of seed.
If you don't, but you can be sure that nobody interferes in between your two calls to random, you can get two generated numbers, and reverse LCG algorithm by shifting bits back to their place. This will leave you with several missing bits (with one more bit thanks to Lua mangling) that you will need to simply bruteforce - just reiterate over all missing bits until your copy of algorithm produces exactly same two "random" numbers you've recorded before. That will be current seed stored inside library's RNG as well. Well programmed solution in Lua can bruteforce this in about 0.2-0.5s on somewhat dated PC - I did it past. Here's example on Crypto.SE talking about this task in more details: Predicting values from a Linear Congruential Generator.
First approach can be used with any other RNG algorithm that doesn't use any real entropy, second with most RNGs that don't mask too much bits in slice to make bruteforcing unreasonable.
Real answer though is: you don't need to keep track of seed at all. What you want is probably something else.
If you set a seed all numbers math.random() generates are pseudo-random (This is always the case as the system will generate a seed by itself).
math.randomseed(4)
print(math.random())
print(math.random())
math.randomseed(4)
print(math.random())
Outputs
0.50827539156303
0.75454387490399
0.50827539156303
So if you reset the seed to the same value you can predict all values that are going to come up to the maximum number of consecutive values that you already generated using that seed.
What the seed does not do is keep the output of math.random() the same. It would be the same if you kept resetting it to the same value.
An analogy as an example
Imagine the random number is an integer between 0 and 9 (instead of a double between 0 and 1).
math.random() could traverse pi's decimals from an arbitrary starting position (default could be system time).
What you do when you use set.seed() is (not literally, this is an analogy as mentioned) set the starting decimals of where in pi you are going to retrieve your numbers.
If you now reset the seed to the same starting position the numbers are going to be the same as the last time you reset the starting position.
You will know the numbers of to the last call, after that you can't be certain anymore.

What algorithm can I use to turn a drunkards walk into a correlated RNG?

I'm a novice programmer (the only reason I say this is because I'm not super familiar with all the terms yet) and I'm trying to make walls generate in respect to the wall before it. I've posted a question about it on here before
Randomly generated tunnel walls that don't jump around from one to the next
and sort of got the answer. What I was mainly looking for was the for loop that was used (I think). Th problem is I didn't know how to implement it properly without getting errors.
My problem ended up being "I couldn't figure out how to inc. this in to it. I have 41 walls altogether that i'm using and the walls are named Left1 and Right1. i had something like this
CGFloat Left1 = 14; for( int i = 0; i < 41; i++ ){
CGFloat offset = (CGFloat)arc4random_uniform(2*100) - 100;
Left1 += offset;
Right1 = Left1 + 100;
but it was telling me as a yellow text that Local declaration of "Left1" hides instance variable and then in a red text it says "Assigning to 'UIImageView *__strong' from incompatible type 'float'. i'm not sure how to fix this"
and I wasn't sure how to fix it. I realize (I think) that arc4random and arc4random_uniform are pretty much the same thing, as far as i know, with slight differences, but not the difference i'm looking for.
As I said before, i'm pretty novice so any example would really be helpful, especially with the variables i'm trying to use. Thank you.
You want a "hashing" function, and preferably a "cryptographic" one because they tend to be significantly higher quality - at the expense of requiring additional CPU resources. But on modern hardware the extra CPU power usually isn't a problem.
The basic idea is you can give any data to the function, and it will spit out a completely random result, but always the same result if you provide the same input.
Have a read up on them here:
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Cryptographic_hash_function
There are hundreds of different algorithms in common use, which is best will depend on what you need.
Personally I recommend sha256. A quick search of "sha256 ios" here on stack overflow will show you how to make one, with the CommonCrypto library. The gist is you should create an NSString or NSData object that contains every offset, then run the entire thing through sha256. The result will be a perfectly random 256 bit number.
If 256 bits is too much, just cut it up. For example you could grab just the first 16 bits of the number, and you will have a perfectly random 16 bit number.

OpenCL code behavior is different for AMD vs NVIDIA cards

I have a constant at the top of my code...
__constant uint uintmaxx = (uint)( (((ulong)1)<<32) - 1 );
It compiles fine on AMD and NVIDIA OpenCL compilers... then executes.
(correct) on ATI cards, returns... 4294967295 or (all 32 bits = 1)
(wrong) on NVIDIA cards, returns... 2147483648 or (only 32'nd bit = 1)
I also tried -1 + 1<<32 and it worked on ATI but not NVIDIA.
What gives? Am I just missing something?
While I'm on the topic of OpenCL compiler differences, does anyone know a good resource that lists the compiler differences between AMD and NVIDIA?
OpenCL conveniently provides that for you already. You can use the predefined UINT_MAX in your kernel code and the implementation will guarantee that it holds the correct value.
However there is also nothing wrong in the method you use. The spec guarantees that uint is 32bits and ulong 64bits, ints are twos complement and everything that is not explicitly mentioned works exactly as is written in C99 spec.
Even just this should work and give you the correct result:
uint uintmaxx = -1;
It seems that NVidia just has a broken compiler, if not I really hope I'll be corrected on the issue. The really odd part there is that how on earth the 32nd bit is 1? Shift to left by 32 moves the original bit to the 33rd place. So what on earth places a bit in the 32nd spot? The only thing I got in my mind is that they don't respect operator ordering at all and transform the formula into (ulong)1 << (32-1) or something like that.
You probably should file a bug report. But to be frank considering that they hate OpenCL as much as Microsoft hates OpenGL, if not even more, I wouldn't really anticipate fast response times.
I fully agree with #sharpneli answer. But just try this:
__constant uint uintmaxx = -1;
And like sharpneli said, use the UINT_MAX macro, it is the safer way.

The importance of using a 16bit integer

How seriously do developers think about using a 16bit integer when writing code? I've been using 32bit integers ever since I've been programming and I don't really think about using 16bit.
Its so easy to declare a 32bit int because its the default for most languages.
Whats the upside of using a 16bit integer apart from a little memory saved?
Now that we have cars, we don't walk or ride horses as much, but we still do walk and ride horses.
There is less need to use shorts these days. In a lot of situations the cost of disk space and availability of RAM mean that we no longer need to squeeze every last bit of storage out of computers as we did 20 years ago, so we can sacrifice a bit of storage efficiency in order to save on development/maintenance costs.
However, where large amounts of data are used, or we are working with systems with small memories (e.g. embedded controllers) or when we are transmitting data over networks, using 32 or 64 bits to represent a 16-bit value is just a waste of memory/bandwidth. It doesn't matter how much memory you have, wasting half or three quarters of it would just be stupid.
APIs/interfaces (e.g. TCP/IP port numbers) and algorithms that require manipulation (e.g. rotation) of 16-bit values.
I was interested in the relative performance so I wrote this small test program to perform a very simple test of the speed of allocating, using, and freeing a significant amount of data in both int and short format.
I run the tests several times in case caching and so on are affected.
#include <iostream>
#include <windows.h>
using namespace std;
const int DATASIZE = 1000000;
template <typename DataType>
long long testCount()
{
long long t1, t2;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
DataType* data = new DataType[DATASIZE];
for(int i = 0; i < DATASIZE; i++) {
data[i] = 0;
}
delete[] data;
QueryPerformanceCounter((LARGE_INTEGER*)&t2);
return t2-t1;
}
int main()
{
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
}
and here are the results on my system (64 bit quad core system running windows7 64 bit, but the program is a 32 bit program built using VC++ express 2010 beta in release mode)
Test using short : 3672 ticks.
Test using int : 7903 ticks.
Test using short : 4321 ticks.
Test using int : 7936 ticks.
Test using short : 3697 ticks.
Test using int : 7701 ticks.
Test using short : 4222 ticks.
This seems to show that there are significant performance advantages at least in some cases to using short instead of int when there is a large amount of data. I realise that this is far from being a comprehensive test, but it's some evidence that not only do they use less space but they can be faster to process too at least in some applications.
when there is memory constraints short can help u lot. for e.g. while coding for embedded systems, u need to consider the memory.
16-bit values are still in great demand (though unsigned would do - don't really need signed).
For example,
16 bit Unicode - UTF-16/UCS-2.
16 bit graphics - especially for embedded devices.
16 bit checksums - for UDP headers and similar.
16 Bit devices - e.g. many norflash devices are 16 bit.
You might need to wrap at 65535.
You might need to work with a message sent from a device which includes fields which are 16 bit. Using 32 bit integers in this case would cause you to be accessing bits at the wrong offset in the message.
You might be working on an embedded 16 bit micro, or an embedded 8 bit micro. Hint: not all processors are x86, 32 bit.
This is really important in database development, because sometimes people are using a lot more space than is really needed (e.g. using int when small would have been sufficient). When you have tables with millions of rows this can be important factor in e.g. database size and queries. I would recommend people using always the appropriate datatype for columns.
I also try to use the correct datatype for other development, I know it can be a pain dealing with long and small (pretty convenient to have everyting int) but I think it pays off in the end, for example when serializing objects.
you ask: Any good reason to keep them around?
Since you say 'language-agnostic' the answer is a 'certainly yes'.
The computer CPU still works with bytes, words, full registers and whatnot, no matter how much these 'data types' are abstracted by some programming languages. There will always be situations where the code needs to 'touch the metal'.
It's hardly a little memory saved [read: 50%] when you allocate memory for a large number of numeric values. Common uses are:
COM and external device interop
Reducing memory consumption for large arrays where each number will never exceed a couple thousands in magnitude
Unique hashes for pairs of objects, where no more than ~65K objects are needed (hash values can only be 32-bit ints, but note that hash table types must transform the value for internal representations so collisions are still likely, but equality can be based on exact hash matches)
Speed up algorithms that rely on structs (smaller sized value types translates to increased performance when they are copied around in memory)
In large arrays, "little memory saved" could instead be "much memory saved".
The use of 16 bit integers is primarily for when you need to encode things for transmission over a network, for saving on hard disk, etc. without using up any more space than necessary. It might also occasionally be useful to save memory if you have a very large array of integers, or a lot of objects that contain integers.
Use of 16 bit integers without there being a good memory saving reason is pretty pointless. And 16 bit local variables are most often silently implemented with 32 or 64 bit integers anyway.
you have probably been using the 16 bit datatype more often than you knew. The char datatype in both C# and Java are 16 bit. Unicode is typically stored in a 16bit datatype.
The question should really be why we need a 16-bit primitive data type, and the answer would be that there is an awful lot of data out there which is naturally represented in 16 bits. One ubiquitous example is audio, e.g. CD audio is represented as streams of 16 bit signed integers.
16 bits is still plenty big enough to hold pixel channel values (e.g. R, G, or B). Most pixels only use 8 bits to store a channel, but Photoshop has a 16-bit mode that professionals use.
In other words, a pixel might be defined as struct Pixel16 { short R, G, B, A; } or an image might be defined as separate channels of struct Channel16 { short channel[]; }
I think most people use the default int on their platform. However there are times when you have to communicate with older systems or libraries that are expecting 16 bit or even eight bit integers (thank god we don't have to worry about 12 bit integers any more). This is especially true for databases. Also, if you're doing bit masking or bit shifting, you might have an algorithm that specifies the length of the integer. By default, and on platforms where memory is cheap, you should probably use integers sized to your processor.
Those 2 bytes add up. Your data types eventually become part of array or databases or messages, they go into data files. It adds up to a lot of wasted space and on embedded systems it can make a huge difference.
When we do peer review of our code at work, if something is sized incorrectly, it will be written as a discrepancy and must be corrected. If we find something that has a range of 1-1000 using an int32_t, it has to be corrected. The range must also be documented in a comment. Our department does not allow use of int, long, etc, we must use int32_t, int16_t, uint16_t, etc. so that the expected size is documented.
uint16_t conicAngle; // angle in tenths of a degree (range 0..3599)
or in Ada:
type Amplitude is range 0 .. 255; // signal amplitude from FPGA
Get in the habit of using what you need and no more and documenting what you need (if the language doesn't support it).
We are currently in the process of fixing a performance problem by resizing the data types in several messages, they have 32 bit fields that could be 8 or 16 bit. By resizing them appropriately we can reduce the message rate in half and improve our data throughput to meet the requirements.
Once upon a time, in the land of Earth, there existed devices called computers.
In the early days following the invention of "computers," there was limited storage in memory for fancy things like numbers and strings.
Billy, a programmer, was encouraged by the evil Wizard (his boss) to use the least amount of memory that he could!
Then one day, memory sizes got large enough that everyone could use 32-bit numbers if they wanted!
I could continue on, but all the other obvious things were already covered.

Resources