Storing data with large input - storage

There is a problem in a competitive programming site(hackerrank) in which the input number is of the range 10^18.So,is it possible to store (10^18) in java?If yes then which data type should be used?

For some easy HackerRank problems, BigInteger or BigDecimal do work for extremely large inputs,but they usually don't work in moderate/difficult problems as they tend to reduce performance & a high number of test-cases of extremely large inputs can cause a timeout.
In such cases,you will need go for different storage techniques e.g. an array of int,each element of the array representing a digit of the large input. You will then need to do digit-based arithmetic on the array for your computations.

BigInteger.valueOf(10).pow(10000)
No real need not be careful, as the BigInteger.valueOf(long) method will give you a compilation error if you try to write a literal that exceeds Long.MAX_VALUE. Furthermore, it's easy to construct a BigInteger much greater, say BigInteger.valueOf(10).pow(10000)

Related

Neo4j floating point sum different results

I am using neo4j to calculate some statistics on a data set. For that I am often using sum on a floating point value. I am getting different results depending on the circumstances. For example, a query that does this:
...
WITH foo
ORDER BY foo.fooId
RETURN SUM(foo.Weight)
Returns different result than the query that simply does the sum:
...
RETURN SUM(foo.Weight)
The differences are miniscule (293.07724195098984 vs 293.07724195099007). But it is enough to make simple equality checks fail. Another example would be a different instance of the database, loaded with the same data using the same loading process can produce the same issue (the dbs might not be 1:1, the load order of some relations might be different). I took the raw values that neo4j sums (by simply removing the SUM()) and verified that they are the same in all cases (different dbs and ordered/not ordered).
What are my options here? I don't mind losing some precision (I already tried to cut down the precision from 15 to 12 decimal places but that did not seem to work), but I need the results to match up.
Because of rounding errors, floats are not associative. (a+b)+c!=a+(b+c).
The result of every operation is rounded to fit the floats coding constraints and (a+b)+c is implemented as round(round(a+b) +c) while a+(b+c) as round(a+round(b+c)).
As an obvious illustration, consider the operation (2^-100 + 1 -1). If interpreted as a (2^-100 + 1)-1, it will return 0, as 1+2^-100 would require a precision too large for floats or double coding in IEEE754 and can only be coded as 1.0. While (2^-100 +(1-1)) correctly returns 2^-100 that can be coded by either floats or doubles.
This is a trivial example, but these rounding errors may exist after every operation and explain why floating point operations are not associative.
Databases generally do not return data in a garanteed order and depending on the actual order, operations will be done differently and that explains the behaviour that you have.
In general, for this reason, it not a good idea to do equality comparison on floats. Generally, it is advised to replace a==b by abs(a-b) is "sufficiently" small.
"sufficiently" may depend on your algorithm. float are equivalent to ~6-7 decimals and doubles to 15-16 decimals (and I think that it is what is used on your DB). Depending on the number of computations, you may have the last 1--3 decimals affected.
The best is probably to use
abs(a-b)<relative-error*max(abs(a),abs(b))
where relative-error must be adjusted to your problem. Probably something around 10^-13 can be correct, but you must experiment, as rounding errors depends on the number of computations, on the dispersion of the values and on what you may consider as "equal" for you problem.
Look at this site for a discussion on comparison methods. And read What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg that discusses, among others, these problems.

Julia: efficient memory allocation

My program is memory-hungry, so I need to save as much memory as I can.
When you assign an integer value to a variable, the type of the value will always be Int64, whether it's 0 or +2^63-1 or -2^63.
I couldn't find a smart way to efficiently allocate memory, so I wrote a function that looks like this (in this case for integers):
function right_int(n)
types = [Int8,Int16,Int32, Int64, Int128]
for t in reverse(types)
try
n = t(n)
catch InexactError
break
end
end
n
end
a = right_int(parse(Int,eval(readline(STDIN))))
But I don't think this is a good way to do it.
I also have a related problem: what's an efficient way of operating with numbers without worrying about typemins and typemaxs? Convert each operand to BigInt and then apply right_int?
You're missing the forest for the trees. right_int is type unstable. Type stability is a key concept in reducing allocations and making Julia fast. By trying to "right-size" your integers to save space, you're actually causing more allocations and higher memory use. As a simple example, let's try making a "right-sized" array of 100 integers from 1-100. They're all small enough to fit in Int8, so that's just 100 bytes plus the array header, right?
julia> #allocated [right_int(i) for i=1:100]
26496
Whoa, 26,496 bytes! Why didn't that work? And why is there so much overhead? The key is that Julia cannot infer what the type of right_int might be, so it has to support any type being returned:
julia> typeof([right_int(i) for i=1:100])
Array{Any,1}
This means that Julia can't pack the integers densely into the array, and instead represents them as pointers to 100 individually "boxed" integers. These boxes tell Julia how to interpret the data that they contain, and that takes quite a bit of overhead. This doesn't just affect arrays, either — any time you use the result of right_int in any function, Julia can no longer optimize that function and ends up making lots of allocations. I highly recommend you read more about type stability in this very good blog post and in the manual's performance tips.
As far as which integer type to use: just use Int unless you know you'll be going over 2 billion. In the cases where you know you need to support huge numbers, use BigInt. It's notable that creating a similar array of BigInt uses significantly less memory than the "right-sized" array above:
julia> #allocated [big(i) for i=1:100]
6496

Larger than Unsigned Long Long

I'm working on an iOS Objective C app where you accumulate a large amount of wealth. By the end of the app, the amount of money users can accumulate is more than a long long can handle. What data type should I use instead? I know I could use an unsigned long, but that only adds a little bit more. I need users to have like 6 more digits to be safe, so instead of the max being 18,446,744,073,709,551,615 (about 1.8x10^19), it would be ideal to have something like 1.8x10^25 as my maximum value.
Precision isn't actually all that important in the end, but it would defiantly save me time to not have to do more than just changing data types throughout my application. Any ideas?
Short Answer
Go for a 3rd party library.
Long Answer
When dealing with large numbers, probably one of the most fundamental design decisions is how am I going to represent the large number?
Will it be a string, an array, a list, or custom (homegrown) storage class.
After that decision is made, the actual math operations can be broken down in smaller parts and then executed with native language types such as int or integer.
Even with strings there is a limit in the number of characters or "numbers" in the number, as indicated here:
What is the maximum possible length of a .NET string?
You might also want to check: Arbitrary description Arithmetic

Use Trie or SortedSet for Dictionary?

I had some questions about usage of Tries/SortedSets for a dictionary.
Which is more efficient for lookups?
Which is more efficient for virtual memory?
Are there any other advantages/disadvantages of either structure when utilized for a dictionary?
No need to answer all three, just looking for some good responses and source material if you have any. Thanks.
Lookups in a Trie are blazing fast, as they just require O(length of key) comparisons, and are almost as fast as it's possible to be. A SortedSet is generally implemented using balanced binary search trees, which would perform many more comparisons, in the worst case O(height of tree) string comparisons. So the Trie is the clear winner here.
Virtual memory efficiency can be seen as how fast the data structure can be loaded into memory. SortedSet takes up space proportional to the number of elements. It's implemented using pointers, which can be bad for the loading efficiency. That can be improved by serializing it and storing it in an array, but that increases the space needed. A Trie in its most simple form takes a lot of memory. It's also implemented using pointers, which is again bad for loading efficiency. Even if serialized, it takes a large amount of memory. But there are interesting alternatives here, which compress the trie and give the same performance. Radix Tries take significantly less amount of memory. Even better, a DAWG (directed acyclid word graph) overlaps common suffixes and prefixes and compresses the dictionary by a huge amount. After compression, the DAWG could take less space than your dictionary itself. It is implemented using an array, so it's fast to load too. At the end, if you have a static dictionary, DAWG would be the best way to go, otherwise it depends.
A trie sees keys as sequences. It is a prefix tree. You can get all words starting from a prefix very fast. Using a trie, you can perform auto completion and auto correction efficiently. Some keys like floating point numbers, could lead to long chains in the trie, which is bad. A SortedSet sees keys as comparable items. So it is easily possible to partition the elements. Both SortedSet and Trie can offer the keys in alphabetical order, but I guess SortedSet would be much faster.

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources