Julia: efficient memory allocation

Julia: efficient memory allocation - memory

My program is memory-hungry, so I need to save as much memory as I can.
When you assign an integer value to a variable, the type of the value will always be Int64, whether it's 0 or +2^63-1 or -2^63.
I couldn't find a smart way to efficiently allocate memory, so I wrote a function that looks like this (in this case for integers):
function right_int(n)
types = [Int8,Int16,Int32, Int64, Int128]
for t in reverse(types)
try
n = t(n)
catch InexactError
break
end
end
n
end
a = right_int(parse(Int,eval(readline(STDIN))))
But I don't think this is a good way to do it.
I also have a related problem: what's an efficient way of operating with numbers without worrying about typemins and typemaxs? Convert each operand to BigInt and then apply right_int?

You're missing the forest for the trees. right_int is type unstable. Type stability is a key concept in reducing allocations and making Julia fast. By trying to "right-size" your integers to save space, you're actually causing more allocations and higher memory use. As a simple example, let's try making a "right-sized" array of 100 integers from 1-100. They're all small enough to fit in Int8, so that's just 100 bytes plus the array header, right?
julia> #allocated [right_int(i) for i=1:100]
26496
Whoa, 26,496 bytes! Why didn't that work? And why is there so much overhead? The key is that Julia cannot infer what the type of right_int might be, so it has to support any type being returned:
julia> typeof([right_int(i) for i=1:100])
Array{Any,1}
This means that Julia can't pack the integers densely into the array, and instead represents them as pointers to 100 individually "boxed" integers. These boxes tell Julia how to interpret the data that they contain, and that takes quite a bit of overhead. This doesn't just affect arrays, either — any time you use the result of right_int in any function, Julia can no longer optimize that function and ends up making lots of allocations. I highly recommend you read more about type stability in this very good blog post and in the manual's performance tips.
As far as which integer type to use: just use Int unless you know you'll be going over 2 billion. In the cases where you know you need to support huge numbers, use BigInt. It's notable that creating a similar array of BigInt uses significantly less memory than the "right-sized" array above:
julia> #allocated [big(i) for i=1:100]
6496

Related

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?

It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

Largest amount of entries in lua table

I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
end
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)

The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.

Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows
...

Storing data with large input

There is a problem in a competitive programming site(hackerrank) in which the input number is of the range 10^18.So,is it possible to store (10^18) in java?If yes then which data type should be used?

For some easy HackerRank problems, BigInteger or BigDecimal do work for extremely large inputs,but they usually don't work in moderate/difficult problems as they tend to reduce performance & a high number of test-cases of extremely large inputs can cause a timeout.
In such cases,you will need go for different storage techniques e.g. an array of int,each element of the array representing a digit of the large input. You will then need to do digit-based arithmetic on the array for your computations.

BigInteger.valueOf(10).pow(10000)
No real need not be careful, as the BigInteger.valueOf(long) method will give you a compilation error if you try to write a literal that exceeds Long.MAX_VALUE. Furthermore, it's easy to construct a BigInteger much greater, say BigInteger.valueOf(10).pow(10000)

Does using single-case discriminated union types have implications on performance?

It is nice to have a wrapper for every primitive value, so that there is no way to misuse it. I suspect this convenience comes at a price. Is there a performance drop? Should I rather use bare primitive values instead if the performance is a concern?

Yes, there's going to be a performance drop when using single-case union types to wrap primitive values. Union cases are compiled into classes, so you'll pay the price of allocating (and later, collecting) the class and you'll also have an additional indirection each time you fetch the value held inside the union case.
Depending on the specifics of your application, and how often you'll incur these additional overheads, it may still be worth doing if it makes your code safer and more modular.
I've written a lot of performance-sensitive code in F#, and my personal preference is to use F# unit-of-measure types whenever possible to "tag" primitive types (e.g., ints). This keeps them from being misused (thanks to the F# compiler's type checker) but also avoids any additional run-time overhead, since the measure types are erased when the code is compiled. If you want some examples of this, I've used this design pattern extensively in my fsharp-tools projects.

Jack has much more experience with writing high-performance F# code than I do, so I think his answer is absolutely right (I also think the idea to use units of measure is pretty interesting).
Just to put things in context, I wrote a really basic test (using just F# Interactive - so things may differ in Release mode) to compare the performance. It allocates an array of wrapped (vs. non-wrapped) int values. This is probably the scenario where non-wrapped types are really a good choice, because the array will be just a continuous block of memory.
#time
// Using a simple wrapped `int` type
type Foo = Foo of int
let foos = Array.init 1000000 Foo
// Add all 'foos' 1k times and ignore the results
for i in 0 .. 1000 do
let mutable total = 0
for Foo f in foos do total <- total + f
On my machine, the for loop takes on average something around 1050ms. Now, the unwrapped version:
let bars = Array.init 1000000 id
for i in 0 .. 1000 do
let mutable total = 0
for b in bars do total <- total + b
On my machine, this takes about 700ms.
So, there is certainly some performance penalty, but perhaps smaller than one would expect (some 33%). And this is looking at a test that does virtually nothing else than unwrap the values in a loop. In code that does something useful, the overhead would be a lot smaller.
This may be an issue if you're writing high-performance code, something that will process lots of data or something that takes some time and the users will run it frequently (like compiler & tools). On the other hand, if you application is not performance critical, then this is not likely to be a problem.

From F# 4.1 onwards adding the [<Struct>] attribute to suitable single case discriminated unions will increase the performance and reduce the number of memory allocations performed.

Working with large arrays - OutOfRam

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.

The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.

If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.

Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart