How slow could handling iolist/deep list be than handling the flat list? - erlang

I see that Erlang Efficiency User's Guide Section 5.3 recommends leaving the non-flat list as it is when being used as an iolist because the penalty of non-flattening is smaller than flattening. Is there any quantitative example of the speed difference?

When a deep list contains n elements, then performing lists:flatten on it will require Θ(n) time, and worse, Θ(n) memory allocations. How slow that is on your machine is a function of many variables; measure and ye shall know.

Related

Advantages and disadvantages of linked list/binary trees as structures for storing and searching for data

I need to write up an evaluation of what the advantages and disadvantages are of linked list and binary trees as structures for storing and searching for data. But I'm a little bit lost on what the advantages and disadvantages could be?
Any help would you greatly appreciated, thanks!
Lets think of how a linked list works vs how a binary tree works
For a doubly linked list we have elements like
Head < - > 5 < - > 10 < - > 4 < - > tail
If we want to add an element we can easily add it to the head or tail of the list by pointing our new element at the end point and the value that the endpoint points to and then updating both of those to point to the new element (making sure to correctly assign previous and next) I've glossed over it here but there are lots of good resources available if you search insertion into linked list. This operation has O(1) time complexity. Do some research on insertion into (balanced) binary trees this will take much longer
Now what about searching? In the linked list above if i want to find an element i have to walk through the entire list from one side until i hit the value i want. This leads to O(n) time complexity, however if we have a balanced binary tree we can check if what we are looking for is higher or lower than the value that is ~ in the middle. If its higher we can eleminate the lower half of the numbers. Then we can do the same step with the remaining numbers. This cuts the amount of remaining numbers by ~ half at every step and leads to significantly reduced times.
There are many many ways to evaluate this and they depend on different implementations. My advice to you is to choose which implementation of each you want to focus on compare the time complexity of operations and then consider alternative implementations and what they do to the time complexity of operations.
For binary consider balanced and unbalanced.
For linked list look at doubly linked, singly linked, with and without pointers to tail (essentially stack vs queue)
If you have questions about specific implementations and how they compare let us know and we will do our best to clear that up.

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?
It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

Golang. Zero Garbage propagation or efficient use of memory

From time to time I face with the concepts like zero garbage or efficient use of memory etc. As an example in the section Features of well-known package httprouter you can see the following:
Zero Garbage: The matching and dispatching process generates zero bytes of garbage. In fact, the only heap allocations that are made, is by building the slice of the key-value pairs for path parameters. If the request path contains no parameters, not a single heap allocation is necessary.
Also this package shows very good benchmark results compared to standard library's http.ServeMux:
BenchmarkHttpServeMux 5000 706222 ns/op 96 B/op 6 allocs/op
BenchmarkHttpRouter 100000 15010 ns/op 0 B/op 0 allocs/op
As far as I understand the second one has (from the table) no heap memory allocation and zero average number of allocations made per repetition.
The question: I want to learn a basic understanding of memory management. When garbage collector allocates/deallocates memory. What does the benchmark numbers means (the last two columns of the table) and how people know when heap is allocating?
I'm absolutely new in memory management, so it's really difficult to understand what's going on "under the hood". The articles I've read:
https://golang.org/ref/mem
https://golang.org/doc/effective_go.html
http://gribblelab.org/CBootcamp/7_Memory_Stack_vs_Heap.html
http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
The garbage collector doesn't allocate memory :-), it just deallocates. Go's garbage collector is evolving, for the details have a look at the design document https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLovrHvvLhK_h0KN8woTO4/preview?sle=true and follow the discussion on the golang mailing lists.
The last two columns in the benchmark output are dead simple: How many bytes have been allocated in total and how many allocations have happened during one iteration of the benchmark code. (This allocation is done by your code, not by the garbage collector). As any allocation is a potential creation of garbage reducing these numbers might be a design goal.
When are things allocated on the heap? Whenever the Go compiler decides to! The compiler tries to allocate on the stack, but sometimes it must use the heap, especially if a value escapes from the local stack-bases scopes. This escape analysis is currently undergoing rework, so it is not easy to tell which value will be heap- or stack-allocated, especially as this is changing from compiler version to version.
I wouldn't be too obsessed with avoiding allocations until your benchmarking show too much GC overhead.

Does functional programming take up more memory?

Warning! possibly a very dumb question
Does functional programming eat up more memory than procedural programming?
I mean ... if your objects(data structures whatever) are all imutable. Don't you end up having more object in the memory at a given time.
Doesn't this eat up more memory?
It depends on what you're doing. With functional programming you don't have to create defensive copies, so for certain problems it can end up using less memory.
Many functional programming languages also have good support for laziness, which can further reduce memory usage as you don't create objects until you actually use them. This is arguably something that's only correlated with functional programming rather than a direct cause, however.
Persistent values, that functional languages encourage but which can be implemented in an imperative language, make sharing a no-brainer.
Although the generally accepted idea is that with a garbage collector, there is some amount of wasted space at any given time (already unreachable but not yet collected blocks), in this context, without a garbage collector, you end up very often copying values that are immutable and could be shared, just because it's too much of a mess to decide who is responsible for freeing the memory after use.
These ideas are expanded on a bit in this experience report which does not claim to be an objective study but only anecdotal evidence.
Apart from avoiding defensive copies by the programmer, a very smart implementation of pure functional programming languages like Haskell or Standard ML (which lack physical pointer equality) can actively recover sharing of structurally equal values in memory, e.g. as part of the memory management and garbage collection.
Thus you can have automatic hash consing provided by your programming language runtime-system.
Compare this with objects in Java: object identity is an integral part of the language definition. Even just exchanging one immutable String for another poses semantic problems.
There is indeed at least a tendency to regard memory as affluent ressource (which, in fact, it really is in most cases), but this applies to modern programming as a whole.
With multiple cores, parallel garbage collectors and available RAM in the gigabytes, one used to concentrate on different aspects of a program than in earlier times, when every byte one could save counted. Remember when Bill Gates said "640K should be enough for every program"?
I know that I'm a lot late on this question.
Functional languages does not in general use more memory than imperative or OO languages. It depends more on the code you write. Yes F#, SML, Haskell and such has immutable values (not variables), but for all of them it goes without saying that if you update f.x. a single linked list, it re-compute only what is necessary.
Say you got a list of 5 elements, and you are removing the first 3 and adding a new one in front of it. it will simply get the pointer that points to the fourth element and let the new list point to that point of data i.e. reusing data. as seen below.
old list
[x0,x1,x2]
\
[x3,x4]
new list /
[y0,y1]
If it was an imperative language we could not do this because the values x3 and x4 could very well change over time, the list [x3,x4] could change too. Say that the 3 elements removed are not used afterward, the memory they use can be cleaned up right away, in contrast to unused space in an array.
That all data are immutable (except IO) are a strength. It simplifies the data flow analysis from a none trivial computation to a trivial one. This combined with a often very strong type system, will give the compiler a bunch of information about the code it can use to do optimization it normally could not do because of indicability. Most often the compiler turn values that are re-computed recursively and discarded from each iteration (recursion) into a mutable computation. These two things gives you the proof that if your program compile it will work. (with some assumptions)
If you look at the language Rust (not functional) just by learning about "borrow system" you will understand more about how and when things can be shared safely. it is a language that is painful to write code in unless you like to see your computer scream at you that your are an idiot. Rust is for the most part the combination of all the study made of programming language and type theory for more than 40 years. I mention Rust, because it despite the pain of writing in it, has the promise that if your program compile, there will be NO memory leaking, dead locking, dangling pointers, even in multi processing programs. This is because it uses much of the research of functional programming language that has been done.
For a more complex example of when functional programming uses less memory, I have made a lexer/parser interpreter (the same as generator but without the need to generate a code file) when computing the states of the DFA (deterministic finite automata) it uses immutable sets, because it compute new sets of already computed sets, my code allocate less memory simply because it borrow already known data points instead of copying it to a new set.
To wrap it up, yes functional programming can use more memory than imperative once. Most likely it is because you are using the wrong abstraction to mirror the problem. i.e. If you try to do it the imperative way in a functional language it will hurt you.
Try this book, it has not much on memory management but is a good book to start with if you will learn about compiler theory and yes it is legal to download. I have ask Torben, he is my old professor.
http://hjemmesider.diku.dk/~torbenm/Basics/
I'll throw my hat in the ring here. The short answer to the question is no, and this is because immutability does not mean the same thing as stored in memory. For example, let's take this toy program :
x = 2
x = x * 3
x = x * 2
print(x)
Which uses mutation to compute new values. Compare this to the same program which does not use mutation:
x = 2
y = x * 3
z = y * 2
print(z)
At first glance, it appears this requires 3x the memory of the first program! However, just because a value is immutable doesn't mean it needs to be stored in memory. In the case of the second program, after y is computed, x is no longer necessary, because it isn't used for the rest of the program, and can be garbage collected, or removed from memory. Similarly, after z is computed, y can be garbage collected. So, in principle, with a perfect garbage collector, after we execute the third line of code, I only need to have stored z in memory.
Another oft-worried about source of memory consumption in functional languages is deep recursion. For example, calculating a large Fibonacci number.
calc_fib(x):
if x > 1:
return x * calc_fib(x-1)
else:
return x
If I run calc_fib(100000), I could implement this in a way which requires storing 100000 values in memory, or I could use Tail-Call Elimination (basically storing only the most-recently computed value in memory instead of all function calls). For less straightforward recursion you can resort to trampolining. So for functional languages which support this, recursion does not need to be a source of massive memory consumption, either. However, not all nominally functional languages do (for example, JavaScript does not).

Creating a linked list using CUDA

Is it possible to create a linked list on a GPU using CUDA?
I am trying to do this and I am encoutering some difficulties.
If I can't allocate dynamic memory in a CUDA kernel, then how can I create a new node and add it to the linked list?
You really don't want to do this if you can help it - the best thing you can do if you can't get away from linked lists is to emulate them via arrays and use array indices rather than pointers for your links.
There are some valid use cases for linked lists on a GPU. Consider using a Skip List as an alternative as they provide faster operations. There are examples of highly concurrent Skip List algorithms available via Google searches.
Check out this link http://www.cse.iitk.ac.in/users/mainakc/lockfree.html/
for CUDA code a PDF and PPT presentation on a number of lock free CUDA data structures.
Link Lists can be constructed in parallel using a reduction algorithm approach. This assumes that ALL members are known at construction time. Each thread starts by connecting 2 nodes. Then half the threads connect the 2 node segments together and so on, reducing the number threads by 2 each iteration. This will build a list in log2 N time.
Memory allocation is a constraint. Pre-allocate all the nodes in an array on the host. Then you can use array subscripts in place of pointers. That has the advantage that the list traversal is valid on the GPU and the host.
For concurrency you need to use CUDA atomic operations. Atomic add/increment to count the nodes used from the node array and Compare and Swap to to set the links between nodes.
Again carefully consider the use case and access patterns. Using one large linked list is very serial. Using 100 - 100's of small Linked list is more parallel. I expect the memory access be uncoalesced unless care is taken to allocate connected nodes in adjacent memory locations.
I agree with Paul, linked lists are a very 'serial' way of thinking. Forget what you've learned about serial operations and just do everything at once : )
take a look at Thrust for the way of doing common operations

Resources