In this SO thread, I learned that keeping a reference to a seq on a large collection will prevent the entire collection from being garbage-collected.
First, that thread is from 2009. Is this still true in "modern" Clojure (v1.4.0 or v1.5.0)?
Second, does this issue also apply to lazy sequences? For example, would (def s (drop 999 (seq (range 1000)))) allow the garbage collector to retire the first 999 elements of the sequence?
Lastly, is there a good way around this issue for large collections? In other words, if I had a vector of, say, 10 million elements, could I consume the vector in such a way that the consumed parts could be garbage collected? What about if I had a hashmap with 10 million elements?
The reason I ask is that I'm operating on fairly large data sets, and I am having to be more careful not to retain references to objects, so that the objects I don't need can be garbage collected. As it is, I'm encountering a java.lang.OutOfMemoryError: GC overhead limit exceeded error in some cases.
It is always the case that if you "hold onto the head" of a sequence then Clojure will be forced to keep everything in memory. It doesn't have a choice: you are still keeping a reference to it.
However the "GC overhead limit reached" isn't the same as an out of memory error - It's more likely a sign that you are running a fictitious workload that is creating and discarding objects so fast that it is tricking the GC into thinking that it is overloaded.
See:
GC overhead limit exceeded
If you put an actual workload on the items being processed, I suspect you will see that this error won't happen any more. You can easily process lazy sequences that are larger than available memory in this case.
Concrete collections like vectors and hashmaps are a different matter however: these are not lazy, so must always be held completely in memory. If you have datasets larger than memory then your options include:
Use lazy sequences and don't hold onto the head
Use specialised collections that support lazy loading (Datomic uses some structures like this I believe)
Treat the data as an event stream (using something like Storm)
Write custom code to partition the data into chunks and process them one at a time.
If you hold the head of a sequence in a binding then you are correct and it can't be gc'd (and that's with every version of Clojure). If you are processing a large amount of results, why do you need to hold onto the head?
As for way around it, yes! lazy-seq implementation can gc parts that have already been 'processed' and are not being directly referenced from within the binding. Just ensure that you are not holding onto the head of the sequence.
Related
This one's kind of an open ended design question I'm afraid.
Anyway: I have a big two-dimensional array of stuff. This array is mutable, and is accessed by a bunch of threads. For now I've just been dealing with this as a Arc<Mutex<Vec<Vec<--owned stuff-->>>>, which has been fine.
The problem is that stuff is about to grow considerably in size, and I'll want to start holding references rather than complete structures. I could do this by inverting everything and going to Vec<Vec<Arc<Mutex>>, but I feel like that would be a ton of overhead, especially because each thread would need a complete copy of the grid rather than a single Arc/Mutex.
What I want to do is have this be an array of references, but somehow communicate that the items being referenced all live long enough according to a single top-level Arc or something similar. Is that possible?
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
EDIT:Giving some more specifics on my code (away from home so this is rough):
What I want:
Outer scope initializes a bunch of Ts and somehow collectively ensures they live long enough (that's the hard part)
Outer scope initializes a grid :Something<Vec<Vec<&T>>> that stores references to the Ts
Outer scope creates a bunch of threads and passes grid to them
Threads dive in and out of some sort of (problable RW) lock on grid, reading the Tsand changing the &Ts in the process.
What I have:
Outer thread creates a grid: Arc<RwLock<Vector<Vector<T>>>>
Arc::clone(& grid)s are passed to individual threads
Read-heavy threads mostly share the lock and sometimes kick each other out for the writes.
The only problem with this is that the grid is storing actual Ts which might be problematically large. (Don't worry too much about the RwLock/thread exclusivity stuff, I think it's perpendicular to the question unless something about it jumps out at you.)
What I don't want to do:
Top level creates a bunch of Arc<Mutex<T>> for individual T
Top level creates a `grid : Vec<Vec<Arc<Mutex>>> and passes it to threads
The problem with that is that I worry about the size of Arc/Mutex on every grid element (I've been going up to 2000x2000 so far and may go larger). Also while the threads would lock each other out less (only if they're actually looking at the same square), they'd have to pick up and drop locks way more as they explore the array, and I think that would be worse than my current RwLock implementation.
Let me start of by your "aside" question, as I feel it's the one that can be answered:
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
The documenation of std::vec::Vec specifies that the layout is essentially a pointer with size information. That means that any Vec<Vec<T>> is a pointer to a densely packed array of pointers to densely packed arrays of Ts. So if block of memory means a contiguous block to you, then no, Vec<Vec<T>> cannot give that you. If that is part of your requirements, you'd have to deal with a datatype (let's call it Grid) that is basically a (pointer, n_rows, n_columns) and define for yourself if the layout should be row-first or column-first.
The next part is that if you want different threads to mutate e.g. columns/rows of your grid at the same time, Arc<Mutex<Grid>> won't cut it, but you already figured that out. You should get clarity whether you can split your problem such that each thread can only operate on rows OR columns. Remember that if any thread holds a &mut Row, no other thread must hold a &mut Column: There will be an overlapping element, and it will be very easy for you to create a data races. If you can assign a static range of of rows to a thread (e.g. thread 1 processes rows 1-3, thread 2 processes row 3-6, etc.), that should make your life considerably easier. To get into "row-wise" processing if it doesn't arise naturally from the problem, you might consider breaking it into e.g. a row-wise step, where all threads operate on rows only, and then a column-wise step, possibly repeating those.
Speculative starting point
I would suggest that your main thread holds the Grid struct which will almost inevitably be implemented with some unsafe methods, e.g. get_row(usize), get_row_mut(usize) if you can split your problem into rows/colmns or get(usize, usize) and get(usize, usize) if you can't. I cannot tell you what exactly these should return, but they might even be custom references to Grid, which:
can only be obtained when the usual borrowing rules are fulfilled (e.g. by blocking the thread until any other GridRefMut is dropped)
implement Drop such that you don't create a deadlock
Every thread holds a Arc<Grid>, and can draw cells/rows/columns for reading/mutating out of the grid as needed, while the grid itself keeps book of references being created and dropped.
The downside of this approach is that you basically implement a runtime borrow-checker yourself. It's tedious and probably error-prone. You should browse crates.io before you do that, but your problem sounds specific enough that you might not find a fitting solution, let alone one that's sufficiently documented.
[This is a followup to a related question here.]
My code has a loop like the following:
Document d(kObjectType);
while (not done() and getNewStuff(d)) {
process(d);
d.RemoveAllMembers();
}
While this code produces the desired results, each iteration appears to allocate new memory for the members of the document; and, when the code is run in limited configurations, it exhausts all memory and terminates prematurely.
The document which is populated by getNewStuff() is arbitrarily complex (that is, it may contain nested objects, arrays, arrays of objects, etc.), and it uses the default allocation method. The answer to the previous question indicated that when the nested objects and/or arrays were destroyed, their storage would be returned to the the allocator (and not to the system). This explains why the process memory consumption doesn't drop when the members are removed. (However, I've confirmed via Valgrind that there is no "memory leak", because the memory is all still properly accounted for, and, when the process exits, it is all properly freed.)
However, it doesn't explain why the process memory consumption continues to increase over time. It seems clear that the call to RemoveAllMembers() is not making the memory underlying those members available for reuse by subsequent AddMembers() (etc.) calls.
My question is, what do I need to do, in addition to or in place of calling RemoveAllMembers() to allow my Document object (d) to be reused on the next iteration?
(By the way, I tried moving the declaration of d to the inside of the loop, and this does indeed produce the desired memory behavior, but I would rather avoid the overhead of destructing and re-constructing d on each iteration.)
Thanks!
In terms of big-O runtime, it seems that both data structures have in the "average" case:
O(1) insertion/removal into the start and end
O(n) insertion/removal into some arbitrary index
O(1) lookup of the start and end
Advantages of circular buffer:
O(1) lookup instead of O(n) of some arbitrary index
Doesn't need to create nodes, thus doesn't need a dynamic allocation on each insertion
Faster traversal due to better cache prediction
Faster removal due to vectorization (e.g. using memmove) to fill the gap
Typically needs less space (because in a linked list, for each node, you have to sort pointers to the next and/or previous node)
Advantages of linked list:
Easier to get O(1) insertion/removal to some specific place (e.g., could get it for midway through the linked list). Circular buffers can do it, but it's more complicated
O(1) insertion in the worst case, unlike circular buffers that are O(n) (when it needs to grow the buffer)
Based on this list, it seems to me that circular buffers are a far better choice in almost every case. Am I missing something?
The MCS lock is one of the most scalable lock designs there is. A thread uses an atomic compare and exchange to attempt to seize the lock. If it works it is done. If it doesn't work, the thread uses an atomic exchange to enqueue itself at the tail of the list of waiters.
There is not way to do a similar thing with circular buffers without locks or more complicated use of atomic instructions.
Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).
From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.
There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.
I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)