Some other questions here have been about the issue of being able to compress only a part/chunk of a large file of compressed data. Allowing some sort of "random access decompression". Bzip2 has always been among the recommendations for such a feature.
Reading about bzip on Wikipedia and on some document refered to as the informal specification it was not completely clear at what level this feature to separately decompress a part of the bzip2 file occurs. There seems to be two options, a) it is on the level of BzipStreams and b) it is even on the level of StreamBlocks (of which to my understading there can be one or more inside of a BzipStream).
BZipFile:=BZipStream+
└──BZipStream:=StreamHeader StreamBlock* StreamFooter
├──StreamHeader:=HeaderMagic Version Level
├──StreamBlock:=BlockHeader BlockTrees BlockData
│ ├──BlockHeader:=BlockMagicBlockCRC Randomized OrigPtr
│ └──BlockTrees:=SymMapNumTrees NumSels Selectors Trees
│ ├──SymMap:=MapL1 MapL2{1,16}
│ ├──Selectors:=Selector{NumSels}
│ └──Trees:=(BitLen Delta{NumSyms}{NumTrees}
└──StreamFooter:=FooterMagic StreamCRCPadding
Albeit the bzip2 is praised often, it seems to me the fact of having the archive data not being byte-aligned, but bit-aligned within each BzipStream, whould suggest that the separate decompression of individual blocks has not been something that was supposed to happen, though I cannot be sure and hence this question :)
Update
A look onto the man bzip2recover manual page tells
bzip2 compresses files in blocks, usually 900kbytes long. Each
block is handled independently. If a media or transmission error
causes a multi-block .bz2 file to become damaged, it may be possible
to recover data from the undamaged blocks in the file.
The compressed representation of each block is delimited by a
48-bit pattern, which makes it possible to find the block boundaries
with rea‐ sonable certainty. Each block also carries its own 32-bit
CRC, so dam‐ aged blocks can be distinguished from undamaged ones.
which might strongly suggest that each block can be decompressed separately. Is this correct?
Related
I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.
I use recon_alloc:memory(allocated_types) and get info like below.
34> recon_alloc:memory(allocated_types).
[{binary_alloc,1546650440},
{driver_alloc,21504840},
{eheap_alloc,28704768840},
{ets_alloc,526938952},
{fix_alloc,145359688},
{ll_alloc,403701800},
{sl_alloc,688968},
{std_alloc,67633992},
{temp_alloc,21504840}]
The eheap_alloc is using 28G. But sum up with heap_size of all process
>lists:sum([begin {_, X}=process_info(P, heap_size), X end || P <- processes()]).
683197586
Only 683M !Any idea where is the 28G ?
You are not comparing the right values. From erlang:process_info
{heap_size, Size}
Size is the size in words of youngest heap generation of the
process. This generation currently include the stack of the process.
This information is highly implementation dependent, and may change if
the implementation change.
recon_alloc:memory(allocated_types) is in bytes by default. You can change it using set_unit. It is not the memory that is currently used but it is the memory reserved by the VM grouped into different allocators. You can use recon_alloc:memory(used) instead. More details in allocator() - Recon Library
Searching through the Erlang source code for the eheap_alloc keyword I didn't come up with much. The most relevant piece of code was this XML code from erts_alloc.xml (https://github.com/erlang/otp/blob/172e812c491680fbb175f56f7604d4098cdc9de4/erts/doc/src/erts_alloc.xml#L46):
<tag><c>eheap_alloc</c></tag>
<item>Allocator used for Erlang heap data, such as Erlang process heaps.</item>
This says that process heaps are stored in eheap_alloc but it doesn't say what else is stored in eheap_alloc. The eheap_alloc stores everything your application needs to run along with some extra memory along with some additional space, so the VM doesn't have to request more memory from the OS every time something needs to be added. There are things the VM must keep in memory that aren't associated with a specific process. For example, large binaries, even though they may used within a process, are not stored inside that processes heap. They are stored in a shared process binary heap called binary_alloc. The binary heap, along with the process heaps and some extra memory, are what make up eheap_alloc.
In your case it looks like you have a lot of memory in your binary_alloc. binary_alloc is probably using a significant portion of your eheap_alloc.
For more details on binary handling checkout these pages:
http://blog.bugsense.com/post/74179424069/erlang-binary-garbage-collection-a-love-hate
http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65224
I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)
I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:
#include <emmintrin.h>
void setbytes(char *p, int c)
{
__m128i i = _mm_set_epi8(c, c, c, c,
c, c, c, c,
c, c, c, c,
c, c, c, c);
_mm_stream_si128((__m128i *)&p[0], i);
_mm_stream_si128((__m128i *)&p[16], i);
_mm_stream_si128((__m128i *)&p[32], i);
_mm_stream_si128((__m128i *)&p[48], i);
}
With such a comment right below it:
Assuming the pointer p is appropriately aligned, a call to this
function will set all bytes of the addressed cache line to c. The
write-combining logic will see the four generated movntdq instructions
and only issue the write command for the memory once the last
instruction has been executed. To summarize, this code sequence not
only avoids reading the cache line before it is written, it also
avoids polluting the cache with data which might not be needed soon.
What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?
When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.
When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.
The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.
The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware):
- Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.
The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.
Note that if the cache line written is already in the main caches, the above method will also update them.
If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.
If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).
Typically the non-temporal cache size is on the order of 4-8 cache lines.
To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.
I think this is partly a terminology question: The passage you quote from Ulrich Drepper's article isn't talking about cached data. It's just using the term "cache line" for an aligned 64B block.
This is normal, and especially useful when talking about a range of hardware with different cache-line sizes. (Earlier x86 CPUs, as recently as PIII, had 32B cache lines, so using this terminology avoids hard-coding that microarch design decision into the discussion.)
A cache-line of data is still a cache-line even if it's not currently hot in any caches.
I don't have references under my fingers to prove what I am saying, but my understanding is this: the only unit of transfer over the memory bus is cache lines, whether they go into the cache or to some special registers. So indeed, the code you pasted fills a cache line, but it is a special cache line that does not reside in cache. Once all bytes of this cache line have been modified, the cache line is send directly to memory, without passing through the cache.
I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck