How to check and release the memory occupied by DolphinDB stream engine? - memory

I have already unsubscribed a table in DolphinDB, but when I executed the function getStreamEngineStat().AsofJoinEngine, there was still memory occupied by the engine.
Does the function return the real-time memory information on the stream engine? How can I check the current memory status and release the memory?

getStreamEngineStat().AsofJoinEngine does return the real-time memory information.
Undefining the subscription is not equivalent to dropping the engine. In your case, if you want to release the memory occupied by the engine, you need first use dropStreamEngine and set the variable returned by createEngine to be NULL.
Specific examples are as follows:
Suppose you define the engine as follows:
ajEngine=createAsofJoinEngine(name="aj1", leftTable=trades, rightTable=quotes, outputTable=prevailingQuotes, metrics=<[price, bid, ask, abs(price-(bid+ask)/2)]>, matchingColumn=`sym, timeColumn=`time, useSystemTime=false, delayedTime=1)
After undefining the subscription, you can correctly release memory by the following code:
dropStreamEngine("aj1") // 1.release the specified engine
ajEngine = NULL // 2. Release the engine handler from memory
If you find that the memory of the asofjoin engine is too large in the subscription, you can also specify the parameter garbageSize to clean up historical data that is no longer needed. The size of garbageSize needs to be set case by case, roughly according to how many records each key will have per hour.
The principle is: First, garbageSize is set for the key, that is, the data within each key group will be cleaned up. If garbageSize is too small, frequent cleaning of historical data will bring unnecessary overhead; if garbageSize is too large, it may not reach the threshold of garbageSize and cannot trigger cleaning, resulting in a leftover of unwanted historical data.
Therefore, it is recommended to clean up once per hour and set the garbageSize with a rough estimate.

Related

Time required to access the memory locations in the same cache line

Consider the big box in the following figure as a cache and the block as a single cache line inside the cache.
The CPU fetched the data (first 4 elements of the array A) from RAM into the cache block.
Now, my question is, does it takes exactly same time to perform read/write operations on all the 4 memory locations (A[0], A[1], A[2] and A[3]) in the cache block or is it approximately same?
PS: I am expecting an answer for ideal case where runtime to perform any read/write operation on any memory location is not affected by the operating system jitter on user processes or applications.
With the line already hot in cache, time is constant for access to any aligned word in the cache. The hardware that handles the offset-within-line part of an address doesn't have to iterate through to the right position or anything, it just MUXes those bytes to the output.
If the line was not already hot in cache, then it depends on the design of the cache. If the CPU doesn't transfer around whole lines at once over a wide bus, then one / some words of the line will arrive before others. A cache that supports early-restart can let the load complete as soon as the needed word arrives.
A critical-word-first bus and memory allow that word to be the first one transferred for a demand-miss. Otherwise they arrive in some fixed order, and a cache miss on the last word of the line could take an extra few cycles.
Related:
Does cacheline size affect memory access latency?
if cache miss happens, the data will be moved to register directly or first moved to cache then to register?
which is optimal a bigger block cache size or a smaller one?

Why does Prometheus consume so much memory?

I'm using Prometheus 2.9.2 for monitoring a large environment of nodes.
As part of testing the maximum scale of Prometheus in our environment, I simulated a large amount of metrics on our test environment.
My management server has 16GB ram and 100GB disk space.
During the scale testing, I've noticed that the Prometheus process consumes more and more memory until the process crashes.
I've noticed that the WAL directory is getting filled fast with a lot of data files while the memory usage of Prometheus rises.
The management server scrapes its nodes every 15 seconds and the storage parameters are all set to default.
I would like to know why this happens, and how/if it is possible to prevent the process from crashing.
Thank you!
The out of memory crash is usually a result of a excessively heavy query. This may be set in one of your rules. (this rule may even be running on a grafana page instead of prometheus itself)
If you have a very large number of metrics it is possible the rule is querying all of them. A quick fix is by exactly specifying which metrics to query on with specific labels instead of regex one.
This article explains why Prometheus may use big amounts of memory during data ingestion. If you need reducing memory usage for Prometheus, then the following actions can help:
Increasing scrape_interval in Prometheus configs.
Reducing the number of scrape targets and/or scraped metrics per target.
P.S. Take a look also at the project I work on - VictoriaMetrics. It can use lower amounts of memory compared to Prometheus. See this benchmark for details.
Because the combination of labels lies on your business, the combination and the blocks may be unlimited, there's no way to solve the memory problem for the current design of prometheus!!!! But i suggest you compact small blocks into big ones, that will reduce the quantity of blocks.
Huge memory consumption for TWO reasons:
prometheus tsdb has a memory block which is named: "head", because head stores all the series in latest hours, it will eat a lot of memory.
each block on disk also eats memory, because each block on disk has a index reader in memory, dismayingly, all labels, postings and symbols of a block are cached in index reader struct, the more blocks on disk, the more memory will be cupied.
in index/index.go, you will see:
type Reader struct {
b ByteSlice
// Close that releases the underlying resources of the byte slice.
c io.Closer
// Cached hashmaps of section offsets.
labels map[string]uint64
// LabelName to LabelValue to offset map.
postings map[string]map[string]uint64
// Cache of read symbols. Strings that are returned when reading from the
// block are always backed by true strings held in here rather than
// strings that are backed by byte slices from the mmap'd index file. This
// prevents memory faults when applications work with read symbols after
// the block has been unmapped. The older format has sparse indexes so a map
// must be used, but the new format is not so we can use a slice.
symbolsV1 map[uint32]string
symbolsV2 []string
symbolsTableSize uint64
dec *Decoder
version int
}
We used the prometheus version 2.19 and we had a significantly better memory performance. This Blog highlights how this release tackles memory problems. i will strongly recommend using it to improve your instance resource consumption.

OpenCL memory consistency

I have a question concerning the OpenCL memory consistency model. Consider the following kernel:
__kernel foo() {
__local lmem[1];
lmem[0] = 1;
lmem[0] += 2;
}
In this case, is any synchronization or memory fence necessary to ensure that lmem[0] == 3?
According to section 3.3.1 of the OpenCL specification,
within a work-item memory has load / store consistency.
To me, this says that the assignment will always be executed before the increment.
However, section 6.12.9 defines the mem_fence function as follows:
Orders loads and stores of a work-item executing a kernel. This means that loads and stores preceding the mem_fence will be committed to memory before any loads and stores following the mem_fence.
Doesn't this contradict section 3.3.1? Or maybe my understanding of load / store consistency is wrong? I would appreciate your help.
As long as only one work-item performs read/write access to a local memory cell, that work-item has a consistent view of it. Committing to memory using a barrier is only necessary to propagate writes to other work-items in the work-group. For example, an OpenCL implementation would be permitted to keep any changes to local memory in private registers until a barrier is encountered. Within the work-item, everything would appear fine, but other work-items would never see these changes. This is how the phrase "committed to memory" should be interpreted in 6.12.9.
Essentially, the interaction between local memory and barriers boils down to this:
Between barriers:
Only one work-item is allowed read/write access to a local memory cell.OR
Any number of work-items in a work-group is allowed read-only access to a local memory cell.
In other words, no work-item may read or write to a local memory cell which is written to by another work-item after the last barrier.

What truly means by Persistent and Transient Column in Allocation Instrumentation Template in Xcode

I am trying to understand , what is meaning of transient and persistent Column in Allocation Template . From the tutorial http://www.raywenderlich.com/97886/instruments-tutorial-with-swift-getting-started I have found
"The Persistent column keeps a count of the number of objects of each type that currently exist in memory. The Transient column shows the number of objects that have existed but have since been deallocated. Persistent objects are using up memory, transient objects have had their memory released.
"
According to the explanation above , From the selected row in Statistics table from the picture , it can be said , 2 objects of NSFileManager currently exist in memory and 19 no. of objects are created and already have been released.
But what it means for optimization or performance issues for iOS App ?
Something like , here the total no of transient object in 19 which is considerably a large number , it should be small if possible for increasing app's effective memory usability or Something else ?
Optimization for performance in short means keeping your app alive and responsive.
The key metric for optimization is not transient or persistent count for one object.
Based on the information your NSFileManager is using 16 Bytes for each object.
So it's 32 currently persistent (2 * 16) and 336 (21 * 16) Total.
A high persistent memory indicates that your current footprint is very high for the given object. A high total memory indicates that your footprint in past might have been high (if subset of those allocation were simultaneous)
While optimizing you should focus on mainly two aspects:
1. How much is the minimum memory foot print when your app loads.
2. How much is the maximum memory foot print. (You need to come up with use cases to figure out this one).
As your memory footprint increases your app slows down in performance because of multiple page swaps done by OS to free up memory. You can track this by VM tracker instrument. Optimization means keeping your average memory footprint lower that point.
Persistent objects are using up memory, transient objects have had their memory released.
The first says # Persistent. This is the number of persistent objects that are being strongly referenced in your project at this moment in time. The second says # Transient. This is the number of deallocated objects that used to be strongly retained but now no longer exist. This is handy because it lets you know if an object is being cleaned up properly or if an object in no longer retained in a particular moment in time. The third says # Total. This is the total count of persistent and transient objects added together.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Resources