Why does Prometheus consume so much memory?

Why does Prometheus consume so much memory? - memory

I'm using Prometheus 2.9.2 for monitoring a large environment of nodes.
As part of testing the maximum scale of Prometheus in our environment, I simulated a large amount of metrics on our test environment.
My management server has 16GB ram and 100GB disk space.
During the scale testing, I've noticed that the Prometheus process consumes more and more memory until the process crashes.
I've noticed that the WAL directory is getting filled fast with a lot of data files while the memory usage of Prometheus rises.
The management server scrapes its nodes every 15 seconds and the storage parameters are all set to default.
I would like to know why this happens, and how/if it is possible to prevent the process from crashing.
Thank you!

The out of memory crash is usually a result of a excessively heavy query. This may be set in one of your rules. (this rule may even be running on a grafana page instead of prometheus itself)
If you have a very large number of metrics it is possible the rule is querying all of them. A quick fix is by exactly specifying which metrics to query on with specific labels instead of regex one.

This article explains why Prometheus may use big amounts of memory during data ingestion. If you need reducing memory usage for Prometheus, then the following actions can help:
Increasing scrape_interval in Prometheus configs.
Reducing the number of scrape targets and/or scraped metrics per target.
P.S. Take a look also at the project I work on - VictoriaMetrics. It can use lower amounts of memory compared to Prometheus. See this benchmark for details.

Because the combination of labels lies on your business, the combination and the blocks may be unlimited, there's no way to solve the memory problem for the current design of prometheus!!!! But i suggest you compact small blocks into big ones, that will reduce the quantity of blocks.
Huge memory consumption for TWO reasons:
prometheus tsdb has a memory block which is named: "head", because head stores all the series in latest hours, it will eat a lot of memory.
each block on disk also eats memory, because each block on disk has a index reader in memory, dismayingly, all labels, postings and symbols of a block are cached in index reader struct, the more blocks on disk, the more memory will be cupied.
in index/index.go, you will see:
type Reader struct {
b ByteSlice
// Close that releases the underlying resources of the byte slice.
c io.Closer
// Cached hashmaps of section offsets.
labels map[string]uint64
// LabelName to LabelValue to offset map.
postings map[string]map[string]uint64
// Cache of read symbols. Strings that are returned when reading from the
// block are always backed by true strings held in here rather than
// strings that are backed by byte slices from the mmap'd index file. This
// prevents memory faults when applications work with read symbols after
// the block has been unmapped. The older format has sparse indexes so a map
// must be used, but the new format is not so we can use a slice.
symbolsV1 map[uint32]string
symbolsV2 []string
symbolsTableSize uint64
dec *Decoder
version int
}

We used the prometheus version 2.19 and we had a significantly better memory performance. This Blog highlights how this release tackles memory problems. i will strongly recommend using it to improve your instance resource consumption.

Related

All blocks read same global memory location section. Fastest method is?

I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)

Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.

How to check and release the memory occupied by DolphinDB stream engine?

I have already unsubscribed a table in DolphinDB, but when I executed the function getStreamEngineStat().AsofJoinEngine, there was still memory occupied by the engine.
Does the function return the real-time memory information on the stream engine? How can I check the current memory status and release the memory?

getStreamEngineStat().AsofJoinEngine does return the real-time memory information.
Undefining the subscription is not equivalent to dropping the engine. In your case, if you want to release the memory occupied by the engine, you need first use dropStreamEngine and set the variable returned by createEngine to be NULL.
Specific examples are as follows:
Suppose you define the engine as follows:
ajEngine=createAsofJoinEngine(name="aj1", leftTable=trades, rightTable=quotes, outputTable=prevailingQuotes, metrics=<[price, bid, ask, abs(price-(bid+ask)/2)]>, matchingColumn=`sym, timeColumn=`time, useSystemTime=false, delayedTime=1)
After undefining the subscription, you can correctly release memory by the following code:
dropStreamEngine("aj1") // 1.release the specified engine
ajEngine = NULL // 2. Release the engine handler from memory
If you find that the memory of the asofjoin engine is too large in the subscription, you can also specify the parameter garbageSize to clean up historical data that is no longer needed. The size of garbageSize needs to be set case by case, roughly according to how many records each key will have per hour.
The principle is: First, garbageSize is set for the key, that is, the data within each key group will be cleaned up. If garbageSize is too small, frequent cleaning of historical data will bring unnecessary overhead; if garbageSize is too large, it may not reach the threshold of garbageSize and cannot trigger cleaning, resulting in a leftover of unwanted historical data.
Therefore, it is recommended to clean up once per hour and set the garbageSize with a rough estimate.

Why is primary and cache memory divided into blocks?

Why is primary and cache memory divided into blocks?
Hi just got posed with this question, I haven't been able to find a detailed explanation corresponding to both primary memory and cache memory, if you have a solution It would be greatly appreciated :)
Thank you

Cache blocks exploit locality of reference based on two types of locality. Temporal locality, after you reference location x you are likely to access location x again shorty. Spatial locality, after you reference location x you are likely to access nearby locations, location x+1, ... shortly.
If you use a value at some distant data center x, you are likely to reuse that value and so it is copied geographically closer, 150ms. If you use a value on disk block x, you are likely to reuse disk block x and so it is kept in memory, 20 ms. If you use a value on memory page x, you are like to reuse memory page x and so the translation of its virtual address to its physical address is kept in the TLB cache. If you use a particular memory location x you are likely to reuse it and its neighbors and so it is kept in cache.
Cache memory is very small, L1D on an M1 is 192kB, and DRAM is very big, 8GB on an M1 Air. L1D cache is much faster than DRAM, maybe 5 cycles vs maybe 200 cycles. I wish this table was in cycles and included registers but it gives a useful overview of latencies:
https://gist.github.com/jboner/2841832
The moral of this is to pack data into aligned structures which fit. If you randomly access memory instead, you will miss in the cache, the TLB, the virtual page cache, ... and everything will be excruciatingly slow.

Most things in computer systems are divided into chunks of fixed sizes: bytes, words, cache blocks, pages.
One reason for this is that while hardware can do many things at once, it is hardware and thus, generally can only do what it was designed for.  Making bytes of blocks of 8 bits, making words blocks of 4 bytes (32-bit systems) or 8 bytes (64-bit systems) is something that we can design hardware to do, and mostly in parallel.
Not using fixed-sized chunks or blocks, on the other hand, can make things much more difficult for the hardware, so, data structures like strings — an example of data that is highly variable in length — are usually handled with software loops.
Usually these fixed sizes are in powers of 2 (32, 64, etc..) — because division and modulus, which are very useful operations are easy to do in binary for powers of 2.
In summary, we must subdivide data into blocks because, we cannot treat all the data as one lump sum (hardware wise at least), and, treating all data as individual bits is also too cumbersome.  So, we chunk or group data into blocks as appropriate for various levels of hardware to deal with in parallel.

eheap allocated in erlang

I use recon_alloc:memory(allocated_types) and get info like below.
34> recon_alloc:memory(allocated_types).
[{binary_alloc,1546650440},
{driver_alloc,21504840},
{eheap_alloc,28704768840},
{ets_alloc,526938952},
{fix_alloc,145359688},
{ll_alloc,403701800},
{sl_alloc,688968},
{std_alloc,67633992},
{temp_alloc,21504840}]
The eheap_alloc is using 28G. But sum up with heap_size of all process
>lists:sum([begin {_, X}=process_info(P, heap_size), X end || P <- processes()]).
683197586
Only 683M !Any idea where is the 28G ?

You are not comparing the right values. From erlang:process_info
{heap_size, Size}
Size is the size in words of youngest heap generation of the
process. This generation currently include the stack of the process.
This information is highly implementation dependent, and may change if
the implementation change.
recon_alloc:memory(allocated_types) is in bytes by default. You can change it using set_unit. It is not the memory that is currently used but it is the memory reserved by the VM grouped into different allocators. You can use recon_alloc:memory(used) instead. More details in allocator() - Recon Library

Searching through the Erlang source code for the eheap_alloc keyword I didn't come up with much. The most relevant piece of code was this XML code from erts_alloc.xml (https://github.com/erlang/otp/blob/172e812c491680fbb175f56f7604d4098cdc9de4/erts/doc/src/erts_alloc.xml#L46):
<tag><c>eheap_alloc</c></tag>
<item>Allocator used for Erlang heap data, such as Erlang process heaps.</item>
This says that process heaps are stored in eheap_alloc but it doesn't say what else is stored in eheap_alloc. The eheap_alloc stores everything your application needs to run along with some extra memory along with some additional space, so the VM doesn't have to request more memory from the OS every time something needs to be added. There are things the VM must keep in memory that aren't associated with a specific process. For example, large binaries, even though they may used within a process, are not stored inside that processes heap. They are stored in a shared process binary heap called binary_alloc. The binary heap, along with the process heaps and some extra memory, are what make up eheap_alloc.
In your case it looks like you have a lot of memory in your binary_alloc. binary_alloc is probably using a significant portion of your eheap_alloc.
For more details on binary handling checkout these pages:
http://blog.bugsense.com/post/74179424069/erlang-binary-garbage-collection-a-love-hate
http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65224

Golang. Zero Garbage propagation or efficient use of memory

From time to time I face with the concepts like zero garbage or efficient use of memory etc. As an example in the section Features of well-known package httprouter you can see the following:
Zero Garbage: The matching and dispatching process generates zero bytes of garbage. In fact, the only heap allocations that are made, is by building the slice of the key-value pairs for path parameters. If the request path contains no parameters, not a single heap allocation is necessary.
Also this package shows very good benchmark results compared to standard library's http.ServeMux:
BenchmarkHttpServeMux 5000 706222 ns/op 96 B/op 6 allocs/op
BenchmarkHttpRouter 100000 15010 ns/op 0 B/op 0 allocs/op
As far as I understand the second one has (from the table) no heap memory allocation and zero average number of allocations made per repetition.
The question: I want to learn a basic understanding of memory management. When garbage collector allocates/deallocates memory. What does the benchmark numbers means (the last two columns of the table) and how people know when heap is allocating?
I'm absolutely new in memory management, so it's really difficult to understand what's going on "under the hood". The articles I've read:
https://golang.org/ref/mem
https://golang.org/doc/effective_go.html
http://gribblelab.org/CBootcamp/7_Memory_Stack_vs_Heap.html
http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)

The garbage collector doesn't allocate memory :-), it just deallocates. Go's garbage collector is evolving, for the details have a look at the design document https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLovrHvvLhK_h0KN8woTO4/preview?sle=true and follow the discussion on the golang mailing lists.
The last two columns in the benchmark output are dead simple: How many bytes have been allocated in total and how many allocations have happened during one iteration of the benchmark code. (This allocation is done by your code, not by the garbage collector). As any allocation is a potential creation of garbage reducing these numbers might be a design goal.
When are things allocated on the heap? Whenever the Go compiler decides to! The compiler tries to allocate on the stack, but sometimes it must use the heap, especially if a value escapes from the local stack-bases scopes. This escape analysis is currently undergoing rework, so it is not easy to tell which value will be heap- or stack-allocated, especially as this is changing from compiler version to version.
I wouldn't be too obsessed with avoiding allocations until your benchmarking show too much GC overhead.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart