Redis Memory Optimization suggestions - memory

I have a Redis Master and 2 slaves. All 3 are currently on the same unix server. The memory used by the 3 instances is approximately 3.5 G , 3 G , 3G. There are about 275000 keys in the redis db. About 4000 are hashes. 1 Set has 100000 values. 1 List has 275000 keys in it. Its a List of Hashes and Sets. The server has total memory of 16 GB. Currently 9.5 GB is used. The persistence is currently off. The rdb file is written once in a day by forced background save. Please provide any suggestions for optimizations. max-ziplist configuration is default currently.

Optimizing Hashes
First, let's look at the hashes. Two important questions - how many elements in each hash, and what is the largest value in those hashes? A hash uses the memory efficient ziplist representation if the following condition is met:
len(hash) < hash-max-ziplist-entries && length-of-largest-field(hash) < hash-max-ziplist-value
You should increase the two settings in redis.conf based on your data, but don't increase it more than 3-4 times the default.
Optimizing Sets
A set with 100000 cannot be optimized, unless you provide additional details on your use case. Some general strategies though -
Maybe use HyperLogLog - Are you using the set to count unique elements? If the only commands you run are sadd and scard - maybe you should switch to a hyperloglog.
Maybe use Bloom Filter - Are you using the set to check for existence of a member? If the only commands you run are sadd and sismember - maybe you should implement a bloom filter and use it instead of the set.
How big is each element? - Set members should be small. If you are storing big objects, you are perhaps doing something incorrect.
Optimizing Lists
A single list with 275000 seems wrong. It is going to be slow to access elements in the center of the list. Are you sure you list is the right data structure for your use case?
Change list-compress-depth to 1 or higher. Read about this setting in redis.conf - there are tradeoffs. But for a list of 275000 elements, you certainly want to enable compression.
Tools
Use the open source redis-rdb-tools to analyze your data set (disclaimer: I am the author of this tool). It will tell you how much memory each key is taking. It will help you to decide where to concentrate your efforts on.
You can also refer to this memory optimization cheat sheet.
What else?
You have provided very little details on your use case. The best savings come from picking the right data structure for your use case. I'd encourage you to update your question with more details on what you are storing within the hash / list / set.

We did following configuration and that helped to reduce the memory footprint by 40%
list-max-ziplist-entries 2048
list-max-ziplist-value 10000
list-compress-depth 1
set-max-intset-entries 2048
hash-max-ziplist-entries 2048
hash-max-ziplist-value 10000
Also, we increased the RAM on the linux server and that helped us with the Redis memory issues.

Related

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?
There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

How to space out influxdb continuous query execution?

I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.

Redis optimal hash set entry size

I have some questions regarding the optimal entry size setting for Redis hash sets.
In this example memory-optimization they use 100 hash entries
per key but use hash-max-zipmap-entries 256 ? Why not
hash-max-zipmap-entries 100 or 128?
On the redis website (above link) they used max hash entry size of
100, but in this post instagram, they mention 1000 entries. So
does this mean the optimal setting is a function of the product of
hash-max-zipmap-entries & hash-max-zipmap-value ?(ie in this case
Instagram has smaller hash-values than memory optimization example?)
Your comments/clarifications are much appreciated.
The key is, from here:
manipulating the compact versions of these [ziplist] structures can become slow as they grow longer
and
[as ziplists grow longer] fetching/updating individual fields of a HASH, Redis will have to decode many individual entries, and CPU caches won’t be as effective
So to your questions
This page just shows an example and I doubt the author gave much thought to the exact values. In real life, IF you wanted to take advantage of ziplists, and you knew your number of entries per hash was <100, then setting it at 100, 128 or 256 would make no difference. hash-max-zipmap-entries is only the LIMIT over which you're telling Redis to change the encoding from ziplist to hash.
There may be some truth in your "product of hash-max-zipmap-entries & hash-max-zipmap-value" idea, but I'm speculating. More importantly, first you have to define "optimal" based on what you want to do. If you want to do lots of HSET/HGETs in a large ziplist, it will be slower than if you used a hash. But if you never get/update single fields only ever do HMSET/HGETALL on a key, large ziplists wouldn't slow you down. The Instagram 1000 was THEIR optimal number based on THEIR specific data, use cases, and Redis function call frequencies.
You encouraged me to read both links and it seems that you are asking for "default value for hash table size".
I don't think that it's possible to say that one number is universal for all possibilities. The described mechanism is similar to standard hash mapping. Look at http://en.wikipedia.org/wiki/Hash_table
If you have small size of hash-table, it means that many various hash values point into the same array, where the equals method is used to find out the item.
On the other hand, large hash table means that it allocates large memory along with many empty fields. But this scales well as the algorithm uses O(1) big O notation and there is no equals searching for the item.
In general the size of the table IMHO depends on the overall count of all elements you expect to put into the table and it also depends on the diversity of the key. I mean if every hash start with "0001" not even size=100000 would help you.

Dimensioning a redis set

We want to use redis for one of our data stores. We have a hard time "guessing" what the size of that redis store will be and we're hoping someone can come up with the right help.
This store will exclusively be be built using Sorted Sets. Each set will have a key that will be an integer between 1 and 10^10. We currently have about 8M keys, but we expect to reach 30M 'quickly'.
Each set will have a variable number of elements, but the average is 17 elements, with a max of 135 and a min of 0. (Let me know if we need to provide other numbers, like st. dev.).
The elements in the sorted set will be strings. Now we want them to be the shortest string possible (5 or 6 chars?), but still avoid collisions. The scores will be timestamps.
We currently have about 500 writes/sec, but expect to grow that 10 times, and we currently have 3000 reads/sec and expect to grow that also 10 times.
We will also use the "dump" strategy rather than AOF.
Our goal is to use a single (yet big) Redis master store (and maybe some slaves store). What RAM should we allocate to our redis instance?
If you use Redis 2.6, you can benefit from the ziplist memory optimization applied to zset, because most of your zsets have a small number of items.
To calculate the memory you need, you can simply fill an instance with a small number of keys corresponding to your requirements and extrapolate. For this use case, memory consumption will grow linearly with the number of keys.
I have just tried it on my system, I get 30 MB per 100000 keys (following your specifications), which results in 9 GB of memory required for 30M keys. You need to take some margin, and include some space for COW memory spent at save time.
A 12 GB server would probably work if you are careful.
A 16 GB server will be just fine.

"Smart" / Economical Data Storage Techniques?

I would like to store millions of data lines that looks like this:
key, value
key is an integer in the range of (0 to 5,000,000); all values are unique;
value is an unsigned int16 value (0 to 65535)
the key is to store the data while taking the LEAST AMOUNT OF DISK SPACE, and yet, be able to query the data. can you think of any algorithms / smart schemes for data storage that would be helpful?
just in case it matters, I use Linux.
One option would be, if the key values are not important data but rather just index data to utilize a flat file of bits ( with a descriptive header ). Every 16 bits is a value and the nth value would then be (n - 1) * 16 bits from the end of the header.
Additionally, if the key value does matter, a set flat file of about 10MB would allow for the entire key space to be stored without storing actual keys. The 16 bits that are at the (n - 1) * 16 offset would be that key's value.
That would probably be the least space intensive method for storage, as it would be only the data that is literally required. ( Though, if you are only interested in say 100k values and one has a key of 5 million you do end up with a lot of wasted space, which wouldn't be there with an actual key,value addressing system. So this methodology only achieves a minimum disk storage for sets of tightly grouped values or many many numbers (over about the 2 million mark ).
how do you plan to use stored data? with random or sequential access? for sequential access you can use any archiving algorithm, e.g. LZMA. Random access doesn't leave you a lot of space for improvements.
can you see any patterns of this data? e.g. if the difference between adjacent keys/values are often small you can store only packed differences. and million of other possible approaches.
[EDIT] also you can check techniques used for data compression in network communication
[EDIT1] and you can check this Google Code Integer Array Compression project
This depend upon the operation and data. I would first recommend "just using a database" (a simple key-value store such as BDB/EhCache [read: Key Value store], for instance :-)
Mimisbrunnr also has a good answer if all the keys are used.
If the keys are near constant/read-only and only a relatively small percent of the keys are used, consider the use of a (disk-based) Heap data-structure (very similar to an Array-based Heap; Heaps need not be Array-based). Robert Sedgewick had a good book from the late 80's that had a very lean implementation, but I forget the name. A Heap will be more beneficial when compared to a flat index with a smaller proportion of used keys and at full-load will have worse storage requirements.
(If abstracted, the used method could be switched and/or a hybrid heap with indexed/sequenced leaf-node values could be used [along with Huffman encoding or whatnot], but that is just adding far more complications. Keep it simple ... hence first suggestion of an existing key/value store ;-)
Happy coding.
Have you considered using a database designed for mobile devices such as SQL Server Compact, or another similar database? These will have a small footprint on the disk, while still providing the full search power you need.
Another example of a compact database engine is KeyDB for linux:
http://3d2f.com/programs/11-989-keydb-download.shtml

Resources