max-series-per-database limit exceeded clarification needed / how to calculate number of series in use - influxdb

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?

There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

Related

How to space out influxdb continuous query execution?

I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.

InfluxDB WHERE clause on a 'High Cardinality' field (or a tag)

I'm playing with InfluxDB and trying to experiment it for a vehicle speed tracking usecase.
Every vehicle's speed at a given time is stored as a data point.
I'm modelling "vehicle_registration" as a tag and other values as fields. I'd want the where clause to be applied on the "vehicle_registration" and it got to be quick. Therefore I'm taking advantage of the indexing capabilities on a tag by default.
But the biggest stumbling block for me is that the tags need to have a lower cardinality.
What are the recommendations here? I want a high cardinal field to be applied in a "where" clause and the queries should be quick.
Any advice?
High cardinality means higher memory requirement. So it really depends what high cardinality means in your use case. 1k will be probably fine for 8GB memory, but 1M will be probably problem for 8GB. The best option is to try it. Simulate it and you will see real memory requirements. Then you will be able to configure proper sizing for InfluxDB based on that (and your budget of course).
Or you can try TSI https://docs.influxdata.com/influxdb/v1.8/concepts/tsi-details/

How to ensure insert rate 1 insert per second when using ClickhouseIO

I'm using Apache Beam Java SDK to process events and write them to the Clickhouse Database.
Luckily there is ready to use ClickhouseIO.
ClickhouseIO accumulates elements and inserts them in batch, but because of the parallel nature of the pipeline it still results in a lot of inserts per second in my case. I'm frequently receiving "DB::Exception: Too many parts" or "DB::Exception: Too much simultaneous queries" in Clickhouse.
Clickhouse documentation recommends doing 1 insert per second.
Is there a way I can ensure this with ClickhouseIO?
Maybe some KV grouping before ClickhouseIO.Write or something?
It looks like you interpret these errors not quite correct:
DB::Exception: Too many parts
It means that insert affect more partitions than allowed (by default this value is 100, it is managed by parameter max_partitions_per_insert_block).
So either the count of affected partition is really large or the PARTITION BY-key was defined pretty granular.
How to fix it:
try to group the INSERT-batch such way it contains data related to less than 100 partitions
try to reduce the size of insert-block (if it quite huge) - withMaxInsertBlockSize
increase the limit max_partitions_per_insert_block in SQL-query (like this, INSERT .. SETTINGS max_partitions_per_insert_block=300 (I think ClickhouseIO should have the ability to set custom options on query level)) or on server-side by modifying userprofile-settings
DB::Exception: Too much simultaneous queries
This one managed by param max_concurrent_queries.
How to fix it:
reduce the count of concurrent queries by Beam means
increase the limit on the server-side in userprofile- or server-settings (see https://github.com/ClickHouse/ClickHouse/issues/7765)

Influx index and high cardinality

I have a high throughput system. I found out that since many events has the same timestamp, influx had overwritten many events.
Therefore I tried moving from milliseconds to nanoseconds, but since I am using JAVA, I couldn't get the real clock based nanoseconds.
I came up with this solution:
I created a new tag called "descriptor" which for each event I insert a random number between 1-1000. These values are fixed and the probability for the same timestamp with the same random descriptor value is very low. This fixes my problem and I can see all the events.
My question is wether it is OK to use these 1000 values - since this is a tag and I understand it can mess up my index and my performance?
Regards, Ido
As the random "descriptors" are completely uncorrelated to other event tags, in the worst case this could increase your series cardinality by 3 orders of magnitude. This is because each existing series (s) will potentially split into up to 1000 unique series (s,1),(s,2),...,(s,1000).
How much of a problem this is will depend on your existing series cardinality. Increasing from 10 to 10,000 is probably no big deal. Increasing from 100,000 to 100,000,000 is more likely to be an issue. You would need to experiment and profile to see.
An alternative approach might be to encode the "descriptor" in the microsecond and/or nanosecond component(s) of the timestamp (as you're not using them anyway) to make them unique.

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

Resources