InfluxDB WHERE clause on a 'High Cardinality' field (or a tag) - influxdb

I'm playing with InfluxDB and trying to experiment it for a vehicle speed tracking usecase.
Every vehicle's speed at a given time is stored as a data point.
I'm modelling "vehicle_registration" as a tag and other values as fields. I'd want the where clause to be applied on the "vehicle_registration" and it got to be quick. Therefore I'm taking advantage of the indexing capabilities on a tag by default.
But the biggest stumbling block for me is that the tags need to have a lower cardinality.
What are the recommendations here? I want a high cardinal field to be applied in a "where" clause and the queries should be quick.
Any advice?

High cardinality means higher memory requirement. So it really depends what high cardinality means in your use case. 1k will be probably fine for 8GB memory, but 1M will be probably problem for 8GB. The best option is to try it. Simulate it and you will see real memory requirements. Then you will be able to configure proper sizing for InfluxDB based on that (and your budget of course).
Or you can try TSI https://docs.influxdata.com/influxdb/v1.8/concepts/tsi-details/

Related

Finding optimum CPU limit for docker containers in Splunk

I'm using Splunk to monitor my applications.
I also store resource statistics in my Splunk too.
Goal: I want to find the optimum CPU limit for each container.
How to I write a query that finds an optimum CPU limit? Or the other question is Should I?
Concern1: When I start customizing my query and let's say that I have used MAX(CPU) command. It doesn't mean that my container will be running at level most of the time. So, I might set an unnecessary high limit for my containers.
Let me explain, when I find a CPU limit value via MAX(CPU) command as 10, this top value might be happened because of a bulk operation. So, my container's expected resource may be around 1.2 all the time, except this single 1 operation that one. So, using MAX value won't work.
Concern2: Let's say that I have used the value of AVG(CPU) value and used it. And that is 2, So how many of my operations will be waited for how many minutes after this change? Or how many of them are going to be timed out? It may create a lot of side-effects. How will I decide the real average value? What parameters should be used?
Is it possible to include such conditions in the query? Or do I need an AI to decide it? :)
Here are my givin parameters:
path=statistics.cpus_system_time_secs
path=statistics.cpus_user_time_secs
path=statistics.cpus_nr_periods
path=statistics.cpus_nr_throttled
path=statistics.cpus_throttled_time_secs
path=statistics.cpus_limit
I bet you can ask better questions than me. Let's discuss.
"Optimum" is going to depend greatly on your own environment (resources available, application priority, etc)
You probably want to look at a combination of the following factors:
avg(CPU)
max(CPU) (and time spent there)
min(CPU) (and time spent there)
I suspect your "optimum" limit is going to be a % below your max...but only if you're spending 'a lot' of time maxxed-out
And, of course, being "maxed" may not matter, if other containers are running acceptably
Keep in mind, once you set that limit, your max will drop (as, likely, will your avg)

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?
There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

How to space out influxdb continuous query execution?

I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.

Influx index and high cardinality

I have a high throughput system. I found out that since many events has the same timestamp, influx had overwritten many events.
Therefore I tried moving from milliseconds to nanoseconds, but since I am using JAVA, I couldn't get the real clock based nanoseconds.
I came up with this solution:
I created a new tag called "descriptor" which for each event I insert a random number between 1-1000. These values are fixed and the probability for the same timestamp with the same random descriptor value is very low. This fixes my problem and I can see all the events.
My question is wether it is OK to use these 1000 values - since this is a tag and I understand it can mess up my index and my performance?
Regards, Ido
As the random "descriptors" are completely uncorrelated to other event tags, in the worst case this could increase your series cardinality by 3 orders of magnitude. This is because each existing series (s) will potentially split into up to 1000 unique series (s,1),(s,2),...,(s,1000).
How much of a problem this is will depend on your existing series cardinality. Increasing from 10 to 10,000 is probably no big deal. Increasing from 100,000 to 100,000,000 is more likely to be an issue. You would need to experiment and profile to see.
An alternative approach might be to encode the "descriptor" in the microsecond and/or nanosecond component(s) of the timestamp (as you're not using them anyway) to make them unique.

Why is there this Capacity Limit on Nodes and Relationships in neo4j?

I wonder why neo4j has a Capacity Limit on Nodes and Relationships. The limit on Nodes and Relationships is 2^35 1 which is a "little" bit more then the "normal" 2^32 integer. Common SQL Databases for example mysql stores there primary key as int(2^32) or bigint(2^64)2. Can you explain me the advantages of this decision? In my opinion this is a key decision point when choosing a database.
It is an artificial limit. They are going to remove it in the not-too-distant future, although I haven't heard any official ETA.
Often enough, you run into hardware limits on a single machine before you actually hit this limit.
The current option is to manually shard your graphs to different machines. Not ideal for some use cases, but it works in other cases. In the future they'll have a way to shard data automatically--no ETA on that either.
Update:
I've learned a bit more about neo4j storage internals. The reason the limits are what they are exactly, are because the id numbers are stored on disk as pointers in several places (node records, relationship records, etc.). To increase it by another power of 2, they'd need to increase 1 byte per node and 1 byte per relationship--it is currently packed as far as it will go without needing to use more bytes on disk. Learn more at this great blog post:
http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html
Update 2:
I've heard that in 2.1 they'll be increasing these limits to around another order of magnitude higher than they currently are.
As of neo4j 3.0, all of these constraints are removed.
Dynamic pointer compression expands Neo4j’s available address space as needed, making it possible to store graphs of any size. That’s right: no more 34 billion node limits!
For more information visit http://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity.

Resources