Influx index and high cardinality - influxdb

I have a high throughput system. I found out that since many events has the same timestamp, influx had overwritten many events.
Therefore I tried moving from milliseconds to nanoseconds, but since I am using JAVA, I couldn't get the real clock based nanoseconds.
I came up with this solution:
I created a new tag called "descriptor" which for each event I insert a random number between 1-1000. These values are fixed and the probability for the same timestamp with the same random descriptor value is very low. This fixes my problem and I can see all the events.
My question is wether it is OK to use these 1000 values - since this is a tag and I understand it can mess up my index and my performance?
Regards, Ido

As the random "descriptors" are completely uncorrelated to other event tags, in the worst case this could increase your series cardinality by 3 orders of magnitude. This is because each existing series (s) will potentially split into up to 1000 unique series (s,1),(s,2),...,(s,1000).
How much of a problem this is will depend on your existing series cardinality. Increasing from 10 to 10,000 is probably no big deal. Increasing from 100,000 to 100,000,000 is more likely to be an issue. You would need to experiment and profile to see.
An alternative approach might be to encode the "descriptor" in the microsecond and/or nanosecond component(s) of the timestamp (as you're not using them anyway) to make them unique.

Related

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?
There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

InfluxDB WHERE clause on a 'High Cardinality' field (or a tag)

I'm playing with InfluxDB and trying to experiment it for a vehicle speed tracking usecase.
Every vehicle's speed at a given time is stored as a data point.
I'm modelling "vehicle_registration" as a tag and other values as fields. I'd want the where clause to be applied on the "vehicle_registration" and it got to be quick. Therefore I'm taking advantage of the indexing capabilities on a tag by default.
But the biggest stumbling block for me is that the tags need to have a lower cardinality.
What are the recommendations here? I want a high cardinal field to be applied in a "where" clause and the queries should be quick.
Any advice?
High cardinality means higher memory requirement. So it really depends what high cardinality means in your use case. 1k will be probably fine for 8GB memory, but 1M will be probably problem for 8GB. The best option is to try it. Simulate it and you will see real memory requirements. Then you will be able to configure proper sizing for InfluxDB based on that (and your budget of course).
Or you can try TSI https://docs.influxdata.com/influxdb/v1.8/concepts/tsi-details/

Plot raw time series with Kibanan Timelion

I might not get something. How can I plot a raw time series with Timelion without applying any further aggregation? Just the raw data of a field over time that I have in an index. Of course I select the proper time window for the data.
I was trying to achieve the same thing, but didn't fully get what I wanted, but maybe these steps will help you.
My data was on by minute basis, so I don't want any more frequent fragmentation. Selecting interval = 1m helps only for short periods of time, but adding "interval=1m" into .es() block works on long periods, too.
To have lines not to return to 0 in between points, use .es().fit(carry)
.es().scale_interval(1m).fit(scale) - this is my chart to return to 0 if there were no data for certain period rather than carrying the line on the same level.
.es(metric=max:value_field) helps not to sum up the values, but show the max of the aggregated set.
My charts are still weirdly aggregated, but maybe it'll help someone.
Useful links:
Sparse time series in timelion
https://www.elastic.co/blog/sparse-timeseries-and-timelion
Scaling issue 1
https://discuss.elastic.co/t/diferent-value-on-y-axis-depending-on-time-interval/67785
Scaling issue 2
https://discuss.elastic.co/t/timelion-giving-wrong-metric-aggregate-value-on-enlarging/132789
Scaling issue 3
https://discuss.elastic.co/t/re-timelion-giving-wrong-metric-aggregate-value-on-enlarging/132925

Converting a apriori object to a list taking more time even for small number of data

I am working on a data set of more than 22,000 records, and when I tried it with the apriori model, it's taking way too much time even for small number of records like 20. Is there a problem in my code or Is there a faster way to convert the asscocians into a list quickly? The code I used is below.
for i in range(0, 20):
transactions.append([str(dataset.values[i,j]) for j in range(0, 543)])
from apyori import apriori
associations = apriori(transactions, min_support=0.004, min_confidence=0.3, min_lift=3, min_length=2)
result = list(associations)
It's difficult to assess without your data, but the complexity of Apriori is based on a number of factors, including your support threshold, number of transactions, number of items, average/max transaction length, etc.
In cases where even a small number of transactions is taking a long time to run it's often a matter of too low of a minimum support. When support is very low (near 0) the algorithm is effectively still brute forcing, since it has to look at all possible combinations of items, of every length. This is the equivalent of a mathematical power set, which is exponential. For just 41 items you're actually trying 2^41 -1 possible combinations, which is just over 1.1 TRILLION possibilities.
I recommend starting with a "high" min_support at first (e.g. 0.20) and then working your way down slowly. It's easier to test things that take seconds at first than ones that'll take a long time.
Other important note: There is no min_length parameter in Apyori. I'm not sure where everyone's getting that from (you're not alone in thinking there is one), unless it's this one random blog post I found. The parameters are as follows (straight from the code):
Keyword arguments:
min_support -- The minimum support of relations (float).
min_confidence -- The minimum confidence of relations (float).
min_lift -- The minimum lift of relations (float).
max_length -- The maximum length of the relation (integer).
For what it's worth, I wrote unofficial docs for Apyori that can be found here.

What should i do to maintain performance of a mobile app which is using database?

I'm building an app using database.
I have a words table and everytime user types something, this app will record and update word the database.
And the frequency field will be auto increase after user enter one matched word.
But the trouble is user type day by day and i afraid the search performance will be reduce after times and also the Int field will reach to the limit (max limit Int) someday.
So, i limit the database to around less than 50.000 records.
I delete less-used records after a certain time.
But i don't know how to deal with frequency Int field of each word?
How to know exactly frequency usage of each word without increasing the field forever?
I recommend that you use a logarithmic scale for the frequency values. That's what is often done in situations like this. See Wikipedia to learn about logarithmic scales.
For example, if you have a word MAN that has a frequency of 15, the value you store in the database would be log(15) ~= 1.17609125906.
If you then find 4 new occurrences of MAN, then you want to add 4 to the field. You cannot add the log values directly because log(x)+log(y)=log(x*y). (See the Logarithm Rules section of this article for more information on log rules.)
Instead -- assuming you use a base 10 logarithm, you would use this formula:
SET frequency = log(10^frequency+4)
Depending on the length of your words, the few bytes for the frequency don't matter. With an unsigned four bytes integer, you can count up to more than two billion, which is way above the number of words what the user can type in in their whole lifespan.
So may want to go for two or three bytes, but the savings may be negligible.
Anyway, there are the following approaches for preventing overflow:
You can detect it, and then undo the operations, scale everything down by some factor of two, and then redo.
You can periodically check all your numbers and do the scaling when approaching the limit.
You can do a probabilistic update like below.
Probabilistic update
Instead of simply incrementing the frequency every time by one, you do it only with a probability which gets lower and lower as the counter grows. For example, you can do the increment with a probability of 1.0 / (oldValue + 1) or 2 ** -oldValue. The latter leads to a logarithmic growth, but, unlike the idea in the other answer, it works.
There are obviously some disadvantages due to the randomness and precision loss, but when all you care about is the relative frequency, it should be good enough.

Resources