InfluxDB - what's shard group duration - time-series

I have created one year policy in InfluxDB and shard group duration was automatically set to 168h.
This is how my retentions look like now:
This is how my shards look like now:
What does it mean for my data that shard's end time is set one week ahead?

It means that all of data written to database st_test and retention policy a_year with a timestamp between 2016-10-03 and 2016-10-10 will be stored in shard 16.
A retention policy is a container for shards. Each shard in the retention policy will have 1w worth of data. And after 1y that shard will expire and we will remove it.
See the shard documentation for more information.

In order to understand shard group durations, you need to understand its relation with retention policy duration.
The Retention policy DURATION determines how long InfluxDB keeps the data. While SHARD DURATION clause determines the time range covered by a shard group.
A single shard group covers a specific time interval; InfluxDB determines that time interval by looking at the DURATION of the relevant retention policy (RP). The table below outlines the default relationship between the DURATION of an RP and the time interval of a shard group,
When you create a retention policy, you can modify that shard duration,
CREATE RETENTION POLICY <retention_policy_name> ON <database_name> DURATION <duration> REPLICATION <n> [SHARD DURATION <duration>] [DEFAULT]

Related

Can InfluxDB have Continuous Queries with same source & target measurements but with different/new tags?

Below is the scenario against which I have this question.
Requirement:
Pre-aggregate time series data within influxDb with granularity of seconds, minutes, hours, days & weeks for each sensor in a device.
Current Proposal:
Create five Continuous Queries (one for each granularity level i.e. Seconds, minutes ...) for each sensor of a device in a different retention policy as that of the raw time series data, when the device is onboarded.
Limitation with Current Proposal:
With increased number of device/sensor (time series data source), the influx will get bloated with too many Continuous Queries (which is not recommended) and will take a toll on the influxDb instance itself.
Question:
To avoid the above problems, is there a possibility to create Continuous Queries on the same source measurement (i.e. raw timeseries measurement) but the aggregates can be differentiated within the measurement using new tags introduced to differentiate the results from Continuous Queries from that of the raw time series data in the measurement.
Example:
CREATE CONTINUOUS QUERY "strain_seconds" ON "database"
RESAMPLE EVERY 5s FOR 1m
BEGIN
SELECT MEAN("strain_top") AS "STRAIN_TOP_MEAN" INTO "database"."raw"."strain" FROM "database"."raw"."strain" GROUP BY time(1s),*
END
As far as I know, and have seen from the docs, it's not possible to apply new tags in continuous queries.
If I've understood the requirements correctly this is one way you could approach it.
CREATE CONTINUOUS QUERY "strain_seconds" ON "database"
RESAMPLE EVERY 5s FOR 1m
BEGIN
SELECT MEAN("strain_top") AS "STRAIN_TOP_MEAN" INTO "database"."raw"."strain" FROM "database"."strain_seconds_retention_policy"."strain" GROUP BY time(1s),*
END
This would save the data in the same measurement but a different retention policy - strain_seconds_retention_policy. When you do a select you specify the corresponding retention policy from which to select.
Note that, it is not possible to perform a select from several retention policies at the same time. If you don't specify one, the default one is used (and not all of them). If it is something you need then another approach could be used.
I don't quite get why you'd need to define a continuous query per device and per sensor. You only need to define five (1 per seconds, minutes, hours, days, weeks) and do a group by * (all) which you already do. As long as the source datapoint has a tag with the id for the corresponding device and sensor, the resampled datapoint will have it too. Any newly added devices (data) will just be processed automatically by those 5 queries and saved into the corresponding retention policies.
If you do want to apply additional tags, you could process the data outside the database in a custom script and write it back with any additional tags you need, instead of using continuous queries

InfluxDB Continuous Query running on entire time series data

If my interpretation is correct, according to the documentation provided here:InfluxDB Downsampling when we down-sample data using a Continuous Query running every 30 minutes, it runs only for the previous 30 minutes data.
Relevant part of the document:
Use the CREATE CONTINUOUS QUERY statement to generate a CQ:
CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN
SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone"
INTO "a_year"."downsampled_orders"
FROM "orders"
GROUP BY time(30m)
END
That query creates a CQ called cq_30m in the database food_data.
cq_30m tells InfluxDB to calculate the 30-minute average of the two
fields website and phone in the measurement orders and in the DEFAULT
RP two_hours. It also tells InfluxDB to write those results to the
measurement downsampled_orders in the retention policy a_year with the
field keys mean_website and mean_phone. InfluxDB will run this query
every 30 minutes for the previous 30 minutes.
When I create a Continuous Query it actually runs on the entire dataset, and not on the previous 30 minutes. My question is, does this happen only the first time after which it runs on the previous 30 minutes of data instead of the entire dataset?
I understand that the query itself uses GROUP BY time(30m) which means it'll return all data grouped together but does this also hold true for the Continuous Query? If so, should I then include a filter to only process the last 30 minutes of data in the Continuous Query?
What you have described is expected functionality.
Schedule and coverage
Continuous queries operate on real-time data. They use the local server’s timestamp, the GROUP BY time() interval, and InfluxDB database’s preset time boundaries to determine when to execute and what time range to cover in the query.
CQs execute at the same interval as the cq_query’s GROUP BY time() interval, and they run at the start of the InfluxDB database’s preset time boundaries. If the GROUP BY time() interval is one hour, the CQ executes at the start of every hour.
When the CQ executes, it runs a single query for the time range between now() and now() minus the GROUP BY time() interval. If the GROUP BY time() interval is one hour and the current time is 17:00, the query’s time range is between 16:00 and 16:59.999999999.
So it should only process the last 30 minutes.
Its a good point about the first run.
I did manage to find a snippet from an old document
Backfilling Data
In the event that the source time series already has data in it when you create a new downsampled continuous query, InfluxDB will go back in time and calculate the values for all intervals up to the present. The continuous query will then continue running in the background for all current and future intervals.
https://influxdbcom.readthedocs.io/en/latest/content/docs/v0.8/api/continuous_queries/#backfilling-data
Which would explain the behaviour you have found

Unable to insert data in influxdb

Could you please help to resolve the below error -
INSERT blocks,color=blue number=90 1547095860
ERR: {"error":"partial write: points beyond retention policy dropped=1"}
Note - 1547095860 is - 1/10/2019, 10:21:00 AM Standard Time
I am trying to insert this data on today i.e. 25th jan 2019.
MY DB settings are as in the image
enter image description here
Thanks!
Based on the screenshot provided, I suppose you're attempting to insert data into one of the databases named LINSERVER?
By default, if no retention policy (RP) is specified for a database then the system will go with the autogen RP; which has an infinite retention period. In other words, you can insert data anywhere from the start of epoch time till the end and data will be preserved .
If a retention policy of example 7d is defined for a database, this means that you can only insert data for a limited period of one week. Data point who falls outside the scope the retention period is considered as expired and will be "garbage collected" by the database.
This also means that you cannot backfill data outside the retention period. My guess is database could have a relatively small retention period which resulted in the new data that you just inserted to be considered as expired. Henceforth the error message "points beyond retention policy dropped=1"
Solution: Either use the default autogen RP or make the duration of your database retention policy longer. All depending on your use case.

influx: all old data got deleted after applying retention policy

I had data from last 7 days in influx. I applied retention policy and suddenly all data got deleted. i have single instance of influx running.
CREATE RETENTION POLICY stats_30_day ON server_stats DURATION 30d REPLICATION 1
ALTER RETENTION POLICY stats_30_day ON server_stats DURATION 30d REPLICATION 1 default
Any idea what went wrong ?
You changed what your default retention policy is. So when you query you'll have to specify the other retention policy. See:
https://docs.influxdata.com/influxdb/v0.13/troubleshooting/frequently_encountered_issues/#missing-data-after-creating-a-new-default-retention-policy

TTL settings for rollups tables in OpsCenter keyspace?

By Default, rollups tables like rollup360, rollup60, rollup7200, rollup86400 have value 0 for default_time_to_live which means the data never expires. But as per Opscenter Metrics blog Using Cassandra’s built in ttl support, OpsCenter expires the columns in the rollups60 column family after 7 days, the rollups300 column family after 4 weeks, the rollups 7200 column family after 1 year, and the data in the rollups86400 column family never expires.
What is the reason behind this setting and Where do we set the TTL for these tables?
Since OpsCenter data is growing, shouldn't we have TTLs defined for
rollups tables at the table level?
But in opscenterd.conf default values are listed below.
[cassandra_metrics]
1min_ttl = 86400
5min_ttl = 604800
2hr_ttl = 2419200
Which settings has preference over the other?
There are defaults if not set anywhere defined in the documentation:
1min_ttl Sets the time in seconds to expire 1
minute data points. The default value is 604800 (7 days).
5min_ttl Sets the time in seconds to expire 5
minute data points. The default value is 2419200 (28 days).
2hr_ttl Sets the time in seconds to expire 2 hour
data points. The default value is 31536000 (365 days).
24hr_ttl Sets the time to expire 24 hour data
points. The default value is 0, or never.
If you dont set them it will use the defaults, but if you override them in the [cassandra_metrics] section of the opscenterd.conf. When the agent on the node stores a rollup for a period it will include whatever TTL its associated with, ie (not exactly how opscenter does it but for demonstration purposes):
INSERT INTO rollups60 (key, timestamp, value) VALUES (...) USING TTL 604800;
In your example you lowered the TTLs which would decrease the amount of data stored. So for:
1) You set lower TTL to decrease amount of data stored on disk. You can configure it as you mentioned in your ticket. Although the compaction strategy can affect this significantly.
2) There is a default ttl setting on the tables, but there really isn't much difference between setting it per query and having it in the table. Doing an alter table is pretty expensive if need to change it compared to just changing the value of the ttl on the inserts. If having issues with obsolete data in tables try switching to LeveledCompactionStrategy (not this increases IO on compactions but probably not noticeable)
According to :
https://docs.datastax.com/en/latest-opsc/opsc/configure/opscChangingPerformanceDataExpiration_t.html
"Edit the cluster_name.conf file."
Chris, you suggested to edit the opscenterd.conf.

Resources