I changed the retention policy of a database to keep data only for one day and after that, I dropped all shards created from autogen RP, but InfluxDB still takes a huge part of the storage on /var/lib/influxdb/data/<DB NAME>/_series folder and it's increasing continuously.
How can I release the storage related to the deleted points?
Related
I set a retention policy on a influx db, but it seems thats is not working. Old data is not actually deleted.
I expect to drop data older than 1 month.
I figure it out. The logic is that the old data is still in the old retention policy "autogen". I just set "autogen" to 30d, not "0s" which is infinite and it worked. Basically when you set an retention policy in existing database, the new data is stored to that retention policy from now on. The old data is still stored in the old retention policy.
I want to migrate 20 GB of Neo4j Graph data to AWS Neptune. how much Neo4j DB downtime is needed for entire data migration and how to handle the data lost during downtime.
Two blog posts that you may want to review include both an initial baseline migration [1] and how to capture changes and perform incremental updates [2].
So long as your existing queries and connection methods work on both platforms (without any modification), you could potentially leverage these two methods to have very minimal downtime.
[1] https://aws.amazon.com/blogs/database/migrating-a-neo4j-graph-database-to-amazon-neptune-with-a-fully-automated-utility/
[2] https://aws.amazon.com/blogs/database/change-data-capture-from-neo4j-to-amazon-neptune-using-amazon-managed-streaming-for-apache-kafka/
I have a grafana windows server.Where we have integrated HyperV snaphot related infor as well as CPU, Memory usage of HV's etc. I could see below folder in our grana windows server
C:\InfluxDB\data\telegraf\autogen
Under this autogen folder, I can see multiple subfolder with .tsm files. Each file create every 7 days and the folder size is around 4 to 5GB. There are many files in this autogen folder from 2nd Feb 2017 to 14 Mar 2018 which is utilizing around 225GB space.
What you see:
autogen is a default Retention Policy (RP) auto-created by InfluxDB and has an infinite data retention duration. All datapoints in Influx are logically stored in shards. Physically shards data is compressed and stored in .tsm files. Shards are unified into shards groups. Each shard group covers a specific time range defined by so-called shard duration and stores datapoints belonging to this time interval. By default for RP with retention duration > 6 month shard group duration is set to 7 days.
For more info see docs on storage engine.
Regarding your questions:
"Is there anyway we can shrink the size of autogen file?"
Probably no. The only thing you can do is to rely on InfluxDB internal compression. Here they say that it may be improved if you increase shard duration.
*Although, because InfluxDB drop the whole shard rather then separate datapoints, the increase of shard duration will make your data to be stored until the whole shard goes out of scope of current retention duration and only then it will be dropped. Though, if you have an infinite retention duration it doesn't matter. This leads us to the second question.
"Is it possible to delete the old file under autogen folder?"
If you can afford loosing old data or can't afford to much storage space InfluxDB lets to specify data Retention Policy (RP), already mentioned above. Basically, all your measurements are associated with a specific RP and the data will be deleted as soon as retention duration comes to the end. So if you specify a RP of 1 year, InfluxDB will automatically delete all datapoints older then now() - 1 year. RP is a standard (and pretty obvious) way of dealing with storage issues. A logical continuation of RP idea is to group and aggregate your data over time over longer discrete time intervals (downsampling). In Influx it can be achieved with Continuous Queries (CQ). You can read more of data retention and downsamping here.
In conclusion, storage limitation are inevitable and properly configured retention policies is the way to go.
I am currently refining a cassandra backup solution.
So i am stumbling upon the point if i should save incremental_backups AND commitlog_archive.
If i understand correctly, restoring from either
Snapshot + Incremental Backups + Commitlog (only these after the last flush)
OR
Snapshot + Commitlog from archive
should end in the same set of data, right?
Or is the latter option much slower because of replaying takes longer than just checking the sstables integrity?
Should i keep both?
I would prefer incremental backups over commit logs.
Incremental backups result in links to immutable sstables which can then be replayed back to a live Cassandra cluster using sstableloader. When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. The disadvantage of incremental backups is that it is all or nothing, it is not possible to select a subset of column families for incremental backup. As I mentioned before, the ability to restore to a live Cassandra cluster to a different column family makes incremental backups superior. And you also have to manage incremental backup space because there is no utility to clean up incremental backups over time or even do a rebase.
The advantage of commit logs is that it provides a point in time restore capability. To restore from commit logs, you have to go to the latest incremental backup or the latest snaphshot (in your former case), stop the database, clear the existing commit logs, copy the commit logs till the latest incremental backup or snapshot, run the rollforward utility to bring the database to the exact point in time that you require.
However, if you use only commit logs, your database downtime is going to be higher as you have more commit logs to process while the database is down. So, I would use the incremental backup approach and then use commit logs.
Lastly, better to use a professional tool out here rather than hacking this on your own - from experience with multiple customers, both the first and second approaches are fraught with potential for error.
It depends on your restore requirements. If you want to restore to particular time then you will need snapshot, incremental and commit logs.
why Commit logs ?
If you take snapshot at 11:00 am today. And you want to restore for 11:30 am today. Now If you only use incremental backups which cover 11:30 am.
Then there is possibility you will have some extra/missing data.
If some one deletes one row at 11:31 am and you have SSTables in incremental backups which has been flushed at 11:32 am so during restore you will find above row as a tombstone which is wrong if you consider the restore time.
So for particular time restore you need to process commit logs along with full snapshot and incremental backups.
Is it possible automatically to clear old data in Influx DB? Let's say some configuration option to keep records for 1 month only? In my server I store quite much statistics, so preventing running the free storage out I would like to have such feature.
Yes, it's simple, just add shard with Retention on 7 days.