Batch write with deferent retentation policies in influxdb - influxdb

I try to write batch data to influxdb but with deferent retention policies, but I could not find a way to do it without grouping the batch data with same retention then send each of the data in different batch..

It is currently not possible to write data with different retention policies in the same batch.

Related

Allow User to Extract Data Dumps From DW

We use synapse in azure as our warehouse and create reports in power bi for our users on top of this. We currently have a request to move all of the data dumps from our production system onto our warehouse DB as some of them are causing performance issue in production when run. We've been looking to re-do these into reports in power bi, however in some instances we still need to provide the "raw" data in csv/excel format. This has thrown an issue as some of these extracts are above 150k rows and therefore we can't use power bi to provide the extract as it has a limit on the rows it can export. Our solution would be to build a process to runs against the db and spits out a file into sharepoint for the user to consume, which we can do however we're unsure of how we could provide a method of the user triggering the extract. One of the ways I was thinking of doing it would be using power apps, however I'm wondering if there is an easier way someone on here might be able to suggest? I just need to provide pages with various buttons that trigger extracts to sharepoint from azure when clicked, which can be controlled by security in some way. Any advice would be appreciated.
Paginated Report Export doesn't have that row limit.
See, eg
https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-automate-paginated-integration
Or you can use ADF Copy Activity to create .csv extracts.

Replay events and ingest them in AWS Timestream

I can't ingest records into AWS Timestream if timestamp is out of the window of Memory Store. Thus, I can't implement functionality where I replay messages, process and ingest them if there is some issue. Are there any solutions for this?
Currently there is not a way to ingest records that are outside of the memory store retention window. Would you be able to create a table with a memory store retention window large enough for the window that you expect to be correcting data?

DB folder utilising lot of space creating space issue

I have a grafana windows server.Where we have integrated HyperV snaphot related infor as well as CPU, Memory usage of HV's etc. I could see below folder in our grana windows server
C:\InfluxDB\data\telegraf\autogen
Under this autogen folder, I can see multiple subfolder with .tsm files. Each file create every 7 days and the folder size is around 4 to 5GB. There are many files in this autogen folder from 2nd Feb 2017 to 14 Mar 2018 which is utilizing around 225GB space.
What you see:
autogen is a default Retention Policy (RP) auto-created by InfluxDB and has an infinite data retention duration. All datapoints in Influx are logically stored in shards. Physically shards data is compressed and stored in .tsm files. Shards are unified into shards groups. Each shard group covers a specific time range defined by so-called shard duration and stores datapoints belonging to this time interval. By default for RP with retention duration > 6 month shard group duration is set to 7 days.
For more info see docs on storage engine.
Regarding your questions:
"Is there anyway we can shrink the size of autogen file?"
Probably no. The only thing you can do is to rely on InfluxDB internal compression. Here they say that it may be improved if you increase shard duration.
*Although, because InfluxDB drop the whole shard rather then separate datapoints, the increase of shard duration will make your data to be stored until the whole shard goes out of scope of current retention duration and only then it will be dropped. Though, if you have an infinite retention duration it doesn't matter. This leads us to the second question.
"Is it possible to delete the old file under autogen folder?"
If you can afford loosing old data or can't afford to much storage space InfluxDB lets to specify data Retention Policy (RP), already mentioned above. Basically, all your measurements are associated with a specific RP and the data will be deleted as soon as retention duration comes to the end. So if you specify a RP of 1 year, InfluxDB will automatically delete all datapoints older then now() - 1 year. RP is a standard (and pretty obvious) way of dealing with storage issues. A logical continuation of RP idea is to group and aggregate your data over time over longer discrete time intervals (downsampling). In Influx it can be achieved with Continuous Queries (CQ). You can read more of data retention and downsamping here.
In conclusion, storage limitation are inevitable and properly configured retention policies is the way to go.

how to cache channel aggregation feed that can be customizable for each user?

we already use Redis in our development stack and I prefer to use it but I know that neo4j has some great tools for it.
there are about 14 channels that publish contents every day.
there are about 1M users and every user can customize his (her) own feed to aggregate data of a combination of these channels
Maybe the "graphity model" is for you.

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

Resources