How to downsample data in AWS TimeStream - time-series

I understand AWS TimeStream allows data to be moved to different types of storage based on retention period but we also need data to be downsampled based on retention period.
For e.g.
48 hours, one second granularity
30 days, one minute granularity
10 years, one hour granularity
How can this be achieved?

I don't think timestream currently supports that in storage. The nature of time-series databases is that you write once & change very very seldom. So by the intention behind it, this kind of granularity change you'd do in the query with for example the bin() function.

Related

How to calculate application availability (SLA)

I have standard ASP.NET MVC project and I need to calculate application availability to find out our SLA level. So, I need to get something like this for our web application.
Information from my hosting provider
System Availability: 99.9860%
Total Uptime: 30d 10h:22m:44s
Total Downtime: 0d 0h:6m:9s
Total Reboots: 3
Mean Time Between Reboots: 10.15 days
But I need to calculate availability for application. So, the question is
How to calculate ASP.NET MVC application availability in proper way?
Maybe someone has already implemented that, or any suggestion how to do that, any help will be appreciated.
Where to start?
The first point what I think that is Application Insights and availability test. The problem is that the minimum value of test frequency is 5 minutes. I need more precise measurements.
Next, create a some tool that will call my app every second and collect information. Result: a very large number of requests.
Also, get some perf counters from IIS or something like that. Need to investigate if it is possible.
I know that the question possible is too broad, but I didn't find any info about implementation of application availability. What do you think about that?
It would take to long if I was to explain all parts that can be done, so I'll keep it short.
Usually you define all these details in a Service Level Agreement where you also define the availability target (i.e. 99 %) that also include planned downtime. A 99 % availability target is to have the app running and its functionality as described in the document with at most approx. 87.6 h per year. Here is a SLA uptime calculator.
The normal interval is 5 minutes as you say, but it you can prove by using an external site / service that the suppliers are not meeting the requirements, you calculate your loss (revenue loss, labor costs etc) and claim the money from them. You already have a Business Impact Analysis (BIA) I guess otherwise you should do it.
Ok, now to the programming / DevOps part. I usually develop applications / services with this in mind and report its status to a third party service like NewRelic, Uptrends or similar. As an example I also use a self-made service for this because accurate requirements for delivering data at least once a second with a hard deadline. In my solution I use WebSockets to send data in both directions following a schedule, event or when needed. A benefit with that is that you can send status (good or bad) let say every 500 ms and you will know within one second if the app has failed (≈ 499 ms + 500 ms).
Using a service like this you can measure the uptime, custom events of interest and possible errors within a second and a ton of other metrics. Usually within 5-100 ms but WCET/WCRT is hard to estimate.
To answer your question, you cannot calculate application availability with so few measure points, once every 5 min is covering approx. 12 seconds per hour and you cannot have any reliable calculation from that. You can assume everything was ok between the measure points but that is called guessing. I have made implementations that have 14 400 measure points per hour in order to provide 500 ms accuracy (Banks).
I hope you got an answer that helps you with your problem.

DB folder utilising lot of space creating space issue

I have a grafana windows server.Where we have integrated HyperV snaphot related infor as well as CPU, Memory usage of HV's etc. I could see below folder in our grana windows server
C:\InfluxDB\data\telegraf\autogen
Under this autogen folder, I can see multiple subfolder with .tsm files. Each file create every 7 days and the folder size is around 4 to 5GB. There are many files in this autogen folder from 2nd Feb 2017 to 14 Mar 2018 which is utilizing around 225GB space.
What you see:
autogen is a default Retention Policy (RP) auto-created by InfluxDB and has an infinite data retention duration. All datapoints in Influx are logically stored in shards. Physically shards data is compressed and stored in .tsm files. Shards are unified into shards groups. Each shard group covers a specific time range defined by so-called shard duration and stores datapoints belonging to this time interval. By default for RP with retention duration > 6 month shard group duration is set to 7 days.
For more info see docs on storage engine.
Regarding your questions:
"Is there anyway we can shrink the size of autogen file?"
Probably no. The only thing you can do is to rely on InfluxDB internal compression. Here they say that it may be improved if you increase shard duration.
*Although, because InfluxDB drop the whole shard rather then separate datapoints, the increase of shard duration will make your data to be stored until the whole shard goes out of scope of current retention duration and only then it will be dropped. Though, if you have an infinite retention duration it doesn't matter. This leads us to the second question.
"Is it possible to delete the old file under autogen folder?"
If you can afford loosing old data or can't afford to much storage space InfluxDB lets to specify data Retention Policy (RP), already mentioned above. Basically, all your measurements are associated with a specific RP and the data will be deleted as soon as retention duration comes to the end. So if you specify a RP of 1 year, InfluxDB will automatically delete all datapoints older then now() - 1 year. RP is a standard (and pretty obvious) way of dealing with storage issues. A logical continuation of RP idea is to group and aggregate your data over time over longer discrete time intervals (downsampling). In Influx it can be achieved with Continuous Queries (CQ). You can read more of data retention and downsamping here.
In conclusion, storage limitation are inevitable and properly configured retention policies is the way to go.

Should I store a global counter or an aggregated value in a TSDB

This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.
You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.

Parse 100GB File Storage Limit

Hey guys so I developed a social network on iOS and used parse for the back end. Our app has taken off and over 50,000 images have been posted in ten days. Aside from hitting the 600 req/sec api limit soon it appears we might fill up the 100gb storage limit sooner. Does this limit (file storage) reset monthly, or once you hit 100gb you are done. It seems like a tiny amount of storage for a PaaS company.
According to the Parse.com website, you receive 2TB file storage in with any package, not 100GB. If you're asking if they give you an additional 2TB each month, the answer would be no. At the beginning of the next month, you are still using the space, it does not reset (unlike, for example, bandwidth). This is the case with (probably) all cloud (SaaS, IaaS, PaaS, etc.) providers. You can increase the amount of file storage for 10c/GB per month.
As for database storage, it seems that 100GB is the hard limit. Again, being storage, you do not get an extra 100GB per month.
If your database is larger than 100GB and you are hitting more than 600 req/sec (averaged over a minute - i.e. 36000 req/min) then you may want to consider building your own infrastructure, perhaps in AWS or similar, so you can scale it properly. You may also want to consider moving your uploaded images outside of the database if they are not already - DB storage is considerably more expensive than file storage - both in terms of cost and performance.
Parse.com has larger plans available - up to a point.
HOWEVER - if you are going to be doing 600 requests a second (wow, since that's 50 MILLION requests a day) you'll need to look at two possibilities:
You can keep you requests under this limit by using local caching, streamlined calls, etc.
You will eventually need to migrate off of the Parse ecosystem.
If memory serves, there used to be an option to get a custom plan with more requests/second. It seems to have disappeared from the Pricing page, to be replaced with this:
What is the cost for an app with a burst limit above 600 requests per second? What happens if I require more than 600 requests/second?
We do not provide custom plans for apps that require more than 600 requests per second.
UPDATE: It looks like there is also a hard limit of 100GB of file storage...
The overage rate for database size is $10/GB but we only allow increases in increments of 20GB. When you you exceed 20GB of database size we will increase your soft limit to 40GB and begin charging you an incremental $200/month. When you hit your soft limit of 40GB we will increase your soft limit to 60GB... and so on up to a hard limit of 100GB.

Retention policy to aggregate several metrics with regular expression in graphite

We are storing metrics having build number in the metric name. Here is the format of the metric in graphite.
latency.<host>.<request>.<buildNumber>.average
Issue with above format is that buildNumber is ever changing value and in our case it changes every week because of the release cycle. This results in new storage file(.wsp) every week and since whisper allocates space upfront, we never fully utilized the space because of changing build number.
I know disk space is cheap resource but still at some point I think we will have lot of unused space.
For e.g if each metric file is 10MB large and if we are sending 5000 different metrics for latency then for a particular build number we will use up 50GB. Now if every week we are sending a new build number then 1TB of disk space will get filled in 20 weeks which is roughly 5 months.(1TB = 1000GB)/(50GB per week) = 20 weeks
Above problem could be solved if we can aggregate multiple metrics in one of last month. Is there any way of specifying a retention policy where multiple metrics are merged in one using some aggregation method?
Or is there any way for tackling this kind of problem in graphite?
If you use the Ceres storage engine for Graphite instead of using Whisper, you will avoid the problems of pre-allocation of space. https://github.com/graphite-project/ceres
I don't believe you can, during downsampling, merge multiple metrics with a specified aggregation. However, you can do this at the point of ingestion via aggregation-rules.conf. Documentation can be found here: http://graphite.readthedocs.org/en/latest/config-carbon.html#aggregation-rules-conf

Resources