We are using influx DB. Basically I wish to store my data for one year total so I created a default RETENTION POLICY fro 365d.
I also want to create downsampling. I store my event every second, but I dont need them in seconds resolution for more than 3 months. basically, after 3 monthes I prefer to SUM the data in days and after 6 months I prefer to sum the data in weeks resolution.
I understood from this doc that I can downsample using "continues queries".
What I also understood that I have to downsample to another table. This will make my query process much harder because I will have to decide which table should I query and maybe combine the data of several tables.
Can I somehow create Downsampling process that will have all the data in the same table?
Regards,
Ido
To summarize the comments on this question: with Influxdb, you can not have multiple retention policies within a single table. You will have to query across multiple retention policies. You might want to Subscribe to this Github feature request for Intelligent Rollups and Querying of Aggregated Data
Related
Tag values and series cardinality
Influxdb creates new series for ever combination of (tag, value) pair that it sees. An example in the documentation shows this with a tag called email. Series cardinality is a limiting factor on performance. Independent tags have a multiplicative effect on series cardinality.
My data
I process data that naturally breaks down into something I call groups. Think of it like an advertising network that processes customers' ads, where a customer is a "group". I'd like to track how much time and resources different groups take to process. I currently have about 1000 groups and I'm working on growth planning, so let's suppose I might soon have 10's or 100's of thousands. There are other tags with 10's or 100's of values (e.g., hostname). These things are all important to being able to understand our data.
I currently have a half million series. I don't think I have a lot of data. I'm running influxdb 1.2.4, looks like our influx version isn't being updated too frequently.
My question
This seems like a relatively ordinary need, but it also seems to be one that is going to get me in trouble with influxdb.
Am I confused that I'm heading for pain?
Is there a better way to address this need?
Am I outright using the wrong tool?
My Rails app allows users to setup a data feed (typically a REST API), and pulls in results at specific intervals to allow the user to later filter/sort/chart/export the data. An example could be pulling a stock price every 15 minutes and saving its value and a timestamp as a row in a table.
Since there could be many users with many feeds setup, I'm trying to determine the best way to handle all of this data in Rails.
I feel like I should stay away from one large mega table with a feed_id on each row since there could be millions and millions of rows very quickly (50 users with 5 feeds running every 15 minutes would be 25,000 rows per day). Will this get unwieldy too quickly or am I underestimating the power of Rails/Postgres? What is the limit?
Another option I came up with was giving each feed its own table – create a table when the feed is added and save the data there. In discussions I've read it seems like dynamic table creation is frowned upon except in special circumstances and I'm wondering if this one fits the mold.
The last option would be adding a second database - potentially NoSQL like MongoDB. I'd rather keep everything in one DB if possible but if that really will yield the best performance and reliability I'd give it a go.
I would love to hear people's experience and opinions in tackling something to this with Rails.
25,000 rows per day makes about 10 million per year. In this case you're well within limits of PostgreSQL for many years. Stock prices are mostly numeric, so, if I were you, I'd have a simple SQL table for all this data. Just avoid extra-long rows (texts) and you should be fine.
In future you could further extend your solution with partitioning (i.e. monthly or yearly) or move older data to some archive.
I have a couple of thousand time-series covering several years at second-granularity. I'd like to store the data in a suitable DB (i.e. one that scales well and can retain all data at original granularity, e.g. Druid, openTSDB or similar). The goal is to be able to view the data in a browser (e.g. by entering a time frame and ideally having zoom/pan functionality).
To limit the number of datapoints that my webserver needs to handle I'd like to have functionality which seems to be working out of the box for Graphite/Grafana (which, if I understand correctly, is not a good choice for long-term retention of data):
a time-series chart in Grafana will limit data by querying aggregations from graphite (e.g. return mean value over 30m buckets when zooming out while showing all data when zooming in).
Now the questions:
are there existing visualization tools for time-series DBs that provide this functionality?
are there existing charting frameworks that allow me to customize the data queried per zoom level?
Feedback on the choice of DB is also welcome (open-source preferred).
You can absolutely store multiple years of data in Graphite, the issue you'll have is that the way that Graphite selects the aggregation level to read from is by locating the highest-resolution archive that covers the requested interval, so you can't automatically take advantage of aggregation to both have efficient long-term graphs and the ability to drill down to the raw data for a time period in the past.
One way to get around this problem is to use carbon-aggregator to generate multiple output series with different intervals from your input series so you can have my.metric.raw, my.metric.10min, my.metric.1hr, etc. You'd combine that with a carbon schema that defines matching interval and retention for each of the series so my.metric.raw is stored at 1-second resolution, .1min at 1-minute etc.
If you do that then in Grafana you can use a template variable to choose which interval you want to graph from, so you'd define a variable $aggregation with options raw, 10min, etc and write your queries like my.metric.$aggregation.
That will give you the performance that you need with the ability to drill into the raw data.
That said, we generally find that while everyone thinks they want lots of historical data at high granularity, it's almost never actually used and is typically an unneeded expense. That may not be the case for you, but think carefully about the actual use-cases when designing the system.
What is the best way to store data in such a way that I can get real time answers of queries like "give me count of last 2 weeks of failed transactions", "give count of accounts created in last 2 years from now". Counting number of rows every time is not an option as number of individual entries in table is huge and may take hours to compute.
I am only interested in finding aggregates in real time in a rolling window fashion. Also, I do not want to retain data older than 2 years and want that to get removed automatically.
Is there any standard way of solving this problem? Do services like redshift/kinesis be helpful?
Thanks in anticipation.
For most data warehousing solutions, we construct aggregate tables with resolutions down to the business date which makes reporting 2 or more years of data very fast. Kinesis can certainly help Redshift ingest data at a high throughput which would then allow you to update the day's aggregation counts in real time. The only difficulty with this approach is that you need to know what aggregations you want to report on ahead of time so you can set them up, but a decent business analyst should be able to provide you with the majority of covering metrics at the start.
I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here