Trying to copy one measurement data from different measurement in different influxDB, not finding any queries
I have query to copy from one measurement to other measurement in same database but need different database.
Kindly suggest...
You can specify the database name within the INTO clause. see https://docs.influxdata.com/influxdb/v1.7/query_language/data_exploration/#the-into-clause.
INTO <database_name>.<retention_policy_name>.<measurement_name> Writes data to a fully qualified measurement. Fully qualify a measurement by specifying its database and retention policy.
INTO <database_name>..<measurement_name> Writes data to a measurement in a user-specified database and the DEFAULT retention policy.
Related
We know that Prometheus has three phases of data storage:
In-memory: this is where the recent memory is stored. It allows for fast query using PromQL as it is RAM memory. [Am I wrong?]
After a few hours the in-memory data is formally saved to the disk in the format of Blocks.
After the data retention period is over, data is stored in a remote access.
I wanted to ask if it is efficient to query over the data stored in the remote access. If I need a lot of metrics to monitor for my org, do I need Grafana Mimir, which handles upto 1 billion active metrics.
Also, as a side question, how many MBs/GBs metrics can Prometheus store before the retention period is over?
Sparingly. Yes. Prom wont like it if you try query over a few years for example since it will go to storage for everything. but getting metrics from storage for an hour is easy and wont be a problem.
how many MBs/GBs metrics can Prometheus store? Its irrelevant. The retention period is intendant of the amount of data stored. You can store 100MB in a day or 100GB in a day it doesn't matter. What will matter is cardinality
We are using influxdb at different industrial sites, where we log up to 10.000 values ranging from 1Hz to 1000Hz sample rate, from 3-5 different machines - resulting in something like 1GB data/hour.
The logging is handles by simple HTTP line-protocal calls to an Influxdb 1.8 server. Running on a xeon 2.5Ghz 10c 64Gb ram 6TB SSD raid 5 array.
Right now the values are stored in the same database with a measurement for each machine, with a retention policy of 20weeks with a shard duration of 1week.
The data is visualized through grafana mostly.
Many people query the database at once through multiple grafana dashboards - which can tend to be fairly slow when I retrieve large amounts of data. No cross measurement calculations are performed, it is only visual plots.
Will I get any read-speed benefits from doing multiple databases instead of a single database with multiple measurements?
When getting data from a database, do influx need to "open" files containing data from all measurements in order to find data from a specific measurement?
I'm trying to evaluate Citus and Greenplum in terms of using them as a Data Warehouse. The general idea is that data from multiple OLTP systems will be integrated in real time via Kafka Connect in a central warehouse for analytical queries.
How does Citus compare to Greenplum in this respect? I have read that Citus has some SQL limitations, e.g. correlated subqueries are not supported if the correlation is not on the distribution column, does Greenplum have similar SQL limitations? Will Greenplum work well if data is being streamed into it (as opposed to batch updates)? I'm just having this feeling that Greenplum is more analytics-focused and can sacrifice some OLTP-specific things, which Citus cannot afford since they position themselves as HTAP (not OLAP). Citus also positions itself as a solution for sub second query times, which is not necessary for my use case - several seconds (up to 5) per query will be satisfactory.
I am not aware of any SQL limitations for Greenplum, like the one you mention above. In some cases, i.e. CUBE or percentile_* window functions (ordered-set aggregate functions) GPORCA, the Greenplum database query optimiser, will fall back to the PostgreSQL query optimiser and these queries won't be as performant as GPORCA-enabled queries - but you would still get a response to your query.
I'd say getting streaming data in vs. batch updates is one thing - using Kafka Connection with JDBC, would work out-of-the-box but won't be taking any advantage of the parallel distributed nature of Greenplum as all your data would have to pass through the coordinator.
What would be optimal is to use something like the Greenplum Streaming Server (GPSS) which would write the data delivered from the client directly into the segments of the Greenplum Database cluster and would allow maximum parallelism and best stream loading performance.
Beam's GroupByKey groups records by key across all partitions and outputs a single iterable per-key-per-window. This "brings associated data together into one location"
Is there a way I can groups records by key locally, so that I still get a single iterable per-key-per-window as its output, but only over the local records in the partition instead of a global group-by-key over all locations?
If I understand your question correctly, you don't want to transfer a data over network if a part of it (partition) was processed on the same machine and then can be grouped locally.
Normally, Beam doesn't provide you details where and how your code will be running since it may vary depending on runner/engine/resource manager. Though, if you can fetch some uniq information about your worker (like hostname, ip or mac address) then you can use it as a part of your key and group all related data by this. Quite likely that in this case these data partitions won't be moved to other machines since all needed input data is already sitting on the same machine and can be processed locally. Though, afaik, there is no 100% guarantee about that.
I'm looking into our company using Prometheus to gather stats from our experiments which run on Kubernetes. There's a plan to use labels to mark the name of specific experiments in our cloud / cluster. This means we will generate a lot of labels which will hog storage over time. When the associated time series have expired, will the labels also be deleted?
tldr; From an operational perspective, Prometheus does not differentiate between time-series names and labels; by deleting your experiment data you will effectively recover the labels you created.
What follows is only relevant to Prometheus >= 2.0
Prometheus stores a times series for each unique combination of metric name, label, and label value. So my_metric{my_tag="a"}, my_metric{my_tag="b"}, and your_metric{} are all just different time series; there is nothing special about labels or label values vs. metrics names.
Furthermore, Prometheus stores data in 2-hour frames on disk. So any labels you've created do not effect operations of your database after two hours, except for on-disk storage size, and query performance if you actually access that older data. Both of these concerns are addressed after your data is purged. Experiment away!