Is it possible to treat different measurements in influxdb with different a retention policy?
This is entirely possible with InfluxDB. To do this you'll need to create a database that has two retention policies and then write the data to the associated retention policy.
Example:
$ influx
> create database mydb
> create retention policy rp_1 on mydb duration 1h replication 1
> create retention policy rp_2 on mydb duration 2h replication 1
Now that our retention policies have been created we simple write data in the following manner:
Sensor 1 will write data to rp_1
curl http://localhost:8086/write?db=mydb&rp=rp_1 --data-binary SOMEDATA
Sensor 2 will write data to rp_2
curl http://localhost:8086/write?db=mydb&rp=rp_2 --data-binary SOMEDATA
Related
if I create multiple retention policy on a measurement, when I query the measurement, do I have to use
select * from RetentionPolicy.measurement
it looks to me the points with different retention policy has to be queried from a different measurement name with RetentionPolicy name as prefix?
So, I can plainly see per container stats in the insights tab of AKS. These must come from somewhere, but I can only find per node stats when querying logs/metrics. How can I query this (in order to build a workbook).
That data is in the Perf table in the LogManagement section:
The documentation page on How to query logs from Azure Monitor for containers has example queries you can start with:
Querying this data takes a bit of parsing, because the Computer field always shows the name of the node the data was gathered from, not the pod. In order to get pod/container specific data, you have to look at records with ObjectName == 'K8SContainer' and parse the InstanceName field, which contains the data you need. InstanceName is built like this: /subscriptions/SUBSCRIPTIONID/resourceGroups/RESOURCEGROUPNAME/providers/Microsoft.ContainerService/managedClusters/CLUSTERNAME/PODUID/CONTAINERNAME. Given that data, we can parse out the PodUid and join with KubePodInventory to get the identifying information for that Pod.
Here's an example query:
Perf
| where ObjectName == 'K8SContainer' and TimeGenerated > ago(1m)
| extend PodUid = tostring(split(InstanceName, '/', 9)[0]), Container = tostring(split(InstanceName, '/', 10)[0])
| join kind=leftouter (KubePodInventory | summarize arg_max(TimeGenerated, *) by PodUid) on PodUid
| project TimeGenerated, ClusterName, Namespace, Pod = Name, Container, PodIp, Node = Computer, CounterName, CounterValue
This query produces a result like this, which should contain the data you need:
As a side note - the Computer field always shows the node name because that's where the OMS agent is running. It's gathering statistics at the node level, but those stats include the memory and CPU usage for each cgroup, which is the backing CPU/memory isolation and limiting technology behind containers in general, just like how namespaces are used to separate networking, filesystems, and process IDs.
To create indexes, Geomesa creates multiple tables in HBase. I have a few questions:
What Geomesa does to ensure these tables are in sync?
What will be the impact on Geomesa query, if the index tables are not in sync?
What happens (with write calls) if Geomesa is not able to write one of the index tables?
Synchronization between tables are the best effort or Geomesa ensure the availability of data with eventual consistency?
I am planning to use Geomesa with Hbase (backed by S3) combination to store my geospatial data; the data size can grow up to Terabytes to Petabytes.
I am investigating how reliable Geomesa is in terms of synchronization between the primary and index table?
HBase Tables:
catalog1
catalog1_node_id_v4 (Main Table)
catalog1_node_z2_geom_v5 (Index Table)
catalog1_node_z3_geom_lastUpdateTime_v6 (Index Table)
catalog1_node_attr_identifier_geom_lastUpdateTime_v8 (Index Table)
Geomesa Schema
geomesa-hbase describe-schema -c catalog1 -f node
INFO Describing attributes of feature 'node'
key | String
namespace | String
identifier | String (Attribute indexed)
versionId | String
nodeId | String
latitude | Integer
longitude | Integer
lastUpdateTime | Date (Spatio-temporally indexed)
tags | Map
geom | Point (Spatio-temporally indexed) (Spatially indexed)
User data:
geomesa.index.dtg | lastUpdateTime
geomesa.indices | z3:6:3:geom:lastUpdateTime,z2:5:3:geom,id:4:3:,attr:8:3:identifier:geom:lastUpdateTime
GeoMesa does not do anything to sync indices - generally this should be taken care of in your ingest pipeline.
If you have a reliable feature ID tied to a given input feature, then you can write that feature multiple times without causing duplicates. During ingest, if a batch of features fails due to a transient issue, then you can just re-write them to ensure that the indices are correct.
For HBase, when you call flush or close on a feature writer, the pending mutations will be sent to the cluster. Once that method returns successfully, then the data has been persisted to HBase. If an exception is thrown, you should re-try the failed features. If there are subsequent HBase failures, you may need to recover write-ahead logs (WALs) as per standard HBase operation.
A feature may also fail to be written due to validation (e.g. a null geometry). In this case, you would not want to re-try the feature as it will never ingest successfully. If you are using the GeoMesa converter framework, you can pre-validate features to ensure that they will ingest ok.
If you do not have an ingest pipeline already, you may want to check out geomesa-nifi, which will let you convert and validate input data, and re-try failures automatically through Nifi flows.
Question
Is it possible to delete measurement data using a time range, For a specific Retention Policy?
DELETE
FROM "SensorData"."Quarantine"./.*/
WHERE "time" >= '2018-02-28T02:26:08.0000000Z'
AND "time" <= '2018-02-28T02:27:08.0000000Z'
Is our current attempt at a query, to drop all data between a time period, however Delete doesn't appear to be happy to have a database or a retention policy listed.
Possible XY Problem
The reason (I suspect it's an unsolved XY problem) (see github://influxdata/influxdb#8088) (This is step 3. below)
We have a Database called SensorData , that has a primary buffer default retention policy of 30d so we don't run out of disk space.
However, if the sensors register an 'exceedance' we have a requirement that requires us to keep that data, + an hour either side, for evidence. We call this Quarantine.
We have so far implemented this as a retention policy called Quarantine.
So we have Primary and Quarantine, and possibly in the future, some sort of high freq retention policy that might be down sampled to Primary.
The XY problem is, "How do you transaction-ally copy/move/change the retention policy on some recorded data in Influx?"
Our solution (after failing to find one)
Was,
e.g.
Create a temp db, named such as to uniquely identify an in progress quarantine operation.
create "TempDB"+"_Quarantine_"+startUnixTime+"_"+"endUnixTime"
Copy the data from Primary to tempdb
Copy Primary -> TempDB
3. delete the data from primary
`Delete Primary`
Copy data to Quarantine
Copy TempDB -> Quarantine
Drop TempDB
Drop TempDB
This would allow rollback for a failed operation, or rollback/resume in the case of a crash.
Chronograf was being really funky with parsing the query, causing a lot of confusion.
Influx (as of 1.4) does not have the ability to delete data for a specific retention policy, and Chronograf did not have the ability to parse the delete command without a database specified.
What ended up working, was (via an API) calling
DELETE FROM /.*/ WHERE "time" >= '2018-02-28T02:26:08.0000000Z' AND "time" <= '2018-02-28T02:27:08.0000000Z'
The database isn't specified, as it was specified elsewhere in the API.
Expected to be equivalent to calling use SensorData on the line before or in the CLI.
So for now the workaround is to just delete the data for all RP and hope you don't need a High Frequency Data retention policy in the future.
If you just want to change the retention policy on some data range, I suggest just copying such data range into another retention policy:
USE "SensorData"
SELECT *
INTO "Quarantine"."MeasurementName"
FROM "Primary"."MeasurementName"
WHERE "time" >= '2018-02-28T02:26:08.0000000Z'
AND "time" <= '2018-02-28T02:27:08.0000000Z'
The data will be deleted from "Primary"."MeasurementName" as usual after duration specified for "Primary" RP ends (30 days), while the copied range will be preserved in "Quarantine" RP.
And if you want to delete the data from primary immediately what can try to do is next:
USE "SensorData"."Primary"
DELETE
FROM "MeasurementName"
WHERE "time" >= '2018-02-28T02:26:08.0000000Z'
AND "time" <= '2018-02-28T02:27:08.0000000Z'
Specifying Database and retention policy is not supported via InfluxQL. I hope it will be in the future or in IFQL.
For now I would recommend to use a different measurement for the aggregated data.
Is there a way to add a tag to an existing entry in InfluxDB measurement? If not in the existing db measurement, is there a way to insert the records with a new tag into a new influx measurement?
Currently I have a set of measurements that should probably be entries in a single measurement where their current measurement names should be tag-keys in the superset of the merged measurements.
e.g.
show measurements
measurement1
measurement2
measurement3
measurement4
should instead be tags on the data included in each measurement and union to form a single measurement joinedmeasurement with indexed tags measurment1, measurement2,...
It would have to be done manually via queries.
For example, in python using the official client:
from influxdb import InfluxDBClient
client = InfluxDBClient('localhost', database='my_db')
measurement = 'measurement1'
db_data = client.query('select value from %s' % (measurement))
data_to_write = [{'measurement': 'joinedmeasurement',
'tags': ['measurement1'],
'time': d['time'],
'fields': {'value': d['value']},
}
for d in db_data.get_points()]
client.write_points(data_to_write)
And so on for the rest of the measurements. Can run the above in a loop to do all of them in one go.
Consider using named fields though in addition to tags. The above example only uses one field - can have as many as you want.
This improves performance further, though obviously fields are not indexed so do not use them for data that queries are to run on.