Retention Policy doesn't deletes data - influxdb

I’m new to influxdb and i want to implement Retention Policy (RP) for my logs.
I loaded a static data using telegraf and have created a RP for that :
CREATE DATABASE test WITH DURATION 60m
but it is not deleting my previous logs .
As i have observed that influx stores data on UTC time format whereas my telegraf server uses system time. Could that be a isuue ??

I would check two things using the Influx CLI. First, check the retention policies on your DB.
> SHOW RETENTION POLICIES
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 1h0m0s 1h0m0s 1 true
For example, I can see my autogen policy has a duration of 1 hour and a shardGroupDuration of 1 hour.
Second, check the shards.
> SHOW SHARDS
name: tester
id database retention_policy shard_group start_time end_time expiry_time owners
-- -------- ---------------- ----------- ---------- -------- ----------- ------
130 tester autogen 130 2018-02-20T21:00:00Z 2018-02-20T22:00:00Z 2018-02-20T23:00:00Z
131 tester autogen 131 2018-02-20T22:00:00Z 2018-02-20T23:00:00Z 2018-02-21T00:00:00Z
132 tester autogen 132 2018-02-20T23:00:00Z 2018-02-21T00:00:00Z 2018-02-21T01:00:00Z
Data is removed when the newest point has a timestamp after the expiry time.

Related

InfluxDB 1.7.2 - Top X over time

I’m new to InfluxDB. I’m using it to store ntopng timeseries data.
ntopng writes a measurement called asn:traffic that stores how many bytes were sent and received for an ASN.
> show tag keys from "asn:traffic"
name: asn:traffic
tagKey
------
asn
ifid
> show field keys from "asn:traffic"
name: asn:traffic
fieldKey fieldType
-------- ---------
bytes_rcvd float
bytes_sent float
>
I can run a query to see the data rate in bps for a specific ASN:
> SELECT non_negative_derivative(mean("bytes_rcvd"), 1s) * 8 FROM "asn:traffic" WHERE "asn" = '2906' AND time >= now() - 12h GROUP BY time(30s) fill(none)
name: asn:traffic
time non_negative_derivative
---- -----------------------
1550294640000000000 30383200
1550294700000000000 35639600
...
...
...
>
However, what I would like to do is create a query that I can use to return the top N ASNs by data rate and plot that on a Grafana graph. Sort of like this example that is using ELK.
I've tried a few variants from posts here and elsewhere, but I haven't been able to get what I'm after. For example, this query I think gets me closer to where I want to be, but there are no values in asn:
> select top(bps,asn,10) from (SELECT non_negative_derivative(mean(bytes_rcvd), 1s) * 8 as bps FROM "asn:traffic" WHERE time >= now() - 12h GROUP BY time(30s) fill(none))
name: asn:traffic
time top asn
---- --- ---
1550299860000000000 853572800
1550301660000000000 1197327200
1550301720000000000 1666883866.6666667
1550310780000000000 674889600
1550329320000000000 20979431866.666668
1550332740000000000 707015600
1550335920000000000 2066646533.3333333
1550336820000000000 618554933.3333334
1550339280000000000 669084933.3333334
1550340300000000000 704147333.3333334
>
Thinking then that perhaps the sub query needs to select asn also, however that proceeds an error about mixing queries:
> select top(bps,asn,10) from (SELECT asn, non_negative_derivative(mean(bytes_rcvd), 1s) * 8 as bps FROM "asn:traffic" WHERE time >= now() - 12h GROUP BY time(30s) fill(none))
ERR: mixing aggregate and non-aggregate queries is not supported
>
Anyone have any thoughts on a solution?
EDIT 1
Per the suggestion by George Shuklin, modifying the query to include asn in GROUP BY displays ASN in the CLI output, but that doesn't translate in Grafana. I'm expecting a stacked graph with each layer of the stacked graph being one of the top 10 asn results.
Try to make ASN as tag, than you can use group by time(30s), 'asn', and that tag will be available in the outer query.

How to query from an Influx database with an absent field?

I have a measurement gathered by telegraf. It has following structure:
name: smart_device
fieldKey fieldType
-------- ---------
exit_status integer
health_ok boolean
read_error_rate integer
seek_error_rate integer
temp_c integer
udma_crc_errors integer
When I query this database I can do this:
> select * from smart_device where "health_ok" = true limit 1
name: smart_device
time capacity device enabled exit_status health_ok host model read_error_rate seek_error_rate serial_no temp_c udma_crc_errors wwn
---- -------- ------ ------- ----------- --------- ---- ----- --------------- --------------- --------- ------ --------------- ---
15337409500 2000398934016 sda Enabled 0 true osd21 Hitachi HDS722020ALA330 0 0 JK11A4B8JR2EGW 38 0 5000cca222e6384f
and this:
> select * from smart_device limit 1
name: smart_device
time capacity device enabled exit_status health_ok host model read_error_rate seek_error_rate serial_no temp_c udma_crc_errors wwn
---- -------- ------ ------- ----------- --------- ---- ----- --------------- --------------- --------- ------ --------------- ---
1533046990 sda 0 osd21
But when I try to filter out records with empty health_ok, I get empty output:
> select * from smart_device where "health_ok"!= true
>
How can I select measurements with empty (no? null?) health_ok?
Unfortunately there is currently no way to do this using InfluxQL. InfluxDB is a form of document oriented database; it means rows of a measurement can have different schema. Therefore, there is no a concept of null for a field of a row; actually this row dose not have the field. for example suppose there are 4 rows in the measurement cost
> select * from cost
name: cost
time isok type value
---- ---- ---- -----
1533970927859614000 true 1 100
1533970938243629700 true 2 101
1533970949371761100 3 103
1533970961571703900 2 104
As you can see, there are two rows with isok=true and two rows which have no field named isok; so there is only one way to select the time of rows which have the isok field with this query:
> select isok from cost
name: cost
time isok
---- ----
1533970927859614000 true
1533970938243629700 true
Since InfluxQL currently dose not support subquery in where clause, therefor there is no way to query for rows with no isok field (If InfluxDB supports this type of query, you can query like this SELECT * FROM cost WHERE time NOT IN (SELECT isok FROM cost))
It's not exactly the answer for the original question, but I found a special trick for Kapacitor.
If this query has been executed by kapacitor, it (kapacitor) has a special node default which allows to add missing fields/tags with some value.
For the health_ok query it will look like this (tickscript):
var data = stream
|from()
.measurement('smart_device')
|default()
.field('health_ok', FALSE)
This allows to assume that if health_ok is missed, it is FALSE.

influxdb query for 5 top cpu usage

I run a shared web hosting using CloudLinux.
From it, I can get a bunch of performence metric
So, my influxDB is :
measurement : lve
fields : CPU,EP,IO,IOPS,MEM,MEMPHY,NETI,NETO,NPROC,fEP,fMEM,fMEMPHY,fNPROC,lCPU,lCPUW,lEP,lIO,lIOPS,lMEM,lMEMPHY,lNETI,lNETO,lNPROC,nCPU
tags : xpool, host, user (where : xpool is xen-pool uid, host is hostname of cloudLinux, user is username of shared hosting)
data is gathered each 5 seconds
How is the query sentence to :
Select records from specific xpool+host , and
get 5 unique username that produce TOP CPU usage in 5 minute periode from it ?.
There is hundreds usaername but I want got top-5 only.
Note: Samething like example 4 of TOP() from https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#top, unless that expected results is:
name: h2o_feet
time top location
---- --- --------
2015-08-18T00:00:00Z 8.12 coyote_creek
2015-08-18T00:54:00Z 2.054 santa_monica
Rather than :
name: h2o_feet
time top location
---- --- --------
2015-08-18T00:48:00Z 7.11 coyote_creek
2015-08-18T00:54:00Z 6.982 coyote_creek
2015-08-18T00:54:00Z 2.054 santa_monica
2015-08-18T00:24:00Z 7.635 coyote_creek
2015-08-18T00:30:00Z 7.5 coyote_creek
2015-08-18T00:36:00Z 7.372 coyote_creek
2015-08-18T00:00:00Z 8.12 coyote_creek
2015-08-18T00:06:00Z 8.005 coyote_creek
2015-08-18T00:12:00Z 7.887 coyote_creek
Since '8.12' is the highest value of 'coyote_creek' and '2.054' is the highest value of 'santa_monica'
Sincerely
-bino-
Probably a subquery could help, for example, this is from a database using telegraf:
SELECT top,host FROM (SELECT TOP(usage_user, 1) AS top, host from cpu WHERE time > now() -1m GROUP BY host)
It will output something like:
name: cpu
time top host
---- --- ----
1527489800000000000 1.4937106918238994 1.host.tld
1527489808000000000 0.3933910306845004 2.host.tld
1527489810000000000 4.17981072555205 3.host.tld
1527489810000000000 0.8654602675059009 4.host.tld
The first query is:
SELECT TOP(usage_user, 1) AS top, host from cpu WHERE time > now() -1m GROUP BY host
Is using TOP to get only 1 item and using the field usage_user
Then to "pretty print" A subquery is used:
SELECT top,host FROM (...)

Influxdb select data from a specific shard

I would like to know if it is possible somehow from the CLI of the influx to select the data of a specific shard. I also would like to select the series within two timestamps but i haven't yet found how. Any input would be appreciated, thank you.
Q: I would like to know if it possible somehow from the CLI of the influx to select the data of a specific shard.
A: At influxdb 1.3 this is not possible. However you should be able to work out what data lives in there.
Query to get the shards information:
show shards
it should tell you the start and end date time of the data (across all series in the database) contained in that shard.
For instance
Given Shard info:
id database retention_policy shard_group start_time end_time expiry_time owners
-- -------- ---------------- ----------- ---------- -------- ----------- ------
123 mydb autogen 123 2012-11-26T00:00:00Z 2012-12-03T00:00:00Z 2012-12-03T00:00:00Z
124 mydb autogen 124 2013-01-14T00:00:00Z 2013-01-21T00:00:00Z 2013-01-21T00:00:00Z
125 mydb autogen 125 2013-04-29T00:00:00Z 2013-05-06T00:00:00Z 2013-05-06T00:00:00Z
Given Measurements:
name: measurements
name
----
measurement_abc
measurement_def
measurement_123
Shard 123 will contain all of the data across the noted measurements above that fall in the start time of 2012-11-26T00:00:00Z and end time of 2012-12-03T00:00:00Z. That is, running a drop shard 123 would see data in that range disappearing across the measurements.

How can I revise my schema design to support a group by query with a field?

I have a simple set of records that I wish to filter by domain, count and group by urls. These are both fields:
time | domain | url
------------------- | ----------- | ----
1500163196000000000 | www.foo.com | /bar
1500163197000000000 | www.foo.com | /bar
1500163198000000000 | www.foo.com | /baz
When I made a query to group the URL counts, it seems to have grouped all records:
SELECT count(url) FROM logs GROUP BY url
name: logs
tags: url=
time count
---- -----
0 3
How can I revise my schema design to support this group by query? If I turned url and domain into a tag, then that means I have no value and can't insert the data.
For anything that needs to be searched on (filters, group by etc), it needs to be a tag (indexed field).
At least one field is required by the DB when writing and it defaults to using a field called value if none is set. So you can set value=0 as your single field if you do not actually want to query it.

Resources