Influxdb select data from a specific shard - influxdb

I would like to know if it is possible somehow from the CLI of the influx to select the data of a specific shard. I also would like to select the series within two timestamps but i haven't yet found how. Any input would be appreciated, thank you.

Q: I would like to know if it possible somehow from the CLI of the influx to select the data of a specific shard.
A: At influxdb 1.3 this is not possible. However you should be able to work out what data lives in there.
Query to get the shards information:
show shards
it should tell you the start and end date time of the data (across all series in the database) contained in that shard.
For instance
Given Shard info:
id database retention_policy shard_group start_time end_time expiry_time owners
-- -------- ---------------- ----------- ---------- -------- ----------- ------
123 mydb autogen 123 2012-11-26T00:00:00Z 2012-12-03T00:00:00Z 2012-12-03T00:00:00Z
124 mydb autogen 124 2013-01-14T00:00:00Z 2013-01-21T00:00:00Z 2013-01-21T00:00:00Z
125 mydb autogen 125 2013-04-29T00:00:00Z 2013-05-06T00:00:00Z 2013-05-06T00:00:00Z
Given Measurements:
name: measurements
name
----
measurement_abc
measurement_def
measurement_123
Shard 123 will contain all of the data across the noted measurements above that fall in the start time of 2012-11-26T00:00:00Z and end time of 2012-12-03T00:00:00Z. That is, running a drop shard 123 would see data in that range disappearing across the measurements.

Related

What are series and bucket in InfluxDb

While trying to understand different concepts of InfluxDb I came across this documentation, where there is a comparision of terms with SQL database.
An InfluxDB measurement is similar to an SQL database table.
InfluxDB tags are like indexed columns in an SQL database.
InfluxDB fields are
like unindexed columns in an SQL database.
InfluxDB points are similar
to SQL rows.
But there are couple of other terminology which I came across, which I could not clearly understand and wondering if there is an SQL equivalent for that.
Series
Bucket
From what I understand from the documentation
series is the collection of data that share a retention policy,
measurement, and tag set.
Does this mean a series is a subset of data in a database table? Or is it like database views ?
I could not see any documentation explaining buckets. I guess this is a new concept in 2.0 release
Can someone please clarify these two concepts.
I have summarized my understanding below:
A bucket is named location with retention policy where time-series data is stored.
A series is a logical grouping of data defined by shared measurement, tag and field.
A measurement is similar to an SQL database table.
A tag is similar to indexed columns in an SQL database.
A field is similar to unindexed columns in an SQL database.
A point is similar to SQL row.
For example, a SQL table workdone:
Email
Status
time
Completed
lorr#influxdb.com
start
1636775801000000000
76
lorr#influxdb.com
finish
1636775868000000000
120
marv#influxdb.com
start
1636775801000000000
0
marv#influxdb.com
finish
1636775868000000000
20
cliff#influxdb.com
start
1636775801000000000
54
cliff#influxdb.com
finish
1636775868000000000
56
The columns Email and Status are indexed.
Hence:
Measurement: workdone
Tags: Email, Status
Field: Completed
Series (Cardinality = 3 x 2 = 6):
Measurement: workdone; Tags: Email: lorr#influxdb.com, Status: start; Field: Completed
Measurement: workdone; Tags: Email: lorr#influxdb.com, Status: finish; Field: Completed
Measurement: workdone; Tags: Email: marv#influxdb.com, Status: start; Field: Completed
Measurement: workdone; Tags: Email: marv#influxdb.com, Status: finish; Field: Completed
Measurement: workdone; Tags: Email: cliff#influxdb.com, Status: start; Field: Completed
Measurement: workdone; Tags: Email: cliff#influxdb.com, Status: finish; Field: Completed
Splitting a logical series across multiple buckets may not improve performance but may complicate flux query as need to include multiple buckets.
The InfluxDb document that you link to has an example of what a Series is, even if they don't label it as such. In InfluxDb, you can think of each combination of measurement and tags as being in it's own "table". The documentation splits it like this.
This table in SQL:
+---------+---------+---------------------+--------------+
| park_id | planet | time | #_foodships |
+---------+---------+---------------------+--------------+
| 1 | Earth | 1429185600000000000 | 0 |
| 2 | Saturn | 1429185601000000000 | 3 |
+---------+---------+---------------------+--------------+
Becomes these two Series in InfluxDb:
name: foodships
tags: park_id=1, planet=Earth
----
name: foodships
tags: park_id=2, planet=Saturn
...etc...
This has implications when you query for the data, and is also the reason why the recommendation is that you don't have tag values with high cardinality. For example, if you had a tag of temperature (especially if it was a precise to multiple decimal points) that InfluxDb would be creating a "table" for each potential combination of tag values.
A Bucket is much easier to understand. It's just a combination of a database with a retention policy. In previous versions of InfluxDb these were separate concepts which have now been combined.
According to the InfluxDB glossary:
Bucket
A bucket is a named location where time-series data is stored in InfluxDB 2.0. In InfluxDB 1.8+, each combination of a
database and a retention policy (database/retention-policy) represents
a bucket. Use the InfluxDB 2.0 API compatibility endpoints included
with InfluxDB 1.8+ to interact with buckets.
Series
A logical grouping of data defined by shared measurement, tag
set, and field key.

influxdb query for 5 top cpu usage

I run a shared web hosting using CloudLinux.
From it, I can get a bunch of performence metric
So, my influxDB is :
measurement : lve
fields : CPU,EP,IO,IOPS,MEM,MEMPHY,NETI,NETO,NPROC,fEP,fMEM,fMEMPHY,fNPROC,lCPU,lCPUW,lEP,lIO,lIOPS,lMEM,lMEMPHY,lNETI,lNETO,lNPROC,nCPU
tags : xpool, host, user (where : xpool is xen-pool uid, host is hostname of cloudLinux, user is username of shared hosting)
data is gathered each 5 seconds
How is the query sentence to :
Select records from specific xpool+host , and
get 5 unique username that produce TOP CPU usage in 5 minute periode from it ?.
There is hundreds usaername but I want got top-5 only.
Note: Samething like example 4 of TOP() from https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#top, unless that expected results is:
name: h2o_feet
time top location
---- --- --------
2015-08-18T00:00:00Z 8.12 coyote_creek
2015-08-18T00:54:00Z 2.054 santa_monica
Rather than :
name: h2o_feet
time top location
---- --- --------
2015-08-18T00:48:00Z 7.11 coyote_creek
2015-08-18T00:54:00Z 6.982 coyote_creek
2015-08-18T00:54:00Z 2.054 santa_monica
2015-08-18T00:24:00Z 7.635 coyote_creek
2015-08-18T00:30:00Z 7.5 coyote_creek
2015-08-18T00:36:00Z 7.372 coyote_creek
2015-08-18T00:00:00Z 8.12 coyote_creek
2015-08-18T00:06:00Z 8.005 coyote_creek
2015-08-18T00:12:00Z 7.887 coyote_creek
Since '8.12' is the highest value of 'coyote_creek' and '2.054' is the highest value of 'santa_monica'
Sincerely
-bino-
Probably a subquery could help, for example, this is from a database using telegraf:
SELECT top,host FROM (SELECT TOP(usage_user, 1) AS top, host from cpu WHERE time > now() -1m GROUP BY host)
It will output something like:
name: cpu
time top host
---- --- ----
1527489800000000000 1.4937106918238994 1.host.tld
1527489808000000000 0.3933910306845004 2.host.tld
1527489810000000000 4.17981072555205 3.host.tld
1527489810000000000 0.8654602675059009 4.host.tld
The first query is:
SELECT TOP(usage_user, 1) AS top, host from cpu WHERE time > now() -1m GROUP BY host
Is using TOP to get only 1 item and using the field usage_user
Then to "pretty print" A subquery is used:
SELECT top,host FROM (...)

influxdb query for grafana with grouped data

I'm using influxdb to store some service metrics. These are simple metrics, such as read bytes or active connections. Then, with grafana, I'm composing some visualizations on top of this.
Displaying something as 'read bytes' is quite simple, it's basically summing up values, grouped by a time interval.
SELECT sum("value") FROM "bytesReceived" WHERE $timeFilter GROUP BY time($__interval) fill(0)
It's on the 'active connections' that I'm having some trouble figuring out. These are tcp sockets connected to a service, where the measurement is the number of connected sockets; this is updated whenever a socket connects or disconnects.
If I had only one instance of the service, this would be easy, I would just do something like
SELECT last("value") FROM "activeConnections" WHERE $timeFilter GROUP BY time($__interval) fill(0)
The thing is that there are multiple instances of the service, which are created dynamically. The measurement is written with the additional tag 'host', that is populated with an id for the runtime service.
So, let's get into the data points.
select * from activeConnections where time > '2018-05-16T16:00:00Z' and time < '2018-05-16T16:10:00Z'
This spits out something like
time host value
---- ---- -----
1526486436041433600 58e5bd04a313 5
1526486438158741000 58e5bd04a313 4
1526486438712713000 58e5bd04a313 3
1526486811218129000 29b39780fd7b 4
So as you can notice, we end up with 3 connections on one host and 4 on another. The problem in hand is... displaying that data merged as a whole, where that last line should be 7, for example.
I tried grouping data by host
select last(value) from activeConnections where time > '2018-05-16T16:00:00Z' and time < '2018-05-16T16:10:00Z' group by host
which gives me the last value for each group
name: activeConnections
tags: host=29b39780fd7b
time last
---- ----
1526486811218129000 4
name: activeConnections
tags: host=58e5bd04a313
time last
---- ----
1526486706993942700 3
Also tried using a subquery
select * from ( select last(value) from activeConnections where time > '2018-05-16T16:00:00Z' and time < '2018-05-16T16:10:00Z' group by host )
But I get the same problem, where I don't know how to group things nicely for grafana with a time interval.
Does any care to comment and help? Would be much appreciated.
Ok,
I seem to have found a solution. It's a shame that Grafana doesn't support sub-queries, so the query needs to be inserted manually with raw view. There's an issue open here.
So, what I needed was a way to group all the hosts value into a single plot line. That can be achieved with the following query:
SELECT sum("value") FROM (SELECT last("value") as "value" FROM "activeConnections" WHERE $timeFilter GROUP BY time($__interval), "host") GROUP BY time($__interval) fill(previous)
I was close before, but failed to notice that in the inner select, if you don't give a name to the resulting select, it comes out as "last" by default. So I was trying to sum up "value", but the field didn't exist out of the sub-query.
Hope this helps someone. Thank you Yuri, for your comment. It pointed me into the right direction.

How can you tell if a Influx Database contains data?

I'm currently trying to count the number of rows in an InfluxDB, but the following fails.
SELECT count(*) FROM "TempData_Quarantine_1519835017000_1519835137000"..:MEASUREMENT";
with the message
InfluxData API responded with status code=BadRequest, response={"error":"error parsing query: found :, expected ; at line 1, char 73"}
To my understanding this query should be checking all measurements and counting them?
(I inherited this code from someone else, so apologies for not understanding it better)
If you need a binary answer to the question "tell if a Influx Database contains data?" then just do
select count(*) from /.*/
In case if the current retention policy in the current database is empty (contains 0 rows) it will return just nothing. Otherwise it will return something like this:
name: api_calls
time count_value
---- -----------
0 5
name: cpu
time count_value
---- -----------
0 1
Also you can specify retention policy explicitly:
SELECT count(*) FROM "TempData_Quarantine_1519835017000_1519835137000"./.*/

Retention Policy doesn't deletes data

I’m new to influxdb and i want to implement Retention Policy (RP) for my logs.
I loaded a static data using telegraf and have created a RP for that :
CREATE DATABASE test WITH DURATION 60m
but it is not deleting my previous logs .
As i have observed that influx stores data on UTC time format whereas my telegraf server uses system time. Could that be a isuue ??
I would check two things using the Influx CLI. First, check the retention policies on your DB.
> SHOW RETENTION POLICIES
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 1h0m0s 1h0m0s 1 true
For example, I can see my autogen policy has a duration of 1 hour and a shardGroupDuration of 1 hour.
Second, check the shards.
> SHOW SHARDS
name: tester
id database retention_policy shard_group start_time end_time expiry_time owners
-- -------- ---------------- ----------- ---------- -------- ----------- ------
130 tester autogen 130 2018-02-20T21:00:00Z 2018-02-20T22:00:00Z 2018-02-20T23:00:00Z
131 tester autogen 131 2018-02-20T22:00:00Z 2018-02-20T23:00:00Z 2018-02-21T00:00:00Z
132 tester autogen 132 2018-02-20T23:00:00Z 2018-02-21T00:00:00Z 2018-02-21T01:00:00Z
Data is removed when the newest point has a timestamp after the expiry time.

Resources