Memory issue when Selecting 14582 row in InfluxDB - influxdb

I have a raw measurement with:
fields:
delta: float
fetch_characteristic: string
tags:
conso_prod
fetch_method
meter_id
operation_id
source
timestep
to_compute
unit
After inserting in InfluxDB, I want to make a query:
SELECT * FROM raw WHERE to_compute=true
This query will run all days, the initial query should select 14582 raws, and then it will select just the last day's raws, so just a few.
I tried to optimize serie cardinality, but it didn't give me any improvements.
To run this query, I need 4 GB in RAM, how can I optimize it ?

Related

How to split a master sheet of email addresses

Ok so the need - I have about 3700 lines of email addresses, names, schools, and professions(those are column headers) I want to split this sheet into 4 with 1000 lines(I understand one will be short) in each but here is the catch I can only have 25 lines/emails from each school. So how would someone go about doing this? Keep in mind each sheet needs to have its own unique emails not repeated on the other sheets.
There are 2 problems here and as I don't know how many schools are on the list and if it's possible to have always less than 25 people from one school (for example - if there are only 30 schools, it would be impossible to distribute them in 1000 row batches).
First task:
Distribute database into 4 sheets, 1000 rows each:
It's simple.
Let's say my data has 4 columns from A to D
I make sheets named 1-1000, 1001-2000, etc.
In each one I put a formula:
1)
=query(Master!A1:D,"select * limit 1000 offset 0")
=query(Master!A1:D,"select * limit 1000 offset 1000")
=query(Master!A1:D,"select * limit 1000 offset 2000")
=query(Master!A1:D,"select * limit 1000 offset 3000")
Etc.
In order to limit number of occurences of each schools, I have to count these occurences and define what is the minimal page number on which this student can be displayed (for example - 17th student from certain school can be on 1st page, but 27th can be at least on 2nd page. 60th student can be on third or further.
When I determine minimal page number, I can sort my data accordingly and display sorted by minimal number:
In this situation my query on next pages have additional parameters:
=query(Master!A1:G,"select A,B,C,D order by G limit 1000 offset 0")
I use column G for sorting, but I don't display it.
You can find my solution here:
https://docs.google.com/spreadsheets/d/1TP6MlMmLiUExOELFhgZnti7LR7VQouMg3h-X7QRcHzQ/copy
Names are generated randomly from polish names generator.

Is it possible to get percentile on aggregated data in Influxdb?

Is it possible to get percentile on aggregated data in Influxdb?
Say, my data is
db,label1=value1 measure1_count=20 measure1_mean=0.8 140000000000
db,label1=value1 measure1_count=8 measure1_mean=0.9 140000001000
db,label1=value1 measure1_count=15 measure1_mean=0.4 140000002000
It it possible to do percentile on above data in influxdb1/2?
Influx db provide the Median aggregate function for calculating median.
select MEDIAN(Value) from ProcessData group by TagName
Note: MEDIAN() is nearly equivalent to PERCENTILE(field_key, 50), except MEDIAN() returns the average of the two middle field values if the field contains an even number of values.
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#:~:text=Note%3A%20MEDIAN()%20is%20nearly,an%20even%20number%20of%20values.

InfluxDB: Starting cumulative_sum() from zero / aggregate grouping required for cumulative_sum and non_negative_difference

Using InfluxDB, I'm trying produce an output that shows cumulative rainfall for a time period, that starts from zero.
The rainfall sensor outputs a cumulative rainfall amount, but resets to zero on power-failure, restart etc.
My first query component uses non_negative_difference() to show the increments.
select
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query
.... yields an increment per raw data point, for example:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.3999999999999986
2018-06-01T14:24:07.024Z 0.10000000000000142
2018-06-01T14:25:09.059Z 0.19999999999999574
2018-06-01T14:26:11.094Z 0
2018-06-01T14:27:13.127Z 0.10000000000000142
2018-06-01T14:28:15.158Z 0.20000000000000284
2018-06-01T14:29:20.027Z 0.09999999999999432
2018-06-01T14:30:22.476Z 0.10000000000000142
2018-06-01T14:30:53.918Z 0.6000000000000014
2018-06-01T14:31:55.968Z 0.5
2018-06-01T14:32:58.007Z 0.5
2018-06-01T14:34:00.046Z 0.20000000000000284
2018-06-01T14:35:02.075Z 0.3999999999999986
2018-06-01T14:36:04.102Z 0.3999999999999986
2018-06-01T14:37:06.136Z 0.20000000000000284
2018-06-01T14:38:08.201Z 0
So far so good.
I'm now trying to stitch these readings back to cumulative total, starting from zero for the intended period.
I can use cumulative_sum() for this, for example:
SELECT
cumulative_sum(nnd)
FROM
(SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query )
which yields:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.7000000000000028
2018-06-01T14:24:07.024Z 0.8000000000000043
2018-06-01T14:25:09.059Z 1
2018-06-01T14:26:11.094Z 1
2018-06-01T14:27:13.127Z 1.1000000000000014
2018-06-01T14:28:15.158Z 1.3000000000000043
2018-06-01T14:29:20.027Z 1.3999999999999986
2018-06-01T14:30:22.476Z 1.5
2018-06-01T14:30:53.918Z 2.1000000000000014
2018-06-01T14:31:55.968Z 2.6000000000000014
2018-06-01T14:32:58.007Z 3.1000000000000014
2018-06-01T14:34:00.046Z 3.3000000000000043
2018-06-01T14:35:02.075Z 3.700000000000003
2018-06-01T14:36:04.102Z 4.100000000000001
2018-06-01T14:37:06.136Z 4.300000000000004
2018-06-01T14:38:08.201Z 4.300000000000004
Looking good!
Now I'd like to group it up into more distinct time buckets, for nice graphing.
Let's try....
SELECT
cumulative_sum(max(nnd))
FROM (SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query)
GROUP BY
time(5m)
and I get an error: ERR: aggregate function required inside the call to non_negative_difference
But I cannot find a reasonable way of adding aggregates and groupings to non_negative_difference() that do not affect the accuracy of the differencing function itself.
The only thing I've been able to do is a dummy aggregate SUM() over time groups that are smaller than the sensor period. But this isn't robust enough for my liking - (and i'm still not sure it is 100% correct)
Is it correct that I must have both queries as aggregate queries?
I was trying to do this very thing for my weather station. Instead of having the weather station calculate the cumulative value I wanted Grafana to do it. The solution that worked for me is the advanced syntax Yuri Lachin mentions in his comments.
With InfluxDB you can use CUMULATIVE_SUM(), but the basic syntax doesn't allow you to group by time (only by tag). The "advanced syntax", however, allows you to to have a time series by nesting an aggregate function like MEAN() or SUM().
Here's the function I am using in Grafana to get a cumulative rainfall total for a selected time period:
SELECT CUMULATIVE_SUM(MEAN("rainfall")) FROM "weather" WHERE $timeFilter GROUP BY time(1h) fill(0).
The GROUP BY is, of course, flexible. I was interested in hourly rainfall so I grouped by 1h. You can group by the time period you find most interesting.
Using this query the rainfall will start from zero for period you select in Grafana. In the Seattle area we had measurable rain (I know, shocker) on 8/6/2020 and 8/8/2020. If I set my date range to include both dates the graph shows just under .2mm total rainfall:
If I switch my graph to 8/8 and 8/9 the total is just under 1mm:
Note: I was also interested in seeing the individual bucket tips so included those as bars on the second Y-axis.
For more detail see: https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#advanced-syntax-7

Elasticsearch group by ranges of time in Rails

I have a large dataset in which I need to group values based on the created_at time.
Requirements are that they're grouped by 5 minute intervals.
I think that this should do it, but it doesn't seem to work:
self.search(aggs: {created_at: {date_histogram: {field: 'created_at', interval:
'5m'}}})
This is the search query:
curl http://localhost:9200/prices_development/_search?pretty -d '{"query":
{"match_all":{}},"size":1000,"from":0,"aggs":{"created_at":{"date_histogram":
{"field":"created_at","interval":"5m"}}},"timeout":"11s","_source":false}'
It just gives me back the entire set of data though. i.e. every record:
How can I to get back the data set, grouped by 5 minutes intervals?
Aggregation results are present in aggregationsfield of the response, not hits.
Set size inside aggs to limit number of aggregations.
If you only want aggregation results then set outer size to 0

Copy from a measurement to another measurement in InfluxDB

Having the following statement:
SELECT * INTO ZZZD FROM P4978
Output:
result
time written
1970-01-01T00:00:00Z 231
Using:
SELECT * FROM ZZZD
I get only 7 lines even if there where 231 lines written. I can't figure why there are only 7 lines. Is there some setting or this is a defect? I'm not able to copy from a measurement to another measurement more than 7 lines.
If you want an exact copy use:
SELECT * INTO ZZZD FROM P4978 group by *
If you don't specify group by *, the tags will turn into fields.
You can verify the tags with show tag keys from ZZZD and the fields with show field keys from ZZZD
Source: https://docs.influxdata.com/influxdb/v1.5/query_language/continuous_queries/#common-issues-with-basic-syntax scroll to issue-4
Into clause is now working with influxDB 0.12 and above.
SELECT * INTO ZZZD FROM P4978
This will work
The INTO clause is intended for use with the downsampling continuous queries. Kapacitor is a better tool for copying data from one measurement to another.

Resources