Elasticsearch group by ranges of time in Rails - ruby-on-rails

I have a large dataset in which I need to group values based on the created_at time.
Requirements are that they're grouped by 5 minute intervals.
I think that this should do it, but it doesn't seem to work:
self.search(aggs: {created_at: {date_histogram: {field: 'created_at', interval:
'5m'}}})
This is the search query:
curl http://localhost:9200/prices_development/_search?pretty -d '{"query":
{"match_all":{}},"size":1000,"from":0,"aggs":{"created_at":{"date_histogram":
{"field":"created_at","interval":"5m"}}},"timeout":"11s","_source":false}'
It just gives me back the entire set of data though. i.e. every record:
How can I to get back the data set, grouped by 5 minutes intervals?

Aggregation results are present in aggregationsfield of the response, not hits.
Set size inside aggs to limit number of aggregations.
If you only want aggregation results then set outer size to 0

Related

Google Sheets: Average percentage using multiple conditions

I would like to get an average percentage out of my sample, however, I need to use several conditions. I tried to use the AVERAGE and AVERAGEIF together with FILTER but everything returns an error and I think I'm incorrectly "merging" formulas.
You can find my test sheet here.
The rules I need to apply:
The score for individual rows is possible to find in the "Data" sheet in cell N and the total results should be visible in the sheet "Calculation" cell E.
As the sample is huge in real life, I need to filter out several pieces of information and add conditions:
to filter out all items where the code/ID starts with 0: Data!A:A&"", "^0.+"
to filter out all items that are matching the date in the Calculation sheet: Data!C:C=$B3
to filter all items with the specific name: Data!B:B=$A3
Any idea how to get the average % out of items with specific filters?
UPDATE
Expected results: I want to see the total average for a specific date, name, and ID, and let's say I would use these filters, then I would see only the final average percentage.
Test =100%
Test = 0%
Test = 100%
Total Average %: 66.7%
Also, I think the best way would be to use AVERAGEIFS, but I'm getting the error "Array arguments to AVERAGEIFS are of different size".
=AVERAGEIFS(Data!N:N,Data!B:B=$A3,Data!C:C=$B3,Data!A:A&"", "^0.+")
=IFERROR(AVERAGEIFS(Data!N3:N,Data!B3:B,A3,Data!C3:C,B3,ARRAYFORMULA(if(LEN(Data!A3:A),REGEXMATCH(Data!A3:A,"^0.+"),"")),TRUE),"")
or
=IFERROR(AVERAGE(FILTER(Data!N3:N,Data!B3:B=A3,Data!C3:C=B3,REGEXMATCH(Data!A3:A,"^0.+"))),"")
or
=IFERROR(INDEX(QUERY({Data!A3:C,Data!N3:N},"select avg(Col4) where Col1 starts with '0' and Col2 = '"&A3&"' and Col3 = '"&B3&"'"),2,0),"")

How to split a master sheet of email addresses

Ok so the need - I have about 3700 lines of email addresses, names, schools, and professions(those are column headers) I want to split this sheet into 4 with 1000 lines(I understand one will be short) in each but here is the catch I can only have 25 lines/emails from each school. So how would someone go about doing this? Keep in mind each sheet needs to have its own unique emails not repeated on the other sheets.
There are 2 problems here and as I don't know how many schools are on the list and if it's possible to have always less than 25 people from one school (for example - if there are only 30 schools, it would be impossible to distribute them in 1000 row batches).
First task:
Distribute database into 4 sheets, 1000 rows each:
It's simple.
Let's say my data has 4 columns from A to D
I make sheets named 1-1000, 1001-2000, etc.
In each one I put a formula:
1)
=query(Master!A1:D,"select * limit 1000 offset 0")
=query(Master!A1:D,"select * limit 1000 offset 1000")
=query(Master!A1:D,"select * limit 1000 offset 2000")
=query(Master!A1:D,"select * limit 1000 offset 3000")
Etc.
In order to limit number of occurences of each schools, I have to count these occurences and define what is the minimal page number on which this student can be displayed (for example - 17th student from certain school can be on 1st page, but 27th can be at least on 2nd page. 60th student can be on third or further.
When I determine minimal page number, I can sort my data accordingly and display sorted by minimal number:
In this situation my query on next pages have additional parameters:
=query(Master!A1:G,"select A,B,C,D order by G limit 1000 offset 0")
I use column G for sorting, but I don't display it.
You can find my solution here:
https://docs.google.com/spreadsheets/d/1TP6MlMmLiUExOELFhgZnti7LR7VQouMg3h-X7QRcHzQ/copy
Names are generated randomly from polish names generator.

Google Sheets Avg Query on empty columns (AVG_SUM_ONLY_NUMERIC)

Google Sheets average (avg) Query will fail with error AVG_SUM_ONLY_NUMERIC if any column in the dataset is empty. How you can overcome this?
Essentially, this occurs as the query is being run on a dynamically generated data set, therefore it's impossible to know what columns are empty beforehand. Moreover the query output "layout" must not change, so, if a column is empty, the query should return blank or 0 as for the faulty empty column.
Let's give it a look
Scenario: a Google Sheet is being used to insert markings for students tests.
When a single test is done by students, teacher assigns multiple grades for it. For instance, one marking for writing, one for comprehension, etc.
The sheet should finally build columns containing an average for all the markings assigned within the same date.
For instance, in the above sheet (link here), columns with markings given on December 16th (cols B,G,M,R,V) should be averaged in column AE.
Thanks to brilliant user Marikamitsos, this is achieved with the following query in cell AE4:
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z,B3:Z3=AE3)),
"select "&TEXTJOIN(",", 1, IF(LEN(A4:A),
"avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),
"select Col2")*1)
How does the above works?
Dataset is filtered by date
Filtered dataset is transposed and an avg Query is run on it
Result dataset is being queried again to easily filter out labels
All this works fine until a student has no markings for a given date, as occurs in cell AG4: student Bob has no markings for October's 28th test, and the query will throw an error AVG_SUM_ONLY_NUMERIC.
Could there be a way to insert a 0 in the filtered dataset FILTER(B4:Z,B3:Z3=AE3) so that ONLY empty rows will be set to 0? This would prevent the query to fail, while avoiding altering the dataset layout.
Or could there be a way to ignore zeroes in avg query?
NOTE: students cannot be graded with '0' when skipping a test!
See if this works
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z+0,B3:Z3=AG3)), "select "&TEXTJOIN(",", 1, IF(LEN(A4:A), "avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),"select Col2")*1)

Google Sheets Query Group By / First-N-Per-Group

I'm trying to find a simple solution for first-n-per-group.
I have a table of data, first column dates and rest data. I want to group based around the date, as multiple entries per date are allowed. For the second column some numbers, but want the FIRST record.
Currently the aggregate function I could possibly use is MIN() but that will return the lowest value and not the first.
A B
01/01/2018 10
01/01/2018 15
02/01/2018 10
02/01/2018 2
02/01/2018 100
02/01/2018 20
03/01/2018 5
03/01/2018 2
Desired output
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Current results using MIN() - undesired
A B
01/01/2018 10
02/01/2018 2
03/01/2018 2
It's a shame there isn't a FIRST() aggregate function in Google Sheets, which would make this a lot easier.
I saw a couple of examples of using the Row Number and ArrayQuery, but that doesn't seem to work for me. There are about 5000 rows of data so trying to keep this as efficient as possible, and not have to recalculate the entire sheet on any change, each taking a few seconds.
Currently I have this, which appends a third column with the Row Number:
=query({A1:B, arrayformula(row(A1:B))}, "select min(Col1),min(Col2) group by Col1")
Thanks
EDIT 1
A suggested solution was =SORTN(A:B,2^99,2,1,1), which is a clean simple one. However, this requires a large range of "free space" to display the returned dataset. Imagine 3000+ rows.
I was hoping for a QUERY() -based solution, as I wanted to do further operations with the results. Specifically, count the occurrences of distinct values.
For example: I wanted a returned dataset of
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Yet I want to count the occurrences of those values (and then ignoring the dates). For example:
B C
10 2
5 1
Perhaps I've confused the situation by using numbers? the "data" in ColB is TEXT (short 3 letter codes), however I used numbers to show I couldn't use MIN() function as that returns the numerically lowest value.
So in brief:
Go through all rows (3000+ rows) and group by the FIRST row of a particular date
return the FIRST value of that row
COUNT() all unique occurrences of those FIRST values, disregarding the date. Just a list with the unique values and their count (again, only the first one of any particular day)
=SORTN(A:B,2^99,2,1,1)
If your data is sorted as in the sample, You can easily remove duplicates with SORTN()

InfluxDB: Starting cumulative_sum() from zero / aggregate grouping required for cumulative_sum and non_negative_difference

Using InfluxDB, I'm trying produce an output that shows cumulative rainfall for a time period, that starts from zero.
The rainfall sensor outputs a cumulative rainfall amount, but resets to zero on power-failure, restart etc.
My first query component uses non_negative_difference() to show the increments.
select
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query
.... yields an increment per raw data point, for example:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.3999999999999986
2018-06-01T14:24:07.024Z 0.10000000000000142
2018-06-01T14:25:09.059Z 0.19999999999999574
2018-06-01T14:26:11.094Z 0
2018-06-01T14:27:13.127Z 0.10000000000000142
2018-06-01T14:28:15.158Z 0.20000000000000284
2018-06-01T14:29:20.027Z 0.09999999999999432
2018-06-01T14:30:22.476Z 0.10000000000000142
2018-06-01T14:30:53.918Z 0.6000000000000014
2018-06-01T14:31:55.968Z 0.5
2018-06-01T14:32:58.007Z 0.5
2018-06-01T14:34:00.046Z 0.20000000000000284
2018-06-01T14:35:02.075Z 0.3999999999999986
2018-06-01T14:36:04.102Z 0.3999999999999986
2018-06-01T14:37:06.136Z 0.20000000000000284
2018-06-01T14:38:08.201Z 0
So far so good.
I'm now trying to stitch these readings back to cumulative total, starting from zero for the intended period.
I can use cumulative_sum() for this, for example:
SELECT
cumulative_sum(nnd)
FROM
(SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query )
which yields:
2018-06-01T14:21:00.926Z 0
2018-06-01T14:22:02.959Z 0.30000000000000426
2018-06-01T14:23:04.992Z 0.7000000000000028
2018-06-01T14:24:07.024Z 0.8000000000000043
2018-06-01T14:25:09.059Z 1
2018-06-01T14:26:11.094Z 1
2018-06-01T14:27:13.127Z 1.1000000000000014
2018-06-01T14:28:15.158Z 1.3000000000000043
2018-06-01T14:29:20.027Z 1.3999999999999986
2018-06-01T14:30:22.476Z 1.5
2018-06-01T14:30:53.918Z 2.1000000000000014
2018-06-01T14:31:55.968Z 2.6000000000000014
2018-06-01T14:32:58.007Z 3.1000000000000014
2018-06-01T14:34:00.046Z 3.3000000000000043
2018-06-01T14:35:02.075Z 3.700000000000003
2018-06-01T14:36:04.102Z 4.100000000000001
2018-06-01T14:37:06.136Z 4.300000000000004
2018-06-01T14:38:08.201Z 4.300000000000004
Looking good!
Now I'd like to group it up into more distinct time buckets, for nice graphing.
Let's try....
SELECT
cumulative_sum(max(nnd))
FROM (SELECT
non_negative_difference(rain) as nnd
FROM
weather
WHERE
$time_query)
GROUP BY
time(5m)
and I get an error: ERR: aggregate function required inside the call to non_negative_difference
But I cannot find a reasonable way of adding aggregates and groupings to non_negative_difference() that do not affect the accuracy of the differencing function itself.
The only thing I've been able to do is a dummy aggregate SUM() over time groups that are smaller than the sensor period. But this isn't robust enough for my liking - (and i'm still not sure it is 100% correct)
Is it correct that I must have both queries as aggregate queries?
I was trying to do this very thing for my weather station. Instead of having the weather station calculate the cumulative value I wanted Grafana to do it. The solution that worked for me is the advanced syntax Yuri Lachin mentions in his comments.
With InfluxDB you can use CUMULATIVE_SUM(), but the basic syntax doesn't allow you to group by time (only by tag). The "advanced syntax", however, allows you to to have a time series by nesting an aggregate function like MEAN() or SUM().
Here's the function I am using in Grafana to get a cumulative rainfall total for a selected time period:
SELECT CUMULATIVE_SUM(MEAN("rainfall")) FROM "weather" WHERE $timeFilter GROUP BY time(1h) fill(0).
The GROUP BY is, of course, flexible. I was interested in hourly rainfall so I grouped by 1h. You can group by the time period you find most interesting.
Using this query the rainfall will start from zero for period you select in Grafana. In the Seattle area we had measurable rain (I know, shocker) on 8/6/2020 and 8/8/2020. If I set my date range to include both dates the graph shows just under .2mm total rainfall:
If I switch my graph to 8/8 and 8/9 the total is just under 1mm:
Note: I was also interested in seeing the individual bucket tips so included those as bars on the second Y-axis.
For more detail see: https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#advanced-syntax-7

Resources