Group by in Influx DB like MySql

Group by in Influx DB like MySql - influxdb

I am very new to Influx DB and curious, how does Group By Works. For e.g. how can I execute following MySql Query in InfluxDb:
select mean(cputime), vm from CPU group by vm;

First of all, please remember: this is NOT a relational database, the InfluxQL is NOT SQL (even though looking so familiar).
In particular, here:
1) You won't be to get aggregate and non-aggregate values in the meantime (whatever that "non" is, field or tag). Yes, even with grouping.
2) Effectively, you can group only by tags (+ special kind of grouping by time intervals)
So, considering "vm" is a tag in your query - it is not legit.
While that
select mean(cputime) from CPU group by vm
is, but I'd rather strongly discourage you and anyone of not having time restrictions on their queries: aside of being quite meaningless, as the timeseries get grown, it's gonna slow everything down dramatically.
So something like this:
select mean(cputime) from CPU where time > now() - 15m group by vm
or even this:
select mean(cputime) from CPU where time > now() - 90m group by time(15m), vm
gonna be way better.

Related

Basic influxdb cardinality

I have developed a project using influxdb and I am currently trying to understand why my influx container keeps crashing due to oom exits.
The way I designed my database is quite basic. I have several buildings, for each building, I need to have timebased values. So I created a database for each building, and a measurement for each type of value (for example energy consumption).
I do not use tags at all, because using the design I described above, all I have left to store is the float values and their timestamp index. I like this design because every building is completely separated from the others (as they should be), and if I want to get data from one of them, I just need to connect to the building's database (or bucket) and query it like so :
SELECT * FROM field1,field2 WHERE time>d1 and time<d2
According to this influx article, if I understand correctly (english isn't my first langage), I have a cardinality of:
3 buildings (bucket/database) * 1000 fields (measurement) * 1 (default tag ?) = 3000 cardinality
This doesn't seem to be much, thus I think I misunderstand something.

will Gremlin graph queries always perform operations in it's own address space?

admittedly, most of my database experience is relational. one of the tenets in that space is to avoid moving data over the network. this manifests by using something like:
select * from person order by last_name limit 10
which will presumably order and limit within the database engine vs using something like:
select * from person
and subsequently ordering and taking the top 10 at the client which could have disastrous effects if there are a million person records.
so, with Gremlin (from Groovy), if i do something like:
g.V().has('#class', 'Person').order{println('!'); it.a.last_name <=> it.b.last_name}[0..9]
i am seeing the ! printed, so i am assuming that this bringing all Person records into the address space of my client prior to the order and limit steps which is not the desired effect.
do my options for processing queries entirely in the database engine become product specific (e.g. for orient-db perhaps submit the query in their flavor of SQL), or is there something about Gremlin that i am missing?

If you want the implementer's query optimizer to kick in, you need to use as many Gremlin steps as possible and avoid pure Groovy/in-memory processing of your graph traversals.
You're most likely looking for something like this (as of TinkerPop v3.2.0):
g.V().has('#class', 'Person').order().by('last_name', incr).limit(10)
If you find yourself using lambdas, chances are often high that this could be done with pure Gremlin steps. Favor Gremlin steps over lambdas.
See TinkerPop v3.2.0 documentation:
Order By step
Limit step

InfluxDB goes down for huge data

I am building a dashboard using InfluxDB. I have a source which generates approx. 2000 points per minute. Each point has 5 tags, 6 fields. There is only one measurement.
Everything works fine for about 24hrs but as the data size grows, I am not able to run any queries on influx. Like for example, right now I have approx 48hrs of data and even a basic select brings down the influx db,
select count(field1) from measurementname
It times out with the error:
ERR: Get http://localhost:8086/query?db=dbname&q=select+count%28field1%29+from+measuementname: EOF
Configuration:
InfluxDB version: 0.10.1 default configuration
The OS Version:Ubuntu 14.04.2 LTS
Configuration: 30GB RAM, 4 VCPUs, 150GB HDD
Some Background:
I have a dashboard and a web app querying the influxdb. The webapp lets a user query the DB based on tag1 or tag2.
Tags:
tag1 - unique for each record. Used in a where clause in the web app to get the record based on this field.
tag2 - unique for each record. Used in a where clause in the web app to get the record based on this field.
tag3 - used in group by. Think of it as departmentid tying a bunch of employees.
tag4 - used in group by. Think of it as departmentid tying a bunch of employees.
tag5 - used in group by. Values 0 or 1 or 2.

Pasting answer from influxdb#googlegroups.com mailing list: https://groups.google.com/d/msgid/influxdb/b4fb503e-18a5-4bd5-84b1-632dc4950747%40googlegroups.com?utm_medium=email&utm_source=footer
tag1 - unique for each record.
tag2 - unique for each record.
This is a poor schema. You are creating a new series for every record, which puts a punishing load on the database. Each series must be indexed, and the entire index currently must reside in RAM. I suspect you are running out of memory after 48 hours because of series cardinality, and the query is just the last straw, not the actual cause of the low RAM situation.
It is very bad practice to use a unique value in tags. You can still use fields in the WHERE clause, they just aren't as performant, and the damage to your system is much less than having a unique series for every point.
https://docs.influxdata.com/influxdb/v0.10/concepts/schema_and_data_layout/
https://docs.influxdata.com/influxdb/v0.10/guides/hardware_sizing/#when-do-i-need-more-ram

Impala Resource Estimation for queries with Group by

I noticed that Impala "Estimated Per-Host Requirements" grow potentially when my queries use a "group by" with several fields. I suppose it calculates the maximum resouces needed for a join:
EXPLAIN select field1, field2
from mytable where field1=123
group by field1, field2
order by field1, field2
limit 100;
I would like to know if there is a way to reduce the estimated value by Impala, because the real needed resources were far lower (300 MB) than the amount estimated (300 GB).
It is important to say that "field1" and "field2" are String.

Unfortunately it is difficult to estimate the required memory based on information known at query planning time which are based on limited statistics that are available, especially when dealing with aggregations and joins that depend on the selectivity of the grouping/join exprs.
Firstly, are you sure you have up-to-date statistics on the table(s) you're using? Run COMPUTE STATS [table] to do so.
If you still have this issue with the correct stats, you can set the set mem_limit=XM query option to tell Impala that the query shouldn't use more than X MB of memory so it will request that amount of memory from Llama rather than the estimate from planning. If you're sure the query doesn't use more than 300MB, you can issue set mem_limit=300M; and then issue your query. If you're running other queries after from the same session, then clear the query option afterwards.

Limiting the search query to run faster

Yes yes yes! I know! This is totally wrong to pass SQL in rails like the command below but I promise!:) it is just for some benchmark purposes
#medications = TestPharmOrderMain.select("brand_name,form,dose,generic_name as alternative,sum (order_count)
as total_count
,sum(order_cost) as total_cost").group("brand_name,form,dose,generic_name").limit(5)
The PostgresQL that I am running this REST service on it has two million rows and it takes like four minutes to return me the JSON from this query which is impossible to develop against it.
Is there a way I can change this query to for example only look at the first twenty rows in the DB and not two million rows so it runs faster for my dev purposes?

If this is for dev purposes do the smart thing and create a tiny database that is representative of the whole system. You can do this with a create table as select statement:
Create table my_test_table
as
select brand_name,form,dose,generic_name as alternative,sum (order_count)
as total_count
,sum(order_cost) as total_cost
from table
group by brand_name,form,dose,generic_name
limit 5
Now you can point your test query to my_test_table and it will only have 5 records and will therefore be quite fast.
Now you can also offset this to something like DBUnit, which is essentially a framework that is laid on top of XUnit. So this can be easily integrated into your testing that is done in what I presume is RubyUnit.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Group by in Influx DB like MySql - influxdb

I am very new to Influx DB and curious, how does Group By Works. For e.g. how can I execute following MySql Query in InfluxDb: select mean(cputime), vm from CPU group by vm;

Related

Basic influxdb cardinality

will Gremlin graph queries always perform operations in it's own address space?

InfluxDB goes down for huge data

Impala Resource Estimation for queries with Group by

Limiting the search query to run faster

Categories

Resources