I am new to influxdb and the TICK environment so maybe it is a basic question but I have not found how to do this. I have installed Influxdb 1.7.2, with Telegraph listening to a MQTT server that receives JSON data generated by different devices. I have Chronograph to visualize the data that is being recieved.
JSON data is a very simple message indicating the generating device as a string and some numeric values detected. I have created some graphs indicating the number of messages recieved in 5 minutes lapse by one of the devices.
SELECT count("devid") AS "Device" FROM "telegraf"."autogen"."mqtt_consumer" WHERE time > :dashboardTime: AND "devid"='D9BB' GROUP BY time(5m) FILL(null)
As you can see, in this query I am setting the device id by hand. I can set this query alone in a graph or combine multiple similar queries for different devices, but I am limited to previously identifying the devices to be controlled.
Is it posible to obtain the results grouped by the values contained in devid? In SQL this would mean including something like GROUP BY "devid", but I have not been able to make it work.
Any ideas?
You can use "GROUP BY devid" if devid is a tag in measurement scheme. In case of devid being the only tag the number of unique values of devid tag is the number of time series in "telegraf"."autogen"."mqtt_consumer" measurement. Typically it is not necessary to use some value both as tag and field. You can think of a set of tags in a measurement as a compound unique index(key) in conventional SQL database.
Related
Below is the scenario against which I have this question.
Requirement:
Pre-aggregate time series data within influxDb with granularity of seconds, minutes, hours, days & weeks for each sensor in a device.
Current Proposal:
Create five Continuous Queries (one for each granularity level i.e. Seconds, minutes ...) for each sensor of a device in a different retention policy as that of the raw time series data, when the device is onboarded.
Limitation with Current Proposal:
With increased number of device/sensor (time series data source), the influx will get bloated with too many Continuous Queries (which is not recommended) and will take a toll on the influxDb instance itself.
Question:
To avoid the above problems, is there a possibility to create Continuous Queries on the same source measurement (i.e. raw timeseries measurement) but the aggregates can be differentiated within the measurement using new tags introduced to differentiate the results from Continuous Queries from that of the raw time series data in the measurement.
Example:
CREATE CONTINUOUS QUERY "strain_seconds" ON "database"
RESAMPLE EVERY 5s FOR 1m
BEGIN
SELECT MEAN("strain_top") AS "STRAIN_TOP_MEAN" INTO "database"."raw"."strain" FROM "database"."raw"."strain" GROUP BY time(1s),*
END
As far as I know, and have seen from the docs, it's not possible to apply new tags in continuous queries.
If I've understood the requirements correctly this is one way you could approach it.
CREATE CONTINUOUS QUERY "strain_seconds" ON "database"
RESAMPLE EVERY 5s FOR 1m
BEGIN
SELECT MEAN("strain_top") AS "STRAIN_TOP_MEAN" INTO "database"."raw"."strain" FROM "database"."strain_seconds_retention_policy"."strain" GROUP BY time(1s),*
END
This would save the data in the same measurement but a different retention policy - strain_seconds_retention_policy. When you do a select you specify the corresponding retention policy from which to select.
Note that, it is not possible to perform a select from several retention policies at the same time. If you don't specify one, the default one is used (and not all of them). If it is something you need then another approach could be used.
I don't quite get why you'd need to define a continuous query per device and per sensor. You only need to define five (1 per seconds, minutes, hours, days, weeks) and do a group by * (all) which you already do. As long as the source datapoint has a tag with the id for the corresponding device and sensor, the resampled datapoint will have it too. Any newly added devices (data) will just be processed automatically by those 5 queries and saved into the corresponding retention policies.
If you do want to apply additional tags, you could process the data outside the database in a custom script and write it back with any additional tags you need, instead of using continuous queries
UPDATE: it seems that the recently released org.apache.beam.sdk.io.hbase-2.6.0 includes the HBaseIO.readAll() api. I tested in google dataflow, and it seems to be working. Will there be any issue or pitfall of using HBaseIO directly in Google Cloud Dataflow setting?
The BigtableIO.read takes PBegin as an input, I am wondering if there is anything like SpannerIO's readAll API, where the BigtableIO's read API input could be a PCollection of ReadOperations (e.g, Scan), and produce a PCollection<Result> from those ReadOperations.
I have a use case where I need to have multiple prefix scans, each with different prefix, and the number of rows with the same prefix can be small (a few hundred) or big (a few hundreds of thousands). If nothing like ReadAll is already available. I am thinking about having a DoFn to have a 'limit' scan, and if the limit scan doesn't reach the end of the key range, I will split it into smaller chunks. In my case, the key space is uniformly distributed, so the number of remaining rows can be well estimated by the last scanned row (assuming all keys smaller than the last scanned key is returned from the scan).
Apology if similar questions have been asked before.
HBaseIO is not compatible with Bigtable HBase connector due to region locator logic. And we haven't implemented the SplittableDoFn api for Bigtable yet.
How big are your rows, are they small enough that scanning a few hundred thousand row can be handled by a single worker?
If yes, then I'll assume that the expensive work you are trying parallelize is further down in your pipeline. In this case, you can:
create a subclass of AbstractCloudBigtableTableDoFn
in the DoFn, use the provided client directly, issuing scan for each prefix element
Each row resulting from the scan should be assigned a shard id and emitted as a KV(shard id, row). The shard id should be a incrementing int mod some multiple of the number of workers.
Then do a GroupBy after the custom DoFn to fan out the shards. It's important to do a GroupByKey to allow for fanout, otherwise a single worker will have to process all of the emitted rows for a prefix.
If your rows are big and you need to split each prefix scan across multiple workers then you will have to augment the above approach:
in main(), issue a SampleRowKeys request, which will give rough split points
insert a step in your pipeline before the manual scanning DoFn to split the prefixes using the results from SampleRowsKeys. ie. If the prefix is a and SampleRowKeys contains 'ac', 'ap', 'aw', then the range that it should emit would be [a-ac), [ac-ap), [ap-aw), [aw-b). Assign a shard id and group by it.
feed the prefixes to manual scan step from above.
I have four singlestat panels which show my used space on different hosts (every host has also different type_instances):
The query for one of this singlestats is the following:
Question: Is there a way to create a fifth singlestat panel which sows the sum of the other 4 singlestats ? (The sum of all "storj_value" where type=shared)
The influx query language does not currently support aggregations across metrics (eg, JOINs). It is possible with Kapacitor but that requires that new aggregated values for all the measurements are written to the DB, by writing code to do it, which will need to be queried separately.
Only option currently is to use an API that does have cross-metric function support, for example Graphite with an InfluxDB storage back-end, InfluxGraph.
The two APIs are quite different - Influx's is query language based, Graphite is not - and tagged InfluxDB data will need to be configured as a Graphite metric path via templates, see configuration examples.
After that, Graphite functions that act across series can be used, in particular for the above question, sumSeries.
In my app users can save sales reports for given dates. What I want to do now is to query the database and select only the latest sales reports (all those reports that have the maximum date in my table).
I know how to sort all reports by date and to select the one with the highest date - however I don't know how to retrieve multiple reports with the highest date.
How can I achieve that? I'm using Postgres.
Something like this?
SalesReport.where(date: SalesReport.maximum('date'))
EDIT: Just to bring visibility to #muistooshort's comment below, you can reduce the two queries to a single query (with a subselect), using the following form:
SalesReport.where(date: SalesReport.select('MAX(date)'))
If there is a lot of latency between your web host and your database host, this could halve execution times. It is almost always the preferred form.
You can get the maximum date to search for matching reports:
max_date = Report.maximum('date')
reports = Report.where(date: max_date)
I am trying to generate a continuous query in influxDB. The query is to fetch the hits per second by doing (1/response time) of the value which i am already getting for another series (say series1).
Here is the query:
select (1000/value) as value from series1 group by time(1s) into api.HPS;
My problem is that the query "select (1000/value) as value from series1 group by time(1s)" works fine and provide me results but as soon as I store the result into continuous query, it starts to give me parse error.
Please help.
Hard to give any concrete advice without the actual parse error returned and perhaps the relevant log lines. Try providing those to the mailing list at influxdb#googlegroups.com or email them to support#influxdb.com.
There's an email on the Google Group that might be relevant, too. https://groups.google.com/d/msgid/influxdb/c99217b3-fdab-4684-b656-a5f5509ed070%40googlegroups.com
Have you tried using whitespace between the values and the operator? E.g. select (1000 / value) AS value....