InfluxDB design choices and query - time-series

I am currently writing a program that does anomaly detection on system calls by creating a database of “normal behaviour”. The goal now is to visualize the data being collected. There are about 200 occurences a minute where a process does a systemcall. Currently, I am writing these system calls to an InfluxDB where the name of the process that does the calls is a tag and the system calls it does are fields.
A datapoint now is a timestamp, measurement, the processname, a call it did
I want to visualize the occurences of certain calls a process did in a specific amount of time, for example:
How many times did process x call ‘fopen’ in the last 5 minutes?
Is my schema sufficient for this? How would a query for this look like?
Thank you in advance.

Related

How to send non aggregated metric to Influx from Springboot application?

I have a SpringBoot application that is under moderate load. I want to collect metric data for a few of the operations of my app. I am majorly interested in Counters and Timers.
I want to count the number of times a method was invoked (# of invocation over a window, for example, #invocation over last 1 day, 1 week, or 1 month)
If the method produces any unexpected result increase failure count and publish a few tags with that metric
I want to time a couple of expensive methods, i.e. I want to see how much time did that method took, and also I want to publish a few tags with metrics to get more context
I have tried StatsD-SignalFx and Micrometer-InfluxDB, but both these solutions have some issues I could not solve
StatsD aggregates the data over flush window and due to aggregation metric tags get messed up. For example, if I send 10 events in a flush window with different tag values, and the StatsD agent aggregates those events and publishes only one event with counter = 10, then I am not sure what tag values it's sending with aggregated data
Micrometer-InfluxDB setup has its own problems, one of them being micrometer sending 0 values for counters if no new metric is produced and in that fake ( 0 value counter) it uses same tag values from last valid (non zero counter)
I am not sure how, but Micrometer also does some sort of aggregation at the client-side in MeterRegistry I believe, because I was getting a few counters with a value of 0.5 in InfluxDB
Next, I am planning to explore Micrometer/StatsD + Telegraf + Influx + Grafana to see if it suits my use case.
Questions:
How to avoid metric aggregation till it reaches the data store (InfluxDB). I can do the required aggregation in Grafana
Is there any standard solution to the problem that I am trying to solve?
Any other suggestion or direction for my use case?

Can InfluxDB have Continuous Queries with same source & target measurements but with different/new tags?

Below is the scenario against which I have this question.
Requirement:
Pre-aggregate time series data within influxDb with granularity of seconds, minutes, hours, days & weeks for each sensor in a device.
Current Proposal:
Create five Continuous Queries (one for each granularity level i.e. Seconds, minutes ...) for each sensor of a device in a different retention policy as that of the raw time series data, when the device is onboarded.
Limitation with Current Proposal:
With increased number of device/sensor (time series data source), the influx will get bloated with too many Continuous Queries (which is not recommended) and will take a toll on the influxDb instance itself.
Question:
To avoid the above problems, is there a possibility to create Continuous Queries on the same source measurement (i.e. raw timeseries measurement) but the aggregates can be differentiated within the measurement using new tags introduced to differentiate the results from Continuous Queries from that of the raw time series data in the measurement.
Example:
CREATE CONTINUOUS QUERY "strain_seconds" ON "database"
RESAMPLE EVERY 5s FOR 1m
BEGIN
SELECT MEAN("strain_top") AS "STRAIN_TOP_MEAN" INTO "database"."raw"."strain" FROM "database"."raw"."strain" GROUP BY time(1s),*
END
As far as I know, and have seen from the docs, it's not possible to apply new tags in continuous queries.
If I've understood the requirements correctly this is one way you could approach it.
CREATE CONTINUOUS QUERY "strain_seconds" ON "database"
RESAMPLE EVERY 5s FOR 1m
BEGIN
SELECT MEAN("strain_top") AS "STRAIN_TOP_MEAN" INTO "database"."raw"."strain" FROM "database"."strain_seconds_retention_policy"."strain" GROUP BY time(1s),*
END
This would save the data in the same measurement but a different retention policy - strain_seconds_retention_policy. When you do a select you specify the corresponding retention policy from which to select.
Note that, it is not possible to perform a select from several retention policies at the same time. If you don't specify one, the default one is used (and not all of them). If it is something you need then another approach could be used.
I don't quite get why you'd need to define a continuous query per device and per sensor. You only need to define five (1 per seconds, minutes, hours, days, weeks) and do a group by * (all) which you already do. As long as the source datapoint has a tag with the id for the corresponding device and sensor, the resampled datapoint will have it too. Any newly added devices (data) will just be processed automatically by those 5 queries and saved into the corresponding retention policies.
If you do want to apply additional tags, you could process the data outside the database in a custom script and write it back with any additional tags you need, instead of using continuous queries

InfluxDB: query to calculate average of StatsD "executionTime" values

I'm sending metrics in StatsD format to Telegraf, which forwards them to InfluxDB 0.9.
I'm measuring execution times (of some event) from multiple hosts. The measurement is called "execTime", and the tag is "host". Once Telegraf gets these numbers, it calculates mean/upper/lower/count, and stores them in separate measurements.
Sample data looks like this in influxdb:
TIME...FIELD..............HOST..........VALUE
t1.....execTime.count.....VM1...........3
t1.....execTime.mean......VM1...........15
t1.....execTime.count.....VM2...........6
t1.....execTime.mean......VM2...........22
(So at time t1, there were 3 events on VM1, with mean execution time 15ms, and on VM2 there were 6 events, and the mean execution time was 22ms)
Now I want to calculate the mean of the operation execution time across both hosts at time t1. Which is (3*15 + 6*22)/(3+6) ms.
But since the count and mean values are in two different series, I can't simply use "select mean(value) from execTime.mean"
Do I need to change my schema, or can I do this with the current setup?
What I need is essentially a new series, which is a combination of the execTime.count and execTime.mean across all hosts. Instead of calculating this on-the-fly, the best approach seems to be to actually create the series along with the others.
So now I have two timer stats being generated on each host for each event:
1. one event with actual hostname for the 'host' tag
2. second event with one tag "host=all"
I can use the first set of series to check mean execution times per host. And the second series gives me the mean time for all hosts combined.
It is possible to do mathematical operations on fields from two different series, provided both series are members of the same measurement. I suspect your schema is non-optimized for your use case.

Custom Date queries using Cumulocity API

Is it possible to aggregate measurements or create custom queries beyond the standard dateFrom dateTo queries?
As an example, I have measurements which have a time delta of 1 minute (2015-01-01T05:05:00, 2015-01-01T05:05:00, 2015-01-01T05:05:00, ...) and I would like to query the measurements at 15 minute intervals (2015-01-01T05:15:00, 2015-01-01T05:30:00, 2015-01-01T05:45:00, ...)
So far I have only come up with these solutions:
Using the standard api request as in
https://tenant.cumulocity.com/measurement/measurements?dateFrom=2015-10-01&dateTo=2015-11-05
and then throwing away most of the data will use a massive amount of time loading the data.
Using cep (cumulocity event language) to generate a new measurement every 15 minutes using the nearest 1 minute measurement seems like a bit of overkill and not very elegant.
Batch requesting the exact minute
https://tenant.cumulocity.com/measurement/measurements?dateFrom=2015-11-05T05:15:00%2B01:00&dateTo=2015-11-05T05:16:00%2B01:00
which will in a massive amount of API requests and also does not seem very efficient.
Use the /measurements/series endpoint which will only give me all series, even those I do not want, as well as only having the aggregation options hourly and daily (as far as I can tell).
Is there a better way of doing this?
you have captured nearly all of the mechanisms that are currently available. There is one more possibility -- not sure if this is an option for you:
Mark the fifteenth measurement when sending it from the device, using e.g. a different type.
I would normally use 2. It's actually quite efficient, it's similar to a materialized view in traditional SQL, plus you can use the data everywhere and in all widgets.
Good luck :-)
Cheers,
André
I would prefer the CEP solution. The rule wouldn't be that complicated. You would of course then store these measurements twice which is not that nice but having your desired measurement with a specific type or fragment will give you the fastest way to query it.
Instead of copying the measurement you could just add a special fragment to the measurement every 15 min in the CEP rule. You cannot update measurements so you would have to delete the measurement incoming every 15 min and then create a new measurement with exactly the same values but add a fragement (e.g. "aggregatedMeasurement": {}).
Your query then looks like this:
https://tenant.cumulocity.com/measurement/measurements?dateFrom=2015-10-01&dateTo=2015-11-05&fragmentType=aggregatedMeasurement
One more idea for point 3:
You could use SmartREST to create a template with the query string and leave the dateFrom and dateTo as placeholders.
From the client side you then would have to make only one request using the bulking feature in SmartREST.
On the server side this would still be transformed into the single requests so you wouldn't gain anything in speed.

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

Resources