InfluxDB: Group rows with same timestamp - influxdb

Assume a DB with the following data records:
2018-04-12T00:00:00Z value=1000 [series=distance]
2018-04-12T00:00:00Z value=10 [series=signal_quality]
2018-04-12T00:01:00Z value=1100 [series=distance]
2018-04-12T00:01:00Z value=0 [series=signal_quality]
There is one field called value. Square brackets denote the tags (further tags omitted). As you can see, the data is captured in different data records instead of using multiple fields on the same record.
Given the above structure, how can I query the time series of distances, filtered by signal quality? The goal is to only get distance data points back when the signal quality is above a fixed threshold (e.g. 5).

"Given the above structure", there's no way to do it in plain InfluxDB.
Please keep in mind - Influx is NONE of a relational DB, it's different, despite query language looks familiar.
Again, given that structure - you can proceed with Kapacitor, as it was already mentioned.
But I strongly suggest you to rethink the structure, if it is possible, if you're able to control the way the metrics are collected.
If that is not an option - here's the way: spin a simple job in Kapacitor that will just join the two points into one basing on time (check this out for how), and then drop it into new measurement.
The data point would look like this, then:
DistanceQualityTogether,tag1=if,tag2=you,tag2=need,tag4=em distance=1000,signal_quality=10 2018-04-12T00:00:00Z
The rest is oblivious with such a measurement.
But again, if you can configure your metrics to be sent like this - better do it.

Related

will an additional, rarely changing tag have a relevant "cost"?

In a InfluxDB (Cloud) I store measurements of a many weather sensors.
So, each InfluxDB data point looks like this:
weather,deviceId=1234 temperature=21.3,humidity=66 1559260800000000000
To the existing single deviceId tag I'd like to add a second tag position, resulting in points such as...
weather,deviceId=1234,position=243 temperature=21.3,humidity=66 1559260800000000000
For a given deviceId the position would change very rarely, but it can happen.
When querying a sensor data, the deviceId and position would be filtered always together.
Will this additional tag have an relevant increase on the billed storage (GB-hr) or negatively affect performance or is InfluxDB able to optimize/compress this additional tag?
Some more context: Some sensors might be reused and placed to a different location. It would not make much sense to analyze data of a single sensor at different positions, hence filter data always like "sensor 1234 at position 243". As this is adding a fifth value to a otherwise relatively small data point, I'm worried that this might "cost" too much.
The short answer is no for both storage size and write/read performance in your case. That means, you should be fine by adding an additional tag.
tag values are stored only once and always as strings. See more details here.
Since values of position are limited and far less than that of deviceId, the storage will be just okay. See more details here.
Regarding the write/read performance, it all comes down to series cardinality, which is simply
The number of unique measurement, tag set, and field key combinations in an InfluxDB bucket.
Again, since values of position are limited and change very rarely, it won't add much cardinality to the database. Therefore, the write/read performance won't be impacted that much.
You could check the series cardinality of the old schema and new schema as follows to make the comparison:
Query series cardinality in the weather measurement:
import "influxdata/influxdb"
influxdb.cardinality(
bucket: "yourBucket",
start: -1y,
predicate: (r) => r._measurement == "weather",
)
See more details here.

Performance difference between measure name and tags

I have an IoT application where all data comes from the different sensors with a standard payload where all that changes is the variable ID which is a four digit hex string.
I currently use something like data.varID as my measurement name. The varID is also a tag, even if redundant. But this is somewhat inconvenient as some times I want to be able to easily query data across more than one varID.
I have tried to find the answer to this question but cannot find it: what’s the difference between
having lots of data.varID measurements
Or
have a single data measurement with varID as a tag
As I understand, both would be equivalent in terms of the number of time series in the database so is there any other consideration?
The types of queries I usually need are simple:
SELECT "value" FROM "db1"."autogen"."data.org1.global.5051" WHERE time > now() - 24h AND ("device"='d--0000-0000-0000-0acf' OR "device"='d--0000-0000-0000-0ace')
so basically getting data for a given variable across devices for a period of time. But in some cases, I also want to get more that one variable at a time, which is why I would like to instead do something like:
SELECT "value" FROM "db1"."autogen"."data.org1" WHERE time > now() - 24h AND ("device"='d--0000-0000-0000-0acf' OR "device"='d--0000-0000-0000-0ace') AND ("variable"="5051") AND ("variable"="5052")
but at this time, I would be putting everything on a single measurement, with "device", "variable" (and a couple other things) as tags.
So, is there any consideration I need to consider before switching to having a single measurement for my whole database?
Since nobody was able to answer this question, I will answer it the best I know understand it.
There does not seem to be any performance difference between one large measurement series Vs smaller measurement series.
But there is a critical difference, which in our case, ended up forcing us into multiple measurements:
In our case, while the schema between different measurements share the same fields, some measurements can have additional fields.
The problem is that fields seem to be associated to the measurement itself, so if we add
data,device=0bd8,var=5053 value=10 1574173550390000
data,device=0bd8,var=5053 value=10 1574173550400000
data,device=0bd8,var=5054 foo=12,value=10 1574173550390000
data,device=0bd8,var=5055 bar=10,value=10 1574173550390000
the fact that var 5054 has a foo field and 5055 has a bar field means that when you query any variable, you will get both foo and bar (set to None if they don't exist):
{'foo': None, 'bar': None}
This means that if you have 100 variables, and each add say, 5 custom fields, you will end up with 500 fields every time you query. And while this is not a storage issue, the fact that the fields are associated with the measurement means you will have an exponential growth on the JSON object returned by the database, even if most fields set to None.
If the schema was to be identical across all measurements, then it seems not to make a difference between using a single data measurement (with different tags) Vs. multiple data.<var> measurements.

"Transactional safety" in influxDB

We have a scenario where we want to frequently change the tag of a (single) measurement value.
Our goal is to create a database which is storing prognosis values. But it should never loose data and track changes to already written data, like changes or overwriting.
Our current plan is to have an additional field "write_ts", which indicates at which point in time the measurement value was inserted or changed, and a tag "version" which is updated with each change.
Furthermore the version '0' should always contain the latest value.
name: temperature
-----------------
time write_ts (val) current_mA (val) version (tag) machine (tag)
2015-10-21T19:28:08Z 1445506564 25 0 injection_molding_1
So let's assume I have an updated prognosis value for this example value.
So, I do:
SELECT curr_measurement
INSERT curr_measurement with new tag (version = 1)
DROP curr_mesurement
//then
INSERT new_measurement with version = 0
Now my question:
If I loose the connection in between for whatever reason in between the SELECT, INSERT, DROP:
I would get double records.
(Or if I do SELECT, DROP, INSERT: I loose data)
Is there any method to prevent that?
Transactions don't exist in InfluxDB
InfluxDB is a time-series database, not a relational database. Its main use case is not one where users are editing old data.
In a relational database that supports transactions, you are protecting yourself against UPDATE and similar operations. Data comes in, existing data gets changed, you need to reliably read these updates.
The main use case in time-series databases is a lot of raw data coming in, followed by some filtering or transforming to other measurements or databases. Picture a one-way data stream. In this scenario, there isn't much need for transactions, because old data isn't getting updated much.
How you can use InfluxDB
In cases like yours, where there is additional data being calculated based on live data, it's common to place this new data in its own measurement rather than as a new field in a "live data" measurement.
As for version tracking and reliably getting updates:
1) Does the version number tell you anything the write_ts number doesn't? Consider not using it, if it's simply a proxy for write_ts. If version only ever increases, it might be duplicating the info given by write_ts, minus the usefulness of knowing when the change was made. If version is expected to decrease from time to time, then it makes sense to keep it.
2) Similarly, if you're keeping old records: does write_ts tell you anything that the time value doesn't?
3) Logging. Do you need to over-write (update) values? Or can you get what you need by adding new lines, increasing write_ts or version as appropriate. The latter is a more "InfluxDB-ish" approach.
4) Reading values. You can read all values as they change with updates. If a client app only needs to know the latest value of something that's being updated (and the time it was updated), querying becomes something like:
SELECT LAST(write_ts), current_mA, machine FROM temperature
You could also try grouping the machine values together:
SELECT LAST(*) FROM temperature GROUP BY machine
So what happens instead of transactions?
In InfluxDB, inserting a point with the same tag keys and timestamp over-writes any existing data with the same field keys, and adds new field keys. So when duplicate entries are written, the last write "wins".
So instead of the traditional SELECT, UPDATE approach, it's more like SELECT A, then calculate on A, and put the results in B, possibly with a new timestamp INSERT B.
Personally, I've found InfluxDB excellent for its ability to accept streams of data from all directions, and its simple protocol and schema-free storage means that new data sources are almost trivial to add. But if my use case has old data being regularly updated, I use a relational database.
Hope that clear up the differences.

using one tag per value in influxdb

I am reading some legacy code and I found that all the inserts in influxdb are done this way (simplified code here) :
influxDB.write(Point.measurement(myeasurement)
.time( System.currentTimeMillis(), TimeUnit.MILLISECONDS)
.addField("myfield", 123)
.tag("rnd",String.valueOf(Math.random() * 100000)))
.build())
As you can guess, the tag value of the tag "rnd" is different for each value, which means that we can have 100k different tag values. Actually, for now we have less values than that, so we should end up having one different tag value per value...
I am not an expert in influxdb, but my understanding is that tags are used by influxdb to group related values together, like partition or shards in other tools. 100k tags values seems to be a lot ...
Is that as horrible as I think it is ? Or is there any possibility that this kind of insert may be usefull for something ?
EDIT: I just realized that Math.random()* is a double, so the * 100000 is just useless. As the String.valueOf(). Actually, there is one series in the database per value, I can't imagine how that could be a good thing :(
It is bad and unnecessary.
Unnecessary because each point that you write to influxdb is uniquely identified by its timestamp + set of applied tag values.
Bad because each set of tag values creates a separate series. Influxdb keeps an index over the series. Having a unique tag value for each datapoint will grow your system resoruce requirements and slow down the database. Unless you don't have that many datapoints, but then you don't really need a timeseries database or just don't care.
As the OP said. Tags are used for grouping by or filtering.
Here are some good reads on the topic
https://docs.influxdata.com/influxdb/v1.7/concepts/tsi-details/
https://www.influxdata.com/blog/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/
According to documentation the upper bound [for series or unique tag values] is usually somewhere between 1 - 4 million series depending on the machine used. which is easily a day worth of high resolution data.

Querying temporal data in Neo4j

There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html

Resources