I have developed a project using influxdb and I am currently trying to understand why my influx container keeps crashing due to oom exits.
The way I designed my database is quite basic. I have several buildings, for each building, I need to have timebased values. So I created a database for each building, and a measurement for each type of value (for example energy consumption).
I do not use tags at all, because using the design I described above, all I have left to store is the float values and their timestamp index. I like this design because every building is completely separated from the others (as they should be), and if I want to get data from one of them, I just need to connect to the building's database (or bucket) and query it like so :
SELECT * FROM field1,field2 WHERE time>d1 and time<d2
According to this influx article, if I understand correctly (english isn't my first langage), I have a cardinality of:
3 buildings (bucket/database) * 1000 fields (measurement) * 1 (default tag ?) = 3000 cardinality
This doesn't seem to be much, thus I think I misunderstand something.
Related
I have an IoT application where all data comes from the different sensors with a standard payload where all that changes is the variable ID which is a four digit hex string.
I currently use something like data.varID as my measurement name. The varID is also a tag, even if redundant. But this is somewhat inconvenient as some times I want to be able to easily query data across more than one varID.
I have tried to find the answer to this question but cannot find it: what’s the difference between
having lots of data.varID measurements
Or
have a single data measurement with varID as a tag
As I understand, both would be equivalent in terms of the number of time series in the database so is there any other consideration?
The types of queries I usually need are simple:
SELECT "value" FROM "db1"."autogen"."data.org1.global.5051" WHERE time > now() - 24h AND ("device"='d--0000-0000-0000-0acf' OR "device"='d--0000-0000-0000-0ace')
so basically getting data for a given variable across devices for a period of time. But in some cases, I also want to get more that one variable at a time, which is why I would like to instead do something like:
SELECT "value" FROM "db1"."autogen"."data.org1" WHERE time > now() - 24h AND ("device"='d--0000-0000-0000-0acf' OR "device"='d--0000-0000-0000-0ace') AND ("variable"="5051") AND ("variable"="5052")
but at this time, I would be putting everything on a single measurement, with "device", "variable" (and a couple other things) as tags.
So, is there any consideration I need to consider before switching to having a single measurement for my whole database?
Since nobody was able to answer this question, I will answer it the best I know understand it.
There does not seem to be any performance difference between one large measurement series Vs smaller measurement series.
But there is a critical difference, which in our case, ended up forcing us into multiple measurements:
In our case, while the schema between different measurements share the same fields, some measurements can have additional fields.
The problem is that fields seem to be associated to the measurement itself, so if we add
data,device=0bd8,var=5053 value=10 1574173550390000
data,device=0bd8,var=5053 value=10 1574173550400000
data,device=0bd8,var=5054 foo=12,value=10 1574173550390000
data,device=0bd8,var=5055 bar=10,value=10 1574173550390000
the fact that var 5054 has a foo field and 5055 has a bar field means that when you query any variable, you will get both foo and bar (set to None if they don't exist):
{'foo': None, 'bar': None}
This means that if you have 100 variables, and each add say, 5 custom fields, you will end up with 500 fields every time you query. And while this is not a storage issue, the fact that the fields are associated with the measurement means you will have an exponential growth on the JSON object returned by the database, even if most fields set to None.
If the schema was to be identical across all measurements, then it seems not to make a difference between using a single data measurement (with different tags) Vs. multiple data.<var> measurements.
I am reading some legacy code and I found that all the inserts in influxdb are done this way (simplified code here) :
influxDB.write(Point.measurement(myeasurement)
.time( System.currentTimeMillis(), TimeUnit.MILLISECONDS)
.addField("myfield", 123)
.tag("rnd",String.valueOf(Math.random() * 100000)))
.build())
As you can guess, the tag value of the tag "rnd" is different for each value, which means that we can have 100k different tag values. Actually, for now we have less values than that, so we should end up having one different tag value per value...
I am not an expert in influxdb, but my understanding is that tags are used by influxdb to group related values together, like partition or shards in other tools. 100k tags values seems to be a lot ...
Is that as horrible as I think it is ? Or is there any possibility that this kind of insert may be usefull for something ?
EDIT: I just realized that Math.random()* is a double, so the * 100000 is just useless. As the String.valueOf(). Actually, there is one series in the database per value, I can't imagine how that could be a good thing :(
It is bad and unnecessary.
Unnecessary because each point that you write to influxdb is uniquely identified by its timestamp + set of applied tag values.
Bad because each set of tag values creates a separate series. Influxdb keeps an index over the series. Having a unique tag value for each datapoint will grow your system resoruce requirements and slow down the database. Unless you don't have that many datapoints, but then you don't really need a timeseries database or just don't care.
As the OP said. Tags are used for grouping by or filtering.
Here are some good reads on the topic
https://docs.influxdata.com/influxdb/v1.7/concepts/tsi-details/
https://www.influxdata.com/blog/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/
According to documentation the upper bound [for series or unique tag values] is usually somewhere between 1 - 4 million series depending on the machine used. which is easily a day worth of high resolution data.
Assume a DB with the following data records:
2018-04-12T00:00:00Z value=1000 [series=distance]
2018-04-12T00:00:00Z value=10 [series=signal_quality]
2018-04-12T00:01:00Z value=1100 [series=distance]
2018-04-12T00:01:00Z value=0 [series=signal_quality]
There is one field called value. Square brackets denote the tags (further tags omitted). As you can see, the data is captured in different data records instead of using multiple fields on the same record.
Given the above structure, how can I query the time series of distances, filtered by signal quality? The goal is to only get distance data points back when the signal quality is above a fixed threshold (e.g. 5).
"Given the above structure", there's no way to do it in plain InfluxDB.
Please keep in mind - Influx is NONE of a relational DB, it's different, despite query language looks familiar.
Again, given that structure - you can proceed with Kapacitor, as it was already mentioned.
But I strongly suggest you to rethink the structure, if it is possible, if you're able to control the way the metrics are collected.
If that is not an option - here's the way: spin a simple job in Kapacitor that will just join the two points into one basing on time (check this out for how), and then drop it into new measurement.
The data point would look like this, then:
DistanceQualityTogether,tag1=if,tag2=you,tag2=need,tag4=em distance=1000,signal_quality=10 2018-04-12T00:00:00Z
The rest is oblivious with such a measurement.
But again, if you can configure your metrics to be sent like this - better do it.
A web application has 4 types of users. I want to track the response time of each type of user.
Solution 1 : Create a measurement with tag of user type which gives me a cardinality of 4.
Solution 2: Create a table for each type of user which creates also a cardinality of 4 (one coming from each table).
Supposing I am not interested on comping the data of the users. So issuing multiple queries on influxdb is not a problem.
What is the outcome of each solution in terms of performance, storage and memory? Which one is the influxdb way?
go with solution 1:
https://docs.influxdata.com/influxdb/v1.6/concepts/schema_and_data_layout/#don-t-encode-data-in-measurement-names
Don’t encode data in measurement names
In general, taking this step will simplify your queries. InfluxDB queries merge data that fall within the same measurement; it’s better to differentiate data with tags than with detailed measurement names.
My use case for influxDB is for storing and trending process data coming from different PLCs. I visualize this data using grafana. In a first pilot, I used the schema design guidelines from influxDB, using a generic measurement name and separating the different value sources by means of tags.
For example, when I have 2 pumps in the 'acid' pump group and 2 pumps in the 'caustic' pump group of which I recond the pressure:
- pump_pressure {pump: pump_1, group: acid}
- pump_pressure {pump: pump_2, group: acid}
- pump_pressure {pump: pump_1, group: caustic}
- pump_pressure {pump: pump_2, group: caustic}
In my use case, the end-user wants to be able to make their own trends using Grafana for example. While this way of recording the data is conform the schema design guidelines of influxDB (I think), it is very confusing for non technical people that are not used to working with and thinking in SQL like languages.
Therefore, I'm tempted to store the data in the way that they are used to, and is the general way of working in similar products (historians):
- ACID_pump_1_pressure
- ACID_pump_2_pressure
- CAUSTIC_pump_1_pressure
- CAUSTIC_pump_2_pressure
This would make it much easier for the end user to make trends, as 1 measurement = one data source, and they don't have to worry about where and group by clauses.
Can anyone point me to some clues what the impact of the latter would be on influxDB performance and storage. Will the data take more space in this way? Please not that the latter method can lead to a few thousand measurement, but their cardinality would all be 1.
There is no reason you can't do that if it fits your use-case better. The guidelines that you start with are there because it unlocks the full power of InfluxDB's tagging capability.
There will be no performance or storage implications. Internally, InfluxDB creates a new series based on each unique measurement "key", where the key is the combination of measurement name and tag key/value pairs.
ie, each of these is a separate series:
pump_pressure,pump=pump_1,group=acid
pump_pressure,pump=pump_2,group=acid
pump_pressure,pump=pump_1,group=caustic
pump_pressure,pump=pump_2,group=caustic
also, each of these is a separate series:
ACID_pump_1_pressure
ACID_pump_2_pressure
CAUSTIC_pump_1_pressure
CAUSTIC_pump_2_pressure
EDIT, source: I work at InfluxData
EDIT 2, this being said, I also agree fully with #srikanta and I would recommend keeping the tags, but finding another solution to interacting with the users of the db (or educating).
Indeed you can go with this approach. However this is not scalable. What if the number of pumps used increases ? Then too, this approach works where the number of pumps is equal to the number of time series. However it becomes a pain to manage.
If the problem to avoid the interaction of the non technical user with the SQL queries then different approach to that should be considered and not to alter the "schema" of the database.
Some more insights --> https://blog.zhaw.ch/icclab/influxdb-design-guidelines-to-avoid-performance-issues/