How InfluxDB leverages the underlying key value store?

How InfluxDB leverages the underlying key value store? - influxdb

I am doing some research on InfluxDB and found out that it uses an underlying Key-Value store for storage (like LevelDB, RocksDB, etc.).
I would like a mental model of what kind of keys are created for storing the time series data.
I am guessing something along the lines of "starting timestamp -> list of values...." but would like a more precise explanation about that.

InfluxDB works a little differently.
An InfluxDB database stores points. A point has four components: a measurement, a tagset, a fieldset, and a timestamp.
The measurement provides a way to associate related points that might have different tagsets or fieldsets. The tagset is a dictionary of key-value pairs to store metadata with a point. The fieldset is a set of typed scalar values—the data being recorded by the point.
The serialization format for points is defined by the [line protocol] (which includes additional examples and explanations if you’d like to read more detail). An example point from the specification helps to explain the terminology:
temperature,machine=unit42,type=assembly internal=32,external=100 1434055562000000035
The measurement is temperature.
The tagset is machine=unit42,type=assembly. The keys, machine and type, in the tagset are called tag keys. The values, unit42 and assembly, in the tagset are called tag values.
The fieldset is internal=32,external=100. The keys, internal and external, in the fieldset are called field keys. The values, 32 and 100, in the fieldset are called field values.
You can check out this post for the full explanation of the internals. https://www.influxdata.com/blog/influxdb-internals-101-part-one/

Related

Partial deserialization with Apache Avro

Is it possible to deserialize a subset of fields from a large object serialized using Apache Avro without deserializing all the fields? I'm using GenericDatumReader and the GenericRecord contains all the fields.
I'm pretty sure you can't do it using GenericDatumReader, but my question is whether it is possible given the binary format of Avro.

Conceptually, binary serialization of Avro data is in-order and depth-first. As you traverse the data, record fields are serialized one after the other, lists are serialized from the top to the bottom, etc.
Within one object, there no markers to separate fields, no tags to identify specific fields, and no index into the binary data to help quickly scan to specific fields.
Depending on your schema, you could write custom code to skip some kinds of data ... for example, if a field is a LIST of FIXED bytes, you could read the size of the list and just jump over the data to the next field. This is pretty specific and wouldn't work for most Avro types though (notably integers are variable length when encoded).
Even in that unlikely case, I don't believe there are any helpers in the Java SDK that would be useful.
In brief, Avro isn't designed to do that, and you're probably not going to find a satisfactory way to do a projection on your Schema without deserializing the entire object. If you have a collection, column-oriented persistence like Parquet is probably the right thing to do!

It is possible if the fields you want to read occur first in the record. We do this in some cases where we want to read only the header fields of an object, not the full data which follows.
You can create a "subset" schema containing just those first fields, and pass this to GenericDatumReader. Avro will deserialise those fields, and anything which comes after will be ignored, because the schema doesn't "know" about it.
But this won't work for the general case where you want to pick out fields from within the middle of a record.

using one tag per value in influxdb

I am reading some legacy code and I found that all the inserts in influxdb are done this way (simplified code here) :
influxDB.write(Point.measurement(myeasurement)
.time( System.currentTimeMillis(), TimeUnit.MILLISECONDS)
.addField("myfield", 123)
.tag("rnd",String.valueOf(Math.random() * 100000)))
.build())
As you can guess, the tag value of the tag "rnd" is different for each value, which means that we can have 100k different tag values. Actually, for now we have less values than that, so we should end up having one different tag value per value...
I am not an expert in influxdb, but my understanding is that tags are used by influxdb to group related values together, like partition or shards in other tools. 100k tags values seems to be a lot ...
Is that as horrible as I think it is ? Or is there any possibility that this kind of insert may be usefull for something ?
EDIT: I just realized that Math.random()* is a double, so the * 100000 is just useless. As the String.valueOf(). Actually, there is one series in the database per value, I can't imagine how that could be a good thing :(

It is bad and unnecessary.
Unnecessary because each point that you write to influxdb is uniquely identified by its timestamp + set of applied tag values.
Bad because each set of tag values creates a separate series. Influxdb keeps an index over the series. Having a unique tag value for each datapoint will grow your system resoruce requirements and slow down the database. Unless you don't have that many datapoints, but then you don't really need a timeseries database or just don't care.
As the OP said. Tags are used for grouping by or filtering.
Here are some good reads on the topic
https://docs.influxdata.com/influxdb/v1.7/concepts/tsi-details/
https://www.influxdata.com/blog/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/
According to documentation the upper bound [for series or unique tag values] is usually somewhere between 1 - 4 million series depending on the machine used. which is easily a day worth of high resolution data.

Normalizing data in Redshift

I've recently started using Redshift for housing millions of data points with a schema that looks like the following:
create table metrics (
name varchar(100),
value decimal(18,4),
time timestamp
) sortkey (name, timestamp);
(The real schema is a bit more complex, but this will satisfy for my question)
I'm wondering if it makes sense to normalize my metric name (currently varchar(100)) by mapping it to an integer and only storing only the integer. (e.g. {id: 1, name: metric1}). The cardinality for name is ~100. By adding a mapping, it would make the application logic quite a bit more complex since it has many streams of input. Also, querying it ahead of time would require reverse mapping.
In a traditional sql database, this would be an obvious YES, but I'm not certain how Redshift handles this as it's a columnar data store. I think it would be nice to have in general, but I would assume that Redshift would/could do some similar mapping underneath the hood since certain columns in any table have lower cardinality than others.

The answer is no. Redshift makes excellent use of compression and will store very few duplicates of your name field.
However you do need to ensure that you are making good use of Redshift's compression options. This section in the docs should tell you all you need to know: http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html
TL;DR: Run ANALYZE COMPRESSION on your table to see what compression Redshift recommends, create a new table using those encodings, and insert your data into that table.

Your best option is to continue to use the varchar data type, as you have here, but apply the "bytedict" compression type. Internally, this is the same as creating a lookup table, but it could actually be faster, since Redshift natively understands a manages it's own table and maps from int->string internally during column decoding.
Here is the bytedict doc reference:
http://docs.aws.amazon.com/redshift/latest/dg/c_Byte_dictionary_encoding.html
Another option that could give you good performance/storage savings for your use cases is runlength:
http://docs.aws.amazon.com/redshift/latest/dg/c_Runlength_encoding.html

Domain Driven Design newbie, please explain 'value objects' and 'services' briefly

I am reading http://en.wikipedia.org/wiki/Domain-driven_design right now, and I just need 2 quick examples so I understand what 'value objects' and 'services' are in the context of DDD.
Value Objects: An object that describes a characteristic of a thing. Value Objects have no conceptual identity. They are typically read-only objects and may be shared using the Flyweight design pattern.
Services: When an operation does not conceptually belong to any object. Following the natural contours of the problem, you can implement these operations in services. The Service concept is called "Pure Fabrication" in GRASP.
Value objexts: can someone give me a simple example this please?
Services: so if it isn't an object/entity, nor belong to repository/factories then its a service? I don't understand this.

The archetypical example of a Value Object is Money. It's very conceivable that if you build an international e-commerce application, you will want to encapsulate the concept of 'money' into a class. This will allow you to perform operations on monetary values - not only basic addition, subtraction and so forth, but possibly also currency conversions between USD and, say, Euro.
Such a Money object has no inherent identity - it contains the values you put into it, and when you dispose of it, it's gone. Additionally, two Money objects containing 10 USD are considered identical even if they are separate object instances.
Other examples of Value Objects are measurements such as length, which might contain a value and a unit, such as 9.87 km or 3 feet. Again, besides simply containing the data, such a type will likely offer conversion methods to other measurements and so forth.
Services, on the other hand, are types that performs an important Domain operation, but doesn't really fit well into the other, more 'noun'-based concepts of the Domain. You should strive to have as few Services as possible, but sometimes, a Service is the best way of encapsulating an important Domain Concept.
You can read more about Value Objects, Services and much more in the excellent book Domain-Driven Design, which I can only recommend.

Value Objects: a typical example is an address. Equality is based on the values of the object, hence the name, and not on identity. That means that for instance 2 Person objects have the same address if the values of their Address objects are equal, even if the Address objects are 2 completely different objects in memory or have a different primary key in the database.
Services: offer actions that do not necessarily belong to a specific domain object but do act upon domain objects. As an example, I'm thinking of a service that sends e-mail notifications in an online shop when the price of a product drops below a certain price.
InfoQ has a free book on DDD (a summary of Eric Evan's book): http://www.infoq.com/minibooks/domain-driven-design-quickly

This is a great example of how to identify Value Objects vs Entities. My other post also gives another example.

User-adjustable data structures

assume a data structure Person used for a contact database. The fields of the structure should be configurable, so that users can add user defined fields to the structure and even change existing fields. So basically there should be a configuration file like
FieldNo FieldName DataType DefaultValue
0 Name String ""
1 Age Integer "0"
...
The program should then load this file, manage the dynamic data structure (dynamic not in a "change during runtime" way, but in a "user can change via configuration file" way) and allow easy and type-safe access to the data fields.
I have already implemented this, storing information about each data field in a static array and storing only the changed values in the objects.
My question: Is there any pattern describing that situation? I guess that I'm not the first one running into the problem of creating a user-adjustable class?
Thanks in advance. Tell me if the question is not clear enough.

I've had a quick look through "Patterns of Enterprise Application Architecture" by Martin Folwer and the Metadata Mapping pattern describes (at quick glance) what you are describing.
An excerpt...
"A Metadata Mapping allows developers to define the mappings in a simple tabular form, which can then be processed bygeneric code to carry out the details of reading, inserting and updating the data."
HTH

I suggest looking at the various Object-Relational pattern in Martin Fowler's Patterns of Enterprise Application Architecture available here. This is a list of patterns it covers here.
The best fit to your problem appears to be metadata mapping here. There are other patterns, Mapper, etc.

The normal way to handle this is for the class to have a list of user-defined records, each of which consists of list of user-defined fields. The configuration information forc this can easily be stored in a database table containing the a type id, field type etc, The actual data is then stored in a simple table with the data represented only as (objectid + field index)/string pairs - you convert the strings to and from the real type when you read or write the database.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How InfluxDB leverages the underlying key value store? - influxdb

Related

Partial deserialization with Apache Avro

using one tag per value in influxdb

Normalizing data in Redshift

Domain Driven Design newbie, please explain 'value objects' and 'services' briefly

User-adjustable data structures

Categories

Resources