Slow query with 22 million points - influxdb

I have 1 TB (text data).
I installed the Influxd, in a machine (240 G RAM, 32 CUP)
I only inserted around 22 million points in one measurement, one tag and 110 field.
When i do query (select id from ts limit 1) , it exceed 20 second, and this is not good.
So can you please help me in what i should do to have a good performance

how many count your series?
maybe your problem come up from here:
https://docs.influxdata.com/influxdb/v1.2/concepts/schema_and_data_layout/#don-t-have-too-many-series
Tags containing highly variable information like UUIDs, hashes, and random strings will lead to a large number of series in the database, known colloquially as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads

Related

dask pivot_table seems very slow

I'm collating a simple pivot table on which position was owned by which fund on which date. It reads a fairly large parquet file per day, extracting just 3 columns. Resultant for 1 year is ~100K rows, 365 columns, each element containing Nan or a single short string (for "fund").
This takes 46 seconds for one year of data, if I fall back to pandas before creating the pivot table.
ddf = dd.read_parquet("s3://.../2021*.parquet", columns=["as_of_date", "position_id", "fund"]).categorize(columns=['as_of_date'])
ddf.compute().pivot_table(index='position_id', columns='as_of_date', values='fund', aggfunc='first')
However, that seems like I'm pushing a lot of data around, I would think it'd be more efficient, or at least as efficient to have dask do the pivot table:
ddf = dd.read_parquet("s3://.../2021*.parquet", columns=["as_of_date", "position_id", "fund"]).categorize(columns=['as_of_date'])
ddf.pivot_table(index='position_id', columns='as_of_date', values='fund', aggfunc='first').compute()
But that takes 12 minutes, and if I give it more than 1 year of data, dies a with KilledWorker exception.
Any guidance as to how to diagnose what's going on here? I'm not clear why it struggles so much to do the pivot_table. FWIW, there are no duplicates, but pivot() doesn't exist in dask, so aggfunc='first' is the simplest placeholder I can think of to make it work
Performing some operations on dask dataframes currently can lead to bottlenecks in processing. These bottlenecks can cause memory problems or leading to a inefficient data shuffling. Pivoting is one such operation. This can be seen on the DAG for a sample pivot:
from dask.datasets import timeseries
df = timeseries(end="2000-01-05").categorize(columns=["name"])
pivot = df.pivot_table(index="id", columns="name", values="x")
pivot.visualize()
As the number of partitions increases, the DAG will get more involved. To see that try increasing the kwarg end="2000-01-05" to a later date.

Iterating over large amounts of data in InfluxDB

I am looking for an efficient way to iterate over the full data of a influxDB table with ~250 million entries.
I am currently paginating the data by using the OFFSET and LIMIT clauses, however this takes takes a lot of time for higher offsets.
SELECT * FROM diff ORDER BY time LIMIT 1000000 OFFSET 0
takes 21 seconds, whereas
SELECT * FROM diff ORDER BY time LIMIT 1000000 OFFSET 40000000
takes 221 seconds.
I am using the Python influxdb wrapper to send the requests.
Is there a way to optimize this or stream the whole table?
UPDATE : Rembering the timestamp of the last received data, and then using a WHERE time >= last_timestamp on the next query, reduces the query time for higher offsets drastically (query time is always ~25 secs). This is rather cumbersome however, because if two data points share the same timestamp, some results might be present on two pages of data, which has to be detected somehow.
You should use Continuous Queries or Kapacitor. Can you elaborate on your use-case, what you're doing with the stream of data?

InlfuxDB TOP function poor performances

I'm using InfluxDB and I'm trying to query values in it with TOP() function.
Here is an exemple of request :
SELECT TOP("duration", 2) AS "top_duration" FROM "range" WHERE "time" > '2017-11-23T15:23:32.243Z' AND "contract" = 'A0000544' AND "type" = 'PRESENCE' AND "room" = '3908' AND "endTime" < 80785557 AND "startTime" > 28630649
In the measurement contract, type and room are tags, duration, startTime and endTime are fields.
I have around 37 866 326 points in range, but only 78 962 for contract 'A0000544' and 10 487 for room '3908'
This request takes several seconds and I'm trying to reduce the processing time.
I tried to create another measurement to reduce my sample and keeping only biggest "duration".
I kept only 4 066 728 points but the processing time was the same.
When I keep only the point about the contract in the measurement the request take around 300ms.
I don't understand why I have so much execution time difference with empty database and in the other hand no difference with the filtered measurement.
Am I missing something? Is there any other possible optimisations?
That is just an assumption, but maybe having filtering by filed rather then by tags alone + having 3 fields in a single measurement is a performance killer. Field are not indexed, so filtering by fields requires a full table scan). Besides, multiple fields per data point create multiple index entries.
I am not sure of the solution... Probably, InfluxDB was not designed for such a complex table schema.

should PAX be in Flighth Dimension or Fact Sales table?

I need to build a data mart using power pivot for a duty free shop at Airport.
Sales manager is analying sales data using by flight number and by PAX, number of people per flight.
So, I don't know where to put PAX. In DimFlight or FactSales. It is addative, right?
Please explain me why and how should I put PAX into which table. DimFlight may includes airline, flignt_no, date, PAX. A flight may also land the airport more than once a day.
PAX is a fact describing a measureable value of a specific flight event. It should be in the fact table, not in the flight dimension. I would expect total capacity to be an attribute of the plane dimension associated to the flight event. (Flight number would likely be a degenerate dimension as it doesn't really own any attributes.) However, the PAX itself should be a measure in the fact table.
You can generate a junk dimension that has the banding mentioned by #Luis Leal to do some capacity analytics. You can even create a numbers dimension with an attribute for each group level so you can do more detailed banding. For example, an attribute for 1s, 10s, 100s, 1000s, etc. You can also calculate the filled capacity of the flight and point to the numbers dimension so you can group flights by 80% full, 90% full etc.
Nothing stops you from modeling it as both dimension and measure, so you can store it both on a dimension table and as a measure on a fact table. If you store it as a measure on the fact table, you can perform several analysis by the other possible dimensions, get insights as averages, max, min, total by x or y dimension, which would be very difficult if you store it only on the dimension table.
On the other hand,storing it in the dimension table enables additional "perspectives" of analysis, for example a common approach is to store in the dimensional table "interval" columns with values like:
from 1 to 1000 pax, from 1001 to 2000. This column calculated at ETL time depending on the value of the PAX. So why not use both?

MonetDB - left/right joins too slow than inner join

I have been comparing MySQL with MonetDB. Obviously, queries that took minutes in MySQL got executed in a matter of few seconds in Monet.
However, I found a real blockade with joins.
I have 2 tables - each with 150 columns. Among these (150+150) columns, around 60 are CHARACTER LARGE OBJECT type. Both the tables are populated with around 50,000 rows - with data in all the 150 columns. The average length of data in a CLOB type column is 9,000 (varying from 2 characters to 20,000 characters). The primary key of both the tables have same values and the join is always based on the primary key. The rows are by default inserted in ascending order on the primary key.
When I ran an inner join query on these two tables with about 5 criteria and with limit 1000, Monet processed this query in 5 seconds, which is completely impressive compared to MySQL's (19 seconds).
But when I ran the same query with same criteria and limits using a left or right joins, Monet took around 5 minutes, which is clearly way behind MySQL's (just 22 seconds).
I went through the logs using trace statement, but the traces of inner and left joins are more or less the same, except that time for each action is far higher in left join trace.
Also, the time taken for the same join query execution varies by 2 or 3 seconds when run at several time intervals.
Having read a lot about Monet's speed compared to traditional relation row based DBs, I could feel that I am missing something, but couldn't figure out what.
Can anyone please tell me why there is such a huge difference in such query execution time and how I can prevent it?
Much grateful to have any help. Thanks a lot in advance.
P.S.: I am running Monet on Macbook Pro - 2.3 GHz Core i7 - Quad core processor with 8 GB RAM.

Resources