dask pivot_table seems very slow - dask

I'm collating a simple pivot table on which position was owned by which fund on which date. It reads a fairly large parquet file per day, extracting just 3 columns. Resultant for 1 year is ~100K rows, 365 columns, each element containing Nan or a single short string (for "fund").
This takes 46 seconds for one year of data, if I fall back to pandas before creating the pivot table.
ddf = dd.read_parquet("s3://.../2021*.parquet", columns=["as_of_date", "position_id", "fund"]).categorize(columns=['as_of_date'])
ddf.compute().pivot_table(index='position_id', columns='as_of_date', values='fund', aggfunc='first')
However, that seems like I'm pushing a lot of data around, I would think it'd be more efficient, or at least as efficient to have dask do the pivot table:
ddf = dd.read_parquet("s3://.../2021*.parquet", columns=["as_of_date", "position_id", "fund"]).categorize(columns=['as_of_date'])
ddf.pivot_table(index='position_id', columns='as_of_date', values='fund', aggfunc='first').compute()
But that takes 12 minutes, and if I give it more than 1 year of data, dies a with KilledWorker exception.
Any guidance as to how to diagnose what's going on here? I'm not clear why it struggles so much to do the pivot_table. FWIW, there are no duplicates, but pivot() doesn't exist in dask, so aggfunc='first' is the simplest placeholder I can think of to make it work

Performing some operations on dask dataframes currently can lead to bottlenecks in processing. These bottlenecks can cause memory problems or leading to a inefficient data shuffling. Pivoting is one such operation. This can be seen on the DAG for a sample pivot:
from dask.datasets import timeseries
df = timeseries(end="2000-01-05").categorize(columns=["name"])
pivot = df.pivot_table(index="id", columns="name", values="x")
pivot.visualize()
As the number of partitions increases, the DAG will get more involved. To see that try increasing the kwarg end="2000-01-05" to a later date.

Related

Influxdb speed up query over long time periods with group by

i write sensor data every second to an influxdb database. Displaying weekly, monthly or yearly summaries in grafana is quite slow since it needs to query many thousand values.
To speed things up, i was thinking about using a cron job to run a queries like
select mean(sensor1) into data_avg_1h from data where time > start and time <= end group by time(1h)
select mean(sensor1) into data_avg_1d from data where time > start and time <= end group by time(1d)
select mean(sensor1) into data_avg_1w from data where time > start and time <= end group by time(1w)
This would mean i need more storage, but queries run much faster.
Is this a bodge job or acceptable and is there a more clever way to do something like that?
Yes. It is perfectly ok and it is also recommended to downsample the data like you have mentioned in the question.
However, instead of using a cronjob it will be better to use Continuous query feature of InfluxDB to achieve the same result.
Downsampling & Contious Query Documentation.
Please be aware that when storing the average value for short period, if you want to calculate the average for a longer period from this downsampled data you will have to calculate the weighted average. Otherwise, you will calculating the average of average which, may not be equal to the average value calculated from the Original data.
This is because, each downsampled average value might be having different number of datapoints.
So while calculating the mean on regular interval store the number of data points received in that interval. This way you will be able to calculate the weighted average.

Slow query with 22 million points

I have 1 TB (text data).
I installed the Influxd, in a machine (240 G RAM, 32 CUP)
I only inserted around 22 million points in one measurement, one tag and 110 field.
When i do query (select id from ts limit 1) , it exceed 20 second, and this is not good.
So can you please help me in what i should do to have a good performance
how many count your series?
maybe your problem come up from here:
https://docs.influxdata.com/influxdb/v1.2/concepts/schema_and_data_layout/#don-t-have-too-many-series
Tags containing highly variable information like UUIDs, hashes, and random strings will lead to a large number of series in the database, known colloquially as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads

Iterating over large amounts of data in InfluxDB

I am looking for an efficient way to iterate over the full data of a influxDB table with ~250 million entries.
I am currently paginating the data by using the OFFSET and LIMIT clauses, however this takes takes a lot of time for higher offsets.
SELECT * FROM diff ORDER BY time LIMIT 1000000 OFFSET 0
takes 21 seconds, whereas
SELECT * FROM diff ORDER BY time LIMIT 1000000 OFFSET 40000000
takes 221 seconds.
I am using the Python influxdb wrapper to send the requests.
Is there a way to optimize this or stream the whole table?
UPDATE : Rembering the timestamp of the last received data, and then using a WHERE time >= last_timestamp on the next query, reduces the query time for higher offsets drastically (query time is always ~25 secs). This is rather cumbersome however, because if two data points share the same timestamp, some results might be present on two pages of data, which has to be detected somehow.
You should use Continuous Queries or Kapacitor. Can you elaborate on your use-case, what you're doing with the stream of data?

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

Data warehouse fact measurements that cannot be meaningfully aggregated over time?

Is there an example of a time-varying numerical quantity that might be in a data warehouse that cannot be meaningfully aggregated over time? If so why?
Stock levels cannot, because they represent a value that is already an aggregation at a particular moment in time.
If you have ten items in stock today and ten yesterday, and ten in stock every day this week, you cannot add them up to "70" meaningfully for the whole week, unless you are measuring something like space utilisation efficiency.
Other examples: bank balance, or speed of flywheel, or time since overhaul.
Many subatomic processes can be observed using our notion of "time" but probably wouldn't make much sense when aggregated. This is because our notion of "time" doesn't make much sense at the quantum level.

Resources