Joining large tables in ClickHouse: out of memory or slow - join

I have 3 large tables (>100 GB with millions of rows each): events, page_views, and sessions. These tables are connected via 1-n relationships, see table setup below. I'm trying to create a denormalized events_wide table that contains a row for each event, where the corresponding page_views and sessions columns are included as well. The idea is to eliminate the joins needed for complex analytics queries, since these joins are slow.
I created a materialized view events_mv which joins the page_views and sessions table to the events table. Whenever a new event is inserted into events, the materialized view should insert a row into events_wide, joining the page_view and session automatically. However, when I insert a single new event, the query either doesn't finish or terminates with an out of memory error.
Even running this simple join query from events to page_views results in an out of memory error: Memory limit (for user) exceeded: would use 99.21 GiB. I use a ClickHouse Cloud production instance with 24+ GB RAM:
SELECT
-- Select columns from events and page_views
FROM events AS e
LEFT JOIN page_views AS p ON p.property_id = e.property_id AND p.id = e.page_view_id
LIMIT 3;
I tried different primary key orderings for the 3 tables (property_id, created_at, id) vs (property_id, id, created_at), different join algorithms (partial_merge, auto, grace_hash), ANY LEFT JOIN, without success. Maybe using UUIDs instead of numeric IDs is part of the problem, but I can't change the UUIDs unfortunately.
This is my table setup with the (property_id, id, created_at) primary keys:
CREATE TABLE events
(
id UUID,
created_at DateTime('UTC'),
property_id Int,
page_view_id Nullable(UUID),
session_id Nullable(UUID),
...
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, id, created_at)
ORDER BY (property_id, id, created_at);
CREATE TABLE page_views
(
id UUID,
created_at DateTime('UTC'),
modified_at DateTime('UTC'),
session_id Nullable(UUID),
...
) ENGINE = ReplacingMergeTree(modified_at)
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, id, created_at)
ORDER BY (property_id, id, created_at);
CREATE TABLE sessions
(
id UUID,
created_at DateTime('UTC'),
modified_at DateTime('UTC'),
property_id Int,
...
) ENGINE = ReplacingMergeTree(modified_at)
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, id, created_at)
ORDER BY (property_id, id, created_at);
CREATE TABLE events_wide
(
id UUID,
created_at DateTime('UTC'),
property_id Int,
page_view_id Nullable(UUID),
session_id Nullable(UUID),
...
-- page_views columns
p_created_at DateTime('UTC'),
p_modified_at DateTime('UTC'),
...
-- sessions columns
s_created_at DateTime('UTC'),
s_modified_at DateTime('UTC'),
...
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, created_at)
ORDER BY (property_id, created_at, id);
CREATE MATERIALIZED VIEW events_mv TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.session_id AS session_id,
e.property_id AS property_id,
e.page_view_id AS page_view_id,
...
-- page_views columns
p.created_at AS p_created_at,
p.modified_at AS p_modified_at,
...
-- sessions columns
s.created_at AS s_created_at,
s.modified_at AS s_modified_at ,
...
FROM events AS e
LEFT JOIN page_views AS p ON p.property_id = e.property_id AND p.id = e.page_view_id
LEFT JOIN sessions AS s ON s.property_id = e.property_id AND s.id = e.session_id
SETTINGS join_algorithm = 'partial_merge';

ClickHouse doesn't have a proper optimizer, so the right tables of the join require to be filtered before performing a join. Otherwise, full tables will be pushed to memory to perform the join causing the issues you're experiencing.
Using the example you've provided:
WITH events_block AS (
SELECT * FROM events LIMIT 3
)
SELECT e.*, p.* FROM events_block AS e
LEFT JOIN (
SELECT * FROM page_views
WHERE (property_id, id) IN (
SELECT property_id, page_view_id FROM events_block
)
) AS p ON p.property_id = e.property_id AND p.id = e.page_view_id;
This could seem weird if you think about single join operation but materialized views are processed in blocks, this will prevent moving to memory full right tables every single time.
So rewriting the materialized view as follows will do the trick:
CREATE MATERIALIZED VIEW events_mv TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.session_id AS session_id,
e.property_id AS property_id,
e.page_view_id AS page_view_id,
...
-- page_views columns
p.created_at AS p_created_at,
p.modified_at AS p_modified_at,
...
-- sessions columns
s.created_at AS s_created_at,
s.modified_at AS s_modified_at,
...
FROM events AS e
LEFT JOIN (
SELECT * FROM page_views
WHERE (property_id, id) IN (
SELECT property_id, page_view_id FROM events
)
) AS p ON p.property_id = e.property_id AND p.id = e.page_view_id
LEFT JOIN (
SELECT * FROM sessions
WHERE (property_id, id) IN (
SELECT property_id, session_id FROM events
)
) AS s ON s.property_id = e.property_id AND s.id = e.session_id

To avoid expensive joins in this case, what about having the single denormalized super table "events_wide" with all the columns but use three different materialized views?
Each Materialized View would insert columns for each table to the "events_wide" and the columns not present with zeros or nulls.
For instance:
CREATE MATERIALIZED VIEW events_to_events_wide TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.session_id AS session_id,
e.property_id AS property_id,
e.page_view_id AS page_view_id,
...
-- page_views columns as null or zeros
...
-- sessions columns as null or zeros
...
FROM events AS e
CREATE MATERIALIZED VIEW page_views_to_events_wide TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.modified_at AS modified_at,
...
-- events columns as null or zeros
...
-- sessions columns as null or zeros
...
FROM page_views AS e
...
Then you have a single table with all records you can aggregate or perform the analysis you need without joins.

In order to keep the memory footprint manageable, you can try breaking the JOIN process down into two join operations.
You can achieve it by chaining materialized views:
First join events with page_view in a dedicated MV, let's say events_with_pv
Then join the events_with_pv with sessions into the final MV events_wide
You can use join_algorithm='auto' and let ClickHouse automatically decide which algorithm to use.
Materializing the intermediate state will allow to reduce the size of the state needed to compute the JOIN. Something to keep in mind in that you want the right table of your JOINs to be the smallest.

Related

Rails query to order records by single association row column

Say I have a books table that has id and name columns and a has_many book_ratings association. The book_ratings table has id, rating, rating_date, and book_id columns and belongs_to :book. I'm displaying the books in a table with 3 columns: the books name, the highest rating for the book, and the most recent rating for the book. I want each column sortable and I'm using the Pagy gem for pagination. I want to make a single call to each table to avoid N+1.
I made this query which includes the max_rating and sorts by max_rating:
#pagy, #books = pagy(
Book.joins(:book_ratings)
.select('books.*,MAX(rating) as max_rating')
.order('max_rating')
.group('books.id')
)
With each book record, I can call book.max_rating to display the max_rating in the table. I cannot think of a way to also include the rating with the latest date and sort the records by that rating. I know I can use MAX(rating_date) to get the date of the book_rating with the most recent date, but I need a way to include the rating value of the book_rating with the latest date and also be able to order the book records by those values.
Any ideas?
To get this to work, I ended up using a select method and then chaining two joins methods. Here's what I ended up using:
Book.select('books.*,t2.rating as latest_rating,t4.rating as highest_rating')
.joins(
"INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id, book_ratings.rating_date
FROM book_ratings
JOIN (
SELECT book_id, MAX(rating_date) as max_date
FROM book_ratings GROUP BY book_id
) t1
ON t1.book_id = book_ratings.book_id
AND t1.max_date = book_ratings.rating_date
) t2
ON t2.book_id = books.id"
)
.joins(
"INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id
FROM book_ratings
JOIN (
SELECT book_id, MAX(rating) as max_rating
FROM book_ratings
GROUP BY book_id
) t3
ON t3.book_id = book_ratings.book_id
AND t3.max_rating = book_ratings.rating
) t4
ON t4.book_id = books.id"
)
.distinct
Calling .to_sql on that query gives this SQL:
SELECT DISTINCT books.*,t2.rating as latest_rating,t4.rating as highest_rating
FROM "books"
INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id, book_ratings.date
FROM book_ratings
JOIN (
SELECT book_id, MAX(date) as max_date
FROM book_ratings
GROUP BY book_id
) t1
ON t1.book_id = book_ratings.book_id
AND t1.max_date = book_ratings.date
) t2
ON t2.book_id = books.id
INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id
FROM book_ratings
JOIN (
SELECT book_id, MAX(rating) as max_rating
FROM book_ratings
GROUP BY book_id
) t3
ON t3.book_id = book_ratings.book_id
AND t3.max_rating = book_ratings.rating
) t4
ON t4.book_id = books.id
When iterating through this collection, latest_rating and highest_rating can be called on each object and provides the value from the rating column for either. It can also be used with .order('latest_rating asc') (or desc) or .order('highest_rating asc') (or desc)

Find n most referenced records by foreign_key in related table

I have a table skills and a table programs_skills which references skill_id as a foreign key, I want to retrieve the 10 most present skills in table programs_skills (I need to count the number of occurrence of skill_id in programs_skills and then order it by descending order).
I wrote this in my skill model:
def self.most_used(limit)
Skill.find(
ActiveRecord::Base.connection.execute(
'SELECT programs_skills.skill_id, count(*) FROM programs_skills GROUP BY skill_id ORDER BY count DESC'
).to_a.first(limit).map { |record| record['skill_id'] }
)
end
This is working but I would like to find a way to perform this query in a more elegant, performant, "activerecord like" way.
Could you help me rewrite this query ?
Just replace your query by:
WITH
T AS
(
SELECT skill_id, COUNT(*) AS NB, RANK() OVER(ORDER BY COUNT(*) DESC) AS RNK
FROM programs_skills
GROUP BY skill_id
)
SELECT wojewodztwo, NB
FROM T
WHERE RNK <= 10
This use CTE and windowed function.
ProgramsSkills.select("skill_id, COUNT(*) AS nb_skills")
.group(:skill_id).order("nb_skills DESC").limit(limit)
.first(limit).pluck(:skill_id)

MySQL self-referencing JOIN

I have two tables:
Customers (
int(11) Id,
varchar(255) Name,
int(11) Referred_ID -- Referred_ID being reference to an Id field
-- (no key on that field) and
)
the other table being:
Invoices (
int(11) Id,
date Billing_date,
int(11) Customer_ID
)
I want to select Id, Billing_date of the invoice AND most important, customer's Name this customer refers to.
Now I'm only able to select his referrer's ID by a query like this one:
SELECT Invoices.Id, Invoices.Billing_date, Customers.Name, Referred_ID
FROM Invoices
INNER JOIN Customers ON Invoices.Customer_Id = Customers.Id;
How should I modify my query to replace that Referred_ID by its owner name?
It's a MySQL from something like 2015, by the way.
You could use two time the customers using an alias for join the referred
SELECT Invoices.Id, Invoices.Billing_date, Customers.Name, Referred.Name
FROM Invoices
INNER JOIN Customers ON Invoices.Customer_Id = Customers.Id
INNER JOIN Customers Referred on Referred.id = Customers.Referred_ID;
use Customers table twice in join
SELECT Invoices.Id, Invoices.Billing_date,
c1.Name as customername,
c1.Referred_ID,
c2.Name as refername
FROM Invoices INNER JOIN Customers c1 ON Invoices.Customer_Id = c1.Id
join Customers c2 on c1.Id=c2.Referred_ID

Rails Postgres query to exclude any results that contain one of three records on join

This is a hard problem to describe but I have Rails query where I join another table and I want to exclude any results where the join table contain one of three conditions.
I have a Device model that relates to a CarUserRole model/record. In that CarUserRole record it will contain one of three :role - "owner", "monitor", "driver". I want to return any results where there is no related CarUserRole record where role: "owner". How would I do that?
This was my first attempt -
Device.joins(:car_user_roles).where('car_user_roles.role = ? OR car_user_roles.role = ? AND car_user_roles.role != ?', 'monitor', 'driver', 'owner')
Here is the sql -
"SELECT \"cars\".* FROM \"cars\" INNER JOIN \"car_user_roles\" ON \"car_user_roles\".\"car_id\" = \"cars\".\"id\" WHERE (car_user_roles.role = 'monitor' OR car_user_roles.role = 'driver' AND car_user_roles.role != 'owner')"
Update
I should mention that a device sometimes has multiple CarUserRole records. A device can have an "owner" and a "driver" CarUserRole. I should also note that they can only have one owner.
Anwser
I ended up going with #Reub's solution via our chat -
where(CarUserRole.where("car_user_roles.car_id = cars.id").where(role: 'owner').exists.not)
Since the car_user_roles table can have multiple records with the same car_id, an inner join can result in the join table having multiple rows for each row in the cars table. So, for a car that has 3 records in the car_user_roles table (monitor, owner and driver), there will be 3 records in the join table (each record having a different role). Your query will filter out the row where the role is owner, but it will match the other two, resulting in that car being returned as a result of your query even though it has a record with role as 'owner'.
Lets first try to form an sql query for the result that you want. We can then convert this into a Rails query.
SELECT * FROM cars WHERE NOT EXISTS (SELECT id FROM car_user_roles WHERE role='owner' AND car_id = cars.id);
The above is sufficient if you want devices which do not have any car_user_role with role as 'owner'. But this can also give you devices which have no corresponding record in car_user_roles. If you want to ensure that the device has at least one record in car_user_roles, you can add the following to the above query.
AND EXISTS (SELECT id FROM car_user_roles WHERE role IN ('monitor', 'driver') AND car_id = cars.id);
Now, we need to convert this into a Rails query.
Device.where(
CarUserRole.where("car_user_roles.car_id = cars.id").where(role: 'owner').exists.not
).where(
CarUserRole.where("car_user_roles.car_id = cars.id").where(role: ['monitor', 'driver']).exists
).all
You could also try the following if your Rails version supports exists?:
Device.joins(:car_user_roles).exists?(role: ['monitor', 'driver']).exists?(role: 'owner').not.select('cars.*').distinct
Select the distinct cars
SELECT DISTINCT (cars.*) FROM cars
Use a LEFT JOIN to pull in the car_user_roles
LEFT JOIN car_user_roles ON cars.id = car_user_roles.car_id
Select only the cars that DO NOT contain an 'owner' car_user_role
WHERE NOT EXISTS(SELECT NULL FROM car_user_roles WHERE cars.id = car_user_roles.car_id AND car_user_roles.role = 'owner')
Select only the cars that DO contain either a 'driver' or 'monitor' car_user_role
AND (car_user_roles.role IN ('driver','monitor'))
Put it all together:
SELECT DISTINCT (cars.*) FROM cars LEFT JOIN car_user_roles ON cars.id = car_user_roles.car_id WHERE NOT EXISTS(SELECT NULL FROM car_user_roles WHERE cars.id = car_user_roles.car_id AND car_user_roles.role = 'owner') AND (car_user_roles.role IN ('driver','monitor'));
Edit:
Execute the query directly from Rails and return only the found object IDs
ActiveRecord::Base.connection.execute(sql).collect { |x| x['id'] }

Using multiple column names in where with RoR and ActiveRecord

I want to produce the following sql using active record.
WHERE (column_name1, column_name1) IN (SELECT ....)
I don't know how to do this is active record.
I've tried these so far
where('column_name1, column_name2' => {})
where([:column_name1, :column_name2] => {})
This is the full query I'd like to create
SELECT a, Count(1)
FROM table
WHERE ( a, b ) IN (SELECT a,
Max(b)
FROM table
GROUP BY a)
GROUP BY a
HAVING Count(1) > 1)
I've already written a scope for the subquery
Thanks in advance.
WHERE (column_name1, column_name1) IN (SELECT ....) is not a valid construct in sql; so it can't be done in active record either.
The valid way of accomplishing the same in SQL would be:
WHERE column_name1 IN (select ....) OR column_name2 IN (select ...)
The same query can be used directly in the active record:
where("column_name1 IN (select ...) OR column_name2 IN (select...)")
Avoiding duplication:
selected_values = select ...
where("column_name IN ? OR column_name2 in ?", selected_values, selected_values)
So I decided to use an inner join to gain the same functionality. Here is my solution.
select(:column1, 'Count(1)').
joins("INNER JOIN (#{subquery.to_sql}) AS table2 ON
table1.column1=table2.column1
AND table1.column2=table2.column2")

Resources