I have 3 large tables (>100 GB with millions of rows each): events, page_views, and sessions. These tables are connected via 1-n relationships, see table setup below. I'm trying to create a denormalized events_wide table that contains a row for each event, where the corresponding page_views and sessions columns are included as well. The idea is to eliminate the joins needed for complex analytics queries, since these joins are slow.
I created a materialized view events_mv which joins the page_views and sessions table to the events table. Whenever a new event is inserted into events, the materialized view should insert a row into events_wide, joining the page_view and session automatically. However, when I insert a single new event, the query either doesn't finish or terminates with an out of memory error.
Even running this simple join query from events to page_views results in an out of memory error: Memory limit (for user) exceeded: would use 99.21 GiB. I use a ClickHouse Cloud production instance with 24+ GB RAM:
SELECT
-- Select columns from events and page_views
FROM events AS e
LEFT JOIN page_views AS p ON p.property_id = e.property_id AND p.id = e.page_view_id
LIMIT 3;
I tried different primary key orderings for the 3 tables (property_id, created_at, id) vs (property_id, id, created_at), different join algorithms (partial_merge, auto, grace_hash), ANY LEFT JOIN, without success. Maybe using UUIDs instead of numeric IDs is part of the problem, but I can't change the UUIDs unfortunately.
This is my table setup with the (property_id, id, created_at) primary keys:
CREATE TABLE events
(
id UUID,
created_at DateTime('UTC'),
property_id Int,
page_view_id Nullable(UUID),
session_id Nullable(UUID),
...
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, id, created_at)
ORDER BY (property_id, id, created_at);
CREATE TABLE page_views
(
id UUID,
created_at DateTime('UTC'),
modified_at DateTime('UTC'),
session_id Nullable(UUID),
...
) ENGINE = ReplacingMergeTree(modified_at)
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, id, created_at)
ORDER BY (property_id, id, created_at);
CREATE TABLE sessions
(
id UUID,
created_at DateTime('UTC'),
modified_at DateTime('UTC'),
property_id Int,
...
) ENGINE = ReplacingMergeTree(modified_at)
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, id, created_at)
ORDER BY (property_id, id, created_at);
CREATE TABLE events_wide
(
id UUID,
created_at DateTime('UTC'),
property_id Int,
page_view_id Nullable(UUID),
session_id Nullable(UUID),
...
-- page_views columns
p_created_at DateTime('UTC'),
p_modified_at DateTime('UTC'),
...
-- sessions columns
s_created_at DateTime('UTC'),
s_modified_at DateTime('UTC'),
...
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
PRIMARY KEY (property_id, created_at)
ORDER BY (property_id, created_at, id);
CREATE MATERIALIZED VIEW events_mv TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.session_id AS session_id,
e.property_id AS property_id,
e.page_view_id AS page_view_id,
...
-- page_views columns
p.created_at AS p_created_at,
p.modified_at AS p_modified_at,
...
-- sessions columns
s.created_at AS s_created_at,
s.modified_at AS s_modified_at ,
...
FROM events AS e
LEFT JOIN page_views AS p ON p.property_id = e.property_id AND p.id = e.page_view_id
LEFT JOIN sessions AS s ON s.property_id = e.property_id AND s.id = e.session_id
SETTINGS join_algorithm = 'partial_merge';
ClickHouse doesn't have a proper optimizer, so the right tables of the join require to be filtered before performing a join. Otherwise, full tables will be pushed to memory to perform the join causing the issues you're experiencing.
Using the example you've provided:
WITH events_block AS (
SELECT * FROM events LIMIT 3
)
SELECT e.*, p.* FROM events_block AS e
LEFT JOIN (
SELECT * FROM page_views
WHERE (property_id, id) IN (
SELECT property_id, page_view_id FROM events_block
)
) AS p ON p.property_id = e.property_id AND p.id = e.page_view_id;
This could seem weird if you think about single join operation but materialized views are processed in blocks, this will prevent moving to memory full right tables every single time.
So rewriting the materialized view as follows will do the trick:
CREATE MATERIALIZED VIEW events_mv TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.session_id AS session_id,
e.property_id AS property_id,
e.page_view_id AS page_view_id,
...
-- page_views columns
p.created_at AS p_created_at,
p.modified_at AS p_modified_at,
...
-- sessions columns
s.created_at AS s_created_at,
s.modified_at AS s_modified_at,
...
FROM events AS e
LEFT JOIN (
SELECT * FROM page_views
WHERE (property_id, id) IN (
SELECT property_id, page_view_id FROM events
)
) AS p ON p.property_id = e.property_id AND p.id = e.page_view_id
LEFT JOIN (
SELECT * FROM sessions
WHERE (property_id, id) IN (
SELECT property_id, session_id FROM events
)
) AS s ON s.property_id = e.property_id AND s.id = e.session_id
To avoid expensive joins in this case, what about having the single denormalized super table "events_wide" with all the columns but use three different materialized views?
Each Materialized View would insert columns for each table to the "events_wide" and the columns not present with zeros or nulls.
For instance:
CREATE MATERIALIZED VIEW events_to_events_wide TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.session_id AS session_id,
e.property_id AS property_id,
e.page_view_id AS page_view_id,
...
-- page_views columns as null or zeros
...
-- sessions columns as null or zeros
...
FROM events AS e
CREATE MATERIALIZED VIEW page_views_to_events_wide TO events_wide AS
SELECT
e.id AS id,
e.created_at AS created_at,
e.modified_at AS modified_at,
...
-- events columns as null or zeros
...
-- sessions columns as null or zeros
...
FROM page_views AS e
...
Then you have a single table with all records you can aggregate or perform the analysis you need without joins.
In order to keep the memory footprint manageable, you can try breaking the JOIN process down into two join operations.
You can achieve it by chaining materialized views:
First join events with page_view in a dedicated MV, let's say events_with_pv
Then join the events_with_pv with sessions into the final MV events_wide
You can use join_algorithm='auto' and let ClickHouse automatically decide which algorithm to use.
Materializing the intermediate state will allow to reduce the size of the state needed to compute the JOIN. Something to keep in mind in that you want the right table of your JOINs to be the smallest.
Say I have a books table that has id and name columns and a has_many book_ratings association. The book_ratings table has id, rating, rating_date, and book_id columns and belongs_to :book. I'm displaying the books in a table with 3 columns: the books name, the highest rating for the book, and the most recent rating for the book. I want each column sortable and I'm using the Pagy gem for pagination. I want to make a single call to each table to avoid N+1.
I made this query which includes the max_rating and sorts by max_rating:
#pagy, #books = pagy(
Book.joins(:book_ratings)
.select('books.*,MAX(rating) as max_rating')
.order('max_rating')
.group('books.id')
)
With each book record, I can call book.max_rating to display the max_rating in the table. I cannot think of a way to also include the rating with the latest date and sort the records by that rating. I know I can use MAX(rating_date) to get the date of the book_rating with the most recent date, but I need a way to include the rating value of the book_rating with the latest date and also be able to order the book records by those values.
Any ideas?
To get this to work, I ended up using a select method and then chaining two joins methods. Here's what I ended up using:
Book.select('books.*,t2.rating as latest_rating,t4.rating as highest_rating')
.joins(
"INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id, book_ratings.rating_date
FROM book_ratings
JOIN (
SELECT book_id, MAX(rating_date) as max_date
FROM book_ratings GROUP BY book_id
) t1
ON t1.book_id = book_ratings.book_id
AND t1.max_date = book_ratings.rating_date
) t2
ON t2.book_id = books.id"
)
.joins(
"INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id
FROM book_ratings
JOIN (
SELECT book_id, MAX(rating) as max_rating
FROM book_ratings
GROUP BY book_id
) t3
ON t3.book_id = book_ratings.book_id
AND t3.max_rating = book_ratings.rating
) t4
ON t4.book_id = books.id"
)
.distinct
Calling .to_sql on that query gives this SQL:
SELECT DISTINCT books.*,t2.rating as latest_rating,t4.rating as highest_rating
FROM "books"
INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id, book_ratings.date
FROM book_ratings
JOIN (
SELECT book_id, MAX(date) as max_date
FROM book_ratings
GROUP BY book_id
) t1
ON t1.book_id = book_ratings.book_id
AND t1.max_date = book_ratings.date
) t2
ON t2.book_id = books.id
INNER JOIN (
SELECT book_ratings.rating, book_ratings.book_id
FROM book_ratings
JOIN (
SELECT book_id, MAX(rating) as max_rating
FROM book_ratings
GROUP BY book_id
) t3
ON t3.book_id = book_ratings.book_id
AND t3.max_rating = book_ratings.rating
) t4
ON t4.book_id = books.id
When iterating through this collection, latest_rating and highest_rating can be called on each object and provides the value from the rating column for either. It can also be used with .order('latest_rating asc') (or desc) or .order('highest_rating asc') (or desc)
I want to join two tables that I have both Id columns in join table by codeigniter.
I want both id column from comment and users tables
I write below code
$this->db->select('users.name as user_full_name, users.id as userid', false);
$this->db->from('users');
$this->db->select()
->from('comment')
->where('project_id', $projectId)
->where('user_id', $user_id)
->join('users', 'comment.user_id_from =userid')
->order_by("comment.id", "asc");
return $this->db->get()->result_array();
but face error, I do not know why
error:
Error Number: 1064
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '* FROM (users, comment) JOIN users ON comment.user_id_from =userid W' at line 1
SELECT users.name as user_full_name, users.id as userid, * FROM (users, comment) JOIN users ON comment.user_id_from =userid WHERE project_id = '3' AND user_id = '84' ORDER BY comment.id ASC
please show me how to solve it
try this:
$this->db->select('*,comment.id as comment_id,users.user_id as user_id,users.name as user_name');
$this->db->from('comment');
$this->db->where('user_id', $user_id);
$this->db->join('users', 'users.user_id = comment.id');
$this->db->order_by("comment.id", "asc");
return $this->db->get()->result_array();
it return all user and comment table data
may be this codeigniter query help you out
This is a hard problem to describe but I have Rails query where I join another table and I want to exclude any results where the join table contain one of three conditions.
I have a Device model that relates to a CarUserRole model/record. In that CarUserRole record it will contain one of three :role - "owner", "monitor", "driver". I want to return any results where there is no related CarUserRole record where role: "owner". How would I do that?
This was my first attempt -
Device.joins(:car_user_roles).where('car_user_roles.role = ? OR car_user_roles.role = ? AND car_user_roles.role != ?', 'monitor', 'driver', 'owner')
Here is the sql -
"SELECT \"cars\".* FROM \"cars\" INNER JOIN \"car_user_roles\" ON \"car_user_roles\".\"car_id\" = \"cars\".\"id\" WHERE (car_user_roles.role = 'monitor' OR car_user_roles.role = 'driver' AND car_user_roles.role != 'owner')"
Update
I should mention that a device sometimes has multiple CarUserRole records. A device can have an "owner" and a "driver" CarUserRole. I should also note that they can only have one owner.
Anwser
I ended up going with #Reub's solution via our chat -
where(CarUserRole.where("car_user_roles.car_id = cars.id").where(role: 'owner').exists.not)
Since the car_user_roles table can have multiple records with the same car_id, an inner join can result in the join table having multiple rows for each row in the cars table. So, for a car that has 3 records in the car_user_roles table (monitor, owner and driver), there will be 3 records in the join table (each record having a different role). Your query will filter out the row where the role is owner, but it will match the other two, resulting in that car being returned as a result of your query even though it has a record with role as 'owner'.
Lets first try to form an sql query for the result that you want. We can then convert this into a Rails query.
SELECT * FROM cars WHERE NOT EXISTS (SELECT id FROM car_user_roles WHERE role='owner' AND car_id = cars.id);
The above is sufficient if you want devices which do not have any car_user_role with role as 'owner'. But this can also give you devices which have no corresponding record in car_user_roles. If you want to ensure that the device has at least one record in car_user_roles, you can add the following to the above query.
AND EXISTS (SELECT id FROM car_user_roles WHERE role IN ('monitor', 'driver') AND car_id = cars.id);
Now, we need to convert this into a Rails query.
Device.where(
CarUserRole.where("car_user_roles.car_id = cars.id").where(role: 'owner').exists.not
).where(
CarUserRole.where("car_user_roles.car_id = cars.id").where(role: ['monitor', 'driver']).exists
).all
You could also try the following if your Rails version supports exists?:
Device.joins(:car_user_roles).exists?(role: ['monitor', 'driver']).exists?(role: 'owner').not.select('cars.*').distinct
Select the distinct cars
SELECT DISTINCT (cars.*) FROM cars
Use a LEFT JOIN to pull in the car_user_roles
LEFT JOIN car_user_roles ON cars.id = car_user_roles.car_id
Select only the cars that DO NOT contain an 'owner' car_user_role
WHERE NOT EXISTS(SELECT NULL FROM car_user_roles WHERE cars.id = car_user_roles.car_id AND car_user_roles.role = 'owner')
Select only the cars that DO contain either a 'driver' or 'monitor' car_user_role
AND (car_user_roles.role IN ('driver','monitor'))
Put it all together:
SELECT DISTINCT (cars.*) FROM cars LEFT JOIN car_user_roles ON cars.id = car_user_roles.car_id WHERE NOT EXISTS(SELECT NULL FROM car_user_roles WHERE cars.id = car_user_roles.car_id AND car_user_roles.role = 'owner') AND (car_user_roles.role IN ('driver','monitor'));
Edit:
Execute the query directly from Rails and return only the found object IDs
ActiveRecord::Base.connection.execute(sql).collect { |x| x['id'] }
I have two tables that look like this:
Products: id category name description active
Sales_sheets: id product_id link
product_id is a foreign key from the products id table
I wrote a prepared statement JOIN like this which works:
SELECT p.name, p.description, s.link FROM products AS p
INNER JOIN sales_sheets AS s ON p.id = s.product_id WHERE active=1 AND category=?
Basically a product can have a link to a PDF, but not every product will have a sales sheet. So if i try to bring up a product which doesn't have a sales sheet attached to it then it always returns no rows.
So i thought I'd have to use a LEFT OUTER JOIN in place of the INNER JOIN, but that returns no rows too, am I naming the tables in the wrong order? I've never had to use an OUTER join before?
SELECT p.name, p.description, s.link FROM products p
LEFT JOIN sales_sheets s ON p.id = s.product_id
WHERE active = 1 && category = ?