How do I clean up my join_table and remove duplicate entries? - ruby-on-rails

I have 2 models - Question and Tag - which have a HABTM between them, and they share a join table questions_tags.
Feast your eyes on this badboy:
1.9.3p392 :011 > Question.count
(852.1ms) SELECT COUNT(*) FROM "questions"
=> 417
1.9.3p392 :012 > Tag.count
(197.8ms) SELECT COUNT(*) FROM "tags"
=> 601
1.9.3p392 :013 > Question.connection.execute("select count(*) from questions_tags").first["count"].to_i
(648978.7ms) select count(*) from questions_tags
=> 39919778
I am assuming that the questions_tags join table contains a bunch of duplicate records - otherwise, I have no idea why it would be so large.
How do I clean up that join table so that it only has uniq content? Or how do I even check to see if there are duplicate records in there?
Edit 1
I am using PostgreSQL, this is the schema for the join_table questions_tags
create_table "questions_tags", :id => false, :force => true do |t|
t.integer "question_id"
t.integer "tag_id"
end
add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"

I'm adding this as a new answer since it's a lot different from my last. This one doesn't assume that you have an id column on the join table. This creates a new table, selects unique rows into it, then drops the old table and renames the new one. This will be much faster than anything involving a subselect.
foo=# select * from questions_tags;
question_id | tag_id
-------------+--------
1 | 2
2 | 1
2 | 2
1 | 1
1 | 1
(5 rows)
foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
question_id | tag_id
-------------+--------
2 | 2
1 | 2
2 | 1
1 | 1
(4 rows)
foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
question_id | tag_id
-------------+--------
2 | 2
1 | 2
2 | 1
1 | 1
(4 rows)

Delete tag associations with bad tag reference
DELETE FROM questions_tags
WHERE NOT EXISTS ( SELECT 1
FROM tags
WHERE tags.id = questions_tags.tag_id);
Delete tag associations with bad question reference
DELETE FROM questions_tags
WHERE NOT EXISTS ( SELECT 1
FROM questions
WHERE questions.id = questions_tags.question_id);
Delete duplicate tag associations
DELETE FROM questions_tags
USING ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
FROM questions_tags qt3
GROUP BY qt3.user_id, qt3.question_id
) qt2
WHERE questions_tags.user_id=qt2.user_id AND
questions_tags.question_id=qt2.question_id AND
questions_tags.id != qt2.id
Note:
Please test the SQL's in your development environment before trying them on your production environment.

Related

Select rows with no match in join table with where condition

In a Rails app with Postgres I have a users, jobs and followers join table. I want to select jobs that are not followed by a specific user. But also jobs with no rows in the join table.
Tables:
users:
id: bigint (pk)
jobs:
id: bigint (pk)
followings:
id: bigint (pk)
job_id: bigint (fk)
user_id: bigint (fk)
Data:
sandbox_development=# SELECT id FROM jobs;
id
----
1
2
3
(3 rows)
sandbox_development=# SELECT id FROM users;
id
----
1
2
sandbox_development=#
SELECT id, user_id, job_id FROM followings;
id | user_id | job_id
----+---------+--------
1 | 1 | 1
2 | 2 | 2
(2 rows)
Expected result
# jobs
id
----
2
3
(2 rows)
Can I create a join query that is the equivalent of this?
sandbox_development=#
SELECT j.id FROM jobs j
WHERE NOT EXISTS(
SELECT 1 FROM followings f
WHERE f.user_id = 1 AND f.job_id = j.id
);
id
----
2
3
(2 rows)
Which does the job but is a PITA to create with ActiveRecord.
So far I have:
Job.joins(:followings).where(followings: { user_id: 1 })
SELECT "jobs".* FROM "jobs"
INNER JOIN "followings"
ON "followings"."job_id" = "jobs"."id"
WHERE "followings"."user_id" != 1
But since its an inner join it does not include jobs with no followers (job id 3). I have also tried various attempts at outer joins that either give all the rows or no rows.
In Rails 5, You can use #left_outer_joins with where not to achieve the result. Left joins doesn't return null rows. So, We need to add nil conditions to fetch the rows.
Rails 5 Query:
Job.left_outer_joins(:followings).where.not(followings: {user_id: 1}).or(Job.left_outer_joins(:followings).where(followings: {user_id: nil}))
Alternate Query:
Job.left_outer_joins(:followings).where("followings.user_id != 1 OR followings.user_id is NULL")
Postgres Query:
SELECT "jobs".* FROM "jobs" LEFT OUTER JOIN "followings" ON "followings"."job_id" = "jobs"."id" WHERE "followings"."user_id" != 1 OR followings.user_id is NULL;
I'm not sure I understand, but this has the output you want and use outer join:
SELECT j.*
FROM jobs j LEFT JOIN followings f ON f.job_id = j.id
LEFT JOIN users u ON u.id = f.user_id AND u.id = 1
WHERE u.id IS NULL;

How to find posts tagged with more than one tag in Rails and Postgresql

I have the models Post, Tag, and PostTag. A post has many tags through post tags. I want to find posts that are exclusively tagged with more than one tag.
has_many :post_tags
has_many :tags, through: :post_tags
For example, given this data set:
posts table
--------------------
id | title |
--------------------
1 | Carb overload |
2 | Heart burn |
3 | Nice n Light |
tags table
-------------
id | name |
-------------
1 | tomato |
2 | potato |
3 | basil |
4 | rice |
post_tags table
-----------------------
id | post_id | tag_id |
-----------------------
1 | 1 | 1 |
2 | 1 | 2 |
3 | 2 | 1 |
4 | 2 | 3 |
5 | 3 | 1 |
I want to find posts tagged with tomato AND basil. This should return only the "Heart burn" post (id 2). Likewise, if I query for posts tagged with tomato AND potato, it should return the "Carb overload" post (id 1).
I tried the following:
Post.joins(:tags).where(tags: { name: ['basil', 'tomato'] })
SQL
SELECT "posts".* FROM "posts"
INNER JOIN "post_tags" ON "post_tags"."post_id" = "posts"."id"
INNER JOIN "tags" ON "tags"."id" = "post_tags"."tag_id"
WHERE "tags"."name" IN ('basil', 'tomato')
This returns all three posts because all share the tag tomato. I also tried this:
Post.joins(:tags).where(tags: { name 'basil' }).where(tags: { name 'tomato' })
SQL
SELECT "posts".* FROM "posts"
INNER JOIN "post_tags" ON "post_tags"."post_id" = "posts"."id"
INNER JOIN "tags" ON "tags"."id" = "post_tags"."tag_id"
WHERE "tags"."name" = 'basil' AND "tags"."name" = 'tomato'
This returns no records.
How can I query for posts tagged with multiple tags?
You may want to review the possible ways to write this kind of query in this answer for applying conditions to multiple rows in a join. Here is one possible option for implementing your query in Rails using 1B, the sub-query approach...
Define a query in the PostTag model that will grab up the Post ID values for a given Tag name:
# PostTag.rb
def self.post_ids_for_tag(tag_name)
joins(:tag).where(tags: { name: tag_name }).select(:post_id)
end
Define a query in the Post model that will grab up the Post records for a given Tag name, using a sub-query structure:
# Post.rb
def self.for_tag(tag_name)
where("id IN (#{PostTag.post_ids_for_tag(tag_name).to_sql})")
end
Then you can use a query like this:
Post.for_tag("basil").for_tag("tomato")
Use method .includes, like this:
Item.where(xpto: "test")
.includes({:orders =>[:suppliers, :agents]}, :manufacturers)
Documentation to .includes here.

Rails ActiveRecord distinct count of attribute greater than number of occurrences

I have a a table author_comments with a fields author_name, comment and brand id.
I would like to get the number (count) of records where the author has more than N (2) records for a given brand.
For example,
author_comments
author_name comment brand
joel "loves donuts" 1
joel "loves cookies" 1
joel "loves oranges" 1
fred "likes bananas" 2
fred "likes tacos" 2
fred "likes chips" 2
joe "thinks this is cool" 1
sally "goes to school" 1
sally "is smart" 1
sally "plays soccer" 1
In this case my query should return 2 for brand 1 and 1 for brand 2.
I'm interested in the best performing option here, not getting all the records from the db and sorting through them in ruby, I can do this. I'm looking for best way using active record constructs or sql.
Update:
Here is the SQL:
SELECT author_name, COUNT(*) AS author_comments
FROM fan_comments
WHERE brand_id =269998788
GROUP BY author_name
HAVING author_comments > 2;
Should I just do find_by_sql?
You can define the same query using active record constructions:
FanComments.all(
:select => 'author_name, count(*) as author_comments',
:group => 'author_name',
:having => 'author_comments > 2') # in rails 2
or:
FanComments.
select('author_name, count(*) as author_comments').
group('author_name').
having('author_comments > 2') # in rails 3
FanComment.group(:author_name).count

How to count group by rows in rails?

When I use User.count(:all, :group => "name"), I get multiple rows, but it's not what I want. What I want is the count of the rows. How can I get it?
Currently (18.03.2014 - Rails 4.0.3) this is correct syntax:
Model.group("field_name").count
It returns hash with counts as values
e.g.
SurveyReport.find(30).reports.group("status").count
#=> {
"pdf_generated" => 56
}
User.count will give you the total number of users and translates to the following SQL: SELECT count(*) AS count_all FROM "users"
User.count(:all, :group => 'name') will give you the list of unique names, along with their counts, and translates to this SQL: SELECT count(*) AS count_all, name AS name FROM "users" GROUP BY name
I suspect you want option 1 above, but I'm not clear on what exactly you want/need.
Probably you want to count the distinct name of the user?
User.count(:name, :distinct => true)
would return 3 if you have user with name John, John, Jane, Joey (for example) in the database.
________
| name |
|--------|
| John |
| John |
| Jane |
| Joey |
|________|
Try using User.find(:all, :group => "name").count
Good luck!
I found an odd way that seems to work. To count the rows returned from the grouping counts.
User Table Example
________
| name |
|--------|
| Bob |
| Bob |
| Joe |
| Susan |
|________|
Counts in the Groups
User.group(:name).count
# SELECT COUNT(*) AS count_all
# FROM "users"
# GROUP BY "users"."name"
=> {
"Bob" => 2,
"Joe" => 1,
"Susan" => 1
}
Row Count from the Counts in the Groups
User.group(:name).count.count
=> 5
Something Hacky
Here's something interesting I ran into, but it's quite hacky as it will add the count to every row, and doesn't play too well in active record land. I don't remember if I was able to get this into an Arel / ActiveRecord query.
SELECT COUNT(*) OVER() AS count, COUNT(*) AS count_all
FROM "users"
GROUP BY "users"."name"
[
{ count: 3, count_all: 2, name: "Bob" },
{ count: 3, count_all: 1, name: "Joe" },
{ count: 3, count_all: 1, name: "Susan" }
]

activerecord has_many :through find with one sql call

I have a these 3 models:
class User < ActiveRecord::Base
has_many :permissions, :dependent => :destroy
has_many :roles, :through => :permissions
end
class Permission < ActiveRecord::Base
belongs_to :role
belongs_to :user
end
class Role < ActiveRecord::Base
has_many :permissions, :dependent => :destroy
has_many :users, :through => :permissions
end
I want to find a user and it's roles in one sql statement, but I can't seem to achieve this:
The following statement:
user = User.find_by_id(x, :include => :roles)
Gives me the following queries:
User Load (1.2ms) SELECT * FROM `users` WHERE (`users`.`id` = 1) LIMIT 1
Permission Load (0.8ms) SELECT `permissions`.* FROM `permissions` WHERE (`permissions`.user_id = 1)
Role Load (0.8ms) SELECT * FROM `roles` WHERE (`roles`.`id` IN (2,1))
Not exactly ideal. How do I do this so that it does one sql query with joins and loads the user's roles into memory so saying:
user.roles
doesn't issue a new sql query
Loading the Roles in a separate SQL query is actually an optimization called "Optimized Eager Loading".
Role Load (0.8ms) SELECT * FROM `roles` WHERE (`roles`.`id` IN (2,1))
(It is doing this instead of loading each role separately, the N+1 problem.)
The Rails team found it was usually faster to use an IN query with the associations looked up previously instead of doing a big join.
A join will only happen in this query if you add conditions on one of the other tables. Rails will detect this and do the join.
For example:
User.all(:include => :roles, :conditions => "roles.name = 'Admin'")
See the original ticket, this previous Stack Overflow question, and Fabio Akita's blog post about Optimized Eager Loading.
As Damien pointed out, if you really want a single query every time you should use join.
But you might not want a single SQL call. Here's why (from here):
Optimized Eager Loading
Let’s take a look at this:
Post.find(:all, :include => [:comments])
Until Rails 2.0 we would see something like the following SQL query in the log:
SELECT `posts`.`id` AS t0_r0, `posts`.`title` AS t0_r1, `posts`.`body` AS t0_r2, `comments`.`id` AS t1_r0, `comments`.`body` AS t1_r1 FROM `posts` LEFT OUTER JOIN `comments` ON comments.post_id = posts.id
But now, in Rails 2.1, the same command will deliver different SQL queries. Actually at least 2, instead of 1. “And how can this be an improvement?” Let’s take a look at the generated SQL queries:
SELECT `posts`.`id`, `posts`.`title`, `posts`.`body` FROM `posts`
SELECT `comments`.`id`, `comments`.`body` FROM `comments` WHERE (`comments`.post_id IN (130049073,226779025,269986261,921194568,972244995))
The :include keyword for Eager Loading was implemented to tackle the dreaded 1+N problem. This problem happens when you have associations, then you load the parent object and start loading one association at a time, thus the 1+N problem. If your parent object has 100 children, you would run 101 queries, which is not good. One way to try to optimize this is to join everything using an OUTER JOIN clause in the SQL, that way both the parent and children objects are loaded at once in a single query.
Seemed like a good idea and actually still is. But for some situations, the monster outer join becomes slower than many smaller queries. A lot of discussion has been going on and you can take a look at the details at the tickets 9640, 9497, 9560, L109.
The bottom line is: generally it seems better to split a monster join into smaller ones, as you’ve seen in the above example. This avoid the cartesian product overload problem. For the uninitiated, let’s run the outer join version of the query:
mysql> SELECT `posts`.`id` AS t0_r0, `posts`.`title` AS t0_r1, `posts`.`body` AS t0_r2, `comments`.`id` AS t1_r0, `comments`.`body` AS t1_r1 FROM `posts` LEFT OUTER JOIN `comments` ON comments.post_id = posts.id ;
+-----------+-----------------+--------+-----------+---------+
| t0_r0 | t0_r1 | t0_r2 | t1_r0 | t1_r1 |
+-----------+-----------------+--------+-----------+---------+
| 130049073 | Hello RailsConf | MyText | NULL | NULL |
| 226779025 | Hello Brazil | MyText | 816076421 | MyText5 |
| 269986261 | Hello World | MyText | 61594165 | MyText3 |
| 269986261 | Hello World | MyText | 734198955 | MyText1 |
| 269986261 | Hello World | MyText | 765025994 | MyText4 |
| 269986261 | Hello World | MyText | 777406191 | MyText2 |
| 921194568 | Rails 2.1 | NULL | NULL | NULL |
| 972244995 | AkitaOnRails | NULL | NULL | NULL |
+-----------+-----------------+--------+-----------+---------+
8 rows in set (0.00 sec)
Pay attention to this: do you see lots of duplications in the first 3 columns (t0_r0 up to t0_r2)? Those are the Post model columns, the remaining being each post’s comment columns. Notice that the “Hello World” post was repeated 4 times. That’s what a join does: the parent rows are repeated for each children. That particular post has 4 comments, so it was repeated 4 times.
The problem is that this hits Rails hard, because it will have to deal with several small and short-lived objects. The pain is felt in the Rails side, not that much on the MySQL side. Now, compare that to the smaller queries:
mysql> SELECT `posts`.`id`, `posts`.`title`, `posts`.`body` FROM `posts` ;
+-----------+-----------------+--------+
| id | title | body |
+-----------+-----------------+--------+
| 130049073 | Hello RailsConf | MyText |
| 226779025 | Hello Brazil | MyText |
| 269986261 | Hello World | MyText |
| 921194568 | Rails 2.1 | NULL |
| 972244995 | AkitaOnRails | NULL |
+-----------+-----------------+--------+
5 rows in set (0.00 sec)
mysql> SELECT `comments`.`id`, `comments`.`body` FROM `comments` WHERE (`comments`.post_id IN (130049073,226779025,269986261,921194568,972244995));
+-----------+---------+
| id | body |
+-----------+---------+
| 61594165 | MyText3 |
| 734198955 | MyText1 |
| 765025994 | MyText4 |
| 777406191 | MyText2 |
| 816076421 | MyText5 |
+-----------+---------+
5 rows in set (0.00 sec)
Actually I am cheating a little bit, I manually removed the created_at and updated_at fields from the all the above queries in order for you to understand it a little bit clearer. So, there you have it: the posts result set, separated and not duplicated, and the comments result set with the same size as before. The longer and more complex the result set, the more this matters because the more objects Rails would have to deal with. Allocating and deallocating several hundreds or thousands of small duplicated objects is never a good deal.
But this new feature is smart. Let’s say you want something like this:
>> Post.find(:all, :include => [:comments], :conditions => ["comments.created_at > ?", 1.week.ago.to_s(:db)])
In Rails 2.1, it will understand that there is a filtering condition for the ‘comments’ table, so it will not break it down into the small queries, but instead, it will generate the old outer join version, like this:
SELECT `posts`.`id` AS t0_r0, `posts`.`title` AS t0_r1, `posts`.`body` AS t0_r2, `posts`.`created_at` AS t0_r3, `posts`.`updated_at` AS t0_r4, `comments`.`id` AS t1_r0, `comments`.`post_id` AS t1_r1, `comments`.`body` AS t1_r2, `comments`.`created_at` AS t1_r3, `comments`.`updated_at` AS t1_r4 FROM `posts` LEFT OUTER JOIN `comments` ON comments.post_id = posts.id WHERE (comments.created_at > '2008-05-18 18:06:34')
So, nested joins, conditions, and so forth on join tables should still work fine. Overall it should speed up your queries. Some reported that because of more individual queries, MySQL seems to receive a stronger punch CPU-wise. Do you home work and make your stress tests and benchmarks to see what happens.
Including a model loads the datas. But makes a second query.
For what you want to do, you should use the :joins parameter.
user = User.find_by_id(x, :joins => :roles)

Resources