Rails Arel Cohort Analysis - ruby-on-rails

I'm trying to do a cohort analysis query in Rails but running into trouble with the correct way to group by the last action date.
I want to end up with rows of the following data for something like this: http://www.quickcohort.com/
count first_action last_action
for all users who registered in the last year. first_action and last_action are truncated to the nearest month.
Getting the counts grouped by the first_action is easy enough, but when I try to extend it to include the last_action I encounter
ActiveRecord::StatementInvalid: PGError: ERROR: aggregates not allowed in GROUP BY clause
Here's what I have so far
User
.select("COUNT(*) AS count,
date_trunc('month', users.created_at) AS first_action,
MAX(date_trunc('month', visits.created_at)) AS last_action # <= Problem
")
.joins(:visits)
.group("first_action, last_action") # TODO: Subquery ?
.order("first_action ASC, last_action ASC")
.where("users.created_at >= date_trunc('month', CAST(? AS timestamp))", 12.months.ago)
The visits table tracks all visits users make to the site. Using the latest visit as the last action seems like it should be easy, but I'm having trouble forming it into SQL.
I'm also open to other solutions if there are better ways, but it seems like a single SQL query would be most performant.

I think you need to do this in a subquery. Something like:
select first_action, last_action, count(1)
from (
select
date_trunc('month', visits.created_at) as first_action,
max(date_trunc('month', visits.created_at)) as last_action
from visits
join users on users.id = visits.user_id
where users.created_at >= ?
group by user_id
)
group by first_action, last_action;
I'm not sure what the most elegant way would be to do this in ARel, but I think it'd be something like this. (Might just be easier to use the SQL directly.)
def date_trunc_month(field)
Arel::Nodes::NamedFunction.new(
'date_trunc', [Arel.sql("'month'"), field])
end
def max(*expressions)
Arel::Nodes::Max.new(expressions)
end
users = User.arel_table
visits = Visit.arel_table
user_visits = visits.
join(users).on(visits[:user_id].eq(users[:id])).
where(users[:created_at].gteq(12.months)).
group(users[:id]).
project(
users[:id],
date_trunc_month(visits[:created_at]).as('first_visit'),
max(date_trunc_month(visits[:created_at])).as('last_visit')
).
as('user_visits')
cohort_data = users.
join(user_visits).on(users[:id].eq(user_visits[:id])).
group(user_visits[:first_visit], user_visits[:last_visit]).
project(
user_visits[:first_visit],
user_visits[:last_visit],
Arel::Nodes::Count.new([1]).as('count')
)

Related

How to get a most recent value group by year by using SQL

I have a Company model that has_many Statement.
class Company < ActiveRecord::Base
has_many :statements
end
I want to get statements that have most latest date field grouped by fiscal_year_end field.
I implemented the function like this:
c = Company.first
c.statements.to_a.group_by{|s| s.fiscal_year_end }.map{|k,v| v.max_by(&:date) }
It works ok, but if possible I want to use ActiveRecord query(SQL), so that I don't need to load unnecessary instance to memory.
How can I write it by using SQL?
select t.username, t.date, t.value
from MyTable t
inner join (
select username, max(date) as MaxDate
from MyTable
group by username
) tm on t.username = tm.username and t.date = tm.MaxDate
For these kinds of things, I find it helpful to get the raw SQL working first, and then translate it into ActiveRecord afterwards. It sounds like a textbook case of GROUP BY:
SELECT fiscal_year_end, MAX(date) AS max_date
FROM statements
WHERE company_id = 1
GROUP BY fiscal_year_end
Now you can express that in ActiveRecord like so:
c = Company.first
c.statements.
group(:fiscal_year_end).
order(nil). # might not be necessary, depending on your association and Rails version
select("fiscal_year_end, MAX(date) AS max_date")
The reason for order(nil) is to prevent ActiveRecord from adding ORDER BY id to the query. Rails 4+ does this automatically. Since you aren't grouping by id, it will cause the error you're seeing. You could also order(:fiscal_year_end) if that is what you want.
That will give you a bunch of Statement objects. They will be read-only, and every attribute will be nil except for fiscal_year_end and the magically-present new field max_date. These instances don't represent specific statements, but statement "groups" from your query. So you can do something like this:
- #statements_by_fiscal_year_end.each do |s|
%tr
%td= s.fiscal_year_end
%td= s.max_date
Note there is no n+1 query problem here, because you fetched everything you need in one query.
If you decide that you need more than just the max date, e.g. you want the whole statement with the latest date, then you should look at your options for the greatest n per group problem. For raw SQL I like LATERAL JOIN, but the easiest approach to use with ActiveRecord is DISTINCT ON.
Oh one more tip: For debugging weird errors, I find it helpful to confirm what SQL ActiveRecord is trying to use. You can use to_sql to get that:
c = Company.first
puts c.statements.
group(:fiscal_year_end).
select("fiscal_year_end, MAX(date) AS max_date").
to_sql
In that example, I'm leaving off order(nil) so you can see that ActiveRecord is adding an ORDER BY clause you don't want.
for example you want to get all statements by start of the months you should use this
#companey = Company.first
#statements = #companey.statements.find(:all, :order => 'due_at, id', :limit => 50)
then group them as you want
#monthly_statements = #statements.group_by { |statement| t.due_at.beginning_of_month }
Building upon Bharat's answer you can do this type of query in Rails using find_by_sql in this way:
Statement.find_by_sql ["Select t.* from statements t INNER JOIN (
SELECT fiscal_year_end, max(date) as MaxDate GROUP BY fiscal_year_end
) tm on t.fiscal_year_end = tm.fiscal_year_end AND
t.created_at = tm.MaxDate WHERE t.company_id = ?", company.id]
Note the last where part to make sure the statements belong to a specific company instance, and that this is called from the class. I haven't tested this with the array form, but I believe you can turn this into a scope and use it like this:
# In Statement model
scope :latest_from_fiscal_year, lambda |enterprise_id| {
find_by_sql[..., enterprise_id] # Query above
}
# Wherever you need these statements for a particular company
company = Company.find(params[:id])
latest_statements = Statement.latest_from_fiscal_year(company.id)
Note that if you somehow need all the latest statements for all companies then this most likely leave you with a N+1 queries problem. But that is a beast for another day.
Note: If anyone else has a way to have this query work on the association without using the last where part (company.statements.latest_from_year and such) let me know and I'll edit this, in my case in rails 3 it just pulled em from the whole table without filtering.

Ruby on Rails Active Record Query

Employers and Jobs. Employers have many jobs. Jobs have a boolean field started.
I am trying to query and find a count for Employers that have more than one job that is started.
How do I do this?
Employer.first.jobs.where(started: true).count
Do I use a loop with a counter or is there a way I can do it with a query?
Thanks!
You can have condition on join
Employer.joins(:jobs).where(jobs: {started: true}).count
You could create a scope like this in your Employer model:
def self.with_started_job
joins(:jobs)
.where(jobs: { started: true })
.having('COUNT(jobs.id) > 0')
end
Then, to get the number of employers that have a started job, you can just use Employer.with_started_job.count.
What is missing is a group by clause. Use .group() then count. Something like Employer.select("employers.id,count(*)").joins(:jobs).where("jobs.started = 1").group("employers.id")
The query joins both tables, eliminates the records that are false, then it counts the total records for each employer.id when grouped together.
Time for exploratory programming!
Given I don't know SQL really well, my way of doing this may not be optimal. I've been unable to use two aggregations without using a subquery. So I split the task in two:
Fetch all employers who have more than one job started
Count the number of entries in the result set
All on the database level, of course! And without raw SQL, so using Arel here and there. Here's what I've come up with:
class Employer < ActiveRecord::Base
has_many :jobs
# I explored my possibilities using this method: fetches
# all the employers and number of jobs each has started.
# Looks best> Employer.with_started_job_counts.map(&:attributes)
# In the final method this one is not used, it's just handy.
def self.with_started_job_counts
jobs = Job.arel_table
joins(:jobs).select(arel_table[Arel.star],
jobs[:id].count.as('job_count'))
.merge(Job.where(started: true))
.group(:id)
end
# Alright. Now try to apply it. Seems to work alright.
def self.busy
jobs = Job.arel_table
joins(:jobs).merge(Job.where(started: true))
.group(:id)
.having(jobs[:id].count.gt 1)
end
# This is one of the tricks I've learned while fiddling
# with Arel. Counts any relations on the database level.
def self.generic_count(subquery)
from(subquery).count
end
# So now... we get this.
def self.busy_count
generic_count(busy)
end
# ...and it seems to get what we need!
end
Resulting SQL is...large. Not huge, but you may have to solve performance issues with it.
SELECT COUNT(*)
FROM (
SELECT "employers".*
FROM "employers"
INNER JOIN "jobs" ON "jobs"."employer_id" = "employers"."id"
WHERE "jobs"."started" = ?
GROUP BY "employers"."id"
HAVING COUNT("jobs"."id") > 1
) subquery [["started", "t"]]
...still, it does seem to get the result.

How to query the result of a method in Ruby (rails)

I'm struggling with a particular feature.
In my users model I have a method to work out their age based on the current date less their first order date.
I'd like to be able to find all users who are older than X days. I can find active users by querying a column called state for 'active' users. But I'm unsure how to query the result of the age method to find users older than X.
Does anyone have any ideas?
Many thanks and seasons greetings.
**Edit
In postgresql I would write;
WITH
firstbill as (
SELECT
DISTINCT(user_id) as customer,
DATE(MIN(billed_at)) as first_order
FROM orders
WHERE state = 'shipped'
GROUP BY 1
ORDER BY 1)
SELECT count*
FROM
(SELECT *, (current_date - first_order) as age
FROM firstbill
JOIN users on users.id = customer) as t2
WHERE age >= 21
I have tried using User.find_by_sql["above query"] but that returns an array not activerecord relation which makes any further joins a little harder
You cannot really query for the return value of a method. Because to do so, you need to load all users and then call that method on every user, like this User.all.select(&:your_method?). That will be very slow if you have many users.
But for your particular example you can write something like this to let the database return the correct users (assuming you have a first_ordercolumn on your user):
User.where('first_order <= ?', 90.days.ago)
or
User.where('first_order <= ?', 1.month.ago)
I think the following startment should return the same users than your Postgresql example:
User.
select('users.*, MIN(DATE(orders.billed_at)) AS first_order_on').
joins('orders ON orders.user_id = users.id'). # just `(:orders)` with `has_many :order` on User
where('orders.state = ?', 'shipped').
group('users.id').
having('first_order_on <= ?', 21.days.ago.to_date)
Solved by using
scope :acquired, User.joins(:orders).where("orders.state = ?", "shipped")
scope :older_than_age, ->(age) {
acquired.group("users.id").having("(current_date - date(min(orders.shipped_at))) >= ?", age)
}

Sequel -- How To Construct This Query?

I have a users table, which has a one-to-many relationship with a user_purchases table via the foreign key user_id. That is, each user can make many purchases (or may have none, in which case he will have no entries in the user_purchases table).
user_purchases has only one other field that is of interest here, which is purchase_date.
I am trying to write a Sequel ORM statement that will return a dataset with the following columns:
user_id
date of the users SECOND purchase, if it exists
So users who have not made at least 2 purchases will not appear in this dataset. What is the best way to write this Sequel statement?
Please note I am looking for a dataset with ALL users returned who have >= 2 purchases
Thanks!
EDIT FOR CLARITY
Here is a similar statement I wrote to get users and their first purchase date (as opposed to 2nd purchase date, which I am asking for help with in the current post):
DB[:users].join(:user_purchases, :user_id => :id)
.select{[:user_id, min(:purchase_date)]}
.group(:user_id)
You don't seem to be worried about the dates, just the counts so
DB[:user_purchases].group_and_count(:user_id).having(:count > 1).all
will return a list of user_ids and counts where the count (of purchases) is >= 2. Something like
[{:count=>2, :user_id=>1}, {:count=>7, :user_id=>2}, {:count=>2, :user_id=>3}, ...]
If you want to get the users with that, the easiest way with Sequel is probably to extract just the list of user_ids and feed that back into another query:
DB[:users].where(:id => DB[:user_purchases].group_and_count(:user_id).
having(:count > 1).all.map{|row| row[:user_id]}).all
Edit:
I felt like there should be a more succinct way and then I saw this answer (from Sequel author Jeremy Evans) to another question using select_group and select_more : https://stackoverflow.com/a/10886982/131226
This should do it without the subselect:
DB[:users].
left_join(:user_purchases, :user_id=>:id).
select_group(:id).
select_more{count(:purchase_date).as(:purchase_count)}.
having(:purchase_count > 1)
It generates this SQL
SELECT `id`, count(`purchase_date`) AS 'purchase_count'
FROM `users` LEFT JOIN `user_purchases`
ON (`user_purchases`.`user_id` = `users`.`id`)
GROUP BY `id` HAVING (`purchase_count` > 1)"
Generally, this could be the SQL query that you need:
SELECT u.id, up1.purchase_date FROM users u
LEFT JOIN user_purchases up1 ON u.id = up1.user_id
LEFT JOIN user_purchases up2 ON u.id = up2.user_id AND up2.purchase_date < up1.purchase_date
GROUP BY u.id, up1.purchase_date
HAVING COUNT(up2.purchase_date) = 1;
Try converting that to sequel, if you don't get any better answers.
The date of the user's second purchase would be the second row retrieved if you do an order_by(:purchase_date) as part of your query.
To access that, do a limit(2) to constrain the query to two results then take the [-1] (or last) one. So, if you're not using models and are working with datasets only, and know the user_id you're interested in, your (untested) query would be:
DB[:user_purchases].where(:user_id => user_id).order_by(:user_purchases__purchase_date).limit(2)[-1]
Here's some output from Sequel's console:
DB[:user_purchases].where(:user_id => 1).order_by(:purchase_date).limit(2).sql
=> "SELECT * FROM user_purchases WHERE (user_id = 1) ORDER BY purchase_date LIMIT 2"
Add the appropriate select clause:
.select(:user_id, :purchase_date)
and you should be done:
DB[:user_purchases].select(:user_id, :purchase_date).where(:user_id => 1).order_by(:purchase_date).limit(2).sql
=> "SELECT user_id, purchase_date FROM user_purchases WHERE (user_id = 1) ORDER BY purchase_date LIMIT 2"

Rails 3. How to perform a "where" query by a virtual attribute?

I have two models: ScheduledCourse and ScheduledSession.
scheduled_course has_many scheduled_sessions
scheduled_session belongs_to scheduled_course
ScheduledCourse has a virtual attribute...
def start_at
s = ScheduledSession.where("scheduled_course_id = ?", self.id).order("happening_at ASC").limit(1)
s[0].happening_at
end
... the start_at virtual attribute checks all the ScheduledSessions that belongs to the ScheduledCourse and it picks the earliest one. So start_at is the date when the first session happens.
Now I need to write in the controller so get only the records that start today and go into the future. Also I need to write another query that gets only past courses.
I can't do the following because start_at is a virtual attribute
#scheduled_courses = ScheduledCourse.where('start_at >= ?', Date.today).page(params[:page])
#scheduled_courses = ScheduledCourse.where('start_at <= ?', Date.today)
SQLite3::SQLException: no such column: start_at: SELECT "scheduled_courses".* FROM "scheduled_courses" WHERE (start_at >= '2012-03-13') LIMIT 25 OFFSET 0
You can't perform SQL queries on columns that aren't in the database. You should consider making this a real database column if you intend to do queries on it instead of a fake column; but if you want to select items from this collection, you can still do so. You just have to do it in Ruby.
ScheduledCourse.page(params).find_all {|s| s.start_at >= Date.today}
Veraticus is right; You cannot use virtual attributes in queries.
However, I think you could just do:
ScheduledCourse.joins(:scheduled_sessions).where('scheduled_courses.happening_at >= ?', Date.today)
It will join the tables together by matching ids, and then you can look at the 'happening_at' column, which is what your 'start_at' attribute really is.
Disclaimer: Untested, but should work.
I wonder if this would be solved by a subquery ( the subquery being to find the earliest date first). If so, perhaps the solution here might help point in a useful direction...

Resources