Bigquery - LEFT Join two queries with second using most recent dates - join

I have a query that does two queries. The first query looks at future events and what user is in that event from one table. The second query looks at the historical events and creates stats for the user in that event based off all the previous events that user was in from a second table.
I am having troubles joining the two queries. The goal of the join would be to join the second query to the first query based off the most recent stats for each user. The stat can't be newer than the date that the future event occured on. If that user hasn't been in any previous event it would just return null for the query 2 in the join.
Below is also an example table for the future query, a table for the historic query, and an ideal output of the join.
Future:
user
date
event code
event info
User 1
1/26/2023
5596
info_5596
User 2
1/26/2023
5586
info_5586
User 3
1/26/2023
5582
info_5582
User 1
1/20/2023
5492
info_5492
User 1
1/2/2023
5341
info_5341
User 2
1/2/2023
5333
info_5333
Historical:
user
date
stat 1
stat2
event code
event info
User 1
1/25/2023
10
52
4352
info_4352
User 2
1/25/2023
11
22
4332
info_4332
User 2
1/12/2023
2
45
4298
info_4298
User 3
1/12/2023
8
88
4111
info_4111
User 1
1/12/2023
7
67
4050
info_4050
User 3
1/2/2023
3
91
4000
info_4000
User 1
1/1/2023
6
15
3558
info_3558
Output of the JOIN:
user
date future
stat 1
stat2
event code future
event info future
User 1
1/26/2023
10
52
5596
info_5596
User 2
1/26/2023
11
22
5586
info_5586
User 3
1/26/2023
8
88
5582
info_5582
User 1
1/20/2023
7
67
5492
info_5492
User 1
1/2/2023
3
91
5341
info_5341
User 2
1/2/2023
null
null
5333
info_5333
I tried using a subquery in the join, but bigquery was saying that it is unsupported. Below is my code attempt. I have also tried using MAX() but it was not liking that to be used in the join as well
Another option I am thinking of is to join the two datasets before ever calculating the query 2 stats. Then filtering. I have a large query already written for both though, so I would prefer not to start over.
Select Distict
A.*
B.Stat1, B.Stat2
from future as A
Left Join historic as B
ON (
A.Date = (Select MAX(B.date) FROM historic as recent_historic WHERE recent_historic.user = A.user)
AND
A.user = B.user
)
ORDER BY A.date

Related

How to fix group in complex query in rails or do a join on two separate query results and then query again

I have been stuck with this problem for a while at work. The data given has been radically changed as I just need the general idea of how to approach the problem, and it would not be possible to provide the actual schema of the tables.
I have a table Users and and another table Membership. And each user has a one to many relationship with membership through the user_membership table. A mock up of the following table is shown below:
id
name
email
1
John
john#gmail.com
2
James
james#gmail.com
...
...
...
id
user_id
membership_id
1
2
1
2
1
2
3
1
3
4
1
4
5
1
5
...
...
...
id
created_at
1
31st Dec 2021
2
1st Jan 2022
3
2nd Jan 2022
4
3rd Jan 2022
5
4th Jan 2022
...
...
I have some level of rather complex querying that returns an ActiverecordRelation.
ie:
users = Users.select(....)
I then need to chain the above query with another query that allows each user with their latest membership_created_at date. Ie:
<User, id: 1, name: John ,email: john#gmail.com, latest_membership_created_at: 4th Jan 2022>
<User, id: 2, name: James ,email: james#gmail.com, latest_membership_created_at: 31st Dec 2021>
My approach:
users = users.joins(user_memberships: :membership).merge(User.all).group(:id).select('membership.*, MAX(membership.created_at) AS membership_created_at_raw')
I get an error:
Query 1 ERROR: ERROR: column "users.id" must appear in the GROUP BY clause or be used in an aggregate function...
Qn 1: Is there anyway I can fix this?
In another related note, is it also possible to do a join of the result of 2 queries? I am thinking perhaps I can do a group of user_membership table by user_id and join with the membership table. Something like
users_created_at = User.all.joins(user_memberships: :membership[).group(:id).select('user.id, MAX(memberships.created_at) AS membership_created_at_raw')
Qn 2: Can we then somehow do an innerjoin between users and users_created_at using rails?
Thank you!
You can do it like this:
User.select(
'DISTINCT ON (users.id) users.id, memberships.*'
).joins(
user_memberships: :membership
).order('users.id, memberships.created_at DESC')
Or raw SQL query
SELECT DISTINCT ON (users.id) users.id, memberships.*
FROM users
LEFT JOIN user_memberships ON user_memberships.user_id = users.id
LEFT JOIN memberships ON memberships.id = user_memberships.membership_id
ORDER BY users.id, memberships.created_at DESC

Rails order by range first

I have model named Group which has many users,
Group contains the fields for min_age and max_age describing the user's with minimum age and maximum age.
Each user has its settings where sets preferred age group like 18 to 25
When a user searches for groups than I have order groups with age b/w 18 to 25 first and than rest
I am doing it with 2 queries like
groups = Group.where("min_age >=? AND max_age <=?", setting.min_age, setting.max_age)
+ Group.where("min_age <? OR max_age >?", setting.min_age, setting.max_age)
It worked but thing is I have too many other filters and I want cut short number of queries.
Is it possible to do this in single query?
You can do that by ordering matching records before records that do not match:
Group.order("CASE WHEN min_age >= #{setting.min_age} AND max_age <= #{settings.max_age} then 1 else 2 end")
See: SQL CASE statement.

Complex Left Outer Self-Join in Rails 5

I have a list_events table where I want to get the latest event per user per list between a certain time. Here's an example of the table.
id user_id list_id event created_at
1 5 1 sub 13:45
2 1 1 sub 14:01
3 1 2 sub 14:02
4 3 1 sub 14:03
5 4 1 sub 14:04
6 1 1 unsub 14:05
The last events per user for list 1 between 14:00 and 15:00 would be...
id user_id list_id event created_at
4 3 1 sub 14:03
5 4 1 sub 14:04
6 1 1 unsub 14:05
In my Rails 5 model I've written the query like so:
list.events
.joins("
left outer join list_events b
on list_events.user_id = b.user_id
and list_events.list_id = b.list_id
and list_events.created_at < b.created_at
")
.where("b.user_id is null")
.where(created_at: start..end)
This works fine, but I'm wondering if there's a way to write this without hand-coding the join. I do notice Rails has a left_outer_join method, but there's no way to specify custom on. Perhaps with a belongs_to?
Also if there's a way to alias list_events as a while still being able to take advantage of the list.events relationship abstraction.

SPSS descriptives long data

I am trying to run descriptives (Means/frequencies) on my data that are in long format/repeated measures. So for example, for 1 participant I have:
Participant Age ID 1 25 ID 1 25 ID 1 25 ID 1 25 ID 2 (Second participant .. etc) 30
So SPSS reads that as an N of 5 and uses that to compute the mean. I want SPSS to ignore repeated cases (Only read ID 1 data as one person, ignore the other 3). How do I do this?
Assuming the ages are always identical for all occurrences of the same ID - what you should do is aggregate (Data => aggregate) your data into a separate dataset, in which you'll take only the first age for each ID. Then you can analyse the age in the new dataset with no repetitions.
you can use this syntax:
DATASET DECLARE OneLinePerID.
AGGREGATE /OUTFILE='OneLinePerID' /BREAK=ID /age=first(age) .
dataset activate OneLinePerID.
means age.

Informix Query Tuning

I have a table called lead, which have about 500 thousand records and we need the following query to get executed.
SELECT skip 300000 first 75 *
FROM lead
WHERE ((enrollment_period IS NULL) OR
(enrollment_period IN ('FT2015','F16','SUM2016','FALL2016','FALL2017','SP17')))
ORDER BY created_on DESC
The table lead has id column as the primary key and thus have clustered index in that column. This query is taking about 12 - 13 mins. When I added a non-clustered index on created_on and enrollment_period columns, it came down to 4 - 5 mins. Then I changed the clustered index from id column to this index, execution time came down further to about 50 seconds now.
Is there any other optimization scope available for this query?
Overall, is there any other change that can be done so that the query will execute faster?
Thanks in Advance,
Manohar

Resources