Bigquery year-over-year join - join

I built a query with a join to layer on yoy data. As soon as I add more dimensions the yoy total doesn't match anymore. Am I applying the using function properly?
When I added browser or device the total is different
(SELECT
date,
py_date,
browser,
device,
sum(ga_sessions) as visit,
FROM (SELECT
device.browser as browser,
device.deviceCategory as device,
PARSE_DATE("%Y%m%d", date) as date,
DATE_SUB(PARSE_DATE("%Y%m%d", date), INTERVAL 364 DAY) as py_date,
totals.visits as ga_sessions,
FROM `xx.ga_sessions_20210824`
GROUP BY date, py_date, browser, device
),
PREV_YEAR as (SELECT
py_date,
browser,
device,
sum(py_ga_sessions) as py_visit,
FROM ( SELECT
PARSE_DATE("%Y%m%d", date) as py_date,
totals.visits as py_ga_sessions,
device.browser as browser,
device.deviceCategory as device
FROM `xx.ga_sessions_20200824`
GROUP BY py_date, browser, device
),
DATAPULL as (SELECT *
FROM CURRENT_PERIOD LEFT OUTER JOIN PREV_YEAR USING(py_date, browser, device) ORDER BY date DESC
)
SELECT * FROM DATAPULL```

Related

Snowflake: Joining a Table with Effective Dates and older records are showing NULL

Summary:
In Snowflake I have a table which records the maximum number of an item which changes every so often. I want to be able to join the max number of the item for that date (effective_date). This is the most basic "example" as in my table has items "expire" when they are removed.
CREATE OR REPLACE TABLE ITEM
(
Item VARCHAR(10),
Quantity Number(5,0),
EFFECTIVE_DATE DATE
)
;
CREATE OR REPLACE TABLE REPORT
(
INVOICE_DATE DATE,
ITEM VARCHAR(10)
)
;
INSERT INTO REPORT
VALUES
('2021-02-01', '100'),
('2021-09-10', '100')
;
INSERT INTO ITEM
VALUES
('100', '10', '2021-01-01'),
('101', '15', '2021-01-01'),
('100', '5', '2021-09-01')
;
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
Returns
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,5,2021-09-01
2021-09-10,100,NULL,NULL,NULL
How do I fix this so I no longer get NULL entries on my join.
Thank you for reading this!
I am hoping to get a result like this
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,10,2021-01-01
2021-09-10,100,100,5,2021-09-01
The issue is with your data and your expectations. Your query is this:
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
which requires that the INVOICE_DATE be less than or equal to the EFFECTIVE DATE of the ITEM. This isn't the case, though. 2021-09-10 is greater than 2021-09-01 so you don't get a join hit, which is why you get NULLs. It's also why your other record is returning the wrong information from your expectations.

How to get data from measurement by group by column_name and max time?

How to get data from measurement by group by column_name and max time?
Here is query I am trying to execute
select cpu_number from vm_details where ro_id='8564a08b-9208-45bf-9758-7d64fe1f91a3' group by entity_uuid;
SELECT mt.*,
FROM MyTable mt INNER JOIN
(
SELECT ID, MIN(Record_Date) MinDate
FROM MyTable
GROUP BY ID
) t ON mt.ID = t.ID AND mt.Record_Date = t.MinDate
This gets the minimum date per ID, and then gets the values based on those values. The only time you would have duplicates is if there are duplicate minimum record_dates for the same ID.

Query optimization for index page

I have a System and a Report model. System has_many reports and Report belongs_to system. Each daily report consists of 175 records per system.
I need a query on my system#index page which should list all systems filtered on most recent report creation. This was my first attempt.
#systems = System.joins('LEFT JOIN reports ON reports.system_id = systems.id').group('systems.id').order('MAX(reports.created_at) ASC')
This lists systems with a report (System Load (2.1ms)) but sorted by system_id not by report created_at.
Second attempt
#systems = System.joins(:reports).where("reports.created_at = (SELECT MAX(created_at) FROM reports p group by system_id having p.system_id = reports.system_id)").order('reports.created_at DESC')
This query does the job, but is really slow ( System Load (546.2ms)), despite having an index on report.created_at.
Third attempt
#systems = System.joins(:reports).where("reports.id = (SELECT MAX(id) FROM reports p group by system_id having p.system_id = reports.system_id)").order('reports.id DESC')
Also does the job, slightly faster than the second attempt (System Load (468.3ms)) but still not fast enough.
Any tips?
EDIT 03032017
I did the numbers on a small test dataset
old query
SELECT s.* FROM systems s
JOIN reports r ON r.system_id = s.id
WHERE r.created_at = (
SELECT MAX(created_at)
FROM reports p
group by p.system_id
having p.system_id = r.system_id)
ORDER BY r.id DESC
Time: 622.683 ms
Philip Couling solution (clean, returns only systems with reports)
SELECT systems.*
FROM systems
JOIN (
SELECT reports.system_id
, MAX(reports.created_at) created
FROM reports
GROUP BY reports.system_id
) AS r_date ON systems.id = r_date.system_id
ORDER BY r_date.created;
Time: 1.434 ms
BookofGreg solution (will give me all systems, report or no report)
select systems.* from systems order by updated_at;
Time: 0.253 ms
I couldn't get systemjack's solution to work.
Fastest solution: bookofgreg
Cleanest solution: philip couling
Thanks for your input.
An index on (reports.system_id, reports.created_at) might make this work efficiently:
#systems = System.joins(:reports).where("reports.created_at = (SELECT MAX(created_at) FROM reports p where p.system_id = reports.system_id) system_id)").order('reports.created_at DESC')
Alternativly...
Your second piece of code:
System.joins(:reports).where("reports.id = (SELECT MAX(id) FROM reports p group by system_id having p.system_id = reports.system_id)").order('reports.id DESC')
expands to:
SELECT system.*
JOIN reports ON system.id = reports.system_id
WHERE reports.created_at = (
SELECT MAX(created_at)
FROM reports p
group by p.system_id
having p.system_id = reports.system_id)
)
ORDER BY reports.id DESC
Notice how it has to look at reports twice. Also, because you include p.system_id = reports.system_id) the nested query will be called once per system record.
Ideally you want to get a list of system_ids and report dates:
So...
SELECT reports.system_id
, MAX(reports.created_at) created
FROM reports
GROUP BY reports.system_id
And then join to that:
SELECT systems.*
FROM systems
JOIN (
SELECT reports.system_id
, MAX(reports.created_at) created
FROM reports
GROUP BY reports.system_id
) AS r_date ON systems.id = r_date.systems_id
ORDER BY r_date.created
One possible solution, if you do not need the report data on the page, is have Report after_save -> { self.system.touch } # in Report when it is updated. This will cause the System's updated_at to take on the time the Report was updated.
This will mean you can just sort the System by their updated at without joining at all.
This solution assumes that there is no other way to update System. If there is then you can specify a time cache column which you can use to order on like after_save -> { self.system.touch(:report_cached_updated_at) }
http://api.rubyonrails.org/classes/ActiveRecord/Persistence.html#method-i-touch
A window function might perform well for you. Not sure how to implement this in rails, but the query to get the latest report for each system could look like:
select * from (
select s.*, r.sytem_id, r.created_at,
row_number() OVER (PARTITION BY s.id ORDER BY r.created_at desc) AS row
from systems s
left join reports r on r.system_id = s.id
) where (row = 1 OR r.system_id is null)
The check for null is there because you have a left join in your example so you must want systems even if there is no report.
or simpler (but not as sure syntax is right):
select *
from systems s
left join reports r on r.system_id = s.id
having (r.system_id is null
OR row_number() OVER (PARTITION BY s.id ORDER BY r.created_at desc) = 1)

bigQuery - Join

I'm trying to join two databases on the ID. The first database on price quotes does not have the data on websites, so I want to join it in from the logs database. However, in the logs database the ID is not unique, but the first chronological appearance of the ID - this is the right website.
When I run the query below, I get:
Resources exceeded during query execution.
Hence I don't know whether the problem is the code or something else.
Thanks
SELECT ID, user,busWeek, count(*) as num FROM [datastore.quotes]
Join (
select objectID, first(website) from (
select objectID, website, date from [datastore.allLogs]
order by date) group by objectID)
as Logs
on ID = objectID
group by ID,user,busWeek
Can you try:
SELECT ID, user,busWeek, count(*) as num FROM [datastore.quotes]
Join EACH (
select objectID, first(website) from (
select objectID, website, date from [datastore.allLogs]
order by date) group EACH by objectID)
as Logs
on ID = objectID
group by ID,user,busWeek
Note the 'EACH' - that keyword won't be needed in the future, but it's still useful today.
I think the issue is in ORDER BY. This brings all calculation to one node which causes "Resources Exceeded" message. I understand you need it to bring first (by date) website for each object.
Try to rewrite this select (inside join) to be partitioned.
For example using window functions with OVER(PARTITION BY ... ORDER BY)
In this case, I think, you have chance to make this in parallel
See below for reference
Window Functions

Limit query results

I have an app that includes music charts to showcase the top tracks (it shows the top 10).
However, I'm trying to limit the charts so that any particular user cannot have more than one track on the top charts at the same time.
If you need any more info, just let me know.
You can use the row_number() function which gives a running number that resets when the user id changes. Then you can use that in a WHERE clause to create a per-user-limit:
SELECT * FROM (
SELECT COALESCE(sum(plays.clicks), 0),
row_number() OVER (PARTITION BY users.id ORDER BY COALESCE(sum(plays.clicks), 0) DESC),
users.id AS user_id,
tracks.*
FROM tracks
JOIN plays
ON tracks.id = plays.track_id
AND plays.created_at > now() - interval '14 days'
INNER JOIN albums
ON tracks.album_id = albums.id
INNER JOIN users
ON albums.user_id = users.id
GROUP BY users.id, tracks.id
ORDER BY 1 desc) sq1
WHERE row_number <= 2 LIMIT 10;

Resources