bigQuery - Join - join

I'm trying to join two databases on the ID. The first database on price quotes does not have the data on websites, so I want to join it in from the logs database. However, in the logs database the ID is not unique, but the first chronological appearance of the ID - this is the right website.
When I run the query below, I get:
Resources exceeded during query execution.
Hence I don't know whether the problem is the code or something else.
Thanks
SELECT ID, user,busWeek, count(*) as num FROM [datastore.quotes]
Join (
select objectID, first(website) from (
select objectID, website, date from [datastore.allLogs]
order by date) group by objectID)
as Logs
on ID = objectID
group by ID,user,busWeek

Can you try:
SELECT ID, user,busWeek, count(*) as num FROM [datastore.quotes]
Join EACH (
select objectID, first(website) from (
select objectID, website, date from [datastore.allLogs]
order by date) group EACH by objectID)
as Logs
on ID = objectID
group by ID,user,busWeek
Note the 'EACH' - that keyword won't be needed in the future, but it's still useful today.

I think the issue is in ORDER BY. This brings all calculation to one node which causes "Resources Exceeded" message. I understand you need it to bring first (by date) website for each object.
Try to rewrite this select (inside join) to be partitioned.
For example using window functions with OVER(PARTITION BY ... ORDER BY)
In this case, I think, you have chance to make this in parallel
See below for reference
Window Functions

Related

Ruby - ActiveRecord - Select one record per 'group' based on a specific column value

I have this table:
User
Name
Role
Mason
Engineer
Jackson
Engineer
Mason
Supervisor
Jackson
Supervisor
Graham
Engineer
Graham
Engineer
There can be exact duplicates (same Name/Role combination). Ignore comments about primary key.
I am writing a query that will give the distinct values from 'Name' column, with the corresponding 'Role'. To select the corresponding 'Role', if there is a 'Supervisor' role for a name, that record is returned. Otherwise, a record with the 'Engineer' role should be returned if it exists.
For the above table, the expected result is:
Name
Role
Mason
Supervisor
Jackson
Supervisor
Graham
Engineer
I tried ordering 'Role' in descending order, so that I can group by Name,Role and pick the first item - it will be a 'Supervisor' role if present, else 'Engineer' role - which matches my expecation.
I also tried doing User.select('DISTINCT ON (name) \*).order(Role: :desc) - I am not seeing this clause in the SQL query that gets executed.
Also, I tried another approach to get all valid Name, Role combinations and then process it offline iterating the result set and using if-else to decide which row to display.
However, I am interested in anything that is efficient and does not over do this handling.
I am new to Ruby and therefore reaching out.
If I wanted to do this in pure SQL, I would have to use GROUP BY.
SELECT Name, MAX(Role) FROM User GROUP BY Name
So one method would be to execute this SQL statement against the base connection.
ActiveRecord::Base.connection.execute("SELECT Name, MAX(Role) FROM User GROUP BY Name")
That would provide exactly the data you need, though it wouldn't be returned as ActiveRecord models. If you need those models then I would use find_by_sql and do an inner join to provide the records.
User.find_by_sql("SELECT User.* FROM User INNER JOIN (SELECT Name AS n, MAX(Role) AS r FROM User GROUP BY Name) U2 WHERE Name = U2.n AND Role = U2.r")
Unfortunately that would provide both records for Graham.

How to fix DF-JOIN-002 Error in Azure Data Factory (Only two Join conditions allowed)

I have a data flow with a Union on two tables then joining the results of the Union to another table. I keep receiving the following error when I try debugging the pipeline or previewing the data.
DF-JOIN-002 at Join 'Join1'(Line 40/Col 26): Only 2 join condition(s) allowed
I'm basically trying to build a pipeline to automate this query:
SELECT DISTINCT k.acct_id, s.Id, Email, FirstName, LastName FROM table_3 s
INNER JOIN
( (SELECT acct_id, event_date FROM table_1)
UNION (SELECT acct_id, event_date FROM table_2)) k
ON k.acct_id = s.Archtics_acct_id__c
WHERE event_date = 'xxxx-xx-xx'
enter image description here
I figured it out after some time. I had to delete the Join activity and add it again. It was still linked to another source. Even though only two sources were selected in the join settings

DB2 joins difficulties

I have the following situation (simplified):
2 BiTemp Tables
basicdata (id, btmp_tsd, name, prename)
extendeddata (id, btmp_tsd, basicid, codename, codevalue)
In extendeddata, there can be multible entries for one basicdata with each a different codename and value.
I have to create an SQL to select all rows which have changed since a specified time. For the basicdata table this is relatively simple:
SELECT ID, BTMP_TSD, NAME, PRENAME
FROM BASICDATA BD
WHERE BTMP_TSD =
(SELECT MAX(BTMP_TSD)
FROM BASICDATA BD2
WHERE BD2.ID = BD.PRTNR_ID
AND BD2.BTMP_TSD > :MINTSD
AND BD2.BTMP_TSD <= :MAXTSD
)
ORDER BY ID
WITH UR
Now I will need to Join on the second table to get the codevalue for the codename 'test'. The problem is, it may not exist, in this case, the row should be collected anyway. But if there is a row but not within the timerange, I should not get a result.
I hope I was able to explain my issue. Joins are one of the things I still don't see trough...
Edit:
Okay here's a sample
basicdata:
id,btmp_tsd,name,prename
1,2013-05-25,test,user
2,2013-06-26,user,two
3,2013-06-26,peter,hans
1,2013-06-20,test,us3r
2,2013-10-30,us3r,two
extendeddata:
id,btmp_tsd,basicid,codename,codevalue
1,2013-05-25,1,superadmin,1
2,2013-06-26,3,admin,1
3,2013-11-25,1,superadmin,0
Okay now having these entries and I want all userid's which have had any changes since 2013-10-01 I should get
User1 (Because the extendeddata superadmin had a change)
User2 (Had a Name change and I want him even tough he has no entry on the extendeddata table)
not User3 (He has an entries on both tables but it's not in the specified range)
The following query should do what you want.
select *
from basicdata b left outer join extendeddata e on b.id=e.basicid
where b.btmp_tsd >= '2013-10-01'
or e.btmp_tsd >= '2013-10-01'
DISCLAIMER: I didn't test the sql. So syntax might not be 100% perfect.

Selecting distinct through join

We have 2 tables: users and statuses
The status table has a user_id, status and occured_on. The status is either 'removed' or 'added' and occured_on is the date the user was removed or added.
I need the current added users. That is, all the (distinct) users whose newest status record is 'added'.
I'm using Rails, and have tried:
User
.joins(:statuses)
.where('statuses.status = ?', 'added')
.order('statuses.occured_on DESC')
.uniq
Which translates to the SQL:
SELECT DISTINCT users.*
FROM users
INNER JOIN statuses
ON statuses.user_id = users.id
WHERE statuses.status = 'added'
ORDER BY statuses.occured_on DESC
That gives me the error:
PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: ...statuses.status = 'added') ORDER BY statuses.oc...
I'd be happy knowing either the Rails code that would work or the straight SQL.
Also, I'd prefer no sub-selects if possible.
Concider the following database schema change:
StatusTable:
StatusId
Status
UserId
ActiveFrom
ActiveTo
Afterwards you can add additional checks such as:
CONSTRAINT chk_from_to CHECK (ActiveFrom <= ActiveTo)
Then your query would look something like:
SELECT users.*
FROM users
JOIN statuses ON UserId = users.user_id AND ActiveFrom < CURRENT_TIMESTAMP AND ActiveTo > CURRENT_TIMESTAMP
WHERE statuses.Status = 'active'
With such structure you might need to change the way you change statuses, but from my own experience, this structure is much more flexible, and easier to query.
SELECT * FROM users INNER JOIN statuses ON users.id=statuses.user_id WHERE statuses.status='added' ORDER BY statuses.occured_on
After clarification, I don't think the schema is well designed for your goal. Can you clarify why you want the status change history contained in that table? My general approach to this would be that active users should be contained in a table called projects_users, containing project_id, user_id. When they are "removed" they should be removed from that table. Logs of the actions - adding and remove users from projects - should be stored in a separate table.
There's no good way that I'm aware of to write this query given your current design. Even if you fixed the errors, this runs error free in MySQL (which is exactly what you have)
SELECT DISTINCT `users`.* FROM `users`
INNER JOIN `projects_users`
ON `users`.`id`=`projects_users`.`user_id`
WHERE `status`='added'
ORDER BY `projects_users`.`occured_on` DESC
it still won't get you the correct results. The ORDER BY clause will just get you the most recent change to "added", it won't guarantee there is not a more recent "removed" action. To do that you'd need to compare the date of each most recent added record to the date of the most recent removed record, for each user, a nightmare.

using SQL aggregate functions with JOINs

I have two tables - tool_downloads and tool_configurations. I am trying to retrieve the most recent build date for each tool in my database. The layout of the DB is simple. One table called tool_downloads keeps track of when a tool is downloaded. Another table is called tool_configurations and stores the actual data about the tool. They are linked together by the tool_conf_id.
If I run the following query which omits dates, I get back 200 records.
SELECT DISTINCT a.tool_conf_id, b.tool_conf_id
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
When I try to add in date information I get back hundreds of thousands of records! Here is the query that fails horribly.
SELECT DISTINCT a.tool_conf_id, max(a.configured_date) as config_date, b.configuration_name
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
I know the problem has something to do with group-bys/aggregate data and joins. I can't really search google since I don't know the name of the problem I'm encountering. Any help would be appreciated.
Solution is:
SELECT b.tool_conf_id, b.configuration_name, max(a.configured_date) as config_date
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
GROUP BY b.tool_conf_id, b.configuration_name

Resources