I have the following code and output:
SELECT CustomerCategoryName, COUNT(a.CustomerID) AS CustomersInThisCategory
FROM Sales.Customers AS a
RIGHT JOIN Sales.CustomerCategories AS b on a.CustomerCategoryID = b.CustomerCategoryID
GROUP BY CustomerCategoryName
ORDER BY CustomersInThisCategory DESC
This generates the following output:
When I add the following COUNT aggregate funcation and Inner Join:
SELECT CustomerCategoryName, COUNT(a.CustomerID) AS CustomersInThisCategory, COUNT(c.OrderID) AS Orders
FROM Sales.Customers AS a
RIGHT JOIN Sales.CustomerCategories AS b on a.CustomerCategoryID = b.CustomerCategoryID
INNER JOIN Sales.Orders AS c ON a.CustomerID = c.CustomerID
GROUP BY CustomerCategoryName
ORDER BY CustomersInThisCategory DESC
The output changes to:
I am not sure as to why the CustomersInThisCategory column is changing to the same as the Orders column? I'm also not sure why the results in the first ouput with 0 values are being removed in the second query as I still have the Right join present.
Any feedback would be much appreciated.
For your first query, count ( distinct a.customerId) should give you the unique customer ids in a category.
Regarding your second question, right join is performed before inner join. So the inner join will splice the records, for which a match is not found.
Hope my answer helps.
Related
I am relatively new to bigquery and think I have an aliasing problem but can't work out what it is. Essentially, I have two tables and while the first table has the majority of the required information the second table has a date of birth that I need to join. I have written the below query and the two initial SELECT statements work in isolation and appear to return the expected values. However, when attempting to join the two tables I get an error stating:
Unrecognized name: t1_teams at [10:60]
WITH table_1 AS (SELECT competition_name, stat_season_name,
matchdata_Date, t1_teams.name, t1_players.Position, CAST(REGEXP_REPLACE(t1_players.uID, r'[a-zA-Z]', '') AS NUMERIC) AS Player_ID1, t1_players.First, t1_players.Last
FROM `prod.feed1`,
UNNEST(teams) AS t1_teams, UNNEST(t1_teams.Players) as t1_players),
table_2 AS (SELECT t2_players.uID AS Player_ID2, t2_players.stat_birth_date
FROM `prod.feed2`,
UNNEST(players) AS t2_players)
SELECT competition_name, stat_season_name, matchdata_Date, t1_teams.name, t1_players.Position, t1_players.uID, t1_players.First, t1_players.Last, t2_players.stat_birth_date
FROM table_1
LEFT JOIN table_2
ON Player_ID1 = Player_ID2
WHERE competition_name = "EPL"
AND stat_season_name = "Season 2018/2019"
Any help in steering me in the right direction would be greatly appreciated as my reading of the bigquery documentation and other searches have drawn a blank.
The problem is here:
WITH table_1 AS (
SELECT
competition_name,
stat_season_name,
matchdata_Date,
-- this line
t1_teams.name,
...
You're selecting t1_teams.name, so you end up with just name an an output column from the select list. If you want to refer to t1_teams later, then select that instead:
WITH table_1 AS (
SELECT
competition_name,
stat_season_name,
matchdata_Date,
-- this line
t1_teams,
...
I have a sql query that I'd like to optimize. I'm not the designer of the database, so I have no way of altering structure, indexes or stored procedures.
I have a table that consists of invoices (called faktura) and each invoice has a unique invoice id. If we have to cancel the invoice a secondary invoice is created in the same table but with a field ("modpartfakturaid") referring to the original invoice id.
Example of faktura table:
invoice 1: Id=152549, modpartfakturaid=null
invoice 2: Id=152592, modpartfakturaid=152549
We also have a table called "BHLFORLINIE" which consists of services rendered to the customer. Some of the services have already been invoiced and match a record in the invoice (FAKTURA) table.
What I'd like to do is get a list of all services that either does not have an invoice yet or does not have an invoice that's been cancelled.
What I'm doing now is this:
`SELECT
dbo.BHLFORLINIE.LeveringsDato AS treatmentDate,
dbo.PatientView.Navn AS patientName,
dbo.PatientView.CPRNR AS patientCPR
FROM
dbo.BHLFORLINIE
INNER JOIN dbo.BHLFORLOEB
ON dbo.BHLFORLOEB.BhlForloebID = dbo.BHLFORLINIE.BhlForloebID
INNER JOIN dbo.PatientView
ON dbo.PatientView.PersonID = dbo.BHLFORLOEB.PersonID
INNER JOIN dbo.HENVISNING
ON dbo.HENVISNING.BhlForloebID = dbo.BHLFORLOEB.BhlForloebID
LEFT JOIN dbo.FAKTURA
ON dbo.BHLFORLINIE.FakturaId = FAKTURA.FakturaId
WHERE
(dbo.BHLFORLINIE.LeveringsDato >= '2017-01-01' OR dbo.BHLFORLINIE.FakturaId IS NULL) AND
dbo.BHLFORLINIE.ProduktNr IN (110,111,112,113,8050,4001,4002,4003,4004,4005,4006,4007,4008,4009,6001,6002,6003,6004,6005,6006,6007,6008,7001,7002,7003,7004,7005,7006,7007,7008) AND
((dbo.FAKTURA.FakturaType = 0 AND
dbo.FAKTURA.FakturaID NOT IN (
SELECT FAKTURA.ModpartFakturaID FROM FAKTURA WHERE FAKTURA.ModpartFakturaID IS NOT NULL
)) OR
dbo.FAKTURA.FakturaType IS NULL)
GROUP BY
dbo.PatientView.CPRNR,
dbo.PatientView.Navn,
dbo.BHLFORLINIE.LeveringsDato`
Is there a smarter way of doing this? Right now the added the query performs three times slower because of the "not in" subquery.
Any help is much appreciated!
Peter
You can use an outer join and check for null values to find non matches
SELECT customer.name, invoice.id
FROM invoices i
INNER JOIN customer ON i.customerId = customer.customerId
LEFT OUTER JOIN invoices i2 ON i.invoiceId = i2.cancelInvoiceId
WHERE i2.invoiceId IS NULL
I have two queries that return the same data.
Query1, which is normal join takes a long time to execute:
SELECT TOP 1000 bigtable.*, tbl1.name, tb2.name FROM
bigtable INNER JOIN tbl1 on bigtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by bigtable.id desc
Query2 that uses a sub-query returns fairly quickly:
SELECT subtable.*, tbl1.name, tb2.name FROM
(SELECT TOP 1000 FROM bigtable) subtable
INNER JOIN tbl1 on subtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by subtable.id desc
bigtable contains 100k rows or so. tbl1 is a very small table (less than 10 rows). I would rather not use subqueries. If I skip the order by clause, both queries run quickly. I have tried adding indexes to the fields being joined, adding a DESC index on id etc. but nothing seems to help.
Any help is appreciated!
===> Update:
This turned out to be an non-issue. After creating another table similar to tbl1 with the same rows, I found that the Query1 ran under a second (with the copied table). Rebuilt stats on tbl1 and it fixed it.
I think that the two queries are not equivalent - try to write the second one as
SELECT subtable.*, tbl1.name, tb2.name FROM
(SELECT TOP 1000 FROM bigtable order by bigtable.id desc) subtable
INNER JOIN tbl1 on subtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by subtable.id desc
I expect the expensive operation to be the ordering of the big table, which is now present in both versions.
I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.
Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.
I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user