How to fix DF-JOIN-002 Error in Azure Data Factory (Only two Join conditions allowed)

How to fix DF-JOIN-002 Error in Azure Data Factory (Only two Join conditions allowed) - join

I have a data flow with a Union on two tables then joining the results of the Union to another table. I keep receiving the following error when I try debugging the pipeline or previewing the data.
DF-JOIN-002 at Join 'Join1'(Line 40/Col 26): Only 2 join condition(s) allowed
I'm basically trying to build a pipeline to automate this query:
SELECT DISTINCT k.acct_id, s.Id, Email, FirstName, LastName FROM table_3 s
INNER JOIN
( (SELECT acct_id, event_date FROM table_1)
UNION (SELECT acct_id, event_date FROM table_2)) k
ON k.acct_id = s.Archtics_acct_id__c
WHERE event_date = 'xxxx-xx-xx'
enter image description here

I figured it out after some time. I had to delete the Join activity and add it again. It was still linked to another source. Even though only two sources were selected in the join settings

Related

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.

Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL

You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Optimizing SQL query using JOIN instead of NOT IN

I have a sql query that I'd like to optimize. I'm not the designer of the database, so I have no way of altering structure, indexes or stored procedures.
I have a table that consists of invoices (called faktura) and each invoice has a unique invoice id. If we have to cancel the invoice a secondary invoice is created in the same table but with a field ("modpartfakturaid") referring to the original invoice id.
Example of faktura table:
invoice 1: Id=152549, modpartfakturaid=null
invoice 2: Id=152592, modpartfakturaid=152549
We also have a table called "BHLFORLINIE" which consists of services rendered to the customer. Some of the services have already been invoiced and match a record in the invoice (FAKTURA) table.
What I'd like to do is get a list of all services that either does not have an invoice yet or does not have an invoice that's been cancelled.
What I'm doing now is this:
`SELECT
dbo.BHLFORLINIE.LeveringsDato AS treatmentDate,
dbo.PatientView.Navn AS patientName,
dbo.PatientView.CPRNR AS patientCPR
FROM
dbo.BHLFORLINIE
INNER JOIN dbo.BHLFORLOEB
ON dbo.BHLFORLOEB.BhlForloebID = dbo.BHLFORLINIE.BhlForloebID
INNER JOIN dbo.PatientView
ON dbo.PatientView.PersonID = dbo.BHLFORLOEB.PersonID
INNER JOIN dbo.HENVISNING
ON dbo.HENVISNING.BhlForloebID = dbo.BHLFORLOEB.BhlForloebID
LEFT JOIN dbo.FAKTURA
ON dbo.BHLFORLINIE.FakturaId = FAKTURA.FakturaId
WHERE
(dbo.BHLFORLINIE.LeveringsDato >= '2017-01-01' OR dbo.BHLFORLINIE.FakturaId IS NULL) AND
dbo.BHLFORLINIE.ProduktNr IN (110,111,112,113,8050,4001,4002,4003,4004,4005,4006,4007,4008,4009,6001,6002,6003,6004,6005,6006,6007,6008,7001,7002,7003,7004,7005,7006,7007,7008) AND
((dbo.FAKTURA.FakturaType = 0 AND
dbo.FAKTURA.FakturaID NOT IN (
SELECT FAKTURA.ModpartFakturaID FROM FAKTURA WHERE FAKTURA.ModpartFakturaID IS NOT NULL
)) OR
dbo.FAKTURA.FakturaType IS NULL)
GROUP BY
dbo.PatientView.CPRNR,
dbo.PatientView.Navn,
dbo.BHLFORLINIE.LeveringsDato`
Is there a smarter way of doing this? Right now the added the query performs three times slower because of the "not in" subquery.
Any help is much appreciated!
Peter

You can use an outer join and check for null values to find non matches
SELECT customer.name, invoice.id
FROM invoices i
INNER JOIN customer ON i.customerId = customer.customerId
LEFT OUTER JOIN invoices i2 ON i.invoiceId = i2.cancelInvoiceId
WHERE i2.invoiceId IS NULL

Unusual Joins SQL

I am having to convert code written by a former employee to work in a new database. In doing so I came across some joins I have never seen and do not fully understand how they work or if there is a need for them to be done in this fashion.
The joins look like this:
From Table A
Join(Table B
Join Table C
on B.Field1 = C.Field1)
On A.Field1 = B.Field1
Does this code function differently from something like this:
From Table A
Join Table B
On A.Field1 = B.Field1
Join Table C
On B.Field1 = C.Field1
If there is a difference please explain the purpose of the first set of code.
All of this is done in SQL Server 2012. Thanks in advance for any help you can provide.

I could create a temp table and then join that. But why use up the cycles\RAM on additional storage and indexes if I can just do it on the fly?
I ran across this scenario today in SSRS - a user wanted to see all the Individuals granted access through an AD group. The user was using a cursor and some temp tables to get the users out of AD and then joining the user to each SSRS object (Folders, reports, linked reports) associated with the AD group. I simplified the whole thing with Cross Apply and a sub query.
GroupMembers table
GroupName
UserID
UserName
AccountType
AccountTypeDesc
SSRSOjbects_Permissions table
Path
PathType
RoleName
RoleDesc
Name (AD group name)
The query needs to return each individual in an AD group associated with each report. Basically a Cartesian product of users to reports within a subset of data. The easiest way to do this looks like this:
select
G.GroupName, G.UserID, G.Name, G.AccountType, G.AccountTypeDesc,
[Path], PathType, RoleName, RoleDesc
from
GroupMembers G
cross apply
(select
[Path], PathType, RoleName, RoleDesc
from
SSRSOjbects_Permissions
where
Name = G.GroupName) S;
You could achieve this with a temp table and some outer joins, but why waste system resources?

I saw this kind of joins - it's MS Access style for handling multi-table joins. In MS Access you need to nest each subsequent join statement into its level brackets. So, for example this T-SQL join:
SELECT a.columna, b.columnb, c.columnc
FROM tablea AS a
LEFT JOIN tableb AS b ON a.id = b.id
LEFT JOIN tablec AS c ON a.id = c.id
you should convert to this:
SELECT a.columna, b.columnb, c.columnc
FROM ((tablea AS a) LEFT JOIN tableb AS b ON a.id = b.id) LEFT JOIN tablec AS c ON a.id = c.id
So, yes, I believe you are right in your assumption

Access 'Not Equal To' Join

I have two tables, neither with a primary id. The same combination of fields uniquely identifies the records in each and makes the records between the two tables relate-able (I think).
I need a query to combine all the records from one table and only the records from the second not already included from the first table. How do I do this using 'not equal to' joins on multiple fields? My results so far only give me the records of the first table, or no records at all.

Try the following:
SELECT ECDSlides.[Supplier Code], ECDSlides.[Supplier Name], ECDSlides.Commodity
FROM ECDSlides LEFT JOIN (ECDSlides.Commodity = [Mit Task Details2].Commodity) AND (ECDSlides.[Supplier Code] = [Mit Task Details2].[Supplier Code])
WHERE [Mit Task Details2].Commodity Is Null;

This might be what you are looking for
SELECT fieldA,fieldB FROM tableA
UNION
SELECT fieldA,fieldB FROM tableB
Union should remove automatically. 'Union All' would not.
If, for some reason, you get perfect duplicates and they are not removed, you could try this :
SELECT DISTINCT * FROM (
SELECT fieldA,fieldB FROM tableA
UNION
SELECT fieldA,fieldB FROM tableB
) AS subquery

bigQuery - Join

I'm trying to join two databases on the ID. The first database on price quotes does not have the data on websites, so I want to join it in from the logs database. However, in the logs database the ID is not unique, but the first chronological appearance of the ID - this is the right website.
When I run the query below, I get:
Resources exceeded during query execution.
Hence I don't know whether the problem is the code or something else.
Thanks
SELECT ID, user,busWeek, count(*) as num FROM [datastore.quotes]
Join (
select objectID, first(website) from (
select objectID, website, date from [datastore.allLogs]
order by date) group by objectID)
as Logs
on ID = objectID
group by ID,user,busWeek

Can you try:
SELECT ID, user,busWeek, count(*) as num FROM [datastore.quotes]
Join EACH (
select objectID, first(website) from (
select objectID, website, date from [datastore.allLogs]
order by date) group EACH by objectID)
as Logs
on ID = objectID
group by ID,user,busWeek
Note the 'EACH' - that keyword won't be needed in the future, but it's still useful today.

I think the issue is in ORDER BY. This brings all calculation to one node which causes "Resources Exceeded" message. I understand you need it to bring first (by date) website for each object.
Try to rewrite this select (inside join) to be partitioned.
For example using window functions with OVER(PARTITION BY ... ORDER BY)
In this case, I think, you have chance to make this in parallel
See below for reference
Window Functions

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to fix DF-JOIN-002 Error in Azure Data Factory (Only two Join conditions allowed) - join

I figured it out after some time. I had to delete the Join activity and add it again. It was still linked to another source. Even though only two sources were selected in the join settings

Related

Why does Hive warn that this subquery would cause a Cartesian product?

Optimizing SQL query using JOIN instead of NOT IN

Unusual Joins SQL

Access 'Not Equal To' Join

bigQuery - Join

Categories

Resources