Impala join with or query - join

I am trying to perform a join in impala as such:
Select * from Table1 t1
left outer join Table2 t2 on (t1.column1 = t2.column1 OR t1.column2 = t2.column2)
But I get the following error:
NotImplementedException: Join with 't2' requires at least one conjunctive equality precidate.
To perform a Cartesian product between two tables, use a CROSS JOIN.
I have tried using a CROSS JOIN but it does not work either.
Is it possible to perform or queries on a join in Impala? Is there a work around?
I have tried it using and AND query and it runs successfully.
Any help or advice is appriciated.

As suggested on the Impala JIRA, you can trying rewriting your query with a UNION ALL clause. Unfortunately you'll have to do the deduplication following the UNION ALL manually.

Related

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Hive join query to list columns from only one table

I am writing a hive query to join two tables; table1 and table2. In the result I just need all columns from table1 and no columns from table2.
I know the solution where I can select all the columns manually by specifying table1.column1, table1.column2.. and so on in the select statement. But I have about 22 columns in table 1. Also, I have to do the same for multiple other tables ans its painful process.
I tried using "SELECT table1.*", but I get a parse exception.
Is there a better way to do it?
Hive 0.13 onwards the following query syntax works:
SELECT a.* FROM a JOIN b ON (a.id = b.id)
This query will select all columns from a. So instead of typing all the column names (making the query cumbersome), it is a better idea to use tablealias.*

JPQL join two entities with no direct relations

I have an issue: When I am trying to join two tables which do not have a foreign key or a direct entity relation through my java code within themselves. I am using the below JPQL query: -
SELECT p FROM P p, OM orgm WHERE p.o.id = orgm.o.id and p.u.id = orgm.u.id and orgm.ma = true and p.u.id = ? AND p.o.id IN (:oId);
But this turns to a MySQL query which has a "cross join" which obviously is expensive.
What I need is to make sure that a similar query gives me an inner join MySQL query between the two tables.
I am trying to make usage of the "WITH" clause but seems that it doesn't work with inner join.
Please revert what can be done in this scenario.
Thanks in advance.

Excluding data using NOT IN in the join using DB2, SQL PL

I need assistance in formulating the correct approach to a query.
I have staff members that I need to give work to. If they're not available on a date, they're excluded from the group of staff members that can get work. I think it's clear what I'm trying to do, but it's incorrect syntax:
INNER JOIN mySchema."STAFF" S
ON RS.STAFF_ID = S.STAFF_ID
AND RS.STAFF_ID NOT IN (SELECT SU.STAFF_ID
FROM mySchema."STAFF_UNAVAIL" SU
WHERE SU.UNAVAIL_DT = OUTSTANDING_DATE)
Any ideas on how one could achieve a NOT IN in a join without actually doing it in the join?
put it in a where clause after the joins
INNER JOIN mySchema."STAFF" S
ON RS.STAFF_ID = S.STAFF_ID
...any other joins...
WHERE RS.STAFF_ID NOT IN (SELECT SU.STAFF_ID
FROM mySchema."STAFF_UNAVAIL" SU
WHERE SU.UNAVAIL_DT = OUTSTANDING_DATE)

using SQL aggregate functions with JOINs

I have two tables - tool_downloads and tool_configurations. I am trying to retrieve the most recent build date for each tool in my database. The layout of the DB is simple. One table called tool_downloads keeps track of when a tool is downloaded. Another table is called tool_configurations and stores the actual data about the tool. They are linked together by the tool_conf_id.
If I run the following query which omits dates, I get back 200 records.
SELECT DISTINCT a.tool_conf_id, b.tool_conf_id
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
When I try to add in date information I get back hundreds of thousands of records! Here is the query that fails horribly.
SELECT DISTINCT a.tool_conf_id, max(a.configured_date) as config_date, b.configuration_name
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
I know the problem has something to do with group-bys/aggregate data and joins. I can't really search google since I don't know the name of the problem I'm encountering. Any help would be appreciated.
Solution is:
SELECT b.tool_conf_id, b.configuration_name, max(a.configured_date) as config_date
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
GROUP BY b.tool_conf_id, b.configuration_name

Resources