Pig script join multiple keys with or - join

Hi I have been trying to run this piece of pig script.
meddata = join meddata by (icddiag1 or icddiag2) left outer, codes by code;
with the intent of joining the meddata file if icddiag1 matches codes or if icddiag2 matches codes. I know I can join multiple times and do an unioin, but is there an easier way to do this?

Related

Conditional join of two tables in Knime

I am new to Knime Analytics.
I have two tables and I need to join them not by equality, but by the difference between the values of two fields. (In sql it would look like "table1 join table2 on abs(table1.mass - table2.mass)<0.005 ") but in nodes I have found only node, that join by equality.
Are there any nodes for conditional joining tables or something like this?
The only way I can think of to do this is as follows. Use a Cross Joiner node to join all rows of each table to all rows of the second table. Now use a Java Snippet Row Filter on the joined table, with the following snippet code
return Math.abs($mass$.doubleValue() - $mass (#1)$.doubleValue()) < 0.005;
(Assuming that both incoming tables have a column called 'mass', which will become 'mass' and 'mass (#1)' after the Cross Joiner

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Impala join with or query

I am trying to perform a join in impala as such:
Select * from Table1 t1
left outer join Table2 t2 on (t1.column1 = t2.column1 OR t1.column2 = t2.column2)
But I get the following error:
NotImplementedException: Join with 't2' requires at least one conjunctive equality precidate.
To perform a Cartesian product between two tables, use a CROSS JOIN.
I have tried using a CROSS JOIN but it does not work either.
Is it possible to perform or queries on a join in Impala? Is there a work around?
I have tried it using and AND query and it runs successfully.
Any help or advice is appriciated.
As suggested on the Impala JIRA, you can trying rewriting your query with a UNION ALL clause. Unfortunately you'll have to do the deduplication following the UNION ALL manually.

Trying to create one table from four

I'm stuck trying to create a query that pulls results from at least three different tables with many to many relationships.
I want to end up with a table that lists cases, the outcomes and complaints.
All cases may have none, one or multiple outcomes, same relationship applies to the complaints. I want to be able to have the case listed once, then subsequent columns to list all the outcomes and complaints related to that case.
I have tried GROUP_CONCAT to get the outcomes in one column instead of repeating the cases but when I use UNION to combine the outcomes and complaints one column header overwrites the other.
Any help appreciated and here's the link to the fiddle http://sqlfiddle.com/#!2/d111e/2/0
I suggest you START with this this query structure:
SELECT
c.caseID, c.caseTitle, c.caseSynopsis /* if more columns ... add to group by also */
, group_concat(co.concern)
, group_concat(re.resultText)
FROM caseSummaries AS c
LEFT JOIN JNCT_CONCERNS_CASESUMMARY AS JCC ON c.caseID = JCC.caseSummary_FK
LEFT JOIN CONCERNS AS co ON JCC.concerns_FK = co.concernsID
LEFT JOIN JNCT_RESULT_CASESUMMARY AS JRC ON c.caseID = JRC.caseSummary_FK
LEFT JOIN RESULTS AS re ON JRC.result_FK = re.result_ID
GROUP BY
c.caseID, c.caseTitle, c.caseSynopsis /* add more ... here also */
;
Treat the table caseSummaries as the most important and then everything else "hangs off" that.
Please note that although MySQL will allow it, you should place EVERY non-aggregating column that you include in the select clause into the group by clause also.
also see: http://sqlfiddle.com/#!2/2d1a79/7

using SQL aggregate functions with JOINs

I have two tables - tool_downloads and tool_configurations. I am trying to retrieve the most recent build date for each tool in my database. The layout of the DB is simple. One table called tool_downloads keeps track of when a tool is downloaded. Another table is called tool_configurations and stores the actual data about the tool. They are linked together by the tool_conf_id.
If I run the following query which omits dates, I get back 200 records.
SELECT DISTINCT a.tool_conf_id, b.tool_conf_id
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
When I try to add in date information I get back hundreds of thousands of records! Here is the query that fails horribly.
SELECT DISTINCT a.tool_conf_id, max(a.configured_date) as config_date, b.configuration_name
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
I know the problem has something to do with group-bys/aggregate data and joins. I can't really search google since I don't know the name of the problem I'm encountering. Any help would be appreciated.
Solution is:
SELECT b.tool_conf_id, b.configuration_name, max(a.configured_date) as config_date
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
GROUP BY b.tool_conf_id, b.configuration_name

Resources