Conditional join of two tables in Knime - join

I am new to Knime Analytics.
I have two tables and I need to join them not by equality, but by the difference between the values of two fields. (In sql it would look like "table1 join table2 on abs(table1.mass - table2.mass)<0.005 ") but in nodes I have found only node, that join by equality.
Are there any nodes for conditional joining tables or something like this?

The only way I can think of to do this is as follows. Use a Cross Joiner node to join all rows of each table to all rows of the second table. Now use a Java Snippet Row Filter on the joined table, with the following snippet code
return Math.abs($mass$.doubleValue() - $mass (#1)$.doubleValue()) < 0.005;
(Assuming that both incoming tables have a column called 'mass', which will become 'mass' and 'mass (#1)' after the Cross Joiner

Related

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Hive join query to list columns from only one table

I am writing a hive query to join two tables; table1 and table2. In the result I just need all columns from table1 and no columns from table2.
I know the solution where I can select all the columns manually by specifying table1.column1, table1.column2.. and so on in the select statement. But I have about 22 columns in table 1. Also, I have to do the same for multiple other tables ans its painful process.
I tried using "SELECT table1.*", but I get a parse exception.
Is there a better way to do it?
Hive 0.13 onwards the following query syntax works:
SELECT a.* FROM a JOIN b ON (a.id = b.id)
This query will select all columns from a. So instead of typing all the column names (making the query cumbersome), it is a better idea to use tablealias.*

Issues with joining multiple tables

I'm really struggling at the moment trying to work out how to join multiple tables without duplicating data.
At the moment I have 8 tables that I was wanted to get various information from per member of staff like the below:
SDQ score, Goal scores, CHI score, number of appointments, number of dna appointments
The tables and field I can see to join are as follows
tblSDQ - Assessed_By_Staff_ID
tblGoals - Recorded_By_Staff_ID
tblCHI - Recorded_By_Staff_ID
tblReferral - Staff_ID
tblStaff - Staff_ID
tblDiaryAppointment - needs to connect to tblDiaryAppointmentClinician using Clinician_Invitee_Staff_ID
I hope someone can help or advice. I just don't know if it's even possible to join all these tables using the same field, or if its possible to join them but then return a number of entries but then just count others?
Syntax depends on a rdbms you are using.
You could use join with specified join fields from both tables:
select bla-bla
from table1
join table2 on ( table1.fileld_name1 = table2.fileld_name2 )
https://dev.mysql.com/doc/refman/5.0/en/join.html
if you need outer join (to show nulls for optional tables data) you could use this:
join table2 on ( table1.fileld_name1 = table2.fileld_name2 or table2.field_name2 is null )
to join with couns you could use subqueries like this
join ( select field_name3, coint(*) as cnt from table3 goup by field_name3 ) AS table3_counts
...
where ( table3_counts.field_name3 = ... or table3_counts.field_name3 is null )
https://dev.mysql.com/doc/refman/5.0/en/from-clause-subqueries.html
PS: Joins are often slow. It's better to denormalize tables to eliminate joins and gain performance. Or do simple selects and join in backend code.

JPQL join two entities with no direct relations

I have an issue: When I am trying to join two tables which do not have a foreign key or a direct entity relation through my java code within themselves. I am using the below JPQL query: -
SELECT p FROM P p, OM orgm WHERE p.o.id = orgm.o.id and p.u.id = orgm.u.id and orgm.ma = true and p.u.id = ? AND p.o.id IN (:oId);
But this turns to a MySQL query which has a "cross join" which obviously is expensive.
What I need is to make sure that a similar query gives me an inner join MySQL query between the two tables.
I am trying to make usage of the "WITH" clause but seems that it doesn't work with inner join.
Please revert what can be done in this scenario.
Thanks in advance.

Resources