RapidMiner multiple Join operator - join

I want to Join (Inner Join) three CSV datasets in RapidMiner. Right now I am using two Join operatos ((Dataset1 Join Dataset2) Join Dataset3).
Is there any operator or method to Join multiple operators simultaneously?

The short answer is no.
However, you could "roll your own" by using the Sub Process operator and place inside that the required number of Join operators. The resulting single operator would look and behave like a single operator.

Related

generating join condition dynamically in pyspark

Can someone suggest a way to pass a listofJoinColumns and a condition to joins in pyspark.
e.g. I need the columns to be joined on to be dynamically taken from a list and also want to pass another condition on the join. Something similar to this done in scala is explained here: generating join condition dynamically in spark/scala
I am looking for a similar solution in pyspark.
I understand that I can use the join e.g.
a.join(b , listofjoincolumns, how="inner")
but I want to pass a join condition as well:
I want to call it as
a.join(b , listofjoincolumns and join condition, how="inner")
Can someone please suggest a way to do so in pyspark.
Try to convert the list of join columns to a join condition itself:
from functools import reduce
from operator import and_
df_a.join(df_b, reduce(and_,
[df_a[col] == df_b[col] for col in listofcols],
joinCond)
)

Predicate Pushdown vs On Clause

When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined. This is known as predicate pushdown (http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).
However, I have also learned recently that there is another way of filtering data from a table before joining it with another table(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/).
One can provide the condition in the ON clause, and table a will be filtered before the join is performed
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id AND a.some_name=6
Do both of these provide the predicate pushdown optimization?
Thank you
Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS
ON join condition works before join.
WHERE is applied after join.
Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 --Right table filter
will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN, because if b.some_name=6, it cannot be NULL.
And PPD does not change this behavior.
You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records
And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.
LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6
PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.
More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344
So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.

How are the mappers decided while running a Hive Map Join?

This is stated on the wiki page of Apache Hive:
If all but one of the tables being joined are small, the join can be performed as a map only job.The query
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
does not need a reducer. For every mapper of A, B is read completely.
How are the number of mappers decided if one of the tables being joined is small but the other is large enough to go out of a single mapper's resources?
Will the join automatically turn into a non-map join then?
The other table cannot be too large.
It is being streamed through the mapper(s).

How to use cogroup for join

I have this join.
A = Join smallTableBigEnoughForInMemory on (F1,F2) RIGHT OUTER, massive on (F1,F2);
B = Join anotherSmallTableBigforInMemory on (F1,F3 ) RIGHT OUTER, massive on (F1,F3);
Since both joins are using one common key, I was wondering if COGROUP can be used for joining data efficiently. Please note this is a RIGHT outer join.
I did think about cogrouping on F1, but small tables has multiple combination ( 200-300) on single key so I have not used join using single key.
I think partitioning may help but data has skew and I am not sure how to use it in Pig
You are looking for Pig's implementation of fragment-replicate joins. See the O'Reilly book Programming Pig for more details about the different join implementations. (See in particular Chapter 8, Making Pig Fly.)
In a fragment-replicate join, no reduce phase is required because each record of the large input is streamed through the mapper, matched up with any records of the small input (which is entirely in memory), and output. However, you must be careful not to do this kind of join with an input which will not fit into memory -- Pig will issue an error and the job will fail.
In Pig's implementation, the large input must be given first, so you will actually be doing a left outer join. Just tack on "using 'replicated'":
A = JOIN massive BY (F1,F2) LEFT OUTER, smallTableBigEnoughForInMemory BY (F1,F2) USING 'replicated';
The join for B will be similar.

SQLite3 Database Query Optimization

I want a result by combining 4 tables. Previously I was using 4 different queries and to improve performance, started with joining the tables and querying from single table. But there was no improvement in performance.
I later learnt that SQLite translates join statements to "where clause" and I can directly use "Where" clause instead of join that would save some CPU time.
But the problem with "Where" clause is if one condition out of four fails, the result set is null. I want a table with rest of the columns (that matches other conditions) filled and not an empty table if one condition fails. Is there a way to acheive this? Thanks!
Have you considered using LEFT OUTER JOIN ?
for example
SELECT Customers.AcctNumber, Customers.Custname, catalogsales.InvoiceNo
FROM Customers
LEFT OUTER JOIN catalogsales ON Customers.Acctnumber = catalogsales.AcctNumber
In this example if there are not any matching rows in "catalogsales", then it will still return the data from the "left" table, which in this case is "Customers"
Without example SQL it's hard to know what you've tried.

Resources