generating join condition dynamically in pyspark - join

Can someone suggest a way to pass a listofJoinColumns and a condition to joins in pyspark.
e.g. I need the columns to be joined on to be dynamically taken from a list and also want to pass another condition on the join. Something similar to this done in scala is explained here: generating join condition dynamically in spark/scala
I am looking for a similar solution in pyspark.
I understand that I can use the join e.g.
a.join(b , listofjoincolumns, how="inner")
but I want to pass a join condition as well:
I want to call it as
a.join(b , listofjoincolumns and join condition, how="inner")
Can someone please suggest a way to do so in pyspark.

Try to convert the list of join columns to a join condition itself:
from functools import reduce
from operator import and_
df_a.join(df_b, reduce(and_,
[df_a[col] == df_b[col] for col in listofcols],
joinCond)
)

Related

Joining 2 large relations in Rascal

I'm trying to join two relations in Rascal, much like a SQL join, with the following code:
rel[loc,loc,loc] methodInvocationsWithClass = {arround 40000 tuples};
rel[loc,loc] declaredClassHierarchy = {around 20000 tuples};
{ <from,to,class,super> | <from,to,class> <- methodInvocationsWithClass, <sub,super> <- declaredClassHierarchy, class == sub };
While this does exactly what I need it appears it only works well on small relations and doesn't scale well.
Is there perhaps a more efficient alternative way to accomplish this?
Indeed, we have the join keyword for this. Also lots of other useful relational operations are supported. Either by keywords or functions inside the Relation module.

RapidMiner multiple Join operator

I want to Join (Inner Join) three CSV datasets in RapidMiner. Right now I am using two Join operatos ((Dataset1 Join Dataset2) Join Dataset3).
Is there any operator or method to Join multiple operators simultaneously?
The short answer is no.
However, you could "roll your own" by using the Sub Process operator and place inside that the required number of Join operators. The resulting single operator would look and behave like a single operator.

how to implement full outer join with lucene query time join?

For the example shown here, if i want to find all articles with all the comments present for each article, i.e. like a full outer join,how can it be done with query time join of lucene.
Is it possible that in the syntax of createJoinQuery
JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, searcher);
fromQuery will return all the distict values for fromField?

How do I join multiple hive queries?

I am trying to join a simple query with a very ugly query that resolves to a single line. They have a date and a userid in common but nothing else. Alone both queries work but for the life of me I cannot get them to work together. Can someone assist me in how I would do this?
Fixed it...when you union queries in hive it looks like you need to have an equal number of fields coming back from each.

SQLite3 Database Query Optimization

I want a result by combining 4 tables. Previously I was using 4 different queries and to improve performance, started with joining the tables and querying from single table. But there was no improvement in performance.
I later learnt that SQLite translates join statements to "where clause" and I can directly use "Where" clause instead of join that would save some CPU time.
But the problem with "Where" clause is if one condition out of four fails, the result set is null. I want a table with rest of the columns (that matches other conditions) filled and not an empty table if one condition fails. Is there a way to acheive this? Thanks!
Have you considered using LEFT OUTER JOIN ?
for example
SELECT Customers.AcctNumber, Customers.Custname, catalogsales.InvoiceNo
FROM Customers
LEFT OUTER JOIN catalogsales ON Customers.Acctnumber = catalogsales.AcctNumber
In this example if there are not any matching rows in "catalogsales", then it will still return the data from the "left" table, which in this case is "Customers"
Without example SQL it's hard to know what you've tried.

Resources