how to do anti left join when the left dataframe is aggregated in pyspark - join

I need to do anti left join and flatten the table. in the most efficient way possible because the right table is massive.
so the first table is: like 1000-10,000 rows
and second massive table: (billions of rows)
the desired outcome is:
kind of left anti-join, but not exactly.
I tried to join the worker table with the first table, and then anti left join, and it's working. But duplicating the list of the entire category for each employee sounds inefficient and creates a huge duplication of data.

If one side of your join is small (like your first table) you can use Broadcast function which has a huge impact on your performance. you can read more information about it in :
https://sparkbyexamples.com/pyspark/pyspark-broadcast-variables/ or pyspark documentation.
The below code can do what you want:
from pyspark.sql import functions as F
a_df.show()
b_df.show()
b_df.join(F.broadcast(a_df), ['category', 'item'], 'left_anti').show()
+--------+----+
|category|item|
+--------+----+
| A| 123|
| A| 456|
| B| 19|
| B| 20|
| B| 66|
| C| 38|
+--------+----+
+------+--------+----+
|worker|category|item|
+------+--------+----+
| jon| B| 19|
| jon| B| 20|
| jon| B| 45|
| danni| B| 18|
| danni| B| 19|
| danni| B| 80|
| danni| B| 94|
| al| A| 123|
| ben| C| 100|
| ben| C| 50|
+------+--------+----+
+--------+----+------+
|category|item|worker|
+--------+----+------+
| B| 45| jon|
| B| 18| danni|
| B| 80| danni|
| B| 94| danni|
| C| 100| ben|
| C| 50| ben|
+--------+----+------+

Related

Pyspark - How can i rename column names based on a lookup table?

I have two tables as follows:
Table 1:
Table 2:
I want to replace the names of the Table 1 with the date column from table 2.
The final output should look like the below table:
All help is appreciated!
Thank You!
I assume that the table 2 is not so huge, considering that they are column name mappings, otherwise there will be memory issues in bringing them to driver. Try this.
tst1=sqlContext.createDataFrame([(1,2,3,4,5,6,7,8),(5,6,7,8,9,10,11,12),(13,14,15,16,17,18,19,20)],["a","b","c","d","e","f","g","h"])
tst2=sqlContext.createDataFrame([('a','apple'),('b','ball'),('c','cat'),('d','dog'),('e','elephant'),('f','fox'),('g','goat'),('h','hat')],["short","long"])
tst1.show()
+---+---+---+---+---+---+---+---+
| a| b| c| d| e| f| g| h|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 5| 6| 7| 8| 9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+---+---+---+---+---+---+---+---+
# Collect the table 2 to extract the mapping
tst_cl = tst2.collect()
# get the old and new names of the columns
old_name=[str(tst_cl[i][0]) for i in range(len(tst_cl))]
new_name=[str(tst_cl[i][1]) for i in range(len(tst_cl))]
# Rename the columns
tst_rn = tst1.select(old_name).toDF(*new_name)
tst_rn.show()
+-----+----+---+---+--------+---+----+---+
|apple|ball|cat|dog|elephant|fox|goat|hat|
+-----+----+---+---+--------+---+----+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 5| 6| 7| 8| 9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+-----+----+---+---+--------+---+----+---+
Once you have collected the column mappings you can use any of the renaming techniques used here : PySpark - rename more than one column using withColumnRenamed
Hint : Should you face some order mismatch issues during collect(mostly you won't,but just if you want to be triple sure ) then consider combining the mapping in table 2 using F.array() method and then collect. The mapping has to be slightly changed
tst_array= tst2.withColumn("name_array",F.array(F.col('short'),F.col('long')))
tst_clc = tst_array.collect()
old_name = [str(tst_clc[i][2][0]) for i in range(len(tst_clc))]
new_name = [str(tst_clc[i][2][1]) for i in range(len(tst_clc))]

mysql join and get rows where two values are set or no value at all

I have two tables:
table "songs"
|id|title |
|1 |song 1|
|2 |song 2|
|3 |song 3|
table "tags"
|id|song_id|type |tag |
|1 |1 |season|christmas|
|2 |1 |time |morning |
|3 |2 |season|summer |
|4 |2 |time |morning |
|5 |2 |time |night |
|6 |3 |time |morning |
For example, i have three tags of type "season" : "christmas", "easter", "valentine". I also have tags of type "daytime": "morning", "afternoon", "evening" and "night".
How can i get all songs that has tag type "season" and tag "christmas" or no "season" tag type at all? But there can be other tag types.
I have written this:
SELECT s.title,t.* FROM songs AS s LEFT JOIN tags AS t ON s.id=t.song_id WHERE (t.type='season' AND t.tag='christmas') OR type ... ?
but i don't know how to create rest of that query so that type "season" must be empty, but there can be other types.
Example:
query 1: to search songs that has both season=christmas and time=morning tags.
query 2: songs that are without season tag and only time=night
I hope you understand what i'm trying to achieve.
Your First Query:
Select s1.id, s1.title, s1.type as "type1", s1.tag as "tag1", s2.type as "type2", s2.tag as "tag2" From (SELECT songs.id, songs.title, tags.type, tags.tag FROM songs JOIN tags ON songs.id=tags.song_id WHERE (tags.type='season' AND tags.tag='christmas')) as s1 Join (SELECT songs.id, songs.title, tags.type, tags.tag FROM songs JOIN tags ON songs.id=tags.song_id WHERE (tags.type='time' AND tags.tag='morning')) as s2 ON s1.id = s2.id
Your Second Query:
SELECT songs.id, songs.title, tags.type, tags.tag FROM songs JOIN tags ON songs.id=tags.song_id WHERE tags.type='time' and tags.tag='morning' and songs.id not in (SELECT song_id FROM tags WHERE tags.type ='season')

Select using concat in single table by joining parent & child id's

I have one table with 3 columns are below
+---------------------------------------+
| id | name | parent_id |
+---------------------------------------+
| -1 | / | |
| 1 | Organization | -1 |
| 2 | United States | 1 |
| 3 | Business Analyst | 1 |
| 4 | Human Resources | 1 |
| 5 | Benefits Manager | 4 |
| 6 | Metropolitan Plant | 2 |
| 7 | Administration | 6 |
+---------------------------------------+
And my query is like this
SELECT CONCAT(parent.name, '/', child.name) AS path
FROM table_name AS child INNER JOIN table_name AS parent
ON child.id = parent.parent_id
I am expecting output as below.
/Organization
/Organization/United States
/Organization/Business Analyst
/Organization/Human Resources
/Organization/Human Resources/Benefits Manager
/Organization/United States/Metropolitan Plant
/Organization/United States/Metropolitan Plant/Administration
Ok...there might be a more elegant way to do this...especially with using do loops...but with what immediately comes to mind, you may need to do several joins. Is the maximum level low? I hope so. Here's an idea, but it's messy and may require a lot of spool depending on your data size:
SELECT CONCAT(path2, '/', D.name) AS path3
FROM
(SELECT CONCAT(path1, '/', B.name) AS path2
FROM
(SELECT CONCAT(parent.name, '/', child.name) AS path1
FROM table_name AS parent LEFT JOIN table_name AS child
ON child.id = parent.parent_id) AS A
LEFT JOIN TABLE_NAME AS B
ON A.id = B.parent_id) AS C
LEFT JOIN TABLE_NAME AS D
ON C.id = D.parent_id
The above code would only take it up to 3 levels. If something better comes to mind, I'll post it.
Suspect you're expected to use a hierarchical query here
WITH foo (id, parent_id, name, fullpath)
AS (SELECT id,
parent_id,
name,
'/' AS fullpath
FROM table_name
WHERE parent_id IS NULL
UNION ALL
SELECT m.id,
m.parent_id,
m.name,
f.fullpath || m.name || '/' AS fullpath
FROM foo f JOIN table_name m ON (m.parent_id = f.id))
SELECT fullpath FROM foo
WHERE id > 0
That'll be pretty close.

Mutiple selection select boxes with Ruby on Rails

I'm using 4 tables to fill a chart, similar to these:
#colors #types #fuels
----------------- --------------------- -------------------
| ID | Color | | ID | Type | | ID | Fuel |
----------------- --------------------- -------------------
| 1 | 'Red' | | 1 | 'SUV' | | 1 | 'Gasoline' |
| 2 | 'Green' | | 2 | 'Truck' | | 2 | 'Diesel' |
| 3 | 'Blue' | | 3 | 'Sports Car' | | 3 | 'Electric' |
| 4 | 'Yellow' | | 4 | 'Compact' | | 4 | 'Hybrid' |
----------------- --------------------- -------------------
The last table is an orders table, similar to this:
#orders
--------------------------------------------------------
| order_id | color_id | type_id | fueld_id | purchases |
--------------------------------------------------------
| 1 | 2 | 1 | 4 | 2 |
| 2 | 1 | 4 | 1 | 4 |
| 3 | 2 | 2 | 2 | 6 |
| 4 | 4 | 1 | 4 | 2 |
| 5 | 1 | 4 | 2 | 1 |
| 6 | 3 | 3 | 3 | 1 |
| 7 | 3 | 3 | 3 | 2 |
| 8 | 2 | 1 | 1 | 2 |
--------------------------------------------------------
I have a controller that polls data from them all to make the chart. So far so good. Now, I want to let the user pick one or more attributes from each table, to make a dynamic orders page.
My approach is to show 3 select boxes (listboxes) that could allow the user to make a selection, and based on this, the #orders table would be modified. The modifications would land on different action, say:
def newtable
...
end
I know how to do this via SQL, but I'm not too sure how to properly show these listboxes using RoR. My idea is to pick one, several, or ALL the elements of each table.
Would form_for do the trick? I was trying to use it, but I don't have a model to base the query on, and I'm not sure how to create one (or if that approach is actually viable).
Thanks.
Well I'm Not Familiar with Rails 4 yet,but i can provide an answer as per Rails 3.
First Make Necessary Associations.
And In your Orders controller define one new method like
def create
----your stuff---
end
In your Order Form page:
<%= form_tag :action => 'create' %>
<%= collection_select :order,:color_id, #colors,'id','Color',{:label=> 'Select a Color',:prompt=> 'Select'} %>
<%= collection_select :order,:type_id, #types,'id','Type',{:label=> 'Select a Type',:prompt=> 'Select'} %>
<%= collection_select :order,:fuel_id, #fuels,'id','Fuel',{:label=> 'Select a Fuel',:prompt=> 'Select'} %>
<%= submit_tag "Create" %>
Hope this works.
Note: This is just a sample code for your understanding.
<%= f.select(:color, #color.collect {|p| [ p.name, p.id ] },
{ :prompt => "Please select"},
{ :multiple => true, :size => 4 }) %>
To configure your controller to accept multiple parameters: Click Me
You can create a separate model as a ruby class to handle this if you wish, but only if you find the other models getting too cramped. Hope that helps

How to make a relation between two table using has_one but none of them stores id of its adjacent table

I have a table name Strategy which looks like this:
+---+ +----+ +---+
|id | | a | | b
+---+ +----+ +---+
|1 | | 1 | | 2 |
|2 | | 2 | | 3 |
|3 | | 2 | | 4 |
|4 | | 4 | | 1 |
|5 | | 4 | | 4 |
+---+ +----+ +---+
and I have a class name Plan which looks like:
class Plan < ActiveRecord::Base
attr_accessible :a, :b
has_one :strategy ...................
end
Now I want to fill this empty has-one relation. Strategy table has column a and b and two column will always have different combination as shown in table. Now I dont know how to build a relation ship on between two table because Strategy table doesn't belong to Plan nor Plan table store strategy table id. Strategy table only contains some static values. But Plan has some a and b value and I want the id of Strategy Table on the basis of plan's a and b value using has_one relation. Is it possible in rails to do that or I need to follow some old logic.
You can build an association manually.
in your Plan model
def strategy
#strategy ||= Strategy.where("a = ? AND b = ?", a, b).first
end
this will let you reference my_plan.strategy in your code
EDIT answer updated to include Chandranshu's excellent suggestions... memoization and handling the case of more than one strategy with duplicate a, b values

Resources