Pyspark - How can i rename column names based on a lookup table? - join

I have two tables as follows:
Table 1:
Table 2:
I want to replace the names of the Table 1 with the date column from table 2.
The final output should look like the below table:
All help is appreciated!
Thank You!

I assume that the table 2 is not so huge, considering that they are column name mappings, otherwise there will be memory issues in bringing them to driver. Try this.
tst1=sqlContext.createDataFrame([(1,2,3,4,5,6,7,8),(5,6,7,8,9,10,11,12),(13,14,15,16,17,18,19,20)],["a","b","c","d","e","f","g","h"])
tst2=sqlContext.createDataFrame([('a','apple'),('b','ball'),('c','cat'),('d','dog'),('e','elephant'),('f','fox'),('g','goat'),('h','hat')],["short","long"])
tst1.show()
+---+---+---+---+---+---+---+---+
| a| b| c| d| e| f| g| h|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 5| 6| 7| 8| 9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+---+---+---+---+---+---+---+---+
# Collect the table 2 to extract the mapping
tst_cl = tst2.collect()
# get the old and new names of the columns
old_name=[str(tst_cl[i][0]) for i in range(len(tst_cl))]
new_name=[str(tst_cl[i][1]) for i in range(len(tst_cl))]
# Rename the columns
tst_rn = tst1.select(old_name).toDF(*new_name)
tst_rn.show()
+-----+----+---+---+--------+---+----+---+
|apple|ball|cat|dog|elephant|fox|goat|hat|
+-----+----+---+---+--------+---+----+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 5| 6| 7| 8| 9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+-----+----+---+---+--------+---+----+---+
Once you have collected the column mappings you can use any of the renaming techniques used here : PySpark - rename more than one column using withColumnRenamed
Hint : Should you face some order mismatch issues during collect(mostly you won't,but just if you want to be triple sure ) then consider combining the mapping in table 2 using F.array() method and then collect. The mapping has to be slightly changed
tst_array= tst2.withColumn("name_array",F.array(F.col('short'),F.col('long')))
tst_clc = tst_array.collect()
old_name = [str(tst_clc[i][2][0]) for i in range(len(tst_clc))]
new_name = [str(tst_clc[i][2][1]) for i in range(len(tst_clc))]

Related

how to do anti left join when the left dataframe is aggregated in pyspark

I need to do anti left join and flatten the table. in the most efficient way possible because the right table is massive.
so the first table is: like 1000-10,000 rows
and second massive table: (billions of rows)
the desired outcome is:
kind of left anti-join, but not exactly.
I tried to join the worker table with the first table, and then anti left join, and it's working. But duplicating the list of the entire category for each employee sounds inefficient and creates a huge duplication of data.
If one side of your join is small (like your first table) you can use Broadcast function which has a huge impact on your performance. you can read more information about it in :
https://sparkbyexamples.com/pyspark/pyspark-broadcast-variables/ or pyspark documentation.
The below code can do what you want:
from pyspark.sql import functions as F
a_df.show()
b_df.show()
b_df.join(F.broadcast(a_df), ['category', 'item'], 'left_anti').show()
+--------+----+
|category|item|
+--------+----+
| A| 123|
| A| 456|
| B| 19|
| B| 20|
| B| 66|
| C| 38|
+--------+----+
+------+--------+----+
|worker|category|item|
+------+--------+----+
| jon| B| 19|
| jon| B| 20|
| jon| B| 45|
| danni| B| 18|
| danni| B| 19|
| danni| B| 80|
| danni| B| 94|
| al| A| 123|
| ben| C| 100|
| ben| C| 50|
+------+--------+----+
+--------+----+------+
|category|item|worker|
+--------+----+------+
| B| 45| jon|
| B| 18| danni|
| B| 80| danni|
| B| 94| danni|
| C| 100| ben|
| C| 50| ben|
+--------+----+------+

mysql join and get rows where two values are set or no value at all

I have two tables:
table "songs"
|id|title |
|1 |song 1|
|2 |song 2|
|3 |song 3|
table "tags"
|id|song_id|type |tag |
|1 |1 |season|christmas|
|2 |1 |time |morning |
|3 |2 |season|summer |
|4 |2 |time |morning |
|5 |2 |time |night |
|6 |3 |time |morning |
For example, i have three tags of type "season" : "christmas", "easter", "valentine". I also have tags of type "daytime": "morning", "afternoon", "evening" and "night".
How can i get all songs that has tag type "season" and tag "christmas" or no "season" tag type at all? But there can be other tag types.
I have written this:
SELECT s.title,t.* FROM songs AS s LEFT JOIN tags AS t ON s.id=t.song_id WHERE (t.type='season' AND t.tag='christmas') OR type ... ?
but i don't know how to create rest of that query so that type "season" must be empty, but there can be other types.
Example:
query 1: to search songs that has both season=christmas and time=morning tags.
query 2: songs that are without season tag and only time=night
I hope you understand what i'm trying to achieve.
Your First Query:
Select s1.id, s1.title, s1.type as "type1", s1.tag as "tag1", s2.type as "type2", s2.tag as "tag2" From (SELECT songs.id, songs.title, tags.type, tags.tag FROM songs JOIN tags ON songs.id=tags.song_id WHERE (tags.type='season' AND tags.tag='christmas')) as s1 Join (SELECT songs.id, songs.title, tags.type, tags.tag FROM songs JOIN tags ON songs.id=tags.song_id WHERE (tags.type='time' AND tags.tag='morning')) as s2 ON s1.id = s2.id
Your Second Query:
SELECT songs.id, songs.title, tags.type, tags.tag FROM songs JOIN tags ON songs.id=tags.song_id WHERE tags.type='time' and tags.tag='morning' and songs.id not in (SELECT song_id FROM tags WHERE tags.type ='season')

Rails validation In attribute table

i am new to ROR.
i am building a classified ads app, i have the following tables in my database:
(some fields have been removed for simplicity)
Table Uers
This table stores all the users.
user_id
name
email
password
Table Ads
This table stores all the ads.
ad_id
users_user_id (FK)
title
desc
cat_id (FK)
created_at
Sample data:
------------------------------------------------------------------------------
| ad_id | users_user_id | title | desc | cat_id | created_at |
------------------------------------------------------------------------------
| 1 | 1 | iphone 4 | brand new | 2 | 30-11-2015 |
------------------------------------------------------------------------------
Table categories
This table stores all the available categories. cat_id in the ads table relates to cat_id in this table.
cat_id
category
parent_cid
Sample data:
-------------------------------------------
|cat_id| category | parent_cid |
-------------------------------------------
|1 | Electronics | NULL |
|2 | Mobile Phone | 1 |
|3 | Apartments | NULL |
|4 | Apartments - Sale | 3 |
-------------------------------------------
Table ads_attribute
This table contains all the available attributes for a particular category. Relates to categories table.
attr_id
cat_id (FK)
attr_label
attr_name
Sample data:
-----------------------------------------------------------
|attr_id | cat_id | attr_label | attr_name |
-----------------------------------------------------------
|1 | 2 | Operating System | Operating_System |
|2 | 2 | Is Touch Screen | Touch_Screen |
|3 | 2 | Manufacturer | Manufacturer |
|4 | 3 | Bedrooms | Bedrooms |
|5 | 3 | Total Area | Area |
|6 | 3 | Posted By | Posted_By |
-----------------------------------------------------------
Table ads_attr_value
This table stores the attribute value for each ad in ads table.
attr_val_id
attr_id (FK)
ad_id
attr_val
Sample data:
---------------------------------------------
|attr_val_id | attr_id | ad_id | attr_val |
---------------------------------------------
|1 | 1 | 1 | Ios 8 |
|2 | 2 | 1 | 1 |
|3 | 3 | 1 | Apple |
---------------------------------------------
What is the best way (the rails way) to validate the data before storing it in the the ads_attr_value table, given the fact that the values would be in select fields and the user can change them easily for example from Ios 8 to "blabla".
I've thought of storing all the possible values for each attribute in a new table and then check if a value sent by the user exist in that table before storing it in the ads_attr_value. what do you think? I am sure that there is a better way.thanks for sharing.
The rails way would probably to define your relationships with ActiveRecord associations : http://guides.rubyonrails.org/association_basics.html.
Therefore you could easily define on your model
class AdsAttrVal < ActiveRecord::Base
belongs_to :ad
validates :ad, presence: true
end
However please keep in mind that rails way to store an id of the table is to name it "id" and not "model_id" like you did ("user_id", "id"). My exemple suppose that the rails way is respected...
You have to specify the validations you want inside <yourModel>.rb (the model file) . For exame if you want to validate if ad_id is a number you should add the numericality parameter in the validates statement, see below:
class AdsAttrValue < ActiveRecord::Base
validates :ad_id, numericality: true
#validate if add_att_value has the permitted values
validate :myCustomValidation
def myCustomValidation
#your logic of validation goes here
#you can access here all the fields from this object recently created
if attr_val == something
#do something
end
end
end
See that validations from rails have an s at the end (validates), and your own written validations do not have (validate).
This validations are executed when creating the object before storing in database in order to see if it complies the validations and not stored it it does not comply. You can add errors in your own validation to let the user know what gone wrong. Go further with this reading of validations in ruby on rails

How to make a relation between two table using has_one but none of them stores id of its adjacent table

I have a table name Strategy which looks like this:
+---+ +----+ +---+
|id | | a | | b
+---+ +----+ +---+
|1 | | 1 | | 2 |
|2 | | 2 | | 3 |
|3 | | 2 | | 4 |
|4 | | 4 | | 1 |
|5 | | 4 | | 4 |
+---+ +----+ +---+
and I have a class name Plan which looks like:
class Plan < ActiveRecord::Base
attr_accessible :a, :b
has_one :strategy ...................
end
Now I want to fill this empty has-one relation. Strategy table has column a and b and two column will always have different combination as shown in table. Now I dont know how to build a relation ship on between two table because Strategy table doesn't belong to Plan nor Plan table store strategy table id. Strategy table only contains some static values. But Plan has some a and b value and I want the id of Strategy Table on the basis of plan's a and b value using has_one relation. Is it possible in rails to do that or I need to follow some old logic.
You can build an association manually.
in your Plan model
def strategy
#strategy ||= Strategy.where("a = ? AND b = ?", a, b).first
end
this will let you reference my_plan.strategy in your code
EDIT answer updated to include Chandranshu's excellent suggestions... memoization and handling the case of more than one strategy with duplicate a, b values

How to delete record from only associated table with has_and_belongs_to_many relation ship

I have two model hotel and theme and both has has_and_belongs_to_many relationship
and third table name is hotels_themes, So I want to delete record only from third tables hotels_themes.
hotels_themes;
+----------+----------+
| hotel_id | theme_id |
+----------+----------+
| 8 | 4 |
| 9 | 5 |
| 11 | 2 |
| 11 | 4 |
| 11 | 6 |
| 12 | 2 |
| 12 | 5 |
+----------+----------+
I want to delete record which match hotel_id and theme_id.
Like sql query delete from hotels_themes where hotel_id=9 and theme_id=5
Use the method delete added to HABTM collections:
hotel = Hotel.find(hotel_id)
theme = Theme.find(theme_id)
hotel.themes.delete(theme)
You just need to empty out the association on either model instance depending on what you are trying to remove. For example:
hotel.themes = []
# or
theme.hotels = []

Resources