Fuzzy join on multiple columns with Spark - join

I have two Spark RDDs without common key that I need to join.
The first RDD is from cassandra table a contains reference set of items (id, item_name, item_type, item_size) for example: (1, 'item 1', 'type_a', 20).
The second RDD is imported each night from another system and it contains roughly the same data without id and is in raw form (raw_item_name, raw_type, raw_item_size) for example ('item 1.', 'type a', 20).
Now I need to join those two RDDs based on similarity of the data. Right know the size of the RDDs is about 10000 but in the future it will grow up.
My actual solutions is: cartesian join of both RDDs, then calculating the distance between ref and raw attributes for each row, then grouping by id and selecting best match.
At this size of RDDs this solution is working but i'm afraid that in the future the cartesian join might be just to big.
What would be better solution?
I tried to look at Spark MLlib but didn't know where to start, which algorith to use etc. Any advice will be greatly appreciated.

Related

bucketing in non equi join in hive

Currently hive does support non equi join.
But as the cross product becomes pretty huge, I was wondering what are the options to tackle a large fact(257 billion rows, 37 tb) and relatively smaller(8.7 gb) dimension table join.
In case of equi join I can make it work easily with proper bucketing on the join column/columns . (using same number of buckets for SMBM practically converting to a map join). But if we think this wont be of any advantage when its a non equi join, because the values will be there in other buckets, practically triggering a shuffle i.e. a reduce phase.
If any one has any thoughts to overcome this, please suggest .....
If the dimension table fits in memory, you can create a Custom User Defined Function (UDF) as stated here, and perform the inequi-join in memory.

How to design this spark join

I need to join two big RDDs and potentially twice. Any help is appreciated to design these joins.
Here is the problem,
First RDD is productIdA, productIdB, similarity and the size is about 100G.
Second RDD is customerId, productId, boughtPrice and size is about 35G.
The result RDD I want is productIdA, productIdB, similarity, customerIds bought both product A and B.
Because I cannot broadcast either of the RDD since both of them are quite big, my design is to aggregate the second RDD by product id then join the first RDD twice but I get huge shuffle spill and all kinds of errors (OOM or out of space because of shuffle). Put the errors aside, I would like to know if any better way to achieve the same result. Thanks
Do you have a row for every product pairing in the first RDD?
If you do (or it's close), then you might want to do something like group the second RDD by customerId, create an element for every pairing, then rearrange and group that RDD by pairing, then group to get a list of customerIds, then join to add in the similarity.
(Whether or not this will result in more or less math depends, I think, on the distribution of number of products purchased per customer.)
Like zero323's comment also implies, once you have the pairings from grouping on customerId, it might be cheaper to recalculate the similarity than to join on a huge dataset.

Redshift performance tuning on a JOIN query

I'm having trouble with performance on the following query:
SELECT [COLUMNS] FROM TABLE A JOIN TABLE B ON [KEYS]
If I remove the join, leaving only the select the query takes seconds. With the join, it takes 30 minutes.
Table sizes are A (844,082,912) & B (1,540,379,815) rows.
Distribution and sort keys are equivalent to the join KEYS.
Looking on AWS graphs, I see (attached) one node with has some 100% CPU utilisation for a short time.
Looking on system table (svv_diskusage) I am not sure what I see (attached), as it does not indicate (as far as I can tell) if one node has much more data than the others.
if the issue is faulty distribution, how can I see it?
is it something else?
Here https://aws.amazon.com/articles/8341516668711341 (Uneven Distribution) you can see an example of the same graph style: one node is working harder than the others, which indicates your data is not evenly distributed.
Regarding svv_diskusage, it describes the values stored in each slice. If the slices are not relatively evenly used, that's an indicator for a bad distribution key. Try the following query to get a higher abstraction over distribution amooung nodes and not slices:
select owner, host, diskno, used, capacity,
(used-tossed)/capacity::numeric *100 as pctused
from stv_partitions order by owner;
set search_path to '$user', 'public', 'ic';
select * from pg_table_def where tablename = '{TableNameHere}';

Google Big Query: How to select subset of smaller table using nested fake join?

I would like to solve the problem related to How can I join two tables using intervals in Google Big Query? by selecting subset of of smaller table.
I wanted to use solution by #FelipeHoffa using row_number function Row number in BigQuery?
I have created nested query as follows:
SELECT a.DevID DeviceId,
a.device_make OS
FROM
(SELECT device_id DevID, device_make, A, lat, long, is_gps
FROM [Data.PlacesMaster] WHERE not device_id is null and is_gps is true) a JOIN (select ROW_NUMBER() OVER() row_number,top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, count from (SELECT top_left_lat, top_left_long, bottom_right_lat,bottom_right_long, A, COUNT(*) count from [Karol.fast_food_box]
GROUP BY (....?)
ORDER BY COUNT DESC,
WHERE row_number BETWEEN 1000 AND 2000)) b ON a.A=b.A
WHERE (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long)
GROUP EACH BY DeviceId,
OS
Could you help in finalising it please? I cannot break the smaller table by "group by", i need to have consistency between two tables and select only items with lat,long from MASTER.table that fit into the given bounding box of a smaller table. I need to match lat,long into box really, my solution form How can I join two tables using intervals in Google Big Query? works only for small tables (approx 1000 to 2000 rows), hence this issue. Thank you in advance.
It looks like you're applying two approaches at once: 1) split a table into chunks of rows, and run on each, and 2) include a field, "A", tagging your boxes and your points into 'regions', that you can equi-join on. Approach (1) just does the same total work in more pieces (also, it's adding complication), so I would suggest focusing on approach (2), which cuts the work down to be ~quadratic in each 'region' rather than quadratic in the size of the whole world.
So the key thing is what values your A takes on, and how many points and boxes carry each A value. For example, if A is a country code, that has the right logical structure, but it probably doesn't help enough once you get a lot of data in any one country. If it goes to the state or province code, that gets you one step farther. Quantized lat/long grid cells generalize better. Sooner or later you do have to deal with falling across a region edge, which can be somewhat tricky. I would use a lat/long grid myself.
What A values are you using? In your data, what is the A value with the maximum (number of points * number of boxes)?

Hadoop's Map-side join implements Hash join?

I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.

Resources