How to design this spark join - join

I need to join two big RDDs and potentially twice. Any help is appreciated to design these joins.
Here is the problem,
First RDD is productIdA, productIdB, similarity and the size is about 100G.
Second RDD is customerId, productId, boughtPrice and size is about 35G.
The result RDD I want is productIdA, productIdB, similarity, customerIds bought both product A and B.
Because I cannot broadcast either of the RDD since both of them are quite big, my design is to aggregate the second RDD by product id then join the first RDD twice but I get huge shuffle spill and all kinds of errors (OOM or out of space because of shuffle). Put the errors aside, I would like to know if any better way to achieve the same result. Thanks

Do you have a row for every product pairing in the first RDD?
If you do (or it's close), then you might want to do something like group the second RDD by customerId, create an element for every pairing, then rearrange and group that RDD by pairing, then group to get a list of customerIds, then join to add in the similarity.
(Whether or not this will result in more or less math depends, I think, on the distribution of number of products purchased per customer.)
Like zero323's comment also implies, once you have the pairings from grouping on customerId, it might be cheaper to recalculate the similarity than to join on a huge dataset.

Related

Combining additive and semi-additive facts in a single report

I'm working on a quarterly report. The report should look something like this:
col
Calculation
Source table
Start_Balance
Sum at start of time period
Account_balance
Sell Transactions
Sum of all sell values between the two time periods
Transactions
Buy Transactions
Sum of all buy values between the two time periods
Transactions
End Balance
Sum at the end of time period
Account_balance
so e.g.
Calculation
sum
Start_Balance
1000
Sell Transactions
500
Buy Transactions
750
End Balance
1250
The problem here is that I'm working with a relational star schema, one of the facts is semi-additive and the other is additive, so they behave differently on the time dimension.
In my case I'm using Cognos analytics, but I think this problem goes for any BI tool. What would be best practice to deal with this issue? I'm certain I can come up with some sql query that combines these two tables into one table which the report reads from, but this doesn't seem like best practice, or is it? Another approach would be to create some measures in the BI tool, I'm not a big fan of this approach because it seems to be least sustainable approach, and I'm unfamiliar with it.
For Cognos you can stitch the tables
The technique has to do with how Cognos aggregates
Framework manager joins are typically 1 to n for describing the relationship
A star schema having the fact table in the middle and representing the N
with all of the outer tables describing/grouping the data, representing the 1
Fact tables, quantitative data, the stuff you want to sum should be on the many side of the relationship
Descriptive tables, qualitative data, the stuff you want to describe or group by should be on the 1 (instead of the many)
To stitch we have multiple tables we want to be facts
Take the common tables that you would use for grouping, like the period (there are probably some others like company, or customer, etc)
Connect each of the fact tables with the common table (aka dimension) like this:
Account_balance N to 1 Company
Account_balance N to 1 Period
Account_balance N to 1 Customer
Transactions N to 1 Company
Transactions N to 1 Period
Transactions N to 1 Customer
This will cause Cognos to perform a full outer join with a coalesce
Allowing you to handle the fact tables even though they have different levels of granularity
Remember with an outer join you may have to handle nulls and you may need to use the summary filter depending on your reporting needs
You want to include the common tables on your report which might conflict with how you want the report to look
An easy work around is to add them to the layout and then set the property to box type none so the sql behaves you want and the report looks the way you want
You'll probably need to setup determinants in the Framework Manager model. The following does a good job in explaining this:
https://www.ibm.com/docs/en/cognos-analytics/11.0.0?topic=concepts-multiple-fact-multiple-grain-queries

Fuzzy join on multiple columns with Spark

I have two Spark RDDs without common key that I need to join.
The first RDD is from cassandra table a contains reference set of items (id, item_name, item_type, item_size) for example: (1, 'item 1', 'type_a', 20).
The second RDD is imported each night from another system and it contains roughly the same data without id and is in raw form (raw_item_name, raw_type, raw_item_size) for example ('item 1.', 'type a', 20).
Now I need to join those two RDDs based on similarity of the data. Right know the size of the RDDs is about 10000 but in the future it will grow up.
My actual solutions is: cartesian join of both RDDs, then calculating the distance between ref and raw attributes for each row, then grouping by id and selecting best match.
At this size of RDDs this solution is working but i'm afraid that in the future the cartesian join might be just to big.
What would be better solution?
I tried to look at Spark MLlib but didn't know where to start, which algorith to use etc. Any advice will be greatly appreciated.

Redshift performance tuning on a JOIN query

I'm having trouble with performance on the following query:
SELECT [COLUMNS] FROM TABLE A JOIN TABLE B ON [KEYS]
If I remove the join, leaving only the select the query takes seconds. With the join, it takes 30 minutes.
Table sizes are A (844,082,912) & B (1,540,379,815) rows.
Distribution and sort keys are equivalent to the join KEYS.
Looking on AWS graphs, I see (attached) one node with has some 100% CPU utilisation for a short time.
Looking on system table (svv_diskusage) I am not sure what I see (attached), as it does not indicate (as far as I can tell) if one node has much more data than the others.
if the issue is faulty distribution, how can I see it?
is it something else?
Here https://aws.amazon.com/articles/8341516668711341 (Uneven Distribution) you can see an example of the same graph style: one node is working harder than the others, which indicates your data is not evenly distributed.
Regarding svv_diskusage, it describes the values stored in each slice. If the slices are not relatively evenly used, that's an indicator for a bad distribution key. Try the following query to get a higher abstraction over distribution amooung nodes and not slices:
select owner, host, diskno, used, capacity,
(used-tossed)/capacity::numeric *100 as pctused
from stv_partitions order by owner;
set search_path to '$user', 'public', 'ic';
select * from pg_table_def where tablename = '{TableNameHere}';

Google Big Query: How to select subset of smaller table using nested fake join?

I would like to solve the problem related to How can I join two tables using intervals in Google Big Query? by selecting subset of of smaller table.
I wanted to use solution by #FelipeHoffa using row_number function Row number in BigQuery?
I have created nested query as follows:
SELECT a.DevID DeviceId,
a.device_make OS
FROM
(SELECT device_id DevID, device_make, A, lat, long, is_gps
FROM [Data.PlacesMaster] WHERE not device_id is null and is_gps is true) a JOIN (select ROW_NUMBER() OVER() row_number,top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, count from (SELECT top_left_lat, top_left_long, bottom_right_lat,bottom_right_long, A, COUNT(*) count from [Karol.fast_food_box]
GROUP BY (....?)
ORDER BY COUNT DESC,
WHERE row_number BETWEEN 1000 AND 2000)) b ON a.A=b.A
WHERE (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long)
GROUP EACH BY DeviceId,
OS
Could you help in finalising it please? I cannot break the smaller table by "group by", i need to have consistency between two tables and select only items with lat,long from MASTER.table that fit into the given bounding box of a smaller table. I need to match lat,long into box really, my solution form How can I join two tables using intervals in Google Big Query? works only for small tables (approx 1000 to 2000 rows), hence this issue. Thank you in advance.
It looks like you're applying two approaches at once: 1) split a table into chunks of rows, and run on each, and 2) include a field, "A", tagging your boxes and your points into 'regions', that you can equi-join on. Approach (1) just does the same total work in more pieces (also, it's adding complication), so I would suggest focusing on approach (2), which cuts the work down to be ~quadratic in each 'region' rather than quadratic in the size of the whole world.
So the key thing is what values your A takes on, and how many points and boxes carry each A value. For example, if A is a country code, that has the right logical structure, but it probably doesn't help enough once you get a lot of data in any one country. If it goes to the state or province code, that gets you one step farther. Quantized lat/long grid cells generalize better. Sooner or later you do have to deal with falling across a region edge, which can be somewhat tricky. I would use a lat/long grid myself.
What A values are you using? In your data, what is the A value with the maximum (number of points * number of boxes)?

Hadoop's Map-side join implements Hash join?

I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.

Resources