How are block nested loop joins optimised for disk reads? - join

Quick question on reasonable optimisations on BNLJ:
Say we have following relations:
A(a,b,d)
B(c,m,n,ww,d)
And following query:
Select A.a, A.b, B.c
From A join B using (d)
Where A.b > 800 AND B.m = 'zack'
In this case all possible optimisation on relations are the following:
A.a, A.b A.d (A.b > 800 A)
B.c B.d (B.m = 'zack' B)
Where we first have selection, then projection.
And now we have two ways to go about this: A as outer loop and B as inner, A as inner loop and B as outer.
Ok, I know all the stuff about buffer pages and the formulas.
However, I am getting confused on this thing.
Do we first do the optimisations for BOTH relations (regardless of what loop they are) and storing the 'optimised' pages in memory and then fetching them to perform the join (and in both cases A-B or B-A we use the optimised pages for BOTH)?
Or do we do optimisations only for relation we choose as an outer loop and don't do ANY optimisations on the relation we choose as an inner loop?
Or we do optimisations on both as we go along?
And say if our optimised version for outer loop is bigger than the buffer space, can we still stored the 'filtered' records in memory and rely that BNLJ will pick them when doing consecutive runs of the outer loop?
So in general what is the whole process of making sure that BNLJ would be executed as efficiently as possible provided the query above?
Thanks!

Related

Presto Multi table Join with Broadcast Join Distribution

I have 3 tables:
A
- id1
- data
B
- id1
- id2
- data
C
- id2
- data
Table A is very small, while table B and C are potentially huge.
Table B has the joining keys for both tables A and C. So, has to be present in the first join.
From what I understand about Joins in Presto, when cost based
optmizations are not enabled, the order of Join executions is the
order of declaration of the Joins.
Also, we would obviously want to
have the smaller table A in the first Join operation as that will
reduce the data size.
So, this means the the first Join will be between tables A and B
But, if I want to perform a distributed Join,
then the build side (right side) of the Join should be the smaller
table.
So, when I come to the second Join between the result of AxB and C, inevitably the right side of the join ends up being the larger table.
Very curious on how people generally handle such a scenario in Presto. If the build side for the distributed Join had been the left side, then it would have flown naturally that we always order the smaller tables to the left.
The ideas of performing Joins in the order they are defined and expecting the right side table to be smaller for Distributed Joins seem contradictory.
Presto generally performs the join in the declared order (when cost-based optimizations are off), but it tries to avoid cross joins if possible. If you run EXPLAIN on your query, you should be able to see the actual join order for your query.
For the example above, you could avoid the cross joins manually by forcing a right-associative join with parenthesis, similar to how arithmetic works (e.g., a - (b - c)):
WITH
a(x) AS (VALUES(1)),
b(x,y) AS (VALUES (1,'a')),
c(y) AS (VALUES 'a')
SELECT *
FROM c JOIN (b JOIN a USING (x)) USING (y)

Extra match clause causes very slow neo4j query

I have the following query, which is very slow:
MATCH (defn :WORKFLOW_DEFINITION { uid: '48a72b6b-6791-4da9-8d8f-dc4375d0fe2d' }),
(instance :WORKFLOW_INSTANCE :SUCCEEDED) -[:DEFINED_BY]-> (defn)
WITH instance, defn
ORDER BY instance.createdAt, instance.uid
LIMIT 1000
OPTIONAL MATCH (instance) -[:INPUT]-> (input :ARTIFACT) <-[:DEFINES_INPUT]- (inputDefn :WORKFLOW_ARTIFACT) <-[:TAKES_INPUT]- (defn)
WITH instance, defn,
collect([input.uid, inputDefn.label, input.bucket, input.key, input.blobUid]) AS inputs
RETURN instance, defn, inputs
The structure of my data is as follows:
WORKFLOW_DEFINITION elements define workflows; each has edges to
one-to-many WORKFLOW_ARTIFACTs (the relationship is TAKES_INPUT) defining the different inputs the definition takes.
WORKFLOW_INSTANCE elements are instances of workflow definitions; they link to their parent definition via DEFINED_BY.
each WORKFLOW_INSTANCE takes one-to-many (representing real/existing on disk files) ARTIFACTs (edges via INPUT relationships).
the number of input ARTIFACTs a WORKFLOW_INSTANCE
takes equals the number of WORKFLOW_ARTIFACTs its parent
WORKFLOW_DEFINITION links to.
each input ARTIFACT links to exactly one of
the parent's WORKFLOW_ARTIFACTs (the WORKFLOW_ARTIFACT provides
data on how the ARTIFACT is consumed when the workflow is run)
I currently have a handful of WORKFLOW_DEFINITIONS. Most WORKFLOW_DEFINITIONs have a few WORKFLOW_INSTANCEs, but one (the one I'm interested querying) has ~2000 WORKFLOW_INSTANCEs. The definition has three WORKFLOW_ARTIFACTs, and hence each WORKFLOW_INSTANCE for this definition has three ARTIFACTs linked to the WORKFLOW_ARTIFACTs.
When I run the query above it requires 50449598 total db hits and takes 16654 ms.
On the other hand, the following query, omitting only the '<-[:TAKES_INPUT]- (defn)' at the end of line 6, requires only 55598 total db hits and takes 82 ms. Here's the fast query with that bit omitted:
MATCH (defn :WORKFLOW_DEFINITION { uid: '48a72b6b-6791-4da9-8d8f-dc4375d0fe2d' }),
(instance :WORKFLOW_INSTANCE :SUCCEEDED) -[:DEFINED_BY]-> (defn)
WITH instance, defn
ORDER BY instance.createdAt, instance.uid
LIMIT 1000
OPTIONAL MATCH (instance) -[:INPUT]-> (input :ARTIFACT) <-[:DEFINES_INPUT]- (inputDefn :WORKFLOW_ARTIFACT)
WITH instance, defn,
collect([input.uid, inputDefn.label, input.bucket, input.key, input.blobUid]) AS inputs
RETURN instance, defn, inputs
Here are the profiled query plans for slow:
Slow query plan
and for fast:
Fast query plan
Why is that final edge back to the WORKFLOW_DEFINITION (to ensure we're getting the correct WORKFLOW_ARTIFACT, since an ARTIFACT may be used in different workflows) making the query so slow? Where's this combinatorial-explosion-looking db hits number coming from?
Thanks in advance for your help, and please let me know if there's anything else I can do to clarify!
Looking at the two query plans, there is a difference in how the query plans are processing the expansions.
If you look at the flow of operations in the slower query plan, you can see that it expands from defn to inputDefn, next from inputDefn to input, and finally an Expand(Into) from instance to input.
The faster query doesn't consider defn when it begins processing the right hand side (since it's not involved in any expansions of the OPTIONAL MATCH), so it can't use that same approach. Instead it expands from the other direction: instance to input, then input to inputDefn. Looks like traversal using this direction ends up being much more efficient.
The takeaway is that traversing from input to inputDefn is efficient, it looks to be a 1-1 correlation. However inputDefn to input seems to be a horrible expansion, looks like inputDefn tend to be supernodes, with an average of 2400 relationships to input nodes each.
As for how to fix this, while there aren't planner hints for explicitly avoiding an expansion in a certain direction, we may be able to use join hints to nudge the planner toward that effect. (you can try the join hint on different variables and EXPLAIN the plan to ensure you're avoiding inputDefn expansions to input).
If we add the following hint: USING JOIN ON inputDefn, then this means we'll expand toward inputDefn from both sides and do a hash join to get the final result set. This will avoid the costly outgoing expansion from inputDefn to input.
MATCH (defn :WORKFLOW_DEFINITION { uid: '48a72b6b-6791-4da9-8d8f-dc4375d0fe2d' }),
(instance :WORKFLOW_INSTANCE :SUCCEEDED) -[:DEFINED_BY]-> (defn)
WITH instance, defn
ORDER BY instance.createdAt, instance.uid
LIMIT 1000
OPTIONAL MATCH (instance) -[:INPUT]-> (input :ARTIFACT) <-[:DEFINES_INPUT]- (inputDefn :WORKFLOW_ARTIFACT) <-[:TAKES_INPUT]- (defn)
USING JOIN ON inputDefn
WITH instance, defn,
collect([input.uid, inputDefn.label, input.bucket, input.key, input.blobUid]) AS inputs
RETURN instance, defn, inputs

Pivot Table type of query in Cypher (in one pass)

I am trying to perform the following query in one pass but I conclude that it is impossible and would furthermore lead to some form of "nested" structure which is never good news in terms of performance.
I may however be missing something here, so I thought I might ask.
The underlying data structure is a many-to-many relationship between two entities A<---0:*--->B
The end goal is to obtain how many times are objects of entity B assigned to objects of entity A within a specific time interval as a percentage of total assignments.
It is exactly this latter part of the question that causes the headache.
Entity A contains an item_date field
Entity B contains an item_category field.
The presentation of the results can be expanded to a table whose columns are the distinct item_date and rows are the different item_category normalised counts. I am just mentioning this for clarity, the query does not have to return the results in that exact form.
My Attempt:
with 12*30*24*3600 as window_length, "1980-1-1" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_date,"s","yyyy-MM-dd")<(date_step+window_length)
with window_length, date_step, count(r) as total_count unwind ["code_A", "code_B", "code_C"] as the_code [MATCH THE PATTERN AGAIN TO COUNT SPECIFIC `item_code` this time.
I am finding it difficult to express this in one pass because it requires the equivalent of two independent GROUP BY-like clauses right after the definition of the graph pattern. You can't express these two in parallel, so you have to unwind them. My worry is that this leads to two evaluations: One for the total count and one for the partial count. The bit I am trying to optimise is some way of re-writing the query so that it does not have to count nodes it has "captured" before but this is very difficult with the implied way the aggregate functions are being applied to a set.
Basically, any attribute that is not an aggregate function becomes the stratification variable. I have to say here that a plain simple double stratification ("Grab everything, produce one level of count by item_date produce another level of count by item_code) does not work for me because there is NO WAY to control the width of the window_length. This means that I cannot compare between two time periods with different rates of assignments of item_codes because the time periods are not equal :(
Please note that retrieving the counts of item_code and then normalising for the sum of those particular codes within a period of time (externally to cypher) would not lead to accurate percentages because the normalisation there would be with respect to that particular subset of item_code rather than the total.
Is there a way to perform a simultaneous count of r within a time period but then (somehow) re-use the already matched a,b subsets of nodes to now evaluate a partial count of those specific b's that (b:{item_code:the_code})-[r2:RELATOB]-(a) where a.item_date...?
If not, then I am going to move to the next fastest thing which is to perform two independent queries (one for the total count, one for the partials) and then do the division externally :/ .
The solution proposed by Tomaz Bratanic in the comment is (I think) along these lines:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
unwind ["code_A","code_B","code_c"] as the_code
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
This:
Is working
It does exactly what I was after (after some minor modifications)
It even adds filling in the missing values with zero because of that ELSE 0 which is effectively forcing a zero even when no count data exists.
But in realistic conditions it is at least 30 seconds slower (no it is not, please see edit) than what I am currently using which re-matches. (And no, it is not because of the extra data that is now returned as the missing data are filled in, this is raw query time).
I thought that it might be worth attaching the query plans here:
This is the plan of the applying the same pattern twice but fast way of doing it:
This is the plan of the performing the count in one pass but slow way of doing it:
I might see how does time scales with data in the input later on, maybe the two are scaling at different rates but at this point, the "one-pass" seems to be already slower than the "two-pass" and frankly, I cannot see how it could get any faster with more data. This is already a simple count of 12 months over 3 categories distributed amongst 18k items (approximately).
Hope this might help others too.
EDIT:
While I had done this originally, there was another modification that I did not include where the second unwind goes AFTER the match. This slashes the time by 20 seconds below the "double match" as the unwind affects the return rather than multiple executions of the same query which now becomes:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
unwind ["code_A","code_B","code_c"] as the_code
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
And here is the execution plan for it too:
Original double match approximately 55790ms, Doing it in one pass (both unwinds BEFORE the match) 82306ms, Doing it in one pass (second unwind after the match) 23461ms.

How to use cogroup for join

I have this join.
A = Join smallTableBigEnoughForInMemory on (F1,F2) RIGHT OUTER, massive on (F1,F2);
B = Join anotherSmallTableBigforInMemory on (F1,F3 ) RIGHT OUTER, massive on (F1,F3);
Since both joins are using one common key, I was wondering if COGROUP can be used for joining data efficiently. Please note this is a RIGHT outer join.
I did think about cogrouping on F1, but small tables has multiple combination ( 200-300) on single key so I have not used join using single key.
I think partitioning may help but data has skew and I am not sure how to use it in Pig
You are looking for Pig's implementation of fragment-replicate joins. See the O'Reilly book Programming Pig for more details about the different join implementations. (See in particular Chapter 8, Making Pig Fly.)
In a fragment-replicate join, no reduce phase is required because each record of the large input is streamed through the mapper, matched up with any records of the small input (which is entirely in memory), and output. However, you must be careful not to do this kind of join with an input which will not fit into memory -- Pig will issue an error and the job will fail.
In Pig's implementation, the large input must be given first, so you will actually be doing a left outer join. Just tack on "using 'replicated'":
A = JOIN massive BY (F1,F2) LEFT OUTER, smallTableBigEnoughForInMemory BY (F1,F2) USING 'replicated';
The join for B will be similar.

Hadoop's Map-side join implements Hash join?

I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.

Resources