Semi-join approach: - step 1: Table R's machine sends ("C" values: c1, c2, c3, c4) to table S - step 2: Table S's machine receives this information and sends corresponding rows( {c2, d1}, {c2, d3} ) - step 3: Table R's machine receives this information and completes the join resulting in ( {a2, b2, c2, d1}, {a2, b2, c2, d3} ) Send all S approach: - step 1: Table S's machine sends entire table ( {c2, d1}, {c5, d2}, {c2, d3}, {c6, d4} ) - step 2: Table R's machine receives this information and does the join which results in ( {a2, b2, c2, d1}, {a2, b2, c2, d3} ) In the first approach, we send 8 data items, but we must communicate twice between machines, while in the second approach we also send 8 data items, but there is only one communication between machines.(Edited: 2020-04-29)
TABLE R A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 TABLE S C D c2 d1 c5 d2 c2 d3 c6 d4 First, we compute semi-join: (S is on the left side because it's smaller.) S (SEMI-JOIN) R
Which evaluates to: S (NATURAL-JOIN) (PROJECT(R, C))First, we perfrom (PROJECT(R, C)), which is: (The machine with R computes this) </pre>
(PROJECT(R, C)) C c2 c5 c2 c6
Next, we join: Bag join is assumed. (The machine with S computes this) Which evaluates to: A B C a2 b2 c2 a2 b2 c2
Next, we send this table to the machine with the table R, and we compute the final join. A B C D a2 b2 c2 d1 a2 b2 c2 d3(Edited: 2020-04-29)
Step 1: S sends over column C (4 entries) Step 2: R computes R semi-join S {[a2, b2, c2]} and sends that to S Step 3: S uses the result to compute the natural join {[a2, b2, c2, d1], [a2, b2, c2, d3]}
This method sends 4 entries over and then 3 entries back over (2 communications, 7 entries sent total) A traditional method would have S send over its entire table to R (1 communication, 8 entries sent total)(Edited: 2020-04-29)
Semi-join: S has the smaller table. 1. Machine with R computes π_{C}(R) and sends it to S. π_{C}(R)= {c1, c2, c3, c4} 2. Machine with S computes S⋉R sends it to the machine with R. S⋉R = {(c2, d1), (c2,d3)} 3. Machine with R uses S⋉R to compute S(C,D)⋈R(A,B,C). S(C,D)⋈R(A,B,C) = {(a2,b2,c2,d1), (a2,b2,c2,d3)} Naive 1. S has a smaller table. 2. A copy of S is sent to R and the join is computed there. {(c2,d1), (c5, d2), (c2, d3), (c6, d4)} sent to R. R computes the join: R⋈S = {(a2, b2, c2, d1), (a2, b2, c2, d3)} The semi-join approach costs seven entries and two times of communication. The naive way costs 8 entries and one time of communication.
-Step 1: S sends over column C, about 4 columns -Step 2: Computes R semi-join S {[a2, b2, c2]} and sends that to S -Step 3: S received the result and compute natural join {[a2, b2, c2, d1], [a2, b2, c2, d3]} -To compare: With semi-join, 7 columns are sent total where as with traditional method 8 columns are needed.
First R would send to S its matching attribute column C with the values that might join with S: c1, c2, c3, c4. Then S computes the join and sends it to R: (c2, d1), (c2, d3). Then R uses that to compute the join: (a2, b2, c2, d1), (a2, b2, c2, d3).
A comparison of this method versus just simply sending over all of S is that we have more communication overhead because we need to send data two times(first the potential columns that may join, then the semi-join) instead of just once. For this current example, 3 columns are being sent for the semi-join method whereas just sending over S would have sent 2 columns.(Edited: 2020-05-01)