Semi-join approach:
- step 1: Table R's machine sends ("C" values: c1, c2, c3, c4) to table S
- step 2: Table S's machine receives this information and sends corresponding rows( {c2, d1}, {c2, d3} )
- step 3: Table R's machine receives this information and completes the join resulting in ( {a2, b2, c2, d1}, {a2, b2, c2, d3} )
Send all S approach:
- step 1: Table S's machine sends entire table ( {c2, d1}, {c5, d2}, {c2, d3}, {c6, d4} )
- step 2: Table R's machine receives this information and does the join which results in ( {a2, b2, c2, d1}, {a2, b2, c2, d3} )
In the first approach, we send 8 data items, but we must communicate twice between machines, while in the second approach we also send 8 data items, but there is only one communication between machines.
(Edited: 2020-04-29) TABLE R A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 TABLE S C D c2 d1 c5 d2 c2 d3 c6 d4 First, we compute semi-join: (S is on the left side because it's smaller.) S (SEMI-JOIN) R
Which evaluates to: S (NATURAL-JOIN) (PROJECT(R, C))First, we perfrom (PROJECT(R, C)), which is: (The machine with R computes this) </pre>
(PROJECT(R, C)) C c2 c5 c2 c6
Next, we join: Bag join is assumed. (The machine with S computes this) Which evaluates to: A B C a2 b2 c2 a2 b2 c2
Next, we send this table to the machine with the table R, and we compute the final join. A B C D a2 b2 c2 d1 a2 b2 c2 d3(Edited: 2020-04-29)
Step 1: S sends over column C (4 entries)
Step 2: R computes R semi-join S {[a2, b2, c2]} and sends that to S
Step 3: S uses the result to compute the natural join {[a2, b2, c2, d1], [a2, b2, c2, d3]}
This method sends 4 entries over and then 3 entries back over (2 communications, 7 entries sent total) A traditional method would have S send over its entire table to R (1 communication, 8 entries sent total)(Edited: 2020-04-29)
Semi-join:
S has the smaller table.
1. Machine with R computes π_{C}(R) and sends it to S.
π_{C}(R)= {c1, c2, c3, c4}
2. Machine with S computes S⋉R sends it to the machine with R.
S⋉R = {(c2, d1), (c2,d3)}
3. Machine with R uses S⋉R to compute S(C,D)⋈R(A,B,C).
S(C,D)⋈R(A,B,C) = {(a2,b2,c2,d1), (a2,b2,c2,d3)}
Naive
1. S has a smaller table.
2. A copy of S is sent to R and the join is computed there.
{(c2,d1), (c5, d2), (c2, d3), (c6, d4)} sent to R.
R computes the join: R⋈S = {(a2, b2, c2, d1), (a2, b2, c2, d3)}
The semi-join approach costs seven entries and two times of communication. The naive way costs 8 entries and one
time of communication.
-Step 1: S sends over column C, about 4 columns
-Step 2: Computes R semi-join S {[a2, b2, c2]} and sends that to S
-Step 3: S received the result and compute natural join {[a2, b2, c2, d1], [a2, b2, c2, d3]}
-To compare:
With semi-join, 7 columns are sent total where as with traditional method 8 columns are needed.
First R would send to S its matching attribute column C with the values that might join with S: c1, c2, c3, c4. Then S computes the join and sends it to R: (c2, d1), (c2, d3). Then R uses that to compute the join: (a2, b2, c2, d1), (a2, b2, c2, d3).
A comparison of this method versus just simply sending over all of S is that we have more communication overhead because we need to send data two times(first the potential columns that may join, then the semi-join) instead of just once. For this current example, 3 columns are being sent for the semi-join method whereas just sending over S would have sent 2 columns.(Edited: 2020-05-01)