[ Prev ]
2019-11-04

-- Oct 30 In-Class Exercise Thread
 Suppose we can index 50,000 documents before we run out of memory and need to merge with a disk based index that uses logarithmic merging. We want to index a billion pages, using 6 machines each responsible for having the index for 1/6 of the documents.
 We have:
 Generation 0 = 50000 * 2^0
 Generation 1 = 50000 * 2^0
 Generation 2= 50000 * 2^1
 Generation 3 = 50000 * 2^2
 Generation 4 = 50000 * 2^3
 .......
 Generation n: 50000 * 2^(n-1) = 50000* 3340
 Hence n = log(3340) + 1 ~= 13
 What's the largest generation number that will be used? - 13 generations
 What is the most generations that will need to be merged when we add have to flush the in memory partition to disk? - 12 generations
(Edited: 2019-11-04)
Suppose we can index 50,000 documents before we run out of memory and need to merge with a disk based index that uses logarithmic merging. We want to index a billion pages, using 6 machines each responsible for having the index for 1/6 of the documents. We have: Generation 0 = 50000 * 2^0 Generation 1 = 50000 * 2^0 Generation 2= 50000 * 2^1 Generation 3 = 50000 * 2^2 Generation 4 = 50000 * 2^3 ....... Generation n: 50000 * 2^(n-1) = 50000* 3340 Hence n = log(3340) + 1 ~= 13 What's the largest generation number that will be used? - 13 generations What is the most generations that will need to be merged when we add have to flush the in memory partition to disk? - 12 generations

-- Oct 30 In-Class Exercise Thread
 Total number of documents = 1000000000
 Total number of machines = 6
 Pages processed by each machine = 167 Million approximately 
 
 Generation 1 = 50k*2 = 100k
 Generation 2 = 50k*4 = 400k
 Generation 3 = 50k*8 = 800k
 and so on
 
 Generation n => 50k x 2^(n-1) = 167M
 n-1 = log2(167M/5000) = 12 approx
 n = 13 
 Therefore 13 generations
 So largest number of generation = 13
 Total number of merges = 12
Total number of documents = 1000000000 Total number of machines = 6 Pages processed by each machine = 167 Million approximately Generation 1 = 50k*2 = 100k Generation 2 = 50k*4 = 400k Generation 3 = 50k*8 = 800k and so on Generation n => 50k x 2^(n-1) = 167M n-1 = log2(167M/5000) = 12 approx n = 13 Therefore 13 generations So largest number of generation = 13 Total number of merges = 12

-- Oct 30 In-Class Exercise Thread
 Total number of documents = 10^9
 Number of machines = 6
 Total documents processed by a machine = (10^9)/6 = 167m
 Gen 0: 50k * 2^0
 Gen 1: 50k * 2^0
 Gen 2: 50k * 2^1
 Gen 3: 50k * 2^2
 Gen n: 50k * 2^(n-1)
 Total number of iterations = n = log(167m/50k) + 1 =  12.7 = 13 
 Total number of merges = n-1 = 12
Total number of documents = 10^9 Number of machines = 6 Total documents processed by a machine = (10^9)/6 = 167m Gen 0: 50k * 2^0 Gen 1: 50k * 2^0 Gen 2: 50k * 2^1 Gen 3: 50k * 2^2 Gen n: 50k * 2^(n-1) Total number of iterations = n = log(167m/50k) + 1 = 12.7 = 13 Total number of merges = n-1 = 12

-- Oct 30 In-Class Exercise Thread
Total number of documents to be indexed = 10^9
Number of machines = 6. Hence each machine is responsible for 1.67 * 10^8pages
Each machine can index 50000 pages before running out of memory
For a machine
G0 will have 50000 documents, G1 100k, G2 200k, G3 400k, G4 800k and so on
It can be seen from above that,
Gn = 50000 * 2^(n-1)
So,
1.67 * 10^8 = 50000 * 2^(n-1)
n ~ 13
The largest generation number that will be used is 13
The number of merges required to get the final index will be 12(n-1)
 
(Edited: 2019-11-04)
Total number of documents to be indexed = 10^9 Number of machines = 6. Hence each machine is responsible for 1.67 * 10^8pages Each machine can index 50000 pages before running out of memory For a machine G0 will have 50000 documents, G1 100k, G2 200k, G3 400k, G4 800k and so on It can be seen from above that, Gn = 50000 * 2^(n-1) So, 1.67 * 10^8 = 50000 * 2^(n-1) n ~ 13 The largest generation number that will be used is 13 The number of merges required to get the final index will be 12(n-1)

-- Oct 30 In-Class Exercise Thread
Number of docs = 10^6
 Number of machines = 6
 Pages per machine = 10^6/6= ~167 Million
 
 Generation 1 = 50k*2 = 100k
 Generation 2 = 50k*4 = 400k
. . .
 
 Generation n : 50k x 2^(n-1) = 167M
 n-1 = log2(167M/5000) = ~12
 n = 13 
 What's the largest generation number that will be used? 
13
 What is the most generations that will need to be merged when we add have to flush the in memory partition to disk?
12
Number of docs = 10^6 Number of machines = 6 Pages per machine = 10^6/6= ~167 Million Generation 1 = 50k*2 = 100k Generation 2 = 50k*4 = 400k . . . Generation n : 50k x 2^(n-1) = 167M n-1 = log2(167M/5000) = ~12 n = 13 What's the largest generation number that will be used? 13 What is the most generations that will need to be merged when we add have to flush the in memory partition to disk? 12

-- Oct 30 In-Class Exercise Thread
number of document for each machine = 1 billion/6 = ~167 Million
50000*2^(n-1) for the nth generation, assuming the in-memory generation is generation 0.
50000*2^(n-1) = 167 Million
log(167 Million/50000) = ~11.7
the largest generation number that will be used 11.7 + 1 = ~13
the most generations that will need to be merged: 12
(Edited: 2019-11-04)
number of document for each machine = 1 billion/6 = ~167 Million 50000*2^(n-1) for the nth generation, assuming the in-memory generation is generation 0. 50000*2^(n-1) = 167 Million log(167 Million/50000) = ~11.7 the largest generation number that will be used 11.7 + 1 = ~13 the most generations that will need to be merged: 12

-- Oct 30 In-Class Exercise Thread
Total documents = 1,000,000,000 Each machine is responsible for indexing roughly 167,000,000 documents Number of generations - 1 = log(167,000,000/50,000) = 11.7 Thus, the maximum generation number = 12 + 1 = 13 Number of merges = 13-1 = 12
<nowiki>Total documents = 1,000,000,000 Each machine is responsible for indexing roughly 167,000,000 documents Number of generations - 1 = log(167,000,000/50,000) = 11.7 Thus, the maximum generation number = 12 + 1 = 13 Number of merges = 13-1 = 12 </nowiki>

-- Oct 30 In-Class Exercise Thread
The total number of machines is 6 and each machine has 1/6 of the index therefore the number of pages that exist in each machine is 10^6/6 which is approximately equal to 167 million pages.
Generation 1: 50k*2 = 100k Generation 2: 50k^4 = 400k
Since generation n is equal to 50k * 2^(n-1), the total number of iterations is equal to log(167m/50k) + 1 = 13 The total number of merges is equal to n - 1 = 13 - 1 = 12
The total number of machines is 6 and each machine has 1/6 of the index therefore the number of pages that exist in each machine is 10^6/6 which is approximately equal to 167 million pages. Generation 1: 50k*2 = 100k Generation 2: 50k^4 = 400k Since generation n is equal to 50k * 2^(n-1), the total number of iterations is equal to log(167m/50k) + 1 = 13 The total number of merges is equal to n - 1 = 13 - 1 = 12

User Icon
-- Oct 30 In-Class Exercise Thread
 Total documents to index = 1 billion  (1000000000)
 Total 6 independent machines 
 Each Machines is responsible of = 1 billion / 6 
                                 = 1.67 * 10^8 documents
 Each Machine have the capacity of up to 50K documents
 G0 = 50,000 * 2^0 = 50K
 G1 = 50,000 * 2^1 = 100K
 G2 = 50,000 * 2^2 = 200K
 G3 = 50,000 * 2^3 = 400k
 .........
 .......
 .......
 For the nth generation
 n -1 = log2(1.67*10^8/50K)
 n -1 = log2(3340)
 n- 1 = 11.7
 n = 12.7
 n  ~ 13
 Hence the largest generation number would be 13  
 
Total documents to index = 1 billion (1000000000) Total 6 independent machines Each Machines is responsible of = 1 billion / 6 = 1.67 * 10^8 documents Each Machine have the capacity of up to 50K documents G0 = 50,000 * 2^0 = 50K G1 = 50,000 * 2^1 = 100K G2 = 50,000 * 2^2 = 200K G3 = 50,000 * 2^3 = 400k ......... ....... ....... For the nth generation n -1 = log2(1.67*10^8/50K) n -1 = log2(3340) n- 1 = 11.7 n = 12.7 n ~ 13 Hence the largest generation number would be 13

-- Oct 30 In-Class Exercise Thread
Total no. of machines = 6 Total no. of docs = 10^9
 Total processed docs by a machine = (10^9)/6 = 167m
 Generation 0: 50k * 2^0
 Generation 1: 50k * 2^0
 Generation2: 50k * 2^1
.......... .............
 Generation n: 50k * 2^(n-1)
 Total iterations = n = log(167m/50k) + 1 =  12.7 = 13 
 Total merges = n-1 = 12
Total no. of machines = 6 Total no. of docs = 10^9 Total processed docs by a machine = (10^9)/6 = 167m Generation 0: 50k * 2^0 Generation 1: 50k * 2^0 Generation2: 50k * 2^1 .......... ............. Generation n: 50k * 2^(n-1) Total iterations = n = log(167m/50k) + 1 = 12.7 = 13 Total merges = n-1 = 12
X