Yioop - PHP Search Engine

CS267 Fall 2019 Practice Final.

Post your solutions for practice final in this thread.

-- CS267 Fall 2019 Practice Final

 Group Members: Rajendra Maharjan, Tushar Panpaliya

Q9. Briefly explain how the page rank algorithm might be implemented

     as a map reduce job.	
 
   Since the question does not  clearly mention about multiple 
   round or single round, so we have presented an algorithm 
   with multiple rounds.
   Let epsilon be the constant that is used as 
   a criterion to decide if we need to stop the 
   iteration process.so the algorithm goes like this
   do {
    
       1.  do map-reduce jobs to compute A * Curret_r

         where A is normalized adjacency matrix. 
         store the result in Hadoop File System (HDS) in 
         the form of (nid, node) 
         where node has its page_rank field set to the 
         value given by the above operation

            
       2. do map-reduce job to compute dangling node 
          correcton to current_r

        store the result in DFS as (nid,node) where node 
        has its page_rank field set to the value given 
        by the above operation

       3. do map-reduce job to compute teleporte 
          correction to current_r

      store the result in DFS as (nid,node) 
     where node has its page_rank field set to 
     the value given by the above operation

      
      4. Now send all the pairs of (nid,node) that 
         is computed above to Reducer job . In the reducer
         job it computes (nid,node) in which a page_rank
         equal to sum of all the three page_ranks
         grouping by nid.store the final result 
         in the form of (nid, node)

     5. Do map-reduce job to compute 
        len = || current_r - old_r ||

   } while (len>epsilon)
   
  output the nodes with their page ranks

(Edited: 2019-12-09)

'''Group Members: Rajendra Maharjan, Tushar Panpaliya''' ''' Q9. Briefly explain how the page rank algorithm might be implemented as a map reduce job.''' Since the question does not clearly mention about multiple round or single round, so we have presented an algorithm with multiple rounds. Let epsilon be the constant that is used as a criterion to decide if we need to stop the iteration process.so the algorithm goes like this do { 1. do map-reduce jobs to compute A * Curret_r where A is normalized adjacency matrix. store the result in Hadoop File System (HDS) in the form of (nid, node) where node has its page_rank field set to the value given by the above operation 2. do map-reduce job to compute dangling node correcton to current_r store the result in DFS as (nid,node) where node has its page_rank field set to the value given by the above operation 3. do map-reduce job to compute teleporte correction to current_r store the result in DFS as (nid,node) where node has its page_rank field set to the value given by the above operation 4. Now send all the pairs of (nid,node) that is computed above to Reducer job . In the reducer job it computes (nid,node) in which a page_rank equal to sum of all the three page_ranks grouping by nid.store the final result in the form of (nid, node) 5. Do map-reduce job to compute len = || current_r - old_r || } while (len>epsilon) output the nodes with their page ranks

-- CS267 Fall 2019 Practice Final

Q5) a) γ code

The γ codeword for a positive integer k has two components. The first component is called the selector, and the second component is the body. The first component contains the length of the body in unary representation. The second component is the binary representation of k. Unary code is represented as the number of 0’s followed by 1.

E.g. k = 5

Body = 101

Selector = 001

Selector and body looks like this: 001 101

As it is observed that, the first bit in the body is redundant, and we can save one bit by omitting it.

|γ(5)| = 00101

b) Golomb code:

Steps are as follows:

1. Choose an integer M, the modulus.

2.Split each Δ-value k into two components, the quotient q(k) and the remainder r(k):

q(k) = [(k - 1) / M], r(k) = (k - 1) mod M

3.Encode k by writing q(k) + 1 in unary representation followed by r(k) as a [log2 (M)] - bit or [log2 (M)] - bit binary number.

Q6) An incremental collection allows new documents to be added to the collection. Logarithmic merging is suitable for document insertion. Logarithmic merging ensures that the number of index partitions remains small, the degradation of a query response is acceptable. Logarithmic Merge maintains a set of index partitions. Each partition has a label g, the generation number of that partition. Whenever the indexing process runs out of memory, it creates a new on-disk index partition. This new inverted file receives the provisional label “1”. The search engine then checks all index partitions created if they have the same label g′. It then merges these two partitions into a new one with the label g′ + 1. This process is repeated until no collisions occur. e.g. Indexing process runs out of memory after scanning 3 posting lists, a new partition is created with label = 1. Again indexing process runs out of memory and new partition is labeled as 1, however, 1 exists already. So, they are merged to form 2 and 1 is deleted.

Group Members : Parth Patel, Aparna Kale, Vy Tran

(Edited: 2019-12-09)

'''Q5) a)''' γ code The γ codeword for a positive integer k has two components. The first component is called the selector, and the second component is the body. The first component contains the length of the body in unary representation. The second component is the binary representation of k. Unary code is represented as the number of 0’s followed by 1. E.g. k = 5 Body = 101 Selector = 001 Selector and body looks like this: 001 101 As it is observed that, the first bit in the body is redundant, and we can save one bit by omitting it. |γ(5)| = 00101 '''b)''' Golomb code: Steps are as follows: 1. Choose an integer M, the modulus. 2.Split each Δ-value k into two components, the quotient q(k) and the remainder r(k): q(k) = [(k - 1) / M], r(k) = (k - 1) mod M 3.Encode k by writing q(k) + 1 in unary representation followed by r(k) as a [log2 (M)] - bit or [log2 (M)] - bit binary number. '''Q6) ''' An incremental collection allows new documents to be added to the collection. Logarithmic merging is suitable for document insertion. Logarithmic merging ensures that the number of index partitions remains small, the degradation of a query response is acceptable. Logarithmic Merge maintains a set of index partitions. Each partition has a label g, the generation number of that partition. Whenever the indexing process runs out of memory, it creates a new on-disk index partition. This new inverted file receives the provisional label “1”. The search engine then checks all index partitions created if they have the same label g′. It then merges these two partitions into a new one with the label g′ + 1. This process is repeated until no collisions occur. e.g. Indexing process runs out of memory after scanning 3 posting lists, a new partition is created with label = 1. Again indexing process runs out of memory and new partition is labeled as 1, however, 1 exists already. So, they are merged to form 2 and 1 is deleted. '''Group Members''' : Parth Patel, Aparna Kale, Vy Tran

-- CS267 Fall 2019 Practice Final

Group members: Aniket Chandak, Charulata Lodha, Sunhera Paul

Q 8.

  No of Machines = n = 5
  No of results to return = k = 10
  We want Machine 1 to return half the results
  Hence l = 1/2 * k = 5
  Now Prob. of that exactly l of the top m results are found by node is given as

  b(n,m,l) = (m l)(1/n)^l (1-1/n)^(m-l)
   = (10 5)(1/5)^5(1-1/5)^5
    = 0.0264

Q7

(Edited: 2019-12-09)

Group members: Aniket Chandak, Charulata Lodha, Sunhera Paul '''Q 8. ''' No of Machines = n = 5 No of results to return = k = 10 We want Machine 1 to return half the results Hence l = 1/2 * k = 5 Now Prob. of that exactly l of the top m results are found by node is given as b(n,m,l) = (m l)(1/n)^l (1-1/n)^(m-l) = (10 5)(1/5)^5(1-1/5)^5 = 0.0264 '''Q7''' ((resource:Scan 09-Dec-2019 (1).pdf|Resource Description for Scan 09-Dec-2019 (1).pdf))

-- CS267 Fall 2019 Practice Final

Question4

Group Member: Sushant Mane, Bhumika Matharu, Qianhui Fan

Resource Description for 191576003415_.pic.jpg

(Edited: 2019-12-12)

<u>'''Question4 ''' </u> Group Member: Sushant Mane, Bhumika Matharu, Qianhui Fan ((resource:191576003415_.pic.jpg|Resource Description for 191576003415_.pic.jpg))

-- CS267 Fall 2019 Practice Final

 Group Member: Sushant Mane, Bhumika Kaur Matharu, Qianhui Fan
  
 Q3	:
 
 trec_eval software takes trec_rel_file and trec_top_file as input.

 The trec_rel_file would be human judged regarding the relevance of the document.
 trec_rel_file is in the following format :-
 qid  iter  docno  rel
 
 qid = query id
 iter = iteration number
 docno = document number
 rel = relevance

 The trec_top_file contains information about what a particular IR system computed on a set of queries
 trec_top_file is in the following format :-
 query_id iter docno rank sim run_id
 
 query_id = query id
 sim = similarity
 run_id = string which gets printed out with the output
 Each columns is space delimited.

 trec_eval software produces trec_eval output stating all the statistics like precision,recip_rank, map score, 
 iprec_at_recall in the following format :-
 run_id all run_nam2

Group Member: Sushant Mane, Bhumika Kaur Matharu, Qianhui Fan '''Q3''': trec_eval software takes trec_rel_file and trec_top_file as input. The trec_rel_file would be human judged regarding the relevance of the document. trec_rel_file is in the following format :- qid iter docno rel qid = query id iter = iteration number docno = document number rel = relevance The trec_top_file contains information about what a particular IR system computed on a set of queries trec_top_file is in the following format :- query_id iter docno rank sim run_id query_id = query id sim = similarity run_id = string which gets printed out with the output Each columns is space delimited. trec_eval software produces trec_eval output stating all the statistics like precision,recip_rank, map score, iprec_at_recall in the following format :- run_id all run_nam2

-- CS267 Fall 2019 Practice Final

Other Team Members(Sorry, I forgot your name :(

 Q1：

Resource Description for WeChat Image_2019121301045_1jpg.jpg

 Q2：

Resource Description for WeChat Image_20191212220559.jpg

(Edited: 2019-12-13)

Other Team Members(Sorry, I forgot your name :( Q1： ((resource:WeChat Image_2019121301045_1jpg.jpg|Resource Description for WeChat Image_2019121301045_1jpg.jpg)) Q2： ((resource:WeChat Image_20191212220559.jpg|Resource Description for WeChat Image_20191212220559.jpg))

-- CS267 Fall 2019 Practice Final

Question 10

Members: Prathamesh Kumkar, Ashutosh Kale, Rahul Shulka

Batch filtering:

Filtering is the process of evaluating documents on an ongoing basis according to some standing information need. For example, a news filter might deliver articles on mental health to a health-care professional. Sometimes, for a filtering task, the documents to be searched do not yet exist. If it is possible to wait for a large number of documents to arrive, we can accumulate them into a corpus, index them, and apply a search method such as BM25 to yield a ranked list. We can then repeat the process for each new batch of documents in turn. This approach is known as batch filtering. From the user’s perspective, batch filtering yields a viable solution if enough documents arrive within a suitable time interval.

Aggregate P@k:

One drawback with size-specific precision at P@k, is that the number of relevant documents can vary between batches, but we are always using the same value for k. Instead, we can use the fact that BM25 orders documents ranked by a relevance score s and, in a given batch, present to the user documents which have s > t for some threshold t. We choose t so that k = ρN documents will have score s > t. The results of each batch are entered successively into a common priority queue. Thus, Aggregate P@k is the precision of the top k elements of the queue once all batches are processed.

Question 10 Members: Prathamesh Kumkar, Ashutosh Kale, Rahul Shulka Batch filtering: Filtering is the process of evaluating documents on an ongoing basis according to some standing information need. For example, a news filter might deliver articles on mental health to a health-care professional. Sometimes, for a filtering task, the documents to be searched do not yet exist. If it is possible to wait for a large number of documents to arrive, we can accumulate them into a corpus, index them, and apply a search method such as BM25 to yield a ranked list. We can then repeat the process for each new batch of documents in turn. This approach is known as batch filtering. From the user’s perspective, batch filtering yields a viable solution if enough documents arrive within a suitable time interval. Aggregate P@k: One drawback with size-specific precision at P@k, is that the number of relevant documents can vary between batches, but we are always using the same value for k. Instead, we can use the fact that BM25 orders documents ranked by a relevance score s and, in a given batch, present to the user documents which have s > t for some threshold t. We choose t so that k = ρN documents will have score s > t. The results of each batch are entered successively into a common priority queue. Thus, Aggregate P@k is the precision of the top k elements of the queue once all batches are processed.

-- CS267 Fall 2019 Practice Final

Team members- Aditi, Harshit, Xincheng

Team members- Aditi, Harshit, Xincheng ((resource:WeChat Image_20191212220559.jpg|Resource Description for WeChat Image_20191212220559.jpg)) ((resource:WeChat Image_2019121301045_1jpg.jpg|Resource Description for WeChat Image_2019121301045_1jpg.jpg))