Yioop - PHP Search Engine

-- CS267 Fall 2019Practice Midterm 1

7)

Resource Description for WhatsApp Image 2019-10-04 at 14.38.00.jpeg

Authors:

Parth Patel

Shreya Bhajikhaye

(Edited: 2019-10-04)

7) ((resource:WhatsApp Image 2019-10-04 at 14.38.00.jpeg|Resource Description for WhatsApp Image 2019-10-04 at 14.38.00.jpeg)) Authors: Parth Patel Shreya Bhajikhaye

-- CS267 Fall 2019Practice Midterm 1

Problem 10:

Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.

(a) sort-based dictionary:

  - all terms that appear in collection are stored in sorted array or in lexicographical order.

 +---------+----------+----------+------------+------------+------------+--------------+
 | (aaa,0) | (aab, 4) | (aac, 8) | (term, 12) | (wait, 16) | (waze, 20) |  (xerox, 24) |
 +---------+----------+----------+------------+------------+------------+--------------+

(b) per-term index

 - in order to carry out efficient random access operations on the postings lists,
 each list is equipped with an auxiliary data structure which is stored on disk, 
 at the beginning of respective posting list.
 
 +-----+-----+-----+-----+
 | TF: | 1   | 51  | 101 |
 +-----+-----+-----+-----+
 | 1   | 34  | 43  | 48  |
 +-----+-----+-----+-----+
 | 51  | 69  | 72  | 89  |
 +-----+-----+-----+-----+
 | 101 | 112 | 123 | 134 |
 +-----+-----+-----+-----+

 (c) move-to-front heuristic
  - If a term is frequent in the text collection, then it should be at the beginning of its chain in the hash table. 
  - Like insert-at-back, the move-to-front heuristic inserts new terms at the end of their respective chain.
  - In addition, however, whenever a term lookup occurs and the term descriptor is not found at the beginning of the chain, it is relocated and moved to the front.
  - This way, when the next lookup for the term takes place, the term’s dictionary entry is still at (or near) the beginning of its chain in the hash table.

 (d) merge based index construction
 - Merge-based indexing is a generalization of the in-memory index construction method, building inverted lists by means of hash table lookup.

Sushant Mane | Ashutosh Kale

(Edited: 2019-10-04)

Problem 10: Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction. (a) sort-based dictionary: - all terms that appear in collection are stored in sorted array or in lexicographical order. +---------+----------+----------+------------+------------+------------+--------------+ | (aaa,0) | (aab, 4) | (aac, 8) | (term, 12) | (wait, 16) | (waze, 20) | (xerox, 24) | +---------+----------+----------+------------+------------+------------+--------------+ (b) per-term index - in order to carry out efficient random access operations on the postings lists, each list is equipped with an auxiliary data structure which is stored on disk, at the beginning of respective posting list. +-----+-----+-----+-----+ | TF: | 1 | 51 | 101 | +-----+-----+-----+-----+ | 1 | 34 | 43 | 48 | +-----+-----+-----+-----+ | 51 | 69 | 72 | 89 | +-----+-----+-----+-----+ | 101 | 112 | 123 | 134 | +-----+-----+-----+-----+ (c) move-to-front heuristic - If a term is frequent in the text collection, then it should be at the beginning of its chain in the hash table. - Like insert-at-back, the move-to-front heuristic inserts new terms at the end of their respective chain. - In addition, however, whenever a term lookup occurs and the term descriptor is not found at the beginning of the chain, it is relocated and moved to the front. - This way, when the next lookup for the term takes place, the term’s dictionary entry is still at (or near) the beginning of its chain in the hash table. (d) merge based index construction - Merge-based indexing is a generalization of the in-memory index construction method, building inverted lists by means of hash table lookup. Sushant Mane | Ashutosh Kale

-- CS267 Fall 2019Practice Midterm 1

Team Members: Qianhui Fan, Xincheng Yuan

Question 5: Suppose the schema independent posting list for bob looks like 1, 3, 120, 1000, 1200, 1201, 1202, 1300. Explain how next(bob, 2) would be computed using galloping search as the implementation for next.

 next("bob", 2):
   initialized variables: 
   P["bob"]= [1, 3, 120, 1000, 1200, 1201, 1202, 1300] // posting list of
                                                       // "bob"
   l["bob"] = 8 //find length of p["bob"], 
                //in the array of length of posting lists
   next_pos = 0
   P["bob"] = 0 => false
   P["bob"][7] = 1300 < 2 => false => continue
   P["bob"][0] = 1 > 2 => false: 
      next_pos = 1, continue
   next_pos > 0 => true, 
   P["bob"][0] = 1 <= 2 => true: 
      low = next_pos - 1 => low = 1 - 1= 0
      jump = 1
      high = low + jump => high = 0 + 1 = 1
      high > 8 => false :
      next_pos = binarySearch ( "bob" , low, high , 2) 
                => binarySearch ("bob", 0 , 1 , 2)
                => next_pos =1
   return: P["bob"][1] => return 3

(Edited: 2019-10-04)

Team Members: Qianhui Fan, Xincheng Yuan Question 5: Suppose the schema independent posting list for bob looks like 1, 3, 120, 1000, 1200, 1201, 1202, 1300. Explain how next(bob, 2) would be computed using galloping search as the implementation for next. next("bob", 2): initialized variables: P["bob"]= [1, 3, 120, 1000, 1200, 1201, 1202, 1300] // posting list of // "bob" l["bob"] = 8 //find length of p["bob"], //in the array of length of posting lists next_pos = 0 P["bob"] = 0 => false P["bob"][7] = 1300 < 2 => false => continue P["bob"][0] = 1 > 2 => false: next_pos = 1, continue next_pos > 0 => true, P["bob"][0] = 1 <= 2 => true: low = next_pos - 1 => low = 1 - 1= 0 jump = 1 high = low + jump => high = 0 + 1 = 1 high > 8 => false : next_pos = binarySearch ( "bob" , low, high , 2) => binarySearch ("bob", 0 , 1 , 2) => next_pos =1 return: P["bob"][1] => return 3

-- CS267 Fall 2019Practice Midterm 1

Team Members: Rahul Shukla and Prathamesh Kumkar

Problem 10: a) sort based dictionary: In a sort-based dictionary, in which all terms that appear in the text collection are arranged in a sorted array or in a search tree. Lookup operations are realized through tree traversal.

b)per-term index: A per-term index is a disk-based data structure that is part of the header of a posting list. It contains a copy of a subset of the postings in the posting list -- for example, every 5000th posting -- and the location in the posting list for these items.

(c) move-to-front heuristic: In a hash-based dictionary, when a term lookup is done and the items are not found at the front of the chain move it to the front. This takes O(1) overhead and tends to keep frequent items at the front of the chain.

(d) merge based index construction -This approach is an extension of the in-memory hash-based index approach. If the index is small enough to fit in memory the two approaches will be the same. If the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition. The in-memory index is wiped, and we continue indexing as if from scratch. After going through the collection one has a sequence of partitions on disk. Terms in this set-up are their own ID, posting lists in each partition are sorted in lexicographical order of their terms. The final stage of the algorithm is to then merge each of the partitions into the final index.

Team Members: Rahul Shukla and Prathamesh Kumkar Problem 10: a) sort based dictionary: In a sort-based dictionary, in which all terms that appear in the text collection are arranged in a sorted array or in a search tree. Lookup operations are realized through tree traversal. b)per-term index: A per-term index is a disk-based data structure that is part of the header of a posting list. It contains a copy of a subset of the postings in the posting list -- for example, every 5000th posting -- and the location in the posting list for these items. (c) move-to-front heuristic: In a hash-based dictionary, when a term lookup is done and the items are not found at the front of the chain move it to the front. This takes O(1) overhead and tends to keep frequent items at the front of the chain. (d) merge based index construction -This approach is an extension of the in-memory hash-based index approach. If the index is small enough to fit in memory the two approaches will be the same. If the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition. The in-memory index is wiped, and we continue indexing as if from scratch. After going through the collection one has a sequence of partitions on disk. Terms in this set-up are their own ID, posting lists in each partition are sorted in lexicographical order of their terms. The final stage of the algorithm is to then merge each of the partitions into the final index.

-- CS267 Fall 2019Practice Midterm 1

Group Members: Xincheng Yuan, Qianhui Fan

 Question 9 for bonus: Suppose n=4 what would be the character n-grams for the word 
 salad? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of th kinds 
 of rules used by a Porter stemmer.
 -4-grams for salad:
  .sal , sala, alad, lad.
 - Stopping	: the process of filtering out function words that have no well-defined 
   meanings while indexing under a language model, to reduce index size and speed up query response time.
 - Stemming	: to reduce words to their base/root form.
 - Porter stemmer uses rules such as 
  sses -> ss
  ies -> i
  ss  -> ss
  s   ->

(Edited: 2019-10-09)

Group Members: Xincheng Yuan, Qianhui Fan Question 9 for bonus: Suppose n=4 what would be the character n-grams for the word salad? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of th kinds of rules used by a Porter stemmer. -4-grams for salad: .sal , sala, alad, lad. - '''Stopping''': the process of filtering out function words that have no well-defined meanings while indexing under a language model, to reduce index size and speed up query response time. - '''Stemming''': to reduce words to their base/root form. - Porter stemmer uses rules such as sses -> ss ies -> i ss -> ss s ->

-- CS267 Fall 2019Practice Midterm 1

Authors: Parth Patel, Shreya Bhajikhaye

Question 9 for bonus: Suppose n=4 what would be the character n-grams for the word salad? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of th kinds of rules used by a Porter stemmer.

n = 4

Character n-grams for the word “salad” would be:

.sal

sala

alad

lad.

Stopping : Stopping is removal of stop words from query before doing an index lookup. 

Stemming : Stemming considers the morphology, or internal structure, of terms, reducing each term to a root form for comparison. 

Porter stemmer uses the prefix of the given term and performs operations on that.

sses -> ss

ies -> i

ss -> ss

s ->

(Edited: 2019-10-09)

Authors: Parth Patel, Shreya Bhajikhaye Question 9 for bonus: Suppose n=4 what would be the character n-grams for the word salad? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of th kinds of rules used by a Porter stemmer. n = 4 Character n-grams for the word “salad” would be: .sal sala alad lad. ---- Stopping : Stopping is removal of stop words from query before doing an index lookup.  ---- Stemming : Stemming considers the morphology, or internal structure, of terms, reducing each term to a root form for comparison.  ---- Porter stemmer uses the prefix of the given term and performs operations on that. sses -> ss ies -> i ss -> ss s ->