Yioop - PHP Search Engine

CS267 Fall 2019Practice Midterm 1.

Post your solutions to practice midterm 1

-- CS267 Fall 2019Practice Midterm 1

3. Give the algorithm from class for phrase search of the phrase given by terms t[1], ..., t[n] after the location position, using our inverted index ADT. (4pts).

  phaseSearch(t[1],t[2],......,t[n], position)
	u = position
	for i = 1 to n do
	   u = next(t[i], u)
	if u == infinity 
	   return infinity
	v = u
	for i = n-1 to 1 do
	   v = prev(t[i], v)   
	if u-v == n-1 then
	   return [u,v]
	else
	   phaseSearch(t[1],t[2],......,t[n],u)

Group Members : Tushar Panpaliya & Rajendra Maharjan

(Edited: 2019-10-04)

''' 3. Give the algorithm from class for phrase search of the phrase given by terms t[1], ..., t[n] after the location position, using our inverted index ADT. (4pts). ''' phaseSearch(t[1],t[2],......,t[n], position) u = position for i = 1 to n do u = next(t[i], u) if u == infinity return infinity v = u for i = n-1 to 1 do v = prev(t[i], v) if u-v == n-1 then return [u,v] else phaseSearch(t[1],t[2],......,t[n],u) Group Members : Tushar Panpaliya & Rajendra Maharjan

-- CS267 Fall 2019Practice Midterm 1

Problem 2 Suppose the phrase book reviews occurs 1000 times in a corpus of 900,000 bigrams, whereas, the term book appears 2000 times in our corpus. What is the 0th order bigram language model probability of book reviews? (1pt answer, 1pt work). What is the first order language model probability? (1pt answer, 1pt work).

Solution -

 total_bigrams= 900,000
 frequency('book reviews') = 1000
 frequency('books') = 2000

 M0('book reviews') = 2000 / 900000 = 0.01111 ~ 0.01

 M1('reviews | book') = 1000 / 2000 = 0.5

Group Members - Ashutosh Kale, Sushant Mane

(Edited: 2019-10-04)

Problem 2 Suppose the phrase book reviews occurs 1000 times in a corpus of 900,000 bigrams, whereas, the term book appears 2000 times in our corpus. What is the 0th order bigram language model probability of book reviews? (1pt answer, 1pt work). What is the first order language model probability? (1pt answer, 1pt work). Solution - total_bigrams= 900,000 frequency('book reviews') = 1000 frequency('books') = 2000 M0('book reviews') = 2000 / 900000 = 0.01111 ~ 0.01 M1('reviews | book') = 1000 / 2000 = 0.5 Group Members - Ashutosh Kale, Sushant Mane

-- CS267 Fall 2019Practice Midterm 1

 Group Members :- Bhumika Kaur Matharu
                  Charulata Lodha

 8. Give the Proximity Ranking algorithm discussed in class.

  nextCover(t[1],.., t[n], position) 
 {
     v:= max_(1≤ i ≤ n)(next(t[i], position));
     if(v == infty)
         return [ infty, infty];
     u := min_(1≤ i ≤ n)(prev(t[i], v+1))
     if(docid(u) == docid(v) ) then 
         return [u,v]; 
        // covers need to be in the same document
     else
         return nextCover(t[1],.., t[n], u);
 }

 rankProximity(t[1],.., t[n], k)
 // t[] term vector
 // k number of results to return 
 {
     u := - infty;
     [u,v] := nextCover(t[1],.., t[n], u);
     d := docid(u);
     score := 0;
     j := 0;
     while( u < infty) do
         if(d < docid(u) ) then
         // if docid changes record info about last docid
             j := j + 1;
             Result[j].docid := d;
             Result[j].score := score;
             d := docid(u);
             score := 0;
         score := score + 1/(v - u +1);
         [u, v] := nextCover(t[1],.., t[n], u);
     if(d < infty) then
         // record last score if not recorded
         j := j + 1;
         Result[j].docid := d;
         Result[j].score := score;
     sort Result[1..j] by score;
     return Result[1..k];   
 }

(Edited: 2019-10-04)

Group Members :- Bhumika Kaur Matharu Charulata Lodha '''8. Give the Proximity Ranking algorithm discussed in class.''' nextCover(t[1],.., t[n], position) { v:= max_(1≤ i ≤ n)(next(t[i], position)); if(v == infty) return [ infty, infty]; u := min_(1≤ i ≤ n)(prev(t[i], v+1)) if(docid(u) == docid(v) ) then return [u,v]; // covers need to be in the same document else return nextCover(t[1],.., t[n], u); } rankProximity(t[1],.., t[n], k) // t[] term vector // k number of results to return { u := - infty; [u,v] := nextCover(t[1],.., t[n], u); d := docid(u); score := 0; j := 0; while( u < infty) do if(d < docid(u) ) then // if docid changes record info about last docid j := j + 1; Result[j].docid := d; Result[j].score := score; d := docid(u); score := 0; score := score + 1/(v - u +1); [u, v] := nextCover(t[1],.., t[n], u); if(d < infty) then // record last score if not recorded j := j + 1; Result[j].docid := d; Result[j].score := score; sort Result[1..j] by score; return Result[1..k]; }

-- CS267 Fall 2019Practice Midterm 1

Team Members : Charulata Lodha & Bhumika Mathura

Q9. (i)Suppose n=4 what would be the character n-grams for the word salad? (1pt)

 Ans: 
   .sal
   sala
   alad
   lad.

 (ii)What is stopping? (1pt) 	
  Stopping is removal of stop words from query before doing an index lookup.
  Stopwords, usually include the function words. 
  In addition to function words, a list of stopwords might include single letters, digits, and other common terms, 
  such as the  state-of-being verb .
  Function words are words that have no well-defined meanings in and of themselves; 
  rather, they modify other words or indicate   grammatical relationships. 
  In English, function words include prepositions, articles, pronouns and articles, and conjunctions. 
  Function words are usually among the most frequently occurring words in any language.

  (iii)What is stemming (1pt)? 	
   Stemming considers the morphology, or internal structure, of terms, reducing each term to a root form for comparison.
   For  example, “orienteering” and “orienteers” might reduce to the root form “orienteer”; “runs” and “running” might 
   reduce to  “run”. 
  The program/function that does this is called a stemmer.

 (iv) Give an example of the kinds of rules used by a Porter stemmer.	
   The Porter Stemmer operates by applying a sequence of rewrites against the current suffix stem. For example,
   sses -> ss
   ies -> i
   ss -> ss
   s ->
   All the rules are applied in sequence, if there was a change they are reapplied until no change.
   The result is the stemmed term.

(Edited: 2019-10-04)

Team Members : Charulata Lodha & Bhumika Mathura ''' Q9.''' '''(i)Suppose n=4 what would be the character n-grams for the word salad? (1pt) ''' Ans: .sal sala alad lad. '''(ii)What is stopping? (1pt) ''' Stopping is removal of stop words from query before doing an index lookup. Stopwords, usually include the function words. In addition to function words, a list of stopwords might include single letters, digits, and other common terms, such as the state-of-being verb . Function words are words that have no well-defined meanings in and of themselves; rather, they modify other words or indicate grammatical relationships. In English, function words include prepositions, articles, pronouns and articles, and conjunctions. Function words are usually among the most frequently occurring words in any language. '''(iii)What is stemming (1pt)? ''' Stemming considers the morphology, or internal structure, of terms, reducing each term to a root form for comparison. For example, “orienteering” and “orienteers” might reduce to the root form “orienteer”; “runs” and “running” might reduce to “run”. The program/function that does this is called a stemmer. '''(iv) Give an example of the kinds of rules used by a Porter stemmer.''' The Porter Stemmer operates by applying a sequence of rewrites against the current suffix stem. For example, sses -> ss ies -> i ss -> ss s -> All the rules are applied in sequence, if there was a change they are reapplied until no change. The result is the stemmed term.

-- CS267 Fall 2019Practice Midterm 1

Group Members :- Snehal Dhumal

                  Aniket Chandak

Q4. Data Structure Needed : Associative Array E.g. $inverted_idx=["hello"=>[1,3,45,67,89]];

Code Snippet:

function first($term) {

  
  return $inverted_idx[$term][0];

}

function last($term) {

  $length=count($inverted_idx[$term])-1;
  return $inverted_idx[$term][$length];

}

Group Members :- Snehal Dhumal Aniket Chandak Q4. Data Structure Needed : Associative Array E.g. $inverted_idx=["hello"=>[1,3,45,67,89]]; Code Snippet: function first($term) { return $inverted_idx[$term][0]; } function last($term) { $length=count($inverted_idx[$term])-1; return $inverted_idx[$term][$length]; }

-- CS267 Fall 2019Practice Midterm 1

Problem 6: Definitions of DocID Index, Frequency Index, Positional Index, and Schema-Independent Index are provided below.

    1. The DocID Index	 is an inverted index where the postings list of each term contains the document identifiers of all documents in which the term appears.

    2. The Frequency Index	 is an inverted index where the postings list of each term is of the form (d, ft,d) where d is defined as the document identifier and ft,d is defined as the frequency of the term, t, in document, d.

    3. The Positional Index	 is an inverted index where the postings list of each term is of the form (d, ft,d, <p1,…pt,f>) where d is defined as the document identifier, ft,d is defined as the frequency of the term, t, in document, d, and p1,…pt,d is the list of positions of the term t, in the document, d.

    4. The Schema-Independent Index	 is an inverted index where the postings list of each term simply contains the positions of the terms in the document.

Group Members: Amala Chirayil, Aparna Kale, Sunhera Paul

(Edited: 2019-10-04)

Problem 6: Definitions of DocID Index, Frequency Index, Positional Index, and Schema-Independent Index are provided below. 1. The '''DocID Index''' is an inverted index where the postings list of each term contains the document identifiers of all documents in which the term appears. 2. The '''Frequency Index''' is an inverted index where the postings list of each term is of the form (d, ft,d) where d is defined as the document identifier and ft,d is defined as the frequency of the term, t, in document, d. 3. The '''Positional Index''' is an inverted index where the postings list of each term is of the form (d, ft,d, <p1,…pt,f>) where d is defined as the document identifier, ft,d is defined as the frequency of the term, t, in document, d, and p1,…pt,d is the list of positions of the term t, in the document, d. 4. The '''Schema-Independent Index''' is an inverted index where the postings list of each term simply contains the positions of the terms in the document. Group Members: Amala Chirayil, Aparna Kale, Sunhera Paul

-- CS267 Fall 2019Practice Midterm 1

Team Members: Prathamesh Kumkar and Rahul Shukla

Problem 1:

Resource Description for Annotation 2019-10-04 142528.png

Team Members: Prathamesh Kumkar and Rahul Shukla Problem 1: ((resource:Annotation 2019-10-04 142528.png|Resource Description for Annotation 2019-10-04 142528.png))

-- CS267 Fall 2019Practice Midterm 1

 Q9. (i)Suppose n=4 what would be the character n-grams for the word salad? (1pt)

 n = 4 Salad

 character n grams for the word salad would be

 .sala
 sala
 alad
 lad.

 # Stopping :

 Stopping words are the usually the most frequently occuring words in any language 
 that have no well- defined meanings, rather they modify words of indicate grammatical relationships.
 In enlish langugage there words are the prepositions, articles, pronouns and conjunctions

 For e.g : a , an the , at etc.

 # Stemming is the process which considers the morphology or the internal structure of the query
 terms and try to match the root words .

 For e.g. orienteering  matches  to orienteers
         running matches to runs

 Rules of the porter stemmer :

 The Porter Stemmer operates on the suffix of the given term . For example,
   sses -> ss
   ies -> i
   ss -> ss
   s ->

Group Member Name : Rajendra Maharjan , Tushar Panpaliya

(Edited: 2019-10-05)

Q9. (i)Suppose n=4 what would be the character n-grams for the word salad? (1pt) n = 4 Salad character n grams for the word salad would be .sala sala alad lad. # Stopping : Stopping words are the usually the most frequently occuring words in any language that have no well- defined meanings, rather they modify words of indicate grammatical relationships. In enlish langugage there words are the prepositions, articles, pronouns and conjunctions For e.g : a , an the , at etc. # Stemming is the process which considers the morphology or the internal structure of the query terms and try to match the root words . For e.g. orienteering matches to orienteers running matches to runs Rules of the porter stemmer : The Porter Stemmer operates on the suffix of the given term . For example, sses -> ss ies -> i ss -> ss s -> Group Member Name : Rajendra Maharjan , Tushar Panpaliya

-- CS267 Fall 2019Practice Midterm 1

Group Members: Amala Chirayil, Aparna Kale, Sunhera Paul

10) Briefly explain and give an example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.

a) sort based dictionary: It is a dictionary in which all the terms appearing in the corpus are arranged in the sorted order. Terms are stored in an array or in a search tree. If stored in a tree, tree traversals are used to traverse the dictionary.

e.g.

[a] -> list

[b] -> list

b)per-term index: In order to provide efficient random access into any given on-disk postings list, each list has to be equipped

with a data structure, which is referred to as the per-term index. This data structure is stored on disk, at the beginning of the

respective postings list. It has a copy of a subset of the postings in the list, for instance, a copy of every 6, 000th posting.

When we access the posting list for the given term t, before performing any actual index operations on the list, the search

engine loads t’s per-term index into memory. Random-access operations of the type required by the next method can then be

carried out by performing a binary search on the in-memory array representing t ’s per-term index, followed by loading up to

6,000 postings from the candidate range into memory. After that random access is performed.

(c) move-to-front heuristic: Consider that we are using hash tables. The collisions in hash tables are resolved by chaining. In

this case, to accelerate the dictionary search, when a term lookup is done and the items it is not found at the front of the chain

then it has to be moved to the front. The overhead is O(1) and tends to keep frequent items at the front of the chain. This is

called as a move-to-front heuristic.

e.g. We have elements, 1,2,3,12,3 and hash function is mod2

1-> 1,3

2-> 2,12

For 3,

1->3,1

2->2,12

(d) merge based index construction - Merge-based indexing is a generalization of the in-memory index construction method by

building inverted lists by means of hash table lookup. If the collection for which an index is being created is small enough for the

index to fit completely into main memory, then merge-based indexing behaves exactly like in-memory indexing. If the text

collection is too large to be indexed completely in main memory, then the indexing process performs a dynamic partitioning of

the collection. That is, it starts building an in-memory index. As soon as it runs out of memory (or when a predefined memory

utilization threshold is reached), it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the

in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed. The result of

the process outlined above is a set of inverted files, each representing a certain part of the whole collection. Each such

subindex is referred to as an index partition. In a final step, the index partitions are merged into the final index, representing the

entire text collection. The final result thus represents a merge based index construction.

For eg.

Index A: [[merge, [2,3,4,5]], (sort,[2,4])]

Index B: [[merge, [6,9]], (element, [15,42])]

Merged Index Construction: [[ merge, [2,3,4,5,6,9] ], [ sort, [2,4] ], [ element, [15, 42] ]

(Edited: 2019-10-04)

'''Group Members: Amala Chirayil, Aparna Kale, Sunhera Paul ''' 10) Briefly explain and give an example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction. a) sort based dictionary: It is a dictionary in which all the terms appearing in the corpus are arranged in the sorted order. Terms are stored in an array or in a search tree. If stored in a tree, tree traversals are used to traverse the dictionary. e.g. [a] -> list [b] -> list b)per-term index: In order to provide efficient random access into any given on-disk postings list, each list has to be equipped with a data structure, which is referred to as the per-term index. This data structure is stored on disk, at the beginning of the respective postings list. It has a copy of a subset of the postings in the list, for instance, a copy of every 6, 000th posting. When we access the posting list for the given term t, before performing any actual index operations on the list, the search engine loads t’s per-term index into memory. Random-access operations of the type required by the next method can then be carried out by performing a binary search on the in-memory array representing t ’s per-term index, followed by loading up to 6,000 postings from the candidate range into memory. After that random access is performed. (c) move-to-front heuristic: Consider that we are using hash tables. The collisions in hash tables are resolved by chaining. In this case, to accelerate the dictionary search, when a term lookup is done and the items it is not found at the front of the chain then it has to be moved to the front. The overhead is O(1) and tends to keep frequent items at the front of the chain. This is called as a move-to-front heuristic. e.g. We have elements, 1,2,3,12,3 and hash function is mod2 1-> 1,3 2-> 2,12 For 3, 1->3,1 2->2,12 (d) merge based index construction - Merge-based indexing is a generalization of the in-memory index construction method by building inverted lists by means of hash table lookup. If the collection for which an index is being created is small enough for the index to fit completely into main memory, then merge-based indexing behaves exactly like in-memory indexing. If the text collection is too large to be indexed completely in main memory, then the indexing process performs a dynamic partitioning of the collection. That is, it starts building an in-memory index. As soon as it runs out of memory (or when a predefined memory utilization threshold is reached), it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed. The result of the process outlined above is a set of inverted files, each representing a certain part of the whole collection. Each such subindex is referred to as an index partition. In a final step, the index partitions are merged into the final index, representing the entire text collection. The final result thus represents a merge based index construction. For eg. Index A: [[merge, [2,3,4,5]], (sort,[2,4])] Index B: [[merge, [6,9]], (element, [15,42])] Merged Index Construction: [[ merge, [2,3,4,5,6,9] ], [ sort, [2,4] ], [ element, [15, 42] ]