Yioop - PHP Search Engine

Midterm_2 Solution.

Post your Midterm 2 Solution here.

-- Midterm_2 Solution

Team Tyler, Riti, Priyanka

Q5. "Lines spoken by Caesar in the play Julius Caesar"

 ("<LINE>" ... "</LINE>") <-
 ((("<SPEECH>" ... "</SPEECH>") ->
 (("<SPEAKER>" ... "</SPEAKER>") -> "caesar")) <-
 (("<PLAY>" ... "</PLAY>") -> 
 (("<TITLES>" ... "</TITLE>") ->
 "Julius Caesar")))

Line contained in a [speech with contains a speaker "caesar"] which is contained in [a play containing the title Julius Caesar]

Q6. trec_eval needs the following files - trec_rel_file - tells whether a given result was relevant to the query - trec_top_file - gives information on the search engine that carried out the search and also presents similarity scores

trec_rel_file format: query_id iter docno rel

 where, 
 query_id = id of query
 iter = iteration number
 rel = human judgement if doc was relevant for query with query_id
 docno = doc which would be decided if relevant for query
 
 trec_top_file format: query_id iter docno rank similarity run_id
 where, 
 query_id, iter, docno - same as above
 rank - rank of doc for query based on similarity score
 similarity - score given by ranking algorithms to the doc
 run_id - the id of the run

Team Tyler, Riti, Priyanka Q5. "Lines spoken by Caesar in the play Julius Caesar" ("<LINE>" ... "</LINE>") <- ((("<SPEECH>" ... "</SPEECH>") -> (("<SPEAKER>" ... "</SPEAKER>") -> "caesar")) <- (("<PLAY>" ... "</PLAY>") -> (("<TITLES>" ... "</TITLE>") -> "Julius Caesar"))) Line contained in a [speech with contains a speaker "caesar"] which is contained in [a play containing the title Julius Caesar] Q6. trec_eval needs the following files - trec_rel_file - tells whether a given result was relevant to the query - trec_top_file - gives information on the search engine that carried out the search and also presents similarity scores trec_rel_file format: query_id iter docno rel where, query_id = id of query iter = iteration number rel = human judgement if doc was relevant for query with query_id docno = doc which would be decided if relevant for query trec_top_file format: query_id iter docno rank similarity run_id where, query_id, iter, docno - same as above rank - rank of doc for query based on similarity score similarity - score given by ranking algorithms to the doc run_id - the id of the run

-- Midterm_2 Solution

Q9 and Q10: Team: Jonathan, Steven, Yaoyan

No 9

First we split the index into generations 1, 2, 3 etc. Partition is created and postings are stored to generation 1. Once done, we check if another generation 1 exists. If yes, combine these two generation 1 as generation 2 and scrap generation 1's. Repeat the process upward to higher generations until only one partition per generation is achieved.

No. 10

LMJM = ∑t∈q q_t* log(1+(1−λ) / λ⋅f_t,d / l_d * l_C / l_t) where q_t denotes the occurrence of term t in the query

           λ controls relative weight given to document and collection language models
           f_t,d  denotes the frequency of term t in document d
           l_d denotes length of document d
           l_C denotes total # of tokens in the corpus
           l_t denotes total # of occurrence of term t in the corpus

LMD = ∑t∈q q_t * log(1+f_t,d / μ * l_C / l_t)−n*log(1+l_d / μ)

           all parameters the same as above
           and μ is # to increase the length of document d

Q9 and Q10: Team: Jonathan, Steven, Yaoyan No 9 First we split the index into generations 1, 2, 3 etc. Partition is created and postings are stored to generation 1. Once done, we check if another generation 1 exists. If yes, combine these two generation 1 as generation 2 and scrap generation 1's. Repeat the process upward to higher generations until only one partition per generation is achieved. No. 10 LMJM = ∑t∈q q_t* log(1+(1−λ) / λ⋅f_t,d / l_d * l_C / l_t) where q_t denotes the occurrence of term t in the query λ controls relative weight given to document and collection language models f_t,d denotes the frequency of term t in document d l_d denotes length of document d l_C denotes total # of tokens in the corpus l_t denotes total # of occurrence of term t in the corpus LMD = ∑t∈q q_t * log(1+f_t,d / μ * l_C / l_t)−n*log(1+l_d / μ) all parameters the same as above and μ is # to increase the length of document d

-- Midterm_2 Solution

Q3 and Q4 submitted by Anand, Anjitha and Neha 3. In sort based index construction we read each term in the corpus and emit pairs, which are written immediately to disk. After all the terms are written to disk, the on-disk records are sorted in the lexicographical order of unique termIDs. Example: Corpus: do you quarrel position = 0 T -> do termID = 6 R0 -> position = 1 T -> you termID = 15 R1 -> position = 2 T -> quarrel termID = 2 R1 -> Disk -> , , After sorting lexicographically, DIsk -> ,, 4. ScoreBM25q,d= IDF(t)*TFBM25(t,d) IDFt=log⁡(NNt) and TFBM25=ft,d*(k1+1)ft,d+k1*(1-b+b*(ldlavg)) k1 and b are usually set of 1.2 and 0.75 respectively Example- Suppose there are 2 documents namely d1, d2 Query q = (t1, t2) Nt1 = Nt2 =1 ft1,d1 = 10 ft2,d1 = 10 ld1 = ld2 = lavg ScoreBM25(q, d1) = log 21 *10*1.2+110+1.21-b+b*1 + log 21 *10*1.2+110+1.21-b+b*1 = 2*1.96 =3.92

<nowiki> Q3 and Q4 submitted by Anand, Anjitha and Neha 3. In sort based index construction we read each term in the corpus and emit <termID, position> pairs, which are written immediately to disk. After all the terms are written to disk, the on-disk records are sorted in the lexicographical order of unique termIDs. Example: Corpus: do you quarrel position = 0 T -> do termID = 6 R0 -> <6,0> position = 1 T -> you termID = 15 R1 -> <15,1> position = 2 T -> quarrel termID = 2 R1 -> <2,2> Disk -> <6,0>, <15,1>, <2,2> After sorting lexicographically, DIsk -> <2,2>,<6,0>, <15,1> 4. ScoreBM25q,d= IDF(t)*TFBM25(t,d) IDFt=log⁡(NNt) and TFBM25=ft,d*(k1+1)ft,d+k1*(1-b+b*(ldlavg)) k1 and b are usually set of 1.2 and 0.75 respectively Example- Suppose there are 2 documents namely d1, d2 Query q = (t1, t2) Nt1 = Nt2 =1 ft1,d1 = 10 ft2,d1 = 10 ld1 = ld2 = lavg ScoreBM25(q, d1) = log 21 *10*1.2+110+1.21-b+b*1 + log 21 *10*1.2+110+1.21-b+b*1 = 2*1.96 =3.92 </nowiki>

-- Midterm_2 Solution

  Name: Harita , Preethi , Avinash, Xianghong Sun

Q-1 How would the codepoint Ԉ, U+0508, be encoded in UTF-8?

 First Convert this no to binary 1010000 1000 (11 bit string)
 Now use a 2 byte UTF encoding(for example, 110xxxxx 10xxxxxx)
 final answer 11010100 10001000
 covert it to hex: D488

Q-2 Briefly explain the following concepts: (a) per-term index and (b) dictionary interleaving

Per-Term Index: A per-term index is a disk-based data structure that is part of the header of a posting list. It contains a copy of a subset of the postings in the posting list -- for example, every 5000th posting -- and the location in the posting list for these items. When accessing the posting list, this structure is read into memory. Searches are first done on the per-term posting list, then the relevant block of 5000 postings is read into memory. Elements of the per-term-index are sometimes called synchronization points, and the whole per-term index method is sometimes called self-indexing.

Dictionary Interleaving: It is a technique of managing disk space and faster retrieval of terms in the dictionary because the dictionary is not small enough to fit into the main memory of a single machine. In a interleaved dictionary, all entries are stored on disk and each entry is placed right before its posting list. This allows the search engine to fetch the dictionary entry and its posting list in one sequential read operation. In addition, copies of some dictionary entries are kept in memory. When a search engine needs to determine the location of a term's posting list, it performs a binary search on the sorted list of in-memory dictionary entries followed by a sequential scan of data found between two entries.

(Edited: 2018-11-14)

Name: Harita , Preethi , Avinash, Xianghong Sun Q-1 How would the codepoint Ԉ, U+0508, be encoded in UTF-8? First Convert this no to binary 1010000 1000 (11 bit string) Now use a 2 byte UTF encoding(for example, 110xxxxx 10xxxxxx) final answer 11010100 10001000 covert it to hex: D488 Q-2 Briefly explain the following concepts: (a) per-term index and (b) dictionary interleaving Per-Term Index: A per-term index is a disk-based data structure that is part of the header of a posting list. It contains a copy of a subset of the postings in the posting list -- for example, every 5000th posting -- and the location in the posting list for these items. When accessing the posting list, this structure is read into memory. Searches are first done on the per-term posting list, then the relevant block of 5000 postings is read into memory. Elements of the per-term-index are sometimes called synchronization points, and the whole per-term index method is sometimes called self-indexing. Dictionary Interleaving: It is a technique of managing disk space and faster retrieval of terms in the dictionary because the dictionary is not small enough to fit into the main memory of a single machine. In a interleaved dictionary, all entries are stored on disk and each entry is placed right before its posting list. This allows the search engine to fetch the dictionary entry and its posting list in one sequential read operation. In addition, copies of some dictionary entries are kept in memory. When a search engine needs to determine the location of a term's posting list, it performs a binary search on the sorted list of in-memory dictionary entries followed by a sequential scan of data found between two entries.

-- Midterm_2 Solution

Problem # 8 - Shehba, Neeraj, Tim

Unordered list item vByte: In order to avoid expensive bit-fiddling operations, we want to encode each Δ-value using an integral number of bytes. The easiest way to do this is to split the binary representation of each Δ-value into 7-bit chunks and preprend to each such chunk a continuation flag--a single bit that indicates whether the current chunk is the last one or whether there are more to follow. The resulting method is called vByte. The vByte-encoded representation of the doclist L = (1624, 1650, 1876, 1972,...), Δ(L) = (1624, 26, 226, 96, 384,...) is 1 1011000 0 0001100 0 0011010 1 1100010 0 0000001 0 1100000 1 0000000 0 0000011 ...
Unordered list item Simple-9 is an example of a word-aligned code which works on multiple of 32 bits. In this coding scheme, the first 4 bits of a 32 bits word serve as a selector. The remaining 28 bits are split into chunks according to the particular selector. For example, if the top four bits were 0000, then 28 1 bit numbers would be encoded in the word; if the selector was 0001, 14, 2 bit numbers could be encoded and so on. </pre>

(Edited: 2018-11-15)

Problem # 8 - Shehba, Neeraj, Tim * Unordered list item vByte: In order to avoid expensive bit-fiddling operations, we want to encode each Δ-value using an integral number of bytes. The easiest way to do this is to split the binary representation of each Δ-value into 7-bit chunks and preprend to each such chunk a continuation flag--a single bit that indicates whether the current chunk is the last one or whether there are more to follow. The resulting method is called vByte. The vByte-encoded representation of the doclist L = (1624, 1650, 1876, 1972,...), Δ(L) = (1624, 26, 226, 96, 384,...) is 1 1011000 0 0001100 0 0011010 1 1100010 0 0000001 0 1100000 1 0000000 0 0000011 ... * Unordered list item Simple-9 is an example of a word-aligned code which works on multiple of 32 bits. In this coding scheme, the first 4 bits of a 32 bits word serve as a selector. The remaining 28 bits are split into chunks according to the particular selector. For example, if the top four bits were 0000, then 28 1 bit numbers would be encoded in the word; if the selector was 0001, 14, 2 bit numbers could be encoded and so on. </pre>

-- Midterm_2 Solution

Question 7 - Shehba, Neeraj, Tim

Gamma coding:

This coding is generally used to encode a delta postings list.
We write the (number of bits - 1) we want in unary with 0's, followed by a 1, followed by the number in binary.
So initially number 5 would be encoded as 001 101.
Notice the high-order bit of the binary code for the number is redundant given that we known the length of the number, so we can drop this bit to get the actual encoding. For 5, 001 101 becomes 001 01

Golomb code:

The postings list is expected to follow geometric distribution. Then the postings list is plotted on graph as probability vs `⌊log_2(Δ)⌋` + 1.
The probability is given by `N_T/N`, where `N_T` is the number of documents the term T appears and N is the total number of documents.
From the graph we can determine the same length postings which have highest frequency. This common length is used as the modulus while encoding in golomb code.
Procedure to encode a golomb code is:
- determine an appropriate modulus
- split each Δ-value k into two components: a quotient q(k) and a remainder r(k) where:
  ;`q(k) = ⌊(k−1) / M⌋`, `r(k) = (k−1) mod M`

; Encode k by writing q(k)+1 in unary followed by

;; r(k) as a `⌊log_2(M)⌋` bit or `⌈log_2(M)⌉` bit number.

The optimal modulus can be calculated using the formula: ;;`M_(opt) = ⎡log_2(2−N_T/N) / -log_2(1−N_T/N)⎤`.

;Encoding for number 5, assume that the modulus is 4.

;;q(k) = 1 and r(k) = 0

;Then the number is encoded as, 01 00.

(Edited: 2018-11-16)

Question 7 - Shehba, Neeraj, Tim Gamma coding: #This coding is generally used to encode a delta postings list. #We write the (number of bits - 1) we want in unary with 0's, followed by a 1, followed by the number in binary. #So initially number 5 would be encoded as 001 101. #Notice the high-order bit of the binary code for the number is redundant given that we known the length of the number, so we can drop this bit to get the actual encoding. For 5, 001 101 becomes 001 01 Golomb code: #The postings list is expected to follow geometric distribution. Then the postings list is plotted on graph as probability vs @BT@⌊log_2(Δ)⌋@BT@ + 1. #The probability is given by @BT@N_T/N@BT@, where @BT@N_T@BT@ is the number of documents the term T appears and N is the total number of documents. #From the graph we can determine the same length postings which have highest frequency. This common length is used as the modulus while encoding in golomb code. #Procedure to encode a golomb code is: #*determine an appropriate modulus #*split each Δ-value k into two components: a quotient q(k) and a remainder r(k) where: ;;@BT@q(k) = ⌊(k−1) / M⌋@BT@, @BT@r(k) = (k−1) mod M@BT@ ; Encode k by writing q(k)+1 in unary followed by: ;; r(k) as a @BT@⌊log_2(M)⌋@BT@ bit or @BT@⌈log_2(M)⌉@BT@ bit number. ; The optimal modulus can be calculated using the formula: ;;@BT@M_(opt) = ⎡log_2(2−N_T/N) / -log_2(1−N_T/N)⎤@BT@. ;Encoding for number 5, assume that the modulus is 4. ;;q(k) = 1 and r(k) = 0 ;Then the number is encoded as, 01 00.