2021-05-14

Homework 5 Explanation.

I asked the Professor to explain a little bit about what he wants us to store in the index file and he replied with this:
 The HW description gives an example document map.
 After the document map, it is up to you whether to store posting lists next or the dictionary.
 For example, if you wrote the posting lists next, this would be followed by each term's posting list. As you write these, remember the offsets.
 Then you could start writing the dictionary.
 This could be a sequence of pairs term followed by a 4byte integer offset (you could do gaps here as well, but let’s keep it easy) for the posting list for that term.
 After this write the sequence of integers for the starts of each term (these integers are offsets into the file where the give term begins).
 Finally, write two more 4byte integers: the total number of terms in the dictionary, the integer offset into the file where the posting lists begin.
 To look up a term. You could fseek to 8bytes from the end of the file (size of file -8). Read the two integers.
 Use total number of terms in the dictionary number to bound your binary search to find a given term.
 Once you found the term and look up its posting list, given my screwy format for the doc map, you can use the offset to where the posting lists begin to let you binary search for a sequence \0 followed by the id you are looking for.
 This could have been made faster if the posting list stored byte offsets into the doc_map rather than id’s, but oh well.
(Edited: 2021-05-14)
I asked the Professor to explain a little bit about what he wants us to store in the index file and he replied with this: The HW description gives an example document map. After the document map, it is up to you whether to store posting lists next or the dictionary. For example, if you wrote the posting lists next, this would be followed by each term's posting list. As you write these, remember the offsets. Then you could start writing the dictionary. This could be a sequence of pairs term followed by a 4byte integer offset (you could do gaps here as well, but let’s keep it easy) for the posting list for that term. After this write the sequence of integers for the starts of each term (these integers are offsets into the file where the give term begins). Finally, write two more 4byte integers: the total number of terms in the dictionary, the integer offset into the file where the posting lists begin. To look up a term. You could fseek to 8bytes from the end of the file (size of file -8). Read the two integers. Use total number of terms in the dictionary number to bound your binary search to find a given term. Once you found the term and look up its posting list, given my screwy format for the doc map, you can use the offset to where the posting lists begin to let you binary search for a sequence \0 followed by the id you are looking for. This could have been made faster if the posting list stored byte offsets into the doc_map rather than id’s, but oh well.

-- Homework 5 Explanation
What exactly is a delta list ?
Delta list = difference list. It is a list that stores a sequence of gaps. The format for the gaps and auxiliary info I want is described in the rest of the sentence.
(Edited: 2021-05-14)
What exactly is a ''delta list''? Delta list = difference list. It is a list that stores a sequence of gaps. The format for the gaps and auxiliary info I want is described in the rest of the sentence.
X