-- Practice Midterm Problems Thread
By Vrinda and Sriramm
-
-
Q10. Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.
-
-
(a) sort-based dictionary - In sort-based dictionaries, the terms that appear in the text collection are sorted in the form of an array or a search tree. Traversal is done through tree traversal. The problems with sort based dictionaries are that all the values should be of the same size. To overcome this problem, we use two arrays- a primary array that contains integers which point to a secondary array containing the actual terms.
-
-
For example:
-
-
Sorted dictionary in memory looks like:
-
-
absence -> 2345, absent -> 26783, …, shake -> 456, shaken -> 1672, shakespeare -> 89541, … zygote -> 6734, zygotes -> 1333.
-
-
Dictionary on disk has “absence” and “absent” nearby and similarly “shake”, “shaken”, “shakespeare”. So, this sort-based lookup makes searching for terms like “absen*” and “shake*” easier.
-
(b) per-term index - per-term index is a set of positions of the terms that are kept as a synchronisation points (something like checkpoints) such that we do not have to go through the entire partitions of the disks to find the position.
For example, the per term indices for “widget” in a document are 12334, 13456, 34598. So, consider that the disk has three partitions. There are 15 other positions at which the term occurs. If we need next(“widget”, 25987), we would know that we have to start searching between 13456 and 34598.
-
(c) move-to-front heuristic - In the “move-to-front” heuristic, the terms are inserted at the back of the list. The terms will be moved to the front of the lists when they are searched. So, the next time when the term is looked up or searched, it will be present in the beginning of the list.
For example, if a new term(in our context) such as “Aztec” is encountered, it is initially inserted at the end of the list. When the term is searched for the first time, it is moved to the front of the posting list, such that when the lookup happens again, the term will be present in the front of the list.
-
(d)Merge-based index - merge-based index is similar to hash-based indexing, in the way that if the index list is small, it can fit into the memory (main memory) and if not, it uses dynamic partitioning to build an in-memory index and as soon as it runs out of memory, it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed, resulting in a set of index files, each representing a part of the whole collection, called an index partition. In a final step, the index partitions are merged into the final index. When we do lookups, we have to search through these index partitions of disk. The posting lists are sorted lexicographically and stored in the disk partitions.
For Example: Let the max amount of in-memory words be 6.
Text -
Doc1: “Do you quarrel, sir?
Doc2: “If you do, stay here, if not, then go.”
Doc3: “Try sir.”
Partition 1:
do - [(1, 1)]
you - [(1, 2)]
quarrel - [(1, 3)]
sir - [(1, 4)]
Partition 2:
if - [(2, 1), (2, 6)]
you - [(2, 2)]
do - [(2, 3)]
stay - [(2, 4)]
here - [(2, 5)]
not - [(2, 7)]
then - [(2, 8)]
go - [(2, 9)]
Partition 3:
try - [(3, 1)]
sir - [(3, 2)]
Recombining:
{
do: [(1, 1), (2, 3)],
go: [(2, 9)],
here: [(2, 5)],
if: [(2, 1), (2, 6)],
not: [(2, 7)],
quarrel: [(1, 3)],
sir: [(1, 4), (3, 2)],
stay: [(2, 4)],
then: [(2, 8)],
try: [(3, 1)],
you: [(1, 2), (2, 2)]
}
(
Edited: 2021-03-22)
By Vrinda and Sriramm
; :
Q10. Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.
; :
(a) sort-based dictionary - In sort-based dictionaries, the terms that appear in the text collection are sorted in the form of an array or a search tree. Traversal is done through tree traversal. The problems with sort based dictionaries are that all the values should be of the same size. To overcome this problem, we use two arrays- a primary array that contains integers which point to a secondary array containing the actual terms.
; :
For example:
; :
Sorted dictionary in memory looks like:
; :
absence -> 2345, absent -> 26783, …, shake -> 456, shaken -> 1672, shakespeare -> 89541, … zygote -> 6734, zygotes -> 1333.
; :
Dictionary on disk has “absence” and “absent” nearby and similarly “shake”, “shaken”, “shakespeare”. So, this sort-based lookup makes searching for terms like “absen*” and “shake*” easier.
;
:
(b) per-term index - per-term index is a set of positions of the terms that are kept as a synchronisation points (something like checkpoints) such that we do not have to go through the entire partitions of the disks to find the position.
; :
For example, the per term indices for “widget” in a document are 12334, 13456, 34598. So, consider that the disk has three partitions. There are 15 other positions at which the term occurs. If we need next(“widget”, 25987), we would know that we have to start searching between 13456 and 34598.
; :
;
:
(c) move-to-front heuristic - In the “move-to-front” heuristic, the terms are inserted at the back of the list. The terms will be moved to the front of the lists when they are searched. So, the next time when the term is looked up or searched, it will be present in the beginning of the list.
; :
For example, if a new term(in our context) such as “Aztec” is encountered, it is initially inserted at the end of the list. When the term is searched for the first time, it is moved to the front of the posting list, such that when the lookup happens again, the term will be present in the front of the list.
; :
;
:
(d)Merge-based index - merge-based index is similar to hash-based indexing, in the way that if the index list is small, it can fit into the memory (main memory) and if not, it uses dynamic partitioning to build an in-memory index and as soon as it runs out of memory, it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed, resulting in a set of index files, each representing a part of the whole collection, called an index partition. In a final step, the index partitions are merged into the final index. When we do lookups, we have to search through these index partitions of disk. The posting lists are sorted lexicographically and stored in the disk partitions.
; :
For Example: Let the max amount of in-memory words be 6.
; :
Text -
; :
Doc1: “Do you quarrel, sir?
; :
Doc2: “If you do, stay here, if not, then go.”
; :
Doc3: “Try sir.”
; :
Partition 1:
; :
do - [(1, 1)]
; :
you - [(1, 2)]
; :
quarrel - [(1, 3)]
; :
sir - [(1, 4)]
; :
Partition 2:
; :
if - [(2, 1), (2, 6)]
; :
you - [(2, 2)]
; :
do - [(2, 3)]
; :
stay - [(2, 4)]
; :
here - [(2, 5)]
; :
not - [(2, 7)]
; :
then - [(2, 8)]
; :
go - [(2, 9)]
; :
Partition 3:
; :
try - [(3, 1)]
; :
sir - [(3, 2)]
; :
Recombining:
; :
{
do: [(1, 1), (2, 3)],
go: [(2, 9)],
here: [(2, 5)],
if: [(2, 1), (2, 6)],
not: [(2, 7)],
quarrel: [(1, 3)],
sir: [(1, 4), (3, 2)],
stay: [(2, 4)],
then: [(2, 8)],
try: [(3, 1)],
you: [(1, 2), (2, 2)]
}