-- CS267 Fall 2019Practice Midterm 1
Group Members: Amala Chirayil, Aparna Kale, Sunhera Paul
10) Briefly explain and give an example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.
a) sort based dictionary: It is a dictionary in which all the terms appearing in the corpus are arranged in the sorted order. Terms are stored in an array or in a search tree. If stored in a tree, tree traversals are used to traverse the dictionary.
e.g.
[a] -> list
[b] -> list
b)per-term index: In order to provide efficient random access into any given on-disk postings list, each list has to be equipped
with a data structure, which is referred to as the per-term index. This data structure is stored on disk, at the beginning of the
respective postings list. It has a copy of a subset of the postings in the list, for instance, a copy of every 6, 000th posting.
When we access the posting list for the given term t, before performing any actual index operations on the list, the search
engine loads t’s per-term index into memory. Random-access operations of the type required by the next method can then be
carried out by performing a binary search on the in-memory array representing t ’s per-term index, followed by loading up to
6,000 postings from the candidate range into memory. After that random access is performed.
(c) move-to-front heuristic: Consider that we are using hash tables. The collisions in hash tables are resolved by chaining. In
this case, to accelerate the dictionary search, when a term lookup is done and the items it is not found at the front of the chain
then it has to be moved to the front. The overhead is O(1) and tends to keep frequent items at the front of the chain. This is
called as a move-to-front heuristic.
e.g. We have elements, 1,2,3,12,3 and hash function is mod2
1-> 1,3
2-> 2,12
For 3,
1->3,1
2->2,12
(d) merge based index construction - Merge-based indexing is a generalization of the in-memory index construction method by
building inverted lists by means of hash table lookup. If the collection for which an index is being created is small enough for the
index to fit completely into main memory, then merge-based indexing behaves exactly like in-memory indexing. If the text
collection is too large to be indexed completely in main memory, then the indexing process performs a dynamic partitioning of
the collection. That is, it starts building an in-memory index. As soon as it runs out of memory (or when a predefined memory
utilization threshold is reached), it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the
in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed. The result of
the process outlined above is a set of inverted files, each representing a certain part of the whole collection. Each such
subindex is referred to as an index partition. In a final step, the index partitions are merged into the final index, representing the
entire text collection. The final result thus represents a merge based index construction.
For eg.
Index A: [[merge, [2,3,4,5]], (sort,[2,4])]
Index B: [[merge, [6,9]], (element, [15,42])]
Merged Index Construction: [[ merge, [2,3,4,5,6,9] ], [ sort, [2,4] ], [ element, [15, 42] ]
(
Edited: 2019-10-04)
'''Group Members: Amala Chirayil, Aparna Kale, Sunhera Paul
'''
10) Briefly explain and give an example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.
a) sort based dictionary: It is a dictionary in which all the terms appearing in the corpus are arranged in the sorted order. Terms are stored in an array or in a search tree. If stored in a tree, tree traversals are used to traverse the dictionary.
e.g.
[a] -> list
[b] -> list
b)per-term index: In order to provide efficient random access into any given on-disk postings list, each list has to be equipped
with a data structure, which is referred to as the per-term index. This data structure is stored on disk, at the beginning of the
respective postings list. It has a copy of a subset of the postings in the list, for instance, a copy of every 6, 000th posting.
When we access the posting list for the given term t, before performing any actual index operations on the list, the search
engine loads t’s per-term index into memory. Random-access operations of the type required by the next method can then be
carried out by performing a binary search on the in-memory array representing t ’s per-term index, followed by loading up to
6,000 postings from the candidate range into memory. After that random access is performed.
(c) move-to-front heuristic: Consider that we are using hash tables. The collisions in hash tables are resolved by chaining. In
this case, to accelerate the dictionary search, when a term lookup is done and the items it is not found at the front of the chain
then it has to be moved to the front. The overhead is O(1) and tends to keep frequent items at the front of the chain. This is
called as a move-to-front heuristic.
e.g. We have elements, 1,2,3,12,3 and hash function is mod2
1-> 1,3
2-> 2,12
For 3,
1->3,1
2->2,12
(d) merge based index construction - Merge-based indexing is a generalization of the in-memory index construction method by
building inverted lists by means of hash table lookup. If the collection for which an index is being created is small enough for the
index to fit completely into main memory, then merge-based indexing behaves exactly like in-memory indexing. If the text
collection is too large to be indexed completely in main memory, then the indexing process performs a dynamic partitioning of
the collection. That is, it starts building an in-memory index. As soon as it runs out of memory (or when a predefined memory
utilization threshold is reached), it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the
in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed. The result of
the process outlined above is a set of inverted files, each representing a certain part of the whole collection. Each such
subindex is referred to as an index partition. In a final step, the index partitions are merged into the final index, representing the
entire text collection. The final result thus represents a merge based index construction.
For eg.
Index A: [[merge, [2,3,4,5]], (sort,[2,4])]
Index B: [[merge, [6,9]], (element, [15,42])]
Merged Index Construction: [[ merge, [2,3,4,5,6,9] ], [ sort, [2,4] ], [ element, [15, 42] ]