-- Midterm_2 Solution
Q3 and Q4 submitted by Anand, Anjitha and Neha
3. In sort based index construction we read each term in the corpus and emit pairs, which are written immediately to disk. After all the terms are written to disk, the on-disk records are sorted in the lexicographical order of unique termIDs.
Example:
Corpus: do you quarrel
position = 0
T -> do
termID = 6
R0 ->
position = 1
T -> you
termID = 15
R1 ->
position = 2
T -> quarrel
termID = 2
R1 ->
Disk -> , ,
After sorting lexicographically,
DIsk -> ,,
4. ScoreBM25q,d= IDF(t)*TFBM25(t,d)
IDFt=log(NNt) and TFBM25=ft,d*(k1+1)ft,d+k1*(1-b+b*(ldlavg))
k1 and b are usually set of 1.2 and 0.75 respectively
Example-
Suppose there are 2 documents namely d1, d2
Query q = (t1, t2)
Nt1 = Nt2 =1
ft1,d1 = 10 ft2,d1 = 10
ld1 = ld2 = lavg
ScoreBM25(q, d1) = log 21 *10*1.2+110+1.21-b+b*1 + log 21 *10*1.2+110+1.21-b+b*1
= 2*1.96
=3.92
<nowiki> Q3 and Q4 submitted by Anand, Anjitha and Neha
3. In sort based index construction we read each term in the corpus and emit <termID, position> pairs, which are written immediately to disk. After all the terms are written to disk, the on-disk records are sorted in the lexicographical order of unique termIDs.
Example:
Corpus: do you quarrel
position = 0
T -> do
termID = 6
R0 -> <6,0>
position = 1
T -> you
termID = 15
R1 -> <15,1>
position = 2
T -> quarrel
termID = 2
R1 -> <2,2>
Disk -> <6,0>, <15,1>, <2,2>
After sorting lexicographically,
DIsk -> <2,2>,<6,0>, <15,1>
4. ScoreBM25q,d= IDF(t)*TFBM25(t,d)
IDFt=log(NNt) and TFBM25=ft,d*(k1+1)ft,d+k1*(1-b+b*(ldlavg))
k1 and b are usually set of 1.2 and 0.75 respectively
Example-
Suppose there are 2 documents namely d1, d2
Query q = (t1, t2)
Nt1 = Nt2 =1
ft1,d1 = 10 ft2,d1 = 10
ld1 = ld2 = lavg
ScoreBM25(q, d1) = log 21 *10*1.2+110+1.21-b+b*1 + log 21 *10*1.2+110+1.21-b+b*1
= 2*1.96
=3.92
</nowiki>