-- Apr 28 In-Class Exercise Thread
N = Size of corpus = 500000
l_avg = Average length of a query in our corpus of queries = 3
l_t = Number of occurrences of the terms "California", "Business", "Tax" and "Return" = 104, 501, 254, 607
DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t))
Or, DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1))
Substituting f_t,d with their normalized version:
f'_t,d = f_t,d * log(1 + l_avg/l_d)
We get,
DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) )
a) For "California Business Tax"
f'_t,d for the first query for every term is 1 * log(1 + 3/3) = 1
DFR = Sum of:
1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116
1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983
1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472
DFR score = 6.116 + 4.983 + 5.472 = 16.571
b) For "California Business Tax Return":
f'_t,d for the second query for every term is 1 * log(1 + 3/4) = log(3/4)
DFR = Sum of:
1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464
1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452
1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889
1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329
DFR score = 5.464 + 4.452 + 4.889 + 4.329 = 19.133
(
Edited: 2021-05-02)
N = Size of corpus = 500000
l_avg = Average length of a query in our corpus of queries = 3
l_t = Number of occurrences of the terms "California", "Business", "Tax" and "Return" = 104, 501, 254, 607
DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t))
Or, DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1))
Substituting f_t,d with their normalized version:
f'_t,d = f_t,d * log(1 + l_avg/l_d)
We get,
DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) )
a) For "California Business Tax"
f'_t,d for the first query for every term is 1 * log(1 + 3/3) = 1
DFR = Sum of:
1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116
1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983
1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472
DFR score = 6.116 + 4.983 + 5.472 = 16.571
b) For "California Business Tax Return":
f'_t,d for the second query for every term is 1 * log(1 + 3/4) = log(3/4)
DFR = Sum of:
1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464
1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452
1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889
1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329
DFR score = 5.464 + 4.452 + 4.889 + 4.329 = 19.133