2021-04-28

Apr 28 In-Class Exercise Thread.

Please post your solutions to the Apr 28 In-Class Exercise to this thread.
Best,
Chris
Please post your solutions to the Apr 28 In-Class Exercise to this thread. Best, Chris

-- Apr 28 In-Class Exercise Thread
DFR_california_3 = (log (1 + (104 / 500000)) + (1 * log (1 + (3 / 3))) log(1 + (500000 / 104))) / ((1 * log (1 + (3 / 3))) + 1) = 0.00030005 + (12.231 / 2) = 6.1155

DFR_business_3 = (log (1 + (501 / 500000)) + (1 * log (1 + (3 / 3))) log(1 + (500000 / 501))) / ((1 * log (1 + (3 / 3))) + 1) = 0.001445 + (9.964 / 2) = 4.982

DFR_tax_3 = (log (1 + (254 / 500000)) + (1 * log (1 + (3 / 3))) log(1 + (500000 / 254))) / ((1 * log (1 + (3 / 3))) + 1) = 0.0007327 + (10.944 / 2) = 5.472

DFR_california_4 = (log (1 + (104 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 104))) / ((1 * log (1 + (3 / 4))) + 1) = 0.00030005 + ((0.8074*12.231) / 1.8074) = 5.46382

DFR_business_4 = (log (1 + (501 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 501))) / ((1 * log (1 + (3 / 4))) + 1) = 0.001445 + ((0.8074*9.964) / 1.8074) = 4.45111

DFR_tax_4 = (log (1 + (254 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 254))) / ((1 * log (1 + (3 / 4))) + 1) = 0.0007327 + ((0.8074*10.944) / 1.8074) = 4.88889

DFR_return = (log (1 + (607 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 607))) / ((1 * log (1 + (3 / 4))) + 1) = 0.0017504 + ((0.8074*9.688) / 1.8074) = 4.32781


1) DFR_california_business_tax = DFR_california_3 * DFR_business_3 * DFR_tax_3 = 6.1155 * 4.982 * 5.472 = 166.7177

2) DFR_california_business_tax_return = DFR_california_4 * DFR_business_4 * DFR_tax_4 * DFR_return = 5.46382 * 4.45111 * 4.88889 * 4.32781 = 514.56846
DFR_california_3 = (log (1 + (104 / 500000)) + (1 * log (1 + (3 / 3))) log(1 + (500000 / 104))) / ((1 * log (1 + (3 / 3))) + 1) = 0.00030005 + (12.231 / 2) = 6.1155<br> <br> DFR_business_3 = (log (1 + (501 / 500000)) + (1 * log (1 + (3 / 3))) log(1 + (500000 / 501))) / ((1 * log (1 + (3 / 3))) + 1) = 0.001445 + (9.964 / 2) = 4.982<br> <br> DFR_tax_3 = (log (1 + (254 / 500000)) + (1 * log (1 + (3 / 3))) log(1 + (500000 / 254))) / ((1 * log (1 + (3 / 3))) + 1) = 0.0007327 + (10.944 / 2) = 5.472<br> <br> DFR_california_4 = (log (1 + (104 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 104))) / ((1 * log (1 + (3 / 4))) + 1) = 0.00030005 + ((0.8074*12.231) / 1.8074) = 5.46382<br> <br> DFR_business_4 = (log (1 + (501 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 501))) / ((1 * log (1 + (3 / 4))) + 1) = 0.001445 + ((0.8074*9.964) / 1.8074) = 4.45111<br> <br> DFR_tax_4 = (log (1 + (254 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 254))) / ((1 * log (1 + (3 / 4))) + 1) = 0.0007327 + ((0.8074*10.944) / 1.8074) = 4.88889<br> <br> DFR_return = (log (1 + (607 / 500000)) + (1 * log (1 + (3 / 4))) log(1 + (500000 / 607))) / ((1 * log (1 + (3 / 4))) + 1) = 0.0017504 + ((0.8074*9.688) / 1.8074) = 4.32781<br> <br><br> 1) DFR_california_business_tax = DFR_california_3 * DFR_business_3 * DFR_tax_3 = 6.1155 * 4.982 * 5.472 = 166.7177<br><br> 2) DFR_california_business_tax_return = DFR_california_4 * DFR_business_4 * DFR_tax_4 * DFR_return = 5.46382 * 4.45111 * 4.88889 * 4.32781 = 514.56846<br>

-- Apr 28 In-Class Exercise Thread
Given N = 500000, l_avg = 3, l_t = 104, 501, 254, 607 and noticing that f_t,d for each term is 1:
Query: "California Business Tax" (l_d = 3)
f'_t,d = f_t,d * log(1 + l_avg/l_d) = 1 * log(1 + 3/3) = 1
For each term, (1-P_2)(-log P_1) = (log(1 + l_t/N) + f'_t,d * log(1 + N/l_t))/(f'_t,d + 1) with document length normalization
"California"
(log(1 + 104/500000) + 1 * log(1 + 500000/104))/(1 + 1) = (0.000300049364453 + 12.2314289005)/2 = 6.11586447493
"Business"
(log(1 + 501/500000) + 1 * log(1 + 500000/501))/(1 + 1) = (0.0014448566786 + 9.96434663281)/2 = 4.98289574474
"Tax"
(log(1 + 254/500000) + 1 * log(1 + 500000/254))/(1 + 1) = (0.000732702989965 + 10.9436165855)/2 = 5.47217464424
DFR Score = 6.11586447493 + 4.98289574474 + 5.47217464424 = 16.5709348639
Query: "California Business Tax Return" (l_d = 4)
f'_t,d = f_t,d * log(1 + l_avg/l_d) = 1 * log(1 + 3/4) = 0.807354922058
For each term, (1-P_2)(-log P_1) = (log(1 + l_t/N) + f'_t,d * log(1 + N/l_t))/(f'_t,d + 1) with document length normalization
"California"
(log(1 + 104/500000) + 0.807354922058 * log(1 + 500000/104))/(0.807354922058 + 1) = (0.000300049364453 + 0.807354922058 * 12.2314289005)/1.80735492206 = 5.4640094513
"Business"
(log(1 + 501/500000) + 0.807354922058 * log(1 + 500000/501))/(0.807354922058 + 1) = (0.0014448566786 + 0.807354922058 * 9.96434663281)/1.80735492206 = 4.45192532887
"Tax"
(log(1 + 254/500000) + 0.807354922058 * log(1 + 500000/254))/(0.807354922058 + 1) = (0.000732702989965 + 0.807354922058 * 10.9436165855)/1.80735492206 = 4.88897632145
"Return"
(log(1 + 607/500000) + 0.807354922058 * log(1 + 500000/607))/(0.807354922058 + 1) = (0.00175036952018 + 0.807354922058 * 9.68776623259)/1.80735492206 = 4.32854445226
DFR Score = 5.4640094513 + 4.45192532887 + 4.88897632145 + 4.32854445226 = 19.1334555539
(Edited: 2021-04-28)
Given N = 500000, l_avg = 3, l_t = 104, 501, 254, 607 and noticing that f_t,d for each term is 1: Query: "California Business Tax" (l_d = 3) f'_t,d = f_t,d * log(1 + l_avg/l_d) = 1 * log(1 + 3/3) = 1 For each term, (1-P_2)(-log P_1) = (log(1 + l_t/N) + f'_t,d * log(1 + N/l_t))/(f'_t,d + 1) with document length normalization "California" (log(1 + 104/500000) + 1 * log(1 + 500000/104))/(1 + 1) = (0.000300049364453 + 12.2314289005)/2 = 6.11586447493 "Business" (log(1 + 501/500000) + 1 * log(1 + 500000/501))/(1 + 1) = (0.0014448566786 + 9.96434663281)/2 = 4.98289574474 "Tax" (log(1 + 254/500000) + 1 * log(1 + 500000/254))/(1 + 1) = (0.000732702989965 + 10.9436165855)/2 = 5.47217464424 DFR Score = 6.11586447493 + 4.98289574474 + 5.47217464424 = 16.5709348639 Query: "California Business Tax Return" (l_d = 4) f'_t,d = f_t,d * log(1 + l_avg/l_d) = 1 * log(1 + 3/4) = 0.807354922058 For each term, (1-P_2)(-log P_1) = (log(1 + l_t/N) + f'_t,d * log(1 + N/l_t))/(f'_t,d + 1) with document length normalization "California" (log(1 + 104/500000) + 0.807354922058 * log(1 + 500000/104))/(0.807354922058 + 1) = (0.000300049364453 + 0.807354922058 * 12.2314289005)/1.80735492206 = 5.4640094513 "Business" (log(1 + 501/500000) + 0.807354922058 * log(1 + 500000/501))/(0.807354922058 + 1) = (0.0014448566786 + 0.807354922058 * 9.96434663281)/1.80735492206 = 4.45192532887 "Tax" (log(1 + 254/500000) + 0.807354922058 * log(1 + 500000/254))/(0.807354922058 + 1) = (0.000732702989965 + 0.807354922058 * 10.9436165855)/1.80735492206 = 4.88897632145 "Return" (log(1 + 607/500000) + 0.807354922058 * log(1 + 500000/607))/(0.807354922058 + 1) = (0.00175036952018 + 0.807354922058 * 9.68776623259)/1.80735492206 = 4.32854445226 DFR Score = 5.4640094513 + 4.45192532887 + 4.88897632145 + 4.32854445226 = 19.1334555539

-- Apr 28 In-Class Exercise Thread
Important constants: -The average length of a query in our corpus of queries is 3 = l_avg -The number of occurrences of the terms "California", "Business", "Tax", and "Return" in the whole corpus are respectively 104, 501, 254, 607 = l_t for each respective terms DFR score of a query is given by: Sum of every t in q(q_t * (1-P_2,t) * (-logP_1,t)) This can be simplified to: Sum of every t in q(q_t * (log(1+l/t_N)+ f_t,d * log(1 + N/l_t)) / (f_t,d + 1) ) we can substitute f_t,d with the normalized version given by: f'_t,d = f_t,d * log(1 + l_avg/l_d) This gives us the final equation for DFR score as: Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) ) The two queries are: - California Business Tax - California Business Tax Return In this example the document is the query q_t is 1 for every term in both queries since each word appears once in each query. f'_t,d for the first query for every term is 1 * log(1 + 3/3) = 1 f'_t,d for the second query for every term is 1 * log(1 + 3/4) = log(3/4) 1st query: 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472 DFR score for query 1 is 6.116 + 4.983 + 5.472 = 16.570934864 2st query: 1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329 DFR score for query 2 is 5.464 + 4.452 + 4.889 + 4.329 = 19.1334555539
(Edited: 2021-04-28)
<nowiki> Important constants: -The average length of a query in our corpus of queries is 3 = l_avg -The number of occurrences of the terms "California", "Business", "Tax", and "Return" in the whole corpus are respectively 104, 501, 254, 607 = l_t for each respective terms DFR score of a query is given by: Sum of every t in q(q_t * (1-P_2,t) * (-logP_1,t)) This can be simplified to: Sum of every t in q(q_t * (log(1+l/t_N)+ f_t,d * log(1 + N/l_t)) / (f_t,d + 1) ) we can substitute f_t,d with the normalized version given by: f'_t,d = f_t,d * log(1 + l_avg/l_d) This gives us the final equation for DFR score as: Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) ) The two queries are: - California Business Tax - California Business Tax Return In this example the document is the query q_t is 1 for every term in both queries since each word appears once in each query. f'_t,d for the first query for every term is 1 * log(1 + 3/3) = 1 f'_t,d for the second query for every term is 1 * log(1 + 3/4) = log(3/4) 1st query: 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472 DFR score for query 1 is 6.116 + 4.983 + 5.472 = 16.570934864 2st query: 1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329 DFR score for query 2 is 5.464 + 4.452 + 4.889 + 4.329 = 19.1334555539 </nowiki>
2021-05-02

-- Apr 28 In-Class Exercise Thread
 DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) )
 f'_t,d = f_t,d * log(1 + l_avg/l_d)
 First Query = "California Business Tax"
 For term california in Query "California Business Tax"
 1* log(1 + 104/500000) + 1* log(1+ 500000/104)  / (1+1)
 log(1.0002) + log(4808.69)  /2
 12.231/2  = 6.1155
 For term business in Query "California Business Tax"
 1*log(1 + 501/500000) + 1* log(1+ 500000/501)  / (1+1)
 log(1.0010) + log(999)  /2 = 4.983
 For term tax in Query "California Business Tax"
 1*log(1 + 254/500000) + 1* log(1+ 500000/254)  / (1+1)
 log(1.0005) + log(1969.5)  /2 = 5.472
 First query DFR score  = 6.115 + 4.983 + 5.472 = 16.571
 Second Query = "California Business Tax Return"
 For term california in Query "California Business Tax Return"
 f'(t, d) = 1.log(1+3/4)
         = 0.8074
 For term business in Query "California Business Tax Return"
 1* log(1 + 104/500000) + 0.8074* log(1+ 500000/104)  / (0.8074+1) = 5.464
 For term business in Query "California Business Tax Return"
 1*log(1 + 501/500000) + 0.8074* log(1+ 500000/501)  / (0.8074+1) = 4.452
 For term tax in Query "California Business Tax Return"
 1*log(1 + 254/500000) + 0.8074* log(1+ 500000/254)  / (0.8074+1) = 4.889
 For term return in Query "California Business Tax Return"
 1*log(1 + 254/500000) + 0.8074* log(1+ 500000/254)  / (0.8074+1) = 4.329
 Second query DFR score  = 5.464 + 4.452 + 4.889 + 4.329 = 19.133
DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) ) f'_t,d = f_t,d * log(1 + l_avg/l_d) First Query = "California Business Tax" For term california in Query "California Business Tax" 1* log(1 + 104/500000) + 1* log(1+ 500000/104) / (1+1) log(1.0002) + log(4808.69) /2 12.231/2 = 6.1155 For term business in Query "California Business Tax" 1*log(1 + 501/500000) + 1* log(1+ 500000/501) / (1+1) log(1.0010) + log(999) /2 = 4.983 For term tax in Query "California Business Tax" 1*log(1 + 254/500000) + 1* log(1+ 500000/254) / (1+1) log(1.0005) + log(1969.5) /2 = 5.472 First query DFR score = 6.115 + 4.983 + 5.472 = 16.571 Second Query = "California Business Tax Return" For term california in Query "California Business Tax Return" f'(t, d) = 1.log(1+3/4) = 0.8074 For term business in Query "California Business Tax Return" 1* log(1 + 104/500000) + 0.8074* log(1+ 500000/104) / (0.8074+1) = 5.464 For term business in Query "California Business Tax Return" 1*log(1 + 501/500000) + 0.8074* log(1+ 500000/501) / (0.8074+1) = 4.452 For term tax in Query "California Business Tax Return" 1*log(1 + 254/500000) + 0.8074* log(1+ 500000/254) / (0.8074+1) = 4.889 For term return in Query "California Business Tax Return" 1*log(1 + 254/500000) + 0.8074* log(1+ 500000/254) / (0.8074+1) = 4.329 Second query DFR score = 5.464 + 4.452 + 4.889 + 4.329 = 19.133

-- Apr 28 In-Class Exercise Thread
N = Size of corpus = 500000
l_avg = Average length of a query in our corpus of queries = 3
l_t = Number of occurrences of the terms "California", "Business", "Tax" and "Return" = 104, 501, 254, 607
DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t))
Or, DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1))
Substituting f_t,d with their normalized version: f'_t,d = f_t,d * log(1 + l_avg/l_d)
We get, DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) )
a) For "California Business Tax"
f'_t,d for the first query for every term is 1 * log(1 + 3/3) = 1
DFR = Sum of:
1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116
1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983
1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472
DFR score = 6.116 + 4.983 + 5.472 = 16.571
b) For "California Business Tax Return":
f'_t,d for the second query for every term is 1 * log(1 + 3/4) = log(3/4)
DFR = Sum of:
1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464
1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452
1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889
1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329
DFR score = 5.464 + 4.452 + 4.889 + 4.329 = 19.133
(Edited: 2021-05-02)
N = Size of corpus = 500000 l_avg = Average length of a query in our corpus of queries = 3 l_t = Number of occurrences of the terms "California", "Business", "Tax" and "Return" = 104, 501, 254, 607 DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t)) Or, DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1)) Substituting f_t,d with their normalized version: f'_t,d = f_t,d * log(1 + l_avg/l_d) We get, DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) ) a) For "California Business Tax" f'_t,d for the first query for every term is 1 * log(1 + 3/3) = 1 DFR = Sum of: 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472 DFR score = 6.116 + 4.983 + 5.472 = 16.571 b) For "California Business Tax Return": f'_t,d for the second query for every term is 1 * log(1 + 3/4) = log(3/4) DFR = Sum of: 1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329 DFR score = 5.464 + 4.452 + 4.889 + 4.329 = 19.133

-- Apr 28 In-Class Exercise Thread
N = Size of corpus => 500000 l_avg = Average length of a query in our corpus of queries => 3 l_t = Number of occurrences of the terms "California", "Business", "Tax" and "Return" = 104, 501, 254, 607
DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t)) Or, DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1))
f'_t,d = f_t,d * log(1 + l_avg/l_d) DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) )
California Business Tax: f'_t,d for every term is 1 * log(1 + 3/3) = 1 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472
DFR score(query 1) is 6.116 + 4.983 + 5.472 = 16.570934864
California Business Tax Return: f'_t,d for every term is 1 * log(1 + 3/4) = log(3/4)
1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329
DFR score(query 2) is 5.464 + 4.452 + 4.889 + 4.329 = 19.1334555539
(Edited: 2021-05-02)
N = Size of corpus => 500000 l_avg = Average length of a query in our corpus of queries => 3 l_t = Number of occurrences of the terms "California", "Business", "Tax" and "Return" = 104, 501, 254, 607 DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t)) Or, DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1)) f'_t,d = f_t,d * log(1 + l_avg/l_d) DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) ) California Business Tax: f'_t,d for every term is 1 * log(1 + 3/3) = 1 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472 DFR score(query 1) is 6.116 + 4.983 + 5.472 = 16.570934864 California Business Tax Return: f'_t,d for every term is 1 * log(1 + 3/4) = log(3/4) 1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329 DFR score(query 2) is 5.464 + 4.452 + 4.889 + 4.329 = 19.1334555539

-- Apr 28 In-Class Exercise Thread
Size of corpus(N) = 500000
 
Average length of a query in our corpus of queries (l_avg ) = 3
 
Number of occurrences of the terms (l_t) "California", "Business", "Tax" and "Return" in the whole corpus are respectively 104, 501, 254, 607
 
DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t)) Or DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1))
 
f'_t,d = f_t,d * log(1 + l_avg/l_d) so, DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) )
 
a) Query: "California Business Tax"
For term “California”: 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116
For term “Business”: 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983
For term “Tax”: 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472
DFR score for the query “California Business Tax” = 6.116 + 4.983 + 5.472 = 16.571
 
b) Query: "California Business Tax Return":
For term “California”: 1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464
For term “Business”: 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452
For term “Tax”: 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889
For term “Return”: 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329
DFR score for the query “California Business Tax Return” = 5.464 + 4.452 + 4.889 + 4.329 = 19.133
(Edited: 2021-05-02)
Size of corpus(N) = 500000 Average length of a query in our corpus of queries (l_avg ) = 3 Number of occurrences of the terms (l_t) "California", "Business", "Tax" and "Return" in the whole corpus are respectively 104, 501, 254, 607 DFR score = Sum of every term t in query q = (q_t x (1-P_2,t) x (-logP_1,t)) Or DFR score = (q_t x (log(1+l/t_N)+ f_t,d x log(1 + N/l_t)) / (f_t,d + 1)) f'_t,d = f_t,d * log(1 + l_avg/l_d) so, DFR score = Sum of every t in q(q_t * (log(1+l/t_N)+ f'_t,d * log(1 + N/l_t)) / (f'_t,d + 1) ) a) Query: "California Business Tax" For term “California”: 1 * (log(1+104/500,000)+ 1 * log(1 + 500,000/104)) / (1 + 1) = 6.116 For term “Business”: 1 * (log(1+501/500,000)+ 1 * log(1 + 500,000/501)) / (1 + 1) = 4.983 For term “Tax”: 1 * (log(1+254/500,000)+ 1 * log(1 + 500,000/254)) / (1 + 1) = 5.472 DFR score for the query “California Business Tax” = 6.116 + 4.983 + 5.472 = 16.571 b) Query: "California Business Tax Return": For term “California”: 1 * (log(1+104/500,000)+ log(3/4) * log(1 + 500,000/104)) / (log(3/4) + 1) = 5.464 For term “Business”: 1 * (log(1+501/500,000)+ log(3/4) * log(1 + 500,000/501)) / (log(3/4) + 1) = 4.452 For term “Tax”: 1 * (log(1+254/500,000)+ log(3/4) * log(1 + 500,000/254)) / (log(3/4) + 1) = 4.889 For term “Return”: 1 * (log(1+607/500,000)+ log(3/4) * log(1 + 500,000/607)) / (log(3/4) + 1) = 4.329 DFR score for the query “California Business Tax Return” = 5.464 + 4.452 + 4.889 + 4.329 = 19.133
2021-05-03

-- Apr 28 In-Class Exercise Thread
corpus = 500,000 ave = 3 california = 104 business = 501 tax = 254 return = 607 DFR = (1-P2)(-logP1) Q1. california business tax california log(1 + 104/500,000) + ftd log(1 + 500,000/104) / (ftd + 1) ftd’ = ftd * log(1 + 3/3) = ftd * 1 = 1 business log(1 + 501/500,000) + ftd log(1 + 500,000/501) / (ftd + 1) ftd’ = ftd * log(1 + 3/3) = ftd * 1 = 1 tax log(1 + 254/500,000) + ftd log(1 + 500,000/254) / (ftd + 1) ftd’ = ftd * log(1 + 3/3) = ftd * 1 = 1 Sum (california business tax) = 6.116 + 4.983 + 5.472 = 16.571 Q2. california business tax return california log(1 + 104/500,000) + ftd log(1 + 500,000/104) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 business log(1 + 501/500,000) + ftd log(1 + 500,000/501) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 tax log(1 + 254/500,000) + ftd log(1 + 500,000/254) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 return log(1 + 607/500,000) + ftd log(1 + 500,000/607) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 Sum (california business tax return) = 5.464 + 4.452 + 4.889 + 4.329 = 19.133
(Edited: 2021-05-03)
<nowiki> corpus = 500,000 ave = 3 california = 104 business = 501 tax = 254 return = 607 DFR = (1-P2)(-logP1) Q1. california business tax california log(1 + 104/500,000) + ftd log(1 + 500,000/104) / (ftd + 1) ftd’ = ftd * log(1 + 3/3) = ftd * 1 = 1 business log(1 + 501/500,000) + ftd log(1 + 500,000/501) / (ftd + 1) ftd’ = ftd * log(1 + 3/3) = ftd * 1 = 1 tax log(1 + 254/500,000) + ftd log(1 + 500,000/254) / (ftd + 1) ftd’ = ftd * log(1 + 3/3) = ftd * 1 = 1 Sum (california business tax) = 6.116 + 4.983 + 5.472 = 16.571 Q2. california business tax return california log(1 + 104/500,000) + ftd log(1 + 500,000/104) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 business log(1 + 501/500,000) + ftd log(1 + 500,000/501) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 tax log(1 + 254/500,000) + ftd log(1 + 500,000/254) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 return log(1 + 607/500,000) + ftd log(1 + 500,000/607) / (ftd + 1) ftd’ = ftd * log(1+3/4) = .807 Sum (california business tax return) = 5.464 + 4.452 + 4.889 + 4.329 = 19.133 </nowiki>

-- Apr 28 In-Class Exercise Thread
corpus_queries = N = 500,000 lavg = 3 occurences = lt california = 104 business = 501 tax = 254 return = 607 DFR = (1 - P₂)(-log(P₁)) = log(1 + lt/N) + f_td log(1 + N/lt) / f_td + 1 f_td’ = f_td * log(1 + lavg/ld) ------------------------------------------------------------------------------- 1. california business tax ld = 3 f’_td = f_td * log(1 + 3/3) = f_td * 1 = 1 * 1 = 1 california DFR = log(1 + 104/500,000) + f_td log(1 + 500,000/104) / (f_td + 1) = log(1 + 104/500,000) + 1 * log(1 + 500,000/104) / (1 + 1) = 12.2317 / 2 = 6.1158 ≈ 6.116 business DFR = log(1 + 501/500,000) + f_td log(1 + 500,000/501) / (f_td + 1) = log(1 + 501/500,000) + 1 * log(1 + 500,000/501) / (1 + 1) = 9.9657 / 2 = 4.9828 ≈ 4.983 tax DFR = log(1 + 254/500,000) + f_td log(1 + 500,000/254) / (f_td + 1) = log(1 + 254/500,000) + 1 * log(1 + 500,000/254) / (1 + 1) = 10.9443 / 2 = 5.4721 ≈ 5.472 Sum(california business tax) = california + business + tax = 6.116 + 4.983 + 5.472 = 16.571 ------------------------------------------------------------------------------- 2. california business tax return ld = 4 f'_td = f_td * log(1 + 3/4) = f_td * 0.807 = 1 * 0.807 = 0.807 california DFR = log(1 + 104/500,000) + f_td log(1 + 500,000/104) / (f_td + 1) = log(1 + 104/500,000) + 0.807 * log(1 + 500,000/104) / (0.807 + 1) = 9.8710 / 1.807 = 5.4626 ≈ 5.463 business DFR = log(1 + 501/500,000) + f_td log(1 + 500,000/501) / (f_td + 1) = log(1 + 501/500,000) + 0.807 * log(1 + 500,000/501) / (0.807 + 1) = 8.0426 / 1.807 = 4.4508 ≈ 4.451 tax DFR = log(1 + 254/500,000) + f_td log(1 + 500,000/254) / (f_td + 1) = log(1 + 254/500,000) + 0.807 * log(1 + 500,000/254) / (0.807 + 1) = 8.8322 / 1.807 = 4.8877 ≈ 4.888 return DFR = log(1 + 607/500,000) + f_td log(1 + 500,000/607) / (f_td + 1) = log(1 + 607/500,000) + 0.807 * log(1 + 500,000/607) / (0.807 + 1) = 7.8197 / 1.807 = 4.3274 ≈ 4.327 Sum(california business tax return) = california + business + tax + return = 5.463 + 4.451 + 4.888 + 4.327 = 19.129
<nowiki> corpus_queries = N = 500,000 lavg = 3 occurences = lt california = 104 business = 501 tax = 254 return = 607 DFR = (1 - P₂)(-log(P₁)) = log(1 + lt/N) + f_td log(1 + N/lt) / f_td + 1 f_td’ = f_td * log(1 + lavg/ld) ------------------------------------------------------------------------------- 1. california business tax ld = 3 f’_td = f_td * log(1 + 3/3) = f_td * 1 = 1 * 1 = 1 california DFR = log(1 + 104/500,000) + f_td log(1 + 500,000/104) / (f_td + 1) = log(1 + 104/500,000) + 1 * log(1 + 500,000/104) / (1 + 1) = 12.2317 / 2 = 6.1158 ≈ 6.116 business DFR = log(1 + 501/500,000) + f_td log(1 + 500,000/501) / (f_td + 1) = log(1 + 501/500,000) + 1 * log(1 + 500,000/501) / (1 + 1) = 9.9657 / 2 = 4.9828 ≈ 4.983 tax DFR = log(1 + 254/500,000) + f_td log(1 + 500,000/254) / (f_td + 1) = log(1 + 254/500,000) + 1 * log(1 + 500,000/254) / (1 + 1) = 10.9443 / 2 = 5.4721 ≈ 5.472 Sum(california business tax) = california + business + tax = 6.116 + 4.983 + 5.472 = 16.571 ------------------------------------------------------------------------------- 2. california business tax return ld = 4 f'_td = f_td * log(1 + 3/4) = f_td * 0.807 = 1 * 0.807 = 0.807 california DFR = log(1 + 104/500,000) + f_td log(1 + 500,000/104) / (f_td + 1) = log(1 + 104/500,000) + 0.807 * log(1 + 500,000/104) / (0.807 + 1) = 9.8710 / 1.807 = 5.4626 ≈ 5.463 business DFR = log(1 + 501/500,000) + f_td log(1 + 500,000/501) / (f_td + 1) = log(1 + 501/500,000) + 0.807 * log(1 + 500,000/501) / (0.807 + 1) = 8.0426 / 1.807 = 4.4508 ≈ 4.451 tax DFR = log(1 + 254/500,000) + f_td log(1 + 500,000/254) / (f_td + 1) = log(1 + 254/500,000) + 0.807 * log(1 + 500,000/254) / (0.807 + 1) = 8.8322 / 1.807 = 4.8877 ≈ 4.888 return DFR = log(1 + 607/500,000) + f_td log(1 + 500,000/607) / (f_td + 1) = log(1 + 607/500,000) + 0.807 * log(1 + 500,000/607) / (0.807 + 1) = 7.8197 / 1.807 = 4.3274 ≈ 4.327 Sum(california business tax return) = california + business + tax + return = 5.463 + 4.451 + 4.888 + 4.327 = 19.129 </nowiki>
[ Next ]
X