Yioop - PHP Search Engine

May 12 In-Class Exercise Thread.

Please post your solution to the May 12 In-Class Exercise Thread.

Best,

Chris

Please post your solution to the May 12 In-Class Exercise Thread. Best, Chris

-- May 12 In-Class Exercise Thread

Kevin, Sriramm, Mustafa
1) Set a threshold so that, from the whole corpus, we will only return 100 documents (P@100). Since X docs have score > N-X, we can say that, if X is 100, our threshold will be 1000 - 100 = 900.
2) Next, look at each batch of 100; for a given batch, see what documents have score greater than 900. For the documents Y with score over 900 in the batch, let rel_Y be the (human-determined) relevant documents. The precision for the batch is |rel_Y| / |Y|.
3) Compute the average of these precision scores over all batches. That is the aggregate precision @ 100.

(Edited: 2021-05-12)

Kevin, Sriramm, Mustafa 1) Set a threshold so that, from the whole corpus, we will only return 100 documents (P@100). Since X docs have score > N-X, we can say that, if X is 100, our threshold will be 1000 - 100 = 900. 2) Next, look at each batch of 100; for a given batch, see what documents have score greater than 900. For the documents Y with score over 900 in the batch, let rel_Y be the (human-determined) relevant documents. The precision for the batch is |rel_Y| / |Y|. 3) Compute the average of these precision scores over all batches. That is the aggregate precision @ 100.

-- May 12 In-Class Exercise Thread

Total number of documents = N = 1000

Batch size = X = 100

Scoring function = N - X = 900

precision_list = []

for each batch in corpus:

results_list = []
relevant_list = []

for each doc in batch:

 - If the score for doc is greater than 900:
  * Add them to results_list
  * Check if they are relevant and add them to relevant_list

Finally, precision for this batch = len(relevant_list)/len(results_list)
Append precision value to precision_list

Once we have gone through the whole loop, average the precision scores in the precision_list

(Edited: 2021-05-12)

Total number of documents = N = 1000 Batch size = X = 100 Scoring function = N - X = 900 precision_list = [] for each batch in corpus: * results_list = [] * relevant_list = [] * for each doc in batch: - If the score for doc is greater than 900: * Add them to results_list * Check if they are relevant and add them to relevant_list * Finally, precision for this batch = len(relevant_list)/len(results_list) * Append precision value to precision_list Once we have gone through the whole loop, average the precision scores in the precision_list

-- May 12 In-Class Exercise Thread

P@100

N = 1000

Batch size = 100

10 batches

if X is 100, then there are 100 documents with a score over 900, so our threshold is 900

Then, for each batch, find all documents that have a score greater than 900, the precision for this batch is the number of relevant documents in the batch over the total number of documents returned

Finally, the average of all these precisions is the aggregate P@100

(Edited: 2021-05-12)

P@100 N = 1000 Batch size = 100 10 batches if X is 100, then there are 100 documents with a score over 900, so our threshold is 900 Then, for each batch, find all documents that have a score greater than 900, the precision for this batch is the number of relevant documents in the batch over the total number of documents returned Finally, the average of all these precisions is the aggregate P@100

-- May 12 In-Class Exercise Thread

def aggregate_at_k(N = 1000, k = 100, X = 100):

	t = N - X
	precision_scores = []
	for batch in corpus:
		Y = []
		rel_Y = []
		for document in batch:
			if score(document) > t:
				Y.append(document)
				if relevant(document): 
					rel_Y.append(document)
		batch_precision = len(rel_Y)/len(Y)
		precision_scores.append(batch_precision)
	return average(precision_scores)

def aggregate_at_k(N = 1000, k = 100, X = 100): t = N - X precision_scores = [] for batch in corpus: Y = [] rel_Y = [] for document in batch: if score(document) > t: Y.append(document) if relevant(document): rel_Y.append(document) batch_precision = len(rel_Y)/len(Y) precision_scores.append(batch_precision) return average(precision_scores)

-- May 12 In-Class Exercise Thread

In this example we are letting X be 100 So , N-X = 900

Now, we check for every doc if the score is greater than 900 and we also check if that doc is relevant. If the score is greater than 900 then we add it to the results list. If the doc is relevant then it also entered in the relevant_results list.

The precision for a batch is calculated by length of results list/length of relevant_results list.

Similarly, precision of all the batches can be calculated and all batch precisions can stored in batchwise_precision list.

(Edited: 2021-05-16)

In this example we are letting X be 100 So , N-X = 900 Now, we check for every doc if the score is greater than 900 and we also check if that doc is relevant. If the score is greater than 900 then we add it to the results list. If the doc is relevant then it also entered in the relevant_results list. The precision for a batch is calculated by length of results list/length of relevant_results list. Similarly, precision of all the batches can be calculated and all batch precisions can stored in batchwise_precision list.

-- May 12 In-Class Exercise Thread

For N=1000 and batch=100,
threshold=N-X=1000-100=900 -> Only 100 documents will have a score over 900

For each batch, store documents with score > threshold,
Calculate precision for each batch:
Precision = percentage of the result that is relevant =
[relevant AND result]/result

Aggregate P@100 score = average(precision for each batch)

(Edited: 2021-05-16)

For N=1000 and batch=100, threshold=N-X=1000-100=900 -> Only 100 documents will have a score over 900 For each batch, store documents with score > threshold, Calculate precision for each batch: Precision = percentage of the result that is relevant = [relevant AND result]/result Aggregate P@100 score = average(precision for each batch)

-- May 12 In-Class Exercise Thread

To compute the aggregate P@100 for a corpus of N=1000 with batch size 100, first have to calculate which documents have a score of N - X => 900. We look at each batch of 100 documents and store the documents with a score over 900. We calculate the precision of each batch by diving the human determined relevant documents in the stored batch over 100. We do that for every batch and then the average precision of the 10 batches is the aggregate P@100.

-- May 12 In-Class Exercise Thread

 Total number of documents N = 1000
 X = 100
 N-X = 1000-100 = 900
 for each batch, 
   for each document 
     find documents that have score greater than 900 and add to the results list
     check relevance add them to relevance
     find precision by the number of relevant documents in the batch over the total number of documents 
     store precision 
 finally, find the average precision to find aggregate precision@100

Total number of documents N = 1000 X = 100 N-X = 1000-100 = 900 for each batch, for each document find documents that have score greater than 900 and add to the results list check relevance add them to relevance find precision by the number of relevant documents in the batch over the total number of documents store precision finally, find the average precision to find aggregate precision@100

-- May 12 In-Class Exercise Thread

Give a concrete procedure for computing aggregate P@100 If N = 1000 documents, corpus; X = 100 documents, batch; N / X = 1000 / 100 = 10 batches; N - X = 1000 - 100 = 900, from scoring function; Steps: 1. Compute threshold. N - X = 1000 - 100 = 900 documents. This shows 100 documents have a score over 900. 2. Iterate each batch (10 batches). 3. Iterate each document in batch. 4. Find document with a score over 900. 5. Determine relevance of found document. 6. After iterating every document, calculate precision of the batch. 7. After iterating every batch, calculate average precision of all batches. 8. Result is computing aggregate P@100.

<nowiki> Give a concrete procedure for computing aggregate P@100 If N = 1000 documents, corpus; X = 100 documents, batch; N / X = 1000 / 100 = 10 batches; N - X = 1000 - 100 = 900, from scoring function; Steps: 1. Compute threshold. N - X = 1000 - 100 = 900 documents. This shows 100 documents have a score over 900. 2. Iterate each batch (10 batches). 3. Iterate each document in batch. 4. Find document with a score over 900. 5. Determine relevance of found document. 6. After iterating every document, calculate precision of the batch. 7. After iterating every batch, calculate average precision of all batches. 8. Result is computing aggregate P@100. </nowiki>