Yioop - PHP Search Engine

Practice Midterm Problems Thread.

Hey Everyone,

Post your solutions to the practice midterm problems to this thread. Remember to list all your teammates.

Best,

Chris

Hey Everyone, Post your solutions to the practice midterm problems to this thread. Remember to list all your teammates. Best, Chris

-- Practice Midterm Problems Thread

Devangi, Pranjali, Mustafa

3. Consider the "As I was going to St Ives ..." riddle, explain how the nextPhrase algorithm would work on inputs:
nextPhrase("had", "seven", 10). As there are variants to the riddle, tell me which one you are using as you go
through the solution.

    As I was going to St. Ives,

    I met a man with seven wives,

    Each wife had seven sacks,

    Each sack had seven cats,

    Each cat had seven kits:

    Kits, cats, sacks, and wives,

    How many were there going to St. Ives?

After removing punctuations, and giving positions:

1 as
2 i
3 was
4 going
5 to
6 st
7 ives
8 i
9 met
10 a
11 man
12 with
13 seven
14 wives
15 each
16 wife
17 had
18 seven
19 sacks
20 each
21 sack
22 had
23 seven
24 cats
25 each
26 cat
27 had
28 seven
29 kits
30 kits
31 cats
32 sacks
33 and
34 wives
35 how
36 many
37 were
38 there
39 going
40 to
41 st
42 ives

Dictionary:

As : 1
I : 2, 8
Was: 3
Going: 4
To: 5, 40
St: 6, 41
Ives: 7, 42
Met : 9
A : 10
Man : 11
With: 12
Seven : 13, 18, 23, 28
Wives : 14, 16, 34

nextPhrase(“had”, “seven”, 10):

For every term in (“had”, “seven”)

term = “had”
v = next(“had”, 10)
v = 17

term = “seven”
v = next(“seven”, 17)
v = 18

U = 18

For every term but last in (“had”, “seven”) (reverse)

term = “seven”
u = prev(“had”, 18)
u = 17
V - u = 18 - 17 = 1 (== n - 1 == 2-1)

Therefore, return [17, 18]

(Edited: 2021-03-22)

Devangi, Pranjali, Mustafa 3. Consider the "As I was going to St Ives ..." riddle, explain how the nextPhrase algorithm would work on inputs: nextPhrase("had", "seven", 10). As there are variants to the riddle, tell me which one you are using as you go through the solution. As I was going to St. Ives, I met a man with seven wives, Each wife had seven sacks, Each sack had seven cats, Each cat had seven kits: Kits, cats, sacks, and wives, How many were there going to St. Ives? After removing punctuations, and giving positions: 1 as 2 i 3 was 4 going 5 to 6 st 7 ives 8 i 9 met 10 a 11 man 12 with 13 seven 14 wives 15 each 16 wife 17 had 18 seven 19 sacks 20 each 21 sack 22 had 23 seven 24 cats 25 each 26 cat 27 had 28 seven 29 kits 30 kits 31 cats 32 sacks 33 and 34 wives 35 how 36 many 37 were 38 there 39 going 40 to 41 st 42 ives Dictionary: As : 1 I : 2, 8 Was: 3 Going: 4 To: 5, 40 St: 6, 41 Ives: 7, 42 Met : 9 A : 10 Man : 11 With: 12 Seven : 13, 18, 23, 28 Wives : 14, 16, 34 nextPhrase(“had”, “seven”, 10): For every term in (“had”, “seven”) term = “had” v = next(“had”, 10) v = 17 term = “seven” v = next(“seven”, 17) v = 18 U = 18 For every term but last in (“had”, “seven”) (reverse) term = “seven” u = prev(“had”, 18) u = 17 V - u = 18 - 17 = 1 (== n - 1 == 2-1) Therefore, return [17, 18]

-- Practice Midterm Problems Thread

By Vrinda and Sriramm:

Q9.Suppose `n=3` what would be the character `n`-grams for the word toast? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of the kinds of rules used by a Porter stemmer.\

Toast - .toast.

.to

toa

oas

ast

st.

Stopping: The removal of unnecessary function words from the text that provide no extra meaning to the text, rather only modifying or indicating grammatical relationships; is called stopping.

Stemming

Stemming considers the morphology, or internal structure, of terms, and reduces each term to a root form for comparison based on its morphology, for example, a query term such as “orienteering” to match an occurrence of “orienteers”, or “runs” to match “running”.

Rules used by Porter Stemmer

sses -> ss

ies -> i

(Edited: 2021-03-22)

By Vrinda and Sriramm: Q9.Suppose @BT@n=3@BT@ what would be the character @BT@n@BT@-grams for the word toast? (1pt) What is stopping? (1pt) What is stemming (1pt)? Give an example of the kinds of rules used by a Porter stemmer.\ ; Toast - .toast. : * .to * toa * oas * ast * st. ; Stopping: The removal of unnecessary function words from the text that provide no extra meaning to the text, rather only modifying or indicating grammatical relationships; is called stopping. ; Stemming: Stemming considers the morphology, or internal structure, of terms, and reduces each term to a root form for comparison based on its morphology, for example, a query term such as “orienteering” to match an occurrence of “orienteers”, or “runs” to match “running”. ; Rules used by Porter Stemmer: *sses -> ss *ies -> i

-- Practice Midterm Problems Thread

Devangi, Pranjali, Mustafa
4. Suggest a data structure that could be used to implement an inverted index in PHP, and give a working code
snippet to show how to implement first($t), last($t) from our inverted index ADT.

According to the slides: "All PHP arrays can be viewed as associative arrays".
We will thus use an associative array for our inverted index and indexed arrays for the posting list within the index.

// assume our associated array inverted index is globally accessible and referred to as $inverted_index

function first($t) {

	if (array_key_exists($t, $inverted_index))

		return reset($inverted_index[$t]);

	return false;

}

function last($t) {

	if (array_key_exists($t, $inverted_index))

		return end($inverted_index[$t]);

	return false;

}

(Edited: 2021-03-22)

Devangi, Pranjali, Mustafa 4. Suggest a data structure that could be used to implement an inverted index in PHP, and give a working code snippet to show how to implement first($t), last($t) from our inverted index ADT. According to the slides: "All PHP arrays can be viewed as associative arrays". We will thus use an associative array for our inverted index and indexed arrays for the posting list within the index. // assume our associated array inverted index is globally accessible and referred to as $inverted_index function first($t) { if (array_key_exists($t, $inverted_index)) return reset($inverted_index[$t]); return false; } function last($t) { if (array_key_exists($t, $inverted_index)) return end($inverted_index[$t]); return false; }

-- Practice Midterm Problems Thread

Austin Anderson, Justin Chang, Vaishnavi Nagireddy

6.Define the following terms (1pt each): (a) docid index, (b) frequency index, (c) positional index, and (d) schema-independent index.

Docid index: A docid index is the simplest type of inverted index. For each term, it contains the document identifiers of all documents in which the term appears.

Example: let’s consider some terms “a”,”am”,”good”. A docid index consists of the document id which gives the document number in which the term appears.

Terms	Docid list
a	1;3
am	1;4
good	2;4

Frequency index: In a frequency index, each entry in an inverted list consists of two components: a document ID and a term frequency value. Each posting in a frequency index is of the form (d, ft_d).

Example: If we have a document containing terms in it, our frequency index looks like

Term	Frequency list
a	1,3
cool	1,2
the	1,4

Positional index: A positional index consists of postings of the form (document,frequency of term in that document, (term’s posting list)) and it also supports all search operations supported by a frequency index.

Example: A positional list for terms “better”,”do”,”good”,”man” will look like

Terms	Positional list
better	(1,3,(2,6,14))
do	(1,2,(12,16))
good	(1,1,(9))
man	(1,1,(5))

Schema-independent index: A schema-independent index considers the whole corpus as a single document. So, it contains word's positions of the individual term occurrences.

Example: A schema-independent index for our terms “a”,”am”,”as” will be like

Term	Schema Independant
a	1;21
am	5;14
as	7;19

(Edited: 2021-03-22)

Austin Anderson, Justin Chang, Vaishnavi Nagireddy 6.Define the following terms (1pt each): (a) docid index, (b) frequency index, (c) positional index, and (d) schema-independent index. Docid index: A docid index is the simplest type of inverted index. For each term, it contains the document identifiers of all documents in which the term appears. Example: let’s consider some terms “a”,”am”,”good”. A docid index consists of the document id which gives the document number in which the term appears. {| |- | Terms || Docid list |- | a || 1;3 |- | am || 1;4 |- | good || 2;4 |} Frequency index: In a frequency index, each entry in an inverted list consists of two components: a document ID and a term frequency value. Each posting in a frequency index is of the form (d, ft_d). Example: If we have a document containing terms in it, our frequency index looks like {| |- | Term || Frequency list |- | a || 1,3 |- | cool || 1,2 |- | the || 1,4 |} Positional index: A positional index consists of postings of the form (document,frequency of term in that document, (term’s posting list)) and it also supports all search operations supported by a frequency index. Example: A positional list for terms “better”,”do”,”good”,”man” will look like {| |- | Terms || Positional list |- | better || (1,3,(2,6,14)) |- | do || (1,2,(12,16)) |- | good || (1,1,(9)) |- | man || (1,1,(5)) |} Schema-independent index: A schema-independent index considers the whole corpus as a single document. So, it contains word's positions of the individual term occurrences. Example: A schema-independent index for our terms “a”,”am”,”as” will be like {| |- | Term || Schema Independant |- | a || 1;21 |- | am || 5;14 |- | as || 7;19 |}

-- Practice Midterm Problems Thread

Question 1

by Kevin Schumacher, William Wang, Paaras Chad

a.) Probability Ranking Principle definition: A query response is most useful to a user if it ranks the documents in descending relevance. The query “search engine” could return search pages and have informational pages describing how a search engine works. This is an example of probability ranking principle because the most relevant items (actual search engines) are at the top while slightly less relevant items like how a search engine works are just below it in relevancy and in result order.

b.)Specificity definition: We only want relevant information. Query “Covid-19” returns the CDC’s Covid response page. This example of specificity because the CDC is the greatest authority on Covid and will therefore have the most relevant information.

c.)Exhaustivity definition: Give all of the possible results on a topic. The query “All dogs go to heaven” returns all documents that contain the phrase. This is an example of exhaustivity because all of the possible results are given in the output.

d.) Novelty definition: Don’t just repeat information, each result should give new information. For the query “Covid-19 news” it should return articles like case count in March and another article case count in February. This is an example of novelty because the results doesn’t return 50 articles from about the case count in March.

(Edited: 2021-03-22)

Question 1 by Kevin Schumacher, William Wang, Paaras Chad a.) Probability Ranking Principle definition: A query response is most useful to a user if it ranks the documents in descending relevance. The query “search engine” could return search pages and have informational pages describing how a search engine works. This is an example of probability ranking principle because the most relevant items (actual search engines) are at the top while slightly less relevant items like how a search engine works are just below it in relevancy and in result order. b.)Specificity definition: We only want relevant information. Query “Covid-19” returns the CDC’s Covid response page. This example of specificity because the CDC is the greatest authority on Covid and will therefore have the most relevant information. c.)Exhaustivity definition: Give all of the possible results on a topic. The query “All dogs go to heaven” returns all documents that contain the phrase. This is an example of exhaustivity because all of the possible results are given in the output. d.) Novelty definition: Don’t just repeat information, each result should give new information. For the query “Covid-19 news” it should return articles like case count in March and another article case count in February. This is an example of novelty because the results doesn’t return 50 articles from about the case count in March.

-- Practice Midterm Problems Thread

Austin, Vaishnavi, Justin

		Prev(term, pos)
		If( term doesn’t exist OR first occurrence of term is after pos)
			Return -inf
		If(last occurrence of term is before current)
			Return last occurrence of term
			
		Initialize high to the length of term's posting list
		Initialize jump to -1
		Initialize low to high + jump
		
		While (low is greater than 0 and the positon of the term at index low is greater than or equal to current)
			Set high to be equal to low
			Set jump to be twice its previous value
			Set low to be itself plus jump
		If (low is less than 0)
			Set low to be 0
		Use binarySearch on the window of low->high to find the highest index of term whose position is before current
		Return the position of the term at index returned by binarySearch

(Edited: 2021-03-22)

Austin, Vaishnavi, Justin Prev(term, pos) If( term doesn’t exist OR first occurrence of term is after pos) Return -inf If(last occurrence of term is before current) Return last occurrence of term Initialize high to the length of term's posting list Initialize jump to -1 Initialize low to high + jump While (low is greater than 0 and the positon of the term at index low is greater than or equal to current) Set high to be equal to low Set jump to be twice its previous value Set low to be itself plus jump If (low is less than 0) Set low to be 0 Use binarySearch on the window of low->high to find the highest index of term whose position is before current Return the position of the term at index returned by binarySearch

-- Practice Midterm Problems Thread

Question 2

by Kevin Schumacher, William Wang, Paaras Chad

Relentless, insatiable deadlines!

This manuscript's still full of red lines.

First I'll sweat through the edits

and check all the credits

then chill with my favorite red wine

28 total words

All bigrams have a M1 of 1. Except for M1(“red lines.”), M1(“the edits”), M1(“ the credits”), M1(“red wine”) all of these are equal to 1/2.

The MLE of the second line is (1/28)^6 • (1/14) = (1/6746464256) = 1.48 • 10 ^ (-10)

(Edited: 2021-03-22)

Question 2 by Kevin Schumacher, William Wang, Paaras Chad Relentless, insatiable deadlines! This manuscript's still full of red lines. First I'll sweat through the edits and check all the credits then chill with my favorite red wine 28 total words All bigrams have a M1 of 1. Except for M1(“red lines.”), M1(“the edits”), M1(“ the credits”), M1(“red wine”) all of these are equal to 1/2. The MLE of the second line is (1/28)^6 • (1/14) = (1/6746464256) = 1.48 • 10 ^ (-10)

-- Practice Midterm Problems Thread

By Vrinda and Sriramm

: Q10. Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction.
: (a) sort-based dictionary - In sort-based dictionaries, the terms that appear in the text collection are sorted in the form of an array or a search tree. Traversal is done through tree traversal. The problems with sort based dictionaries are that all the values should be of the same size. To overcome this problem, we use two arrays- a primary array that contains integers which point to a secondary array containing the actual terms.
: For example:
: Sorted dictionary in memory looks like:
: absence -> 2345, absent -> 26783, …, shake -> 456, shaken -> 1672, shakespeare -> 89541, … zygote -> 6734, zygotes -> 1333.
: Dictionary on disk has “absence” and “absent” nearby and similarly “shake”, “shaken”, “shakespeare”. So, this sort-based lookup makes searching for terms like “absen*” and “shake*” easier.

(b) per-term index - per-term index is a set of positions of the terms that are kept as a synchronisation points (something like checkpoints) such that we do not have to go through the entire partitions of the disks to find the position.

For example, the per term indices for “widget” in a document are 12334, 13456, 34598. So, consider that the disk has three partitions. There are 15 other positions at which the term occurs. If we need next(“widget”, 25987), we would know that we have to start searching between 13456 and 34598.

(c) move-to-front heuristic - In the “move-to-front” heuristic, the terms are inserted at the back of the list. The terms will be moved to the front of the lists when they are searched. So, the next time when the term is looked up or searched, it will be present in the beginning of the list.

For example, if a new term(in our context) such as “Aztec” is encountered, it is initially inserted at the end of the list. When the term is searched for the first time, it is moved to the front of the posting list, such that when the lookup happens again, the term will be present in the front of the list.

(d)Merge-based index - merge-based index is similar to hash-based indexing, in the way that if the index list is small, it can fit into the memory (main memory) and if not, it uses dynamic partitioning to build an in-memory index and as soon as it runs out of memory, it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed, resulting in a set of index files, each representing a part of the whole collection, called an index partition. In a final step, the index partitions are merged into the final index. When we do lookups, we have to search through these index partitions of disk. The posting lists are sorted lexicographically and stored in the disk partitions.

For Example: Let the max amount of in-memory words be 6.

Text -

Doc1: “Do you quarrel, sir?

Doc2: “If you do, stay here, if not, then go.”

Doc3: “Try sir.”

Partition 1:

do - [(1, 1)]

you - [(1, 2)]

quarrel - [(1, 3)]

sir - [(1, 4)]

Partition 2:

if - [(2, 1), (2, 6)]

you - [(2, 2)]

do - [(2, 3)]

stay - [(2, 4)]

here - [(2, 5)]

not - [(2, 7)]

then - [(2, 8)]

go - [(2, 9)]

Partition 3:

try - [(3, 1)]

sir - [(3, 2)]

Recombining:

{ do: [(1, 1), (2, 3)], go: [(2, 9)], here: [(2, 5)], if: [(2, 1), (2, 6)], not: [(2, 7)], quarrel: [(1, 3)], sir: [(1, 4), (3, 2)], stay: [(2, 4)], then: [(2, 8)], try: [(3, 1)], you: [(1, 2), (2, 2)] }

(Edited: 2021-03-22)

By Vrinda and Sriramm ; : Q10. Briefly explain and give example of the following (1pt each): (a) sort-based dictionary, (b) per-term index, (c) move-to-front heuristic, (d) merge based index construction. ; : (a) sort-based dictionary - In sort-based dictionaries, the terms that appear in the text collection are sorted in the form of an array or a search tree. Traversal is done through tree traversal. The problems with sort based dictionaries are that all the values should be of the same size. To overcome this problem, we use two arrays- a primary array that contains integers which point to a secondary array containing the actual terms. ; : For example: ; : Sorted dictionary in memory looks like: ; : absence -> 2345, absent -> 26783, …, shake -> 456, shaken -> 1672, shakespeare -> 89541, … zygote -> 6734, zygotes -> 1333. ; : Dictionary on disk has “absence” and “absent” nearby and similarly “shake”, “shaken”, “shakespeare”. So, this sort-based lookup makes searching for terms like “absen*” and “shake*” easier. ; : (b) per-term index - per-term index is a set of positions of the terms that are kept as a synchronisation points (something like checkpoints) such that we do not have to go through the entire partitions of the disks to find the position. ; : For example, the per term indices for “widget” in a document are 12334, 13456, 34598. So, consider that the disk has three partitions. There are 15 other positions at which the term occurs. If we need next(“widget”, 25987), we would know that we have to start searching between 13456 and 34598. ; : ; : (c) move-to-front heuristic - In the “move-to-front” heuristic, the terms are inserted at the back of the list. The terms will be moved to the front of the lists when they are searched. So, the next time when the term is looked up or searched, it will be present in the beginning of the list. ; : For example, if a new term(in our context) such as “Aztec” is encountered, it is initially inserted at the end of the list. When the term is searched for the first time, it is moved to the front of the posting list, such that when the lookup happens again, the term will be present in the front of the list. ; : ; : (d)Merge-based index - merge-based index is similar to hash-based indexing, in the way that if the index list is small, it can fit into the memory (main memory) and if not, it uses dynamic partitioning to build an in-memory index and as soon as it runs out of memory, it builds an on-disk inverted file by transferring the in-memory index data to disk, deletes the in-memory index, and continues indexing. This procedure is repeated until the whole collection has been indexed, resulting in a set of index files, each representing a part of the whole collection, called an index partition. In a final step, the index partitions are merged into the final index. When we do lookups, we have to search through these index partitions of disk. The posting lists are sorted lexicographically and stored in the disk partitions. ; : For Example: Let the max amount of in-memory words be 6. ; : Text - ; : Doc1: “Do you quarrel, sir? ; : Doc2: “If you do, stay here, if not, then go.” ; : Doc3: “Try sir.” ; : Partition 1: ; : do - [(1, 1)] ; : you - [(1, 2)] ; : quarrel - [(1, 3)] ; : sir - [(1, 4)] ; : Partition 2: ; : if - [(2, 1), (2, 6)] ; : you - [(2, 2)] ; : do - [(2, 3)] ; : stay - [(2, 4)] ; : here - [(2, 5)] ; : not - [(2, 7)] ; : then - [(2, 8)] ; : go - [(2, 9)] ; : Partition 3: ; : try - [(3, 1)] ; : sir - [(3, 2)] ; : Recombining: ; : { do: [(1, 1), (2, 3)], go: [(2, 9)], here: [(2, 5)], if: [(2, 1), (2, 6)], not: [(2, 7)], quarrel: [(1, 3)], sir: [(1, 4), (3, 2)], stay: [(2, 4)], then: [(2, 8)], try: [(3, 1)], you: [(1, 2), (2, 2)] }