Yioop - PHP Search Engine

Dec 2 In-Class Exercise Thread.

Post your solutions to the Dec 2 In-Class Exercise to this thread.

Best,

Chris

Post your solutions to the Dec 2 In-Class Exercise to this thread. Best, Chris

-- Dec 2 In-Class Exercise Thread

We will need to create a terms list that resembles an inverted index. Each term has a list mapped to it that contains all the skipgrams in which it is included. Each skipgram includes the two words before and the two words after this term.

To find how many skipgrams two terms share, each skipgram can be an String or ArrayList object (assuming we are doing this in Java) that contains the four terms. Given two terms t1 and t2, we can use the "contains" method to see how many of t1's skipgrams t2 is found, increasing the tally with each positive occurrence. This will return the total number of skipgrams they share. The other parameter (total number of skipgrams in t1 + total number of skipgrams in t2) can be found easily by looking up the total sizes of the skipgram lists mapped to each of the two terms and adding them together.

We will need to create a terms list that resembles an inverted index. Each term has a list mapped to it that contains all the skipgrams in which it is included. Each skipgram includes the two words before and the two words after this term. To find how many skipgrams two terms share, each skipgram can be an String or ArrayList object (assuming we are doing this in Java) that contains the four terms. Given two terms t1 and t2, we can use the "contains" method to see how many of t1's skipgrams t2 is found, increasing the tally with each positive occurrence. This will return the total number of skipgrams they share. The other parameter (total number of skipgrams in t1 + total number of skipgrams in t2) can be found easily by looking up the total sizes of the skipgram lists mapped to each of the two terms and adding them together.

-- Dec 2 In-Class Exercise Thread

 to calculate the common skipgrams -> make a list of skipgrams for each term in a 
 hashtable -> sort these skipgrams alphabetically -> when comparing two terms (say t1, t2) 
 take their  respective skipgram lists -> look at the 1st skipgram of t1 and 1st skipgram 
 of t2 if they match increase the counter for their common skipgram, if match not found 
 then move to the next skipgram of the term that comes alphabetically first so if the 2nd 
 skipgram of t1 comes before 2nd skipgram of t2 we compare 2nd skipram of t1 to 1st 
 skipgram of t2 and so on-> compute this recursively till end of both lists is reached.

(Edited: 2020-12-03)

to calculate the common skipgrams -> make a list of skipgrams for each term in a hashtable -> sort these skipgrams alphabetically -> when comparing two terms (say t1, t2) take their respective skipgram lists -> look at the 1st skipgram of t1 and 1st skipgram of t2 if they match increase the counter for their common skipgram, if match not found then move to the next skipgram of the term that comes alphabetically first so if the 2nd skipgram of t1 comes before 2nd skipgram of t2 we compare 2nd skipram of t1 to 1st skipgram of t2 and so on-> compute this recursively till end of both lists is reached.

-- Dec 2 In-Class Exercise Thread

Preparation to reach the stage of computing distance function:

Split all the text in the files into sentences.
Read those sentences and make intermediate skip grams out of the sentences with 5 terms each (replacing asterisks for start and end)
Map the middle terms of all intermediate skipgrams to the List of skipgrams as mentioned in HW description. We'll get a Map<String, List<String>>
Sort the map in a way that keys get sorted with respect to descending order of size of their skipgrams list.
Compute the n most frequent terms with respect to number of skipgrams and which are not present in skip words list.

Now we have the list of terms that we want to put into distance function! To calculate the distance function between t1 and t2, we'll fetch their respective lists of skipgrams. Then we can call t1.retainAll(t2) method to get the common skipgrams efficiently. Size of common skipgrams will be found from this method while total number of skipgrams can be found by just adding the size of two lists.

(Edited: 2020-12-06)

Preparation to reach the stage of computing distance function: ---- * Split all the text in the files into sentences. * Read those sentences and make intermediate skip grams out of the sentences with 5 terms each (replacing asterisks for start and end) * Map the middle terms of all intermediate skipgrams to the List of skipgrams as mentioned in HW description. We'll get a Map<String, List<String>> * Sort the map in a way that keys get sorted with respect to descending order of size of their skipgrams list. * Compute the n most frequent terms with respect to number of skipgrams and which are not present in skip words list. ---- Now we have the list of terms that we want to put into distance function! To calculate the distance function between t1 and t2, we'll fetch their respective lists of skipgrams. Then we can call t1.retainAll(t2) method to get the common skipgrams efficiently. Size of common skipgrams will be found from this method while total number of skipgrams can be found by just adding the size of two lists.

-- Dec 2 In-Class Exercise Thread

- Split the text file into sentences. - Sentences into intermediary skipgrams - Store these terms and associated skipgrams in a hashmap - For calculating the similarity, we can use python's numpy.isin() function to obtain the common skipgrams.