Yioop - PHP Search Engine

Mar 3 In-Class Exercise Thread.

Post your solution to the Mar 3 In-Class Exercise to this thread.

Best,

Chris

Post your solution to the Mar 3 In-Class Exercise to this thread. Best, Chris

-- Mar 3 In-Class Exercise Thread

kid pup kiss

an example where stemming would increase recall but decrease accuracy would be java and javascript both being stemmed to java, despite being different topics

kid pup kiss an example where stemming would increase recall but decrease accuracy would be java and javascript both being stemmed to java, despite being different topics

-- Mar 3 In-Class Exercise Thread

 kid pup and kiss
 Any situation where multiple words with different definitions are mapped to the same token will decrease precision and increase recall

kid pup and kiss Any situation where multiple words with different definitions are mapped to the same token will decrease precision and increase recall

-- Mar 3 In-Class Exercise Thread

Normalization: "kids puppies and kisses"

Porter Stemmer Stemming: "kid puppi kiss"

An example could be "canned" vs "cannibal".
Stemming both of these could cut them down to just the word "can" or "cann". This means that our index would store documents featuring the word "canned" alongside documents featuring "cannibal". It could then be possible that by searching for "canned food" a result talking about a "cannibal"'s diet may be returned, which is not what we would want to see.
The same is true vice versa; if we wanted to learn about cannibals, we would also see results featuring canned food, so, maybe a Thanksgiving drive or something like that. In short, the number of results returned is higher, but the fraction of how many of those results would actually be relevant to our query would be lower.
In more technical terms, stemming this way would increase the recall because we are now considering more of our results as being relevant and, thus, the fraction of relevant results to total result count is higher.
However, the fraction of relevant results to total relevant counts in the corpus is lower as we are including irrelevant results, hence our precision decreases.

(Edited: 2021-03-03)

Normalization: "kids puppies and kisses" Porter Stemmer Stemming: "kid puppi kiss" An example could be "canned" vs "cannibal". Stemming both of these could cut them down to just the word "can" or "cann". This means that our index would store documents featuring the word "canned" alongside documents featuring "cannibal". It could then be possible that by searching for "canned food" a result talking about a "cannibal"'s diet may be returned, which is not what we would want to see. The same is true vice versa; if we wanted to learn about cannibals, we would also see results featuring canned food, so, maybe a Thanksgiving drive or something like that. In short, the number of results returned is higher, but the fraction of how many of those results would actually be relevant to our query would be lower. In more technical terms, stemming this way would increase the recall because we are now considering more of our results as being relevant and, thus, the fraction of relevant results to total result count is higher. However, the fraction of relevant results to total relevant counts in the corpus is lower as we are including irrelevant results, hence our precision decreases.

-- Mar 3 In-Class Exercise Thread

kid puppi kiss

example of increase recall and decrease precision:

searching for "shooting up meth" and with stemming applied it's: "shoot up meth"
"shoots up meth" also matches that query when stemming is applied which increases recall
but it also searches for shooting like as in gun.

(Edited: 2021-03-03)

kid puppi kiss example of increase recall and decrease precision: *searching for "shooting up meth" and with stemming applied it's: "shoot up meth" *"shoots up meth" also matches that query when stemming is applied which increases recall *but it also searches for shooting like as in gun.

-- Mar 3 In-Class Exercise Thread

Normalized: kids puppies and kisses

Normalized and stemmed: kid puppi and kiss

Example where stemming increases recall and decreases precision

"stemming" vs "stems"

If we apply stemming to these terms, we get "stem". This increases recall because we get more results with "stem" as a keyword. However, if we were looking specifically for stemming (this exercise) we may get irrelevant results involving plants, which decreases precision.

(Edited: 2021-03-03)

Normalized: kids puppies and kisses Normalized and stemmed: kid puppi and kiss Example where stemming increases recall and decreases precision "stemming" vs "stems" If we apply stemming to these terms, we get "stem". This increases recall because we get more results with "stem" as a keyword. However, if we were looking specifically for stemming (this exercise) we may get irrelevant results involving plants, which decreases precision.

-- Mar 3 In-Class Exercise Thread

normalized: kids puppies and kisses stemming: kid puppi and kiss Higher recall, lower precision example: fight and fighter. "fight" refers to an event. "fighter" refers to a participant in a fight. When searching for the biography of a fighter, you may receive results for the fighter's upcoming fights before the biography.

(Edited: 2021-03-03)

<nowiki> normalized: kids puppies and kisses stemming: kid puppi and kiss Higher recall, lower precision example: fight and fighter. "fight" refers to an event. "fighter" refers to a participant in a fight. When searching for the biography of a fighter, you may receive results for the fighter's upcoming fights before the biography. </nowiki>

-- Mar 3 In-Class Exercise Thread

Normalized phrase: "Kids puppies and kisses" Stemmed phrase: "kid puppi an kiss"

Docs: "The core of the club is creating a community.", "Communism is not the antithesis of capitalism."

Here, recall is increased but precision decreases because “commun” can be mapped to two different meanings - community and communism. So if the query is “community”, Recall = Relevant Results/Relevant = 1/1 = 1 Precision = Relevant Results/Results = ½ = 0.5

Normalized phrase: "Kids puppies and kisses" Stemmed phrase: "kid puppi an kiss" Docs: "The core of the club is creating a community.", "Communism is not the antithesis of capitalism." Here, recall is increased but precision decreases because “commun” can be mapped to two different meanings - community and communism. So if the query is “community”, Recall = Relevant Results/Relevant = 1/1 = 1 Precision = Relevant Results/Results = ½ = 0.5

-- Mar 3 In-Class Exercise Thread

Normalization: "kids puppies and kisses"

Porter Stemmer Stemming: "kid puppi kiss"

Ex: "note" vs "notepad".

This example denotes a concrete situation where both words with different definitions (one is a word definition of a script in paper/file while another would be a physical item or software)are stemmed to the same query (note) will and increase recall and decrease precision (lower relevant result to results ratio).

(Edited: 2021-03-03)

Normalization: "kids puppies and kisses" Porter Stemmer Stemming: "kid puppi kiss" Ex: "note" vs "notepad". This example denotes a concrete situation where both words with different definitions (one is a word definition of a script in paper/file while another would be a physical item or software)are stemmed to the same query (note) will and increase recall and decrease precision (lower relevant result to results ratio).

-- Mar 3 In-Class Exercise Thread

Normalization = kids puppies and kisses Porter Stemmer = kid puppi kiss

Example: Players and Plays

Plays refer to theatrical shows where the people who are a part of the play are called 'actors' rather than 'players' and players refer to people taking part in sports. Both of them stem to Play, thus increasing recall and decreasing precision

Normalization = kids puppies and kisses Porter Stemmer = kid puppi kiss Example: Players and Plays Plays refer to theatrical shows where the people who are a part of the play are called 'actors' rather than 'players' and players refer to people taking part in sports. Both of them stem to Play, thus increasing recall and decreasing precision