-- Mar 3 In-Class Exercise Thread
Normalization: "kids puppies and kisses"
Porter Stemmer Stemming: "kid puppi kiss"
An example could be "canned" vs "cannibal".
Stemming both of these could cut them down to just the word "can" or "cann".
This means that our index would store documents featuring the word "canned" alongside documents featuring "cannibal". It could
then be possible that by searching for "canned food" a result talking about a "cannibal"'s diet may be returned, which is not what
we would want to see.
The same is true vice versa; if we wanted to learn about cannibals, we would also see results featuring
canned food, so, maybe a Thanksgiving drive or something like that. In short, the number of results returned is higher, but the fraction
of how many of those results would actually be relevant to our query would be lower.
In more technical terms, stemming this way would increase the recall because we are now considering more of our results as being relevant and, thus, the fraction
of relevant results to total result count is higher.
However, the fraction of relevant results to total relevant counts in the
corpus is lower as we are including irrelevant results, hence our precision decreases.
(
Edited: 2021-03-03)
Normalization: "kids puppies and kisses"
Porter Stemmer Stemming: "kid puppi kiss"<br>
An example could be "canned" vs "cannibal". <br>Stemming both of these could cut them down to just the word "can" or "cann".
This means that our index would store documents featuring the word "canned" alongside documents featuring "cannibal". It could
then be possible that by searching for "canned food" a result talking about a "cannibal"'s diet may be returned, which is not what
we would want to see. <br>The same is true vice versa; if we wanted to learn about cannibals, we would also see results featuring
canned food, so, maybe a Thanksgiving drive or something like that. In short, the number of results returned is higher, but the fraction
of how many of those results would actually be relevant to our query would be lower.<br>
In more technical terms, stemming this way would increase the recall because we are now considering more of our results as being relevant and, thus, the fraction
of relevant results to total result count is higher. <br>However, the fraction of relevant results to total relevant counts in the
corpus is lower as we are including irrelevant results, hence our precision decreases.