-- Practice Midterm
-- Damanpreet Kaur and Divyashree Jayaram
10) Fenestrate
For char 5 grams the word would be split into following terms-
.fene , fenes , enest, nestr, estra, strat, trate, rate.
Stopping - In IR systems stopwords are stripped from the query before doing an index look-up. These stopwords often include function words. Function words are words that have no well-defined meanings in and of themselves; rather, they modify words or indicate grammatical relationships.
The process of stopping is also applied to the corpus before the creation of the index.
Stopping helps to reduce index size and also decreases the query processing time.
For instance- A and B
It would be converted to A B
Stemming- It is the process of an IR system where objective is to reduce each term to its root form. for example - "runs" to match "running"
Potter stemmer rules
sses => ss
ies => I
ss => ss (Stop execution)
s =>
****************************************************************************
6)
The positional index entry is like <d,frequency,<posting_list>>.
function nextTermInDoc(t,off,<n',f,<p1,p2....,pn>>)
{
// P[] = array of posting list fetched from <n',f,<p1,p2....,pn>>
// l[] = array of lengths of these posting lists
static c = []; //last index positions for terms
current = off
if(l[t] == 0 || P[l[t]] <= current) then
return infty;
if( P[1] > current) then
c[t] := 1;
return P[c[t]];
if( c[t] > 1 && P[c[t] - 1] <= current ) do
low := c[t] -1;
else
low := 1;
jump := 1;
high := low + jump;
while (high < l[t] && P[high] <= current) do
low := high;
jump := 2*jump;
high := low + jump;
if(high > l[t]) then
high := l[t];
c[t] = binarySearch(t, low, high, current)
return P[c[t]];
}
function nextDoc(t, current)
{
// P[] = array of document list fetched from <term: <<n',f,<p1,p2....,pn>, <n,f ,<p1,p2 ....,pn''>..........k entries of pos index>
// l[] = array of lengths of these posting lists
static c = []; //last index positions for terms
if(l[t] == 0 || P[l[t]] <= current) then
return infty;
if( P[1] > current) then
c[t] := 1;
return <P[c[t]],f,<p1,p2....>>;
if( c[t] > 1 && P[c[t] - 1] <= current ) do
low := c[t] -1;
else
low := 1;
jump := 1;
high := low + jump;
while (high < l[t] && P[high] <= current) do
low := high;
jump := 2*jump;
high := low + jump;
if(high > l[t]) then
high := l[t];
c[t] = binarySearch(t, low, high, current)
return P[c[t]];
}
function next(t,n:m)
{
<n',f,<p1,p2....>> = nextDoc(t,n-1)
if(n==n')
off = m
else
off = -infinity
m' = nextTermInDoc(t,off,<n',f,<p1,p2....>>)
return n':m'
}
(
Edited: 2022-03-14)
<pre>
-- Damanpreet Kaur and Divyashree Jayaram
10) Fenestrate
For char 5 grams the word would be split into following terms-
.fene , fenes , enest, nestr, estra, strat, trate, rate.
Stopping - In IR systems stopwords are stripped from the query before doing an index look-up. These stopwords often include function words. Function words are words that have no well-defined meanings in and of themselves; rather, they modify words or indicate grammatical relationships.
The process of stopping is also applied to the corpus before the creation of the index.
Stopping helps to reduce index size and also decreases the query processing time.
For instance- A and B
It would be converted to A B
Stemming- It is the process of an IR system where objective is to reduce each term to its root form. for example - "runs" to match "running"
Potter stemmer rules
sses => ss
ies => I
ss => ss (Stop execution)
s =>
*****************************************************************************
6)
The positional index entry is like <d,frequency,<posting_list>>.
function nextTermInDoc(t,off,<n',f,<p1,p2....,pn>>)
{
// P[] = array of posting list fetched from <n',f,<p1,p2....,pn>>
// l[] = array of lengths of these posting lists
static c = []; //last index positions for terms
current = off
if(l[t] == 0 || P[l[t]] <= current) then
return infty;
if( P[1] > current) then
c[t] := 1;
return P[c[t]];
if( c[t] > 1 && P[c[t] - 1] <= current ) do
low := c[t] -1;
else
low := 1;
jump := 1;
high := low + jump;
while (high < l[t] && P[high] <= current) do
low := high;
jump := 2*jump;
high := low + jump;
if(high > l[t]) then
high := l[t];
c[t] = binarySearch(t, low, high, current)
return P[c[t]];
}
function nextDoc(t, current)
{
// P[] = array of document list fetched from <term: <<n',f,<p1,p2....,pn>, <n'',f'',<p1'',p2''....,pn''>..........k entries of pos index>
// l[] = array of lengths of these posting lists
static c = []; //last index positions for terms
if(l[t] == 0 || P[l[t]] <= current) then
return infty;
if( P[1] > current) then
c[t] := 1;
return <P[c[t]],f,<p1,p2....>>;
if( c[t] > 1 && P[c[t] - 1] <= current ) do
low := c[t] -1;
else
low := 1;
jump := 1;
high := low + jump;
while (high < l[t] && P[high] <= current) do
low := high;
jump := 2*jump;
high := low + jump;
if(high > l[t]) then
high := l[t];
c[t] = binarySearch(t, low, high, current)
return P[c[t]];
}
function next(t,n:m)
{
<n',f,<p1,p2....>> = nextDoc(t,n-1)
if(n==n')
off = m
else
off = -infinity
m' = nextTermInDoc(t,off,<n',f,<p1,p2....>>)
return n':m'
}
</pre>