-- Final Practice
Q 2 and 9)
Team members: Avinash, Preethi
Q 2)
The basic DFR considers all the documents have the same length, given by
`(log(1+(l_t)/N)+f_(t,d) log(1+(N)/(l_t)))/(f_(t,d)+1)` ---- 1
Document Length Normalization is incorporated into 1 by adjusting the term frequency as computed by Amati and Rijsbergen as
`f'_(t,d)=f_(t,d)⋅log(1+(l_(avg))/(l_d)) ` -----2
Plugging in 2 in 1, we get the DFR equation as
`(log(1+(l_t)/N)+f'_(t,d) .log(1+ (l_(avg))/(l_d) ).log(1+N/(l_t)))/(( f'_(t,d) .log(1+ (l_(avg))/(l_d) ))+1)`
Here
`l_t` – total occurrences of the term in the collection
`l_(avg)` – Average length of the documents
N – Number of documents
`l_d` – Length of the document
`f_(t,d)` – Frequency of the term in the document
Q 9)
The only matrix used in HITS algorithm is the adjacency matrix L. From L and `L^T`, the symmetric positive semidefinite matrices `LL^T` and `L^TL` leads to convergence across iteration in HITS algorithm.
In SALSA, we normalize the rows of the adjacency matrix L to form `L_r` and also normalize the columns of L to form `L_c`. We then use `L_rL_c^T`, `L_c^TL_r` to form the hub matrix H and authority matrix A respectively. This normalization makes SALSA immune to topic drift.
Eg : L = `[[0,1,0,0],[1,0,1,0],[0,1,1,1],[0,0,0,0]]` is used as adjacency matrix in HITS
This is transformed to
`L_r` = `[[0,1,0,0],[1/2,0,1/2,0],[0,1/3,1/3,1/3],[0,0,0,0]]` and
`L_c` = `[[0,1/2,0,0],[1,0,1/2,0],[0,1/2,1/2,1],[0,0,0,0]]`
which is used in SALSA to compute the authority and hub vectors
(
Edited: 2018-12-12)
Q 2 and 9)
Team members: Avinash, Preethi
Q 2)
The basic DFR considers all the documents have the same length, given by
@BT@(log(1+(l_t)/N)+f_(t,d) log(1+(N)/(l_t)))/(f_(t,d)+1)@BT@ ---- 1
Document Length Normalization is incorporated into 1 by adjusting the term frequency as computed by Amati and Rijsbergen as
@BT@f'_(t,d)=f_(t,d)⋅log(1+(l_(avg))/(l_d)) @BT@ -----2
Plugging in 2 in 1, we get the DFR equation as
@BT@(log(1+(l_t)/N)+f'_(t,d) .log(1+ (l_(avg))/(l_d) ).log(1+N/(l_t)))/(( f'_(t,d) .log(1+ (l_(avg))/(l_d) ))+1)@BT@
Here
@BT@l_t@BT@ – total occurrences of the term in the collection
@BT@l_(avg)@BT@ – Average length of the documents
N – Number of documents
@BT@l_d@BT@ – Length of the document
@BT@f_(t,d)@BT@ – Frequency of the term in the document
Q 9)
The only matrix used in HITS algorithm is the adjacency matrix L. From L and @BT@L^T@BT@, the symmetric positive semidefinite matrices @BT@LL^T@BT@ and @BT@L^TL@BT@ leads to convergence across iteration in HITS algorithm.
In SALSA, we normalize the rows of the adjacency matrix L to form @BT@L_r@BT@ and also normalize the columns of L to form @BT@L_c@BT@. We then use @BT@L_rL_c^T@BT@, @BT@L_c^TL_r@BT@ to form the hub matrix H and authority matrix A respectively. This normalization makes SALSA immune to topic drift.
Eg : L = @BT@[[0,1,0,0],[1,0,1,0],[0,1,1,1],[0,0,0,0]]@BT@ is used as adjacency matrix in HITS
This is transformed to
@BT@L_r@BT@ = @BT@[[0,1,0,0],[1/2,0,1/2,0],[0,1/3,1/3,1/3],[0,0,0,0]]@BT@ and
@BT@L_c@BT@ = @BT@[[0,1/2,0,0],[1,0,1/2,0],[0,1/2,1/2,1],[0,0,0,0]]@BT@
which is used in SALSA to compute the authority and hub vectors