2018-11-27

Hw5.

I have some question on the programming part of the homework.
1. For the first program, do we only find the .txt file in the path_to_folder_to_index folder, not other files type?
2. There's no path_to_folder_to_index parameter for the second program. How do we find the path_to_folder_to_index for the second program? The output file from the first program does not show anything on where the files are. Or do I misunderstand this question?
3. The BM25 we implemented was disjunctive, but the homework needs it to be conjunctive. Do we need to make it conjective for this program?
I have some question on the programming part of the homework. 1. For the first program, do we only find the .txt file in the path_to_folder_to_index folder, not other files type? 2. There's no path_to_folder_to_index parameter for the second program. How do we find the path_to_folder_to_index for the second program? The output file from the first program does not show anything on where the files are. Or do I misunderstand this question? 3. The BM25 we implemented was disjunctive, but the homework needs it to be conjunctive. Do we need to make it conjective for this program?

-- Hw5
  1. There might be other files in the folder beside .txt files, but you should not read the non-text files.
  2. You don't need path_to_folder_to_index, the output of the first program tells us doc id program names, so we could find what file corresponds to a given doc id if we wanted. For the trec eval software input we don't need that.
  3. Yes
# There might be other files in the folder beside .txt files, but you should not read the non-text files. # You don't need path_to_folder_to_index, the output of the first program tells us doc id program names, so we could find what file corresponds to a given doc id if we wanted. For the trec eval software input we don't need that. # Yes

-- Hw5
I got 2 and 3.
For first one, Does it mean we need to read all files (not only the .txt files) in the path_to_folder_to_index and check if they're text-based files or not. Then index the text-based files? Am I correct?
I got 2 and 3. For first one, Does it mean we need to read all files (not only the .txt files) in the path_to_folder_to_index and check if they're text-based files or not. Then index the text-based files? Am I correct?
2018-12-05

-- Hw5
For Problem 7.3, is the expectation that we figure out the performance gain hidden in the O-notation AND do the proof as described in the original problem? Or just figure out the performance gain hidden in the O-notation?
For Problem 7.3, is the expectation that we figure out the performance gain hidden in the O-notation AND do the proof as described in the original problem? Or just figure out the performance gain hidden in the O-notation?

-- Hw5
@xianghong yes
@xianghong yes

-- Hw5
@sshahab You should prove a `Theta` result, not just an O result. The goal is not to figure out the performance gain hidden in the constants so much as to show the performance gain is within a constant factor.
@sshahab You should prove a @BT@Theta@BT@ result, not just an O result. The goal is not to figure out the performance gain hidden in the constants so much as to show the performance gain is within a constant factor.
2018-12-07

-- Hw5
For the first part of the programming assignment, in order to calculate Mopt, we need Nt and N, if Nt and N are same, formula for Mopt will give error, what to choose M value in that scenario ?
For the first part of the programming assignment, in order to calculate Mopt, we need Nt and N, if Nt and N are same, formula for Mopt will give error, what to choose M value in that scenario ?

-- Hw5
@andvish93 There are two solutions.
First, because that term appears in every document, we don't need to use any bit to encode the delta list because gap is always 1. You can just encode the f_t,d.
Second, just set M to be 0. So you always use 1 bit to encode the gap id.
(Edited: 2018-12-08)
@andvish93 There are two solutions. First, because that term appears in every document, we don't need to use any bit to encode the delta list because gap is always 1. You can just encode the f_t,d. Second, just set M to be 0. So you always use 1 bit to encode the gap id.
X