2022-11-21

Producing Text Files From Wiki.

Hi Professor,
We are asked to create a folder of 10 text files using text from 10 different Wikipedia pages. How would you like us to do this part? Can you elaborate more on that task? Do we copy and paste all text from a Wiki page? Do we keep punctuations? More details about how we should do text processing would be very helpful for us.
Hi Professor, We are asked to create a folder of 10 text files using text from 10 different Wikipedia pages. How would you like us to do this part? Can you elaborate more on that task? Do we copy and paste all text from a Wiki page? Do we keep punctuations? More details about how we should do text processing would be very helpful for us.

-- Producing Text Files From Wiki
Copy and paste works. Then your code that processes for 5 grams will strip the punctuation before making the 5-grams.
Best,
Chris
Copy and paste works. Then your code that processes for 5 grams will strip the punctuation before making the 5-grams. Best, Chris
2022-11-24

-- Producing Text Files From Wiki
Hi Professor, is a small sample of the Wiki text okay? Do we need all text from a Wiki page?
Hi Professor, is a small sample of the Wiki text okay? Do we need all text from a Wiki page?

-- Producing Text Files From Wiki
Train on all of the main content of the page. You don't need to train of the nav elements of the sides and top. More data will generally give better results and it is the computer not you doing the work.
Best,
Chris
(Edited: 2022-11-24)
Train on all of the main content of the page. You don't need to train of the nav elements of the sides and top. More data will generally give better results and it is the computer not you doing the work. Best, Chris

-- Producing Text Files From Wiki
Also, Professor, do we treat numbers and other symbols as just text? How do you want us to treat these symbols that are not alphabetical?
Also, Professor, do we treat numbers and other symbols as just text? How do you want us to treat these symbols that are not alphabetical?

-- Producing Text Files From Wiki
Split into sentences. Delete punctuation. Otherwise, treat non-whitespace characters as text.
Best,
Chris
Split into sentences. Delete punctuation. Otherwise, treat non-whitespace characters as text. Best, Chris
2022-11-28

-- Producing Text Files From Wiki
Hi Professor, do we add two underscores at the start and end of every sentence, like how it is shown in hw5 instruction page?
(Edited: 2022-11-28)
Hi Professor, do we add two underscores at the start and end of every sentence, like how it is shown in hw5 instruction page?

-- Producing Text Files From Wiki
Yes
(Edited: 2022-11-28)
Yes
X