-- Maximizing performance
I would be surprise if it was 20,000 links per minute. You probably meant per hour.
My guess is you could up the speed by running more fetchers. For a set up like the above, you could run 1 queue server and 4 or 5 fetchers. A single queue server can handle about 100,000 - 200,000 urls/hour. A single fetcher maxes out at around 20,000. My guess is your current limiting factor is how fast the fetchers can process pages. Fetchers are responsible for initial page summarization and inverted index creation -- not just downloading pages. Indices produced by the fetchers are periodically merged by the queue server. Under Page Options, the byte range to download, the choice of summarizer, and max page summary length, will all factor in on how long it takes to process a single page.
Best,
Chris
(
Edited: 2016-12-01)
I would be surprise if it was 20,000 links per minute. You probably meant per hour.
My guess is you could up the speed by running more fetchers. For a set up like the above, you could run 1 queue server and 4 or 5 fetchers. A single queue server can handle about 100,000 - 200,000 urls/hour. A single fetcher maxes out at around 20,000. My guess is your current limiting factor is how fast the fetchers can process pages. Fetchers are responsible for initial page summarization and inverted index creation -- not just downloading pages. Indices produced by the fetchers are periodically merged by the queue server. Under Page Options, the byte range to download, the choice of summarizer, and max page summary length, will all factor in on how long it takes to process a single page.
Best,
Chris