2016-11-30

Maximizing performance.

I'm running Yioop on a server with 150GB SSD, a 6 core processor, 8GB ram and a 300Mb/s connection. Yioop seems to max out at 20,000 links per minute which seems very low. What are some ways I can increase the indexing speed?
I'm running Yioop on a server with 150GB SSD, a 6 core processor, 8GB ram and a 300Mb/s connection. Yioop seems to max out at 20,000 links per minute which seems very low. What are some ways I can increase the indexing speed?

-- Maximizing performance
I would be surprise if it was 20,000 links per minute. You probably meant per hour. My guess is you could up the speed by running more fetchers. For a set up like the above, you could run 1 queue server and 4 or 5 fetchers. A single queue server can handle about 100,000 - 200,000 urls/hour. A single fetcher maxes out at around 20,000. My guess is your current limiting factor is how fast the fetchers can process pages. Fetchers are responsible for initial page summarization and inverted index creation -- not just downloading pages. Indices produced by the fetchers are periodically merged by the queue server. Under Page Options, the byte range to download, the choice of summarizer, and max page summary length, will all factor in on how long it takes to process a single page.
Best,
Chris
(Edited: 2016-12-01)
I would be surprise if it was 20,000 links per minute. You probably meant per hour. My guess is you could up the speed by running more fetchers. For a set up like the above, you could run 1 queue server and 4 or 5 fetchers. A single queue server can handle about 100,000 - 200,000 urls/hour. A single fetcher maxes out at around 20,000. My guess is your current limiting factor is how fast the fetchers can process pages. Fetchers are responsible for initial page summarization and inverted index creation -- not just downloading pages. Indices produced by the fetchers are periodically merged by the queue server. Under Page Options, the byte range to download, the choice of summarizer, and max page summary length, will all factor in on how long it takes to process a single page. Best, Chris
X