-- Crawl Delay
Those two constants have nothing to do with crawl delay. NUM_MULTI_CURL_PAGES refers to the number of urls a fetcher will try to download at the same time (using threads). Suppose it is 100. Then Yioop might download a batch of 100 pages in one go, then process those pages, then download the next batch of 100 pages, and so on. When it processes pages, extracting content, etc, it does so in one thread serially. Between pages it sleeps FETCHER_PROCESS_DELAY microseconds. So if FETCHER_PROCESS_DELAY is 10000, then between processing pages it sleeps for 1/100 of a second. This prevented overheating on one of my older Linux boxes.
The following lines in Fetcher.php try to detect hosts that are getting swamped:
if (($response_code >= 400 && $response_code != 404) ||
$response_code < 100) {
// < 100 will capture failures to connect which are returned
// as strings
$was_error = true;
$this->hosts_with_errors[$host]++;
}
If a host has more than DOWNLOAD_ERROR_THRESHOLD many errors then it is treated as if it had a crawl delay of ERROR_CRAWL_DELAY.
Best,
Chris
(
Edited: 2019-04-05)
Those two constants have nothing to do with crawl delay. NUM_MULTI_CURL_PAGES refers to the number of urls a fetcher will try to download at the same time (using threads). Suppose it is 100. Then Yioop might download a batch of 100 pages in one go, then process those pages, then download the next batch of 100 pages, and so on. When it processes pages, extracting content, etc, it does so in one thread serially. Between pages it sleeps FETCHER_PROCESS_DELAY microseconds. So if FETCHER_PROCESS_DELAY is 10000, then between processing pages it sleeps for 1/100 of a second. This prevented overheating on one of my older Linux boxes.
The following lines in Fetcher.php try to detect hosts that are getting swamped:
if (($response_code >= 400 && $response_code != 404) ||
$response_code < 100) {
// < 100 will capture failures to connect which are returned
// as strings
$was_error = true;
$this->hosts_with_errors[$host]++;
}
If a host has more than DOWNLOAD_ERROR_THRESHOLD many errors then it is treated as if it had a crawl delay of ERROR_CRAWL_DELAY.
Best,
Chris