-- Crawler Halting
Hi Chris...
I'm revisiting using Yioop for a project (which ironically is Canadian focused like findcan.ca) ... 3.8.1 running on (4) machines. One machine is configured as the scheduler/name server and other three machines are running four fetchers each.
My problem though is crawling - more specifically limiting the crawl. If I go into Crawl Options and select "Use Yioop Defaults" then everything starts up and within a few minutes I have thousands of pages. If I decide to "use options below" and put in a few sites, everything seems to work. However - it seems so far that if I specify domain entries in "allowed to sites" this is where things don't work ... it just "sits idle" and never wants to crawl.
More specific examples of what my crawler options look like:
Allowed to Crawl Sites:
domain:.bell.ca
domain:.canada.com
[nothing in exception list]
Seed Sites:
http://bell.ca/
http://canada.com/
Is there something wrong with my configuration? I want to crawl the seed sites but through the "allowed" sites include any URL's at those domains ....
Thank you!
Paul
(
Edited: 2016-11-02)
Hi Chris...
I'm revisiting using Yioop for a project (which ironically is Canadian focused like findcan.ca) ... 3.8.1 running on (4) machines. One machine is configured as the scheduler/name server and other three machines are running four fetchers each.
My problem though is crawling - more specifically limiting the crawl. If I go into Crawl Options and select "Use Yioop Defaults" then everything starts up and within a few minutes I have thousands of pages. If I decide to "use options below" and put in a few sites, everything seems to work. However - it seems so far that if I specify domain entries in "allowed to sites" this is where things don't work ... it just "sits idle" and never wants to crawl.
More specific examples of what my crawler options look like:
Allowed to Crawl Sites:
domain:.bell.ca
domain:.canada.com
[nothing in exception list]
Seed Sites:
http://bell.ca/
http://canada.com/
Is there something wrong with my configuration? I want to crawl the seed sites but through the "allowed" sites include any URL's at those domains ....
Thank you!
Paul