Yioop - PHP Search Engine

-- Crawler Halting

Hi Chris...

I'm revisiting using Yioop for a project (which ironically is Canadian focused like findcan.ca) ... 3.8.1 running on (4) machines. One machine is configured as the scheduler/name server and other three machines are running four fetchers each.

My problem though is crawling - more specifically limiting the crawl. If I go into Crawl Options and select "Use Yioop Defaults" then everything starts up and within a few minutes I have thousands of pages. If I decide to "use options below" and put in a few sites, everything seems to work. However - it seems so far that if I specify domain entries in "allowed to sites" this is where things don't work ... it just "sits idle" and never wants to crawl.

More specific examples of what my crawler options look like:

Allowed to Crawl Sites:

domain:.bell.ca

domain:.canada.com

[nothing in exception list]

Seed Sites:

http://bell.ca/

http://canada.com/

Is there something wrong with my configuration? I want to crawl the seed sites but through the "allowed" sites include any URL's at those domains ....

Thank you! Paul

(Edited: 2016-11-02)

Hi Chris... I'm revisiting using Yioop for a project (which ironically is Canadian focused like findcan.ca) ... 3.8.1 running on (4) machines. One machine is configured as the scheduler/name server and other three machines are running four fetchers each. My problem though is crawling - more specifically limiting the crawl. If I go into Crawl Options and select "Use Yioop Defaults" then everything starts up and within a few minutes I have thousands of pages. If I decide to "use options below" and put in a few sites, everything seems to work. However - it seems so far that if I specify domain entries in "allowed to sites" this is where things don't work ... it just "sits idle" and never wants to crawl. More specific examples of what my crawler options look like: Allowed to Crawl Sites: domain:.bell.ca domain:.canada.com [nothing in exception list] Seed Sites: http://bell.ca/ http://canada.com/ Is there something wrong with my configuration? I want to crawl the seed sites but through the "allowed" sites include any URL's at those domains .... Thank you! Paul

-- Crawler Halting

Maybe switch:

 domain:.bell.ca
 domain:.canada.com

to

 domain:bell.ca
 domain:canada.com

Best,

Chris

Maybe switch: domain:.bell.ca domain:.canada.com to domain:bell.ca domain:canada.com Best, Chris

-- Crawler Halting

Thanks Chris... I tried this with a few domains now and it seems to be working. I had previously tried without the "." in front - maybe I misunderstood the documentation that references that. However, when I try this with a larger subset of domains (about 160 lines) the crawl never starts - could this be a PHP memory setting somewhere maybe? (just using default settings and each server has 32G of RAM) I'm going to try it again overnight with a larger list of domains... Thanks again!

-- Crawler Halting

findcan.ca had 154 allowed to crawl domains when it was crawling. So the number of lines itself is probably not the problem.

Best,

Chris

findcan.ca had 154 allowed to crawl domains when it was crawling. So the number of lines itself is probably not the problem. Best, Chris