2017-03-14

Trouble-shoot QueueServer.php and Fetcher.php.

Hi,
I'm freshly installing the latest Yioop version.
I built a new CentOS 7.3.1611 server, added an Apache virtual server running under mod_fcgi, php 5.4, etc.
I then git cloned yioop, put it into the main web directory, fixed the AllowOverride, went into the web interface of the /admin.
The questions i have are:
1. The main web directory shows me a blank page, which is odd, nothing I could see in Apache logs to indicate any problems (usually blank pages are php errors or the like)
2. I got into the /admin path, started to configure based on your Install guide on-line. I turn ON queueserver and fetcher, nothing in logs. I turn on media server, log file generated.
I then try the CLI approach using:
php QueueServer.php terminal
and jumps back to the shell prompt immediately. Same goes for php Fetcher.php terminal.
Using terminal, I would expect it to sit there until I press CTRL-C (as documented).
So something is wrong and debugging it is difficult when no output is produced.
So for point 2, how can I debug this ?
Or maybe this doesn't work too well under mod_fcgi ? - note that permissions on the files and directories are completely owned by the mod_fcgi user/group.
Any advice?
Thanks.
Hi, I'm freshly installing the latest Yioop version. I built a new CentOS 7.3.1611 server, added an Apache virtual server running under mod_fcgi, php 5.4, etc. I then git cloned yioop, put it into the main web directory, fixed the AllowOverride, went into the web interface of the /admin. The questions i have are: 1. The main web directory shows me a blank page, which is odd, nothing I could see in Apache logs to indicate any problems (usually blank pages are php errors or the like) 2. I got into the /admin path, started to configure based on your Install guide on-line. I turn ON queueserver and fetcher, nothing in logs. I turn on media server, log file generated. I then try the CLI approach using: php QueueServer.php terminal and jumps back to the shell prompt immediately. Same goes for php Fetcher.php terminal. Using terminal, I would expect it to sit there until I press CTRL-C (as documented). So something is wrong and debugging it is difficult when no output is produced. So for point 2, how can I debug this ? Or maybe this doesn't work too well under mod_fcgi ? - note that permissions on the files and directories are completely owned by the mod_fcgi user/group. Any advice? Thanks.

-- Trouble-shoot QueueServer.php and Fetcher.php
I would guess that a blank page would mean some kind of PHP error like you said. Do you have a separate PHP log file and did the error appear there. Often if you have fcgi there is a separate log file for that as well. I haven't really tested under fcgi so I can't say if it would work or not using it. My guess is that it shouldn't be too hard to get it to work. The fact that the QueueServer.php immediately crashes suggests to me that you are missing some PHP function that yioop needs or a lack of memory. Maybe popen? I am not sure why though system error messages aren't going to the terminal in this case? Can you change this?
I would guess that a blank page would mean some kind of PHP error like you said. Do you have a separate PHP log file and did the error appear there. Often if you have fcgi there is a separate log file for that as well. I haven't really tested under fcgi so I can't say if it would work or not using it. My guess is that it shouldn't be too hard to get it to work. The fact that the QueueServer.php immediately crashes suggests to me that you are missing some PHP function that yioop needs or a lack of memory. Maybe popen? I am not sure why though system error messages aren't going to the terminal in this case? Can you change this?
2017-03-16

-- Trouble-shoot QueueServer.php and Fetcher.php
Hi. Thank you for your reply.
I couldn't find any way to resolve the issue.
I installed a fresh CentOS 7 server, and followed your instructions exactly as display in the CentOS Linux install guide. Got exactly the same problem.
I then decided to upgrade PHP from 5.4 (default) to 5.6 using the webtatic repo.
I did this, then got the Yioop webpage successfully displaying (finally).
I then tried running the "php QueueServer.php terminal" and it works. Same for Fetcher.
I then went through to the older system and upgraded to PHP 5.6 (this time trying remi repo), tested the "php QueueServer.php terminal" and it worked. Unfortunately, when I try to access the homepage it tries to download the PHP file instead of interpret it. This is still running under mod_fcgid, so it might not be Yioop at this stage.
So in summary so far, from my tests, the latest CentOS 7 with PHP 5.4 simply doesn't work with the latest Yioop version.
I'll continue to run tests tonight and report back here.
(Edited: 2017-03-16)
Hi. Thank you for your reply. I couldn't find any way to resolve the issue. I installed a fresh CentOS 7 server, and followed your instructions exactly as display in the CentOS Linux install guide. Got exactly the same problem. I then decided to upgrade PHP from 5.4 (default) to 5.6 using the webtatic repo. I did this, then got the Yioop webpage successfully displaying (finally). I then tried running the "php QueueServer.php terminal" and it works. Same for Fetcher. I then went through to the older system and upgraded to PHP 5.6 (this time trying remi repo), tested the "php QueueServer.php terminal" and it worked. Unfortunately, when I try to access the homepage it tries to download the PHP file instead of interpret it. This is still running under mod_fcgid, so it might not be Yioop at this stage. So in summary so far, from my tests, the latest CentOS 7 with PHP 5.4 simply doesn't work with the latest Yioop version. I'll continue to run tests tonight and report back here.

-- Trouble-shoot QueueServer.php and Fetcher.php
Update.
For some reason remi repo doesn't work well with mod_fcgid. I removed it, replaced with webtatic's PHP 5.6 and all seems to be working now.
I'm able to access the main web interface, able to run a QueueServer and Fetcher, and started the crawl of one website.
I then started 2 more fetchers and added two more websites (totaling 3 websites and 3 fetchers) and noticed my web hosting server overloaded from running around Load Average 2 all the way to 180 before I stopped the crawler and it went down again.
The 3 websites I crawled with the 3 fetchers are all on the one web hosting server. The 3 websites each have an instance of Wordpress so have a MySQL backend (using memcached, opcache, etc).
It seems the web server hosting the websites had its RAM used up, which affected Disk IO. So even though the CPU usage was low, the RAM caused the server to get hit hard from the crawls.
I then ran it again to watch what's happening, and see the CPU's go to 100%, load average go up again, so I have to stop the crawlers.
I'm not sure how the crawlers work, but is there a way to limit how many GET requests they make at any given time on a server?
I do I just use one fetcher?
(Edited: 2017-03-16)
Update. For some reason remi repo doesn't work well with mod_fcgid. I removed it, replaced with webtatic's PHP 5.6 and all seems to be working now. I'm able to access the main web interface, able to run a QueueServer and Fetcher, and started the crawl of one website. I then started 2 more fetchers and added two more websites (totaling 3 websites and 3 fetchers) and noticed my web hosting server overloaded from running around Load Average 2 all the way to 180 before I stopped the crawler and it went down again. The 3 websites I crawled with the 3 fetchers are all on the one web hosting server. The 3 websites each have an instance of Wordpress so have a MySQL backend (using memcached, opcache, etc). It seems the web server hosting the websites had its RAM used up, which affected Disk IO. So even though the CPU usage was low, the RAM caused the server to get hit hard from the crawls. I then ran it again to watch what's happening, and see the CPU's go to 100%, load average go up again, so I have to stop the crawlers. I'm not sure how the crawlers work, but is there a way to limit how many GET requests they make at any given time on a server? I do I just use one fetcher?

-- Trouble-shoot QueueServer.php and Fetcher.php
Hi Michael,
In your src/configs folder create a file LocalConfig.php with tweaks to Yioops defined constants. Some values you could try playing with are:
    /**
     * Delay in microseconds between processing pages to try to avoid
     * CPU overheating. On some systems, you can set this to 0.
     */
    nsconddefine('FETCHER_PROCESS_DELAY', 10000);
   /** number of multi curl page requests in one go */
   nsconddefine('NUM_MULTI_CURL_PAGES', 100);
Hope this helps, Chris
Hi Michael, In your src/configs folder create a file LocalConfig.php with tweaks to Yioops defined constants. Some values you could try playing with are: /** * Delay in microseconds between processing pages to try to avoid * CPU overheating. On some systems, you can set this to 0. */ nsconddefine('FETCHER_PROCESS_DELAY', 10000); /** number of multi curl page requests in one go */ nsconddefine('NUM_MULTI_CURL_PAGES', 100); Hope this helps, Chris

-- Trouble-shoot QueueServer.php and Fetcher.php
Great thanks. I've done that and will test another crawl soon.
Hmm.. I got to the main webpage and see this at the top of the browser:
http://prntscr.com/ekwo0f
which was from the LocalConfig.php file containing:
$ cat LocalConfig.php nsconddefine('FETCHER_PROCESS_DELAY', 10000); nsconddefine('NUM_MULTI_CURL_PAGES', 100);
Realising it's a normal text file instead of a php script, I then did this:
<?php /**
  • Delay in microseconds between processing pages to try to avoid
  • CPU overheating. On some systems, you can set this to 0.
  • / nsconddefine('FETCHER_PROCESS_DELAY', 10000); /** number of multi curl page requests in one go */ nsconddefine('NUM_MULTI_CURL_PAGES', 100); ?>
refreshed the webpage and it's blank! At least this time there's an error in the Apache log showing:
[Fri Mar 17 11:10:52.861826 2017] [fcgid:warn] [pid 27596] [client x.x.x.x:61823] mod_fcgid: stderr: PHP Fatal error: Call to undefined function nsconddefine() in /somepath/src/configs/LocalConfig.php on line 6
Any ideas on this one? the installation of Yioop is from a git clone following the previous install instructions, so technically not sure what version I am actually running?
Please advise.
Lastly, if this config file ends up working, and the web server receiving the crawl still overloads, do I just increase the values some more?
Thanks.
(Edited: 2017-03-16)
Great thanks. I've done that and will test another crawl soon. Hmm.. I got to the main webpage and see this at the top of the browser: http://prntscr.com/ekwo0f which was from the LocalConfig.php file containing: $ cat LocalConfig.php nsconddefine('FETCHER_PROCESS_DELAY', 10000); nsconddefine('NUM_MULTI_CURL_PAGES', 100); Realising it's a normal text file instead of a php script, I then did this: <?php /** * Delay in microseconds between processing pages to try to avoid * CPU overheating. On some systems, you can set this to 0. */ nsconddefine('FETCHER_PROCESS_DELAY', 10000); /** number of multi curl page requests in one go */ nsconddefine('NUM_MULTI_CURL_PAGES', 100); ?> refreshed the webpage and it's blank! At least this time there's an error in the Apache log showing: [Fri Mar 17 11:10:52.861826 2017] [fcgid:warn] [pid 27596] [client x.x.x.x:61823] mod_fcgid: stderr: PHP Fatal error: Call to undefined function nsconddefine() in /somepath/src/configs/LocalConfig.php on line 6 Any ideas on this one? the installation of Yioop is from a git clone following the previous install instructions, so technically not sure what version I am actually running? Please advise. Lastly, if this config file ends up working, and the web server receiving the crawl still overloads, do I just increase the values some more? Thanks.
2017-03-17

-- Trouble-shoot QueueServer.php and Fetcher.php
Hi Chris. As an FYI, I was reading about this on Yandex and then here:

The crawl delay directive is used by some crawlers, not sure if you have this implemented in Yioop, but if not you may wish to consider it also.
(Edited: 2017-03-18)
Hi Chris. As an FYI, I was reading about this on Yandex and then here:<br><br> [[https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive|https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive]]<br><br> The crawl delay directive is used by some crawlers, not sure if you have this implemented in Yioop, but if not you may wish to consider it also.
2017-03-18

-- Trouble-shoot QueueServer.php and Fetcher.php
Make sure to have a namespace at the top of the file:
 /**
  * Local configuration overrides
  * @package configs
  */
 namespace seekquarry\yioop\configs;
Yes, Yioop implements Crawl-delay.
Best, Chris
(Edited: 2017-03-18)
Make sure to have a namespace at the top of the file: /** * Local configuration overrides * @package configs */ namespace seekquarry\yioop\configs; Yes, Yioop implements Crawl-delay. Best, Chris
2017-03-19

-- Trouble-shoot QueueServer.php and Fetcher.php
Hi Chris. OK, I added that section to the top of the file, and seems to be loading fine now (I also added the PHP header/footer bits too).

I will test this again shortly.

I will also test the crawl-delay feature in the robots.txt file also, I'm defining it currently for only my user agent.

Just to confirm, The "user agent" entry I put into robots.txt is the "Crawl Robot Name" equivalent in Yioop settings?

Again, handy stuff to know. Thanks.
(Edited: 2017-03-19)
Hi Chris. OK, I added that section to the top of the file, and seems to be loading fine now (I also added the PHP header/footer bits too).<br><br> I will test this again shortly.<br><br> I will also test the crawl-delay feature in the robots.txt file also, I'm defining it currently for only my user agent.<br><br> Just to confirm, The "user agent" entry I put into robots.txt is the "Crawl Robot Name" equivalent in Yioop settings?<br><br> Again, handy stuff to know. Thanks.
2017-03-23

-- Trouble-shoot QueueServer.php and Fetcher.php
Yes, the value in the field Crawl Robot Name is the user agent the robot will identify itself as. You can double check this, by looking at your access logs, if you crawl a site you control.
Best, Chris
Yes, the value in the field Crawl Robot Name is the user agent the robot will identify itself as. You can double check this, by looking at your access logs, if you crawl a site you control. Best, Chris
X