2022-07-26

Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling.

Hey guys,
I just installed Yioop to my Windows machine to crawl a folder that has a ton of HTML files in it, but when I ran the test to crawl I get a lot of these:
WARNING: Invalid argument supplied for foreach() at line 1298 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php
  in seekquarry\yioop\locale\en_US\resources\Tokenizer->stemPhrase, line 1274 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php 
  called from seekquarry\yioop\locale\en_US\resources\Tokenizer->extractTripletByType, line 562 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php 
  called from seekquarry\yioop\locale\en_US\resources\Tokenizer->rearrangeTripletsByType, line 999 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php 
  called from seekquarry\yioop\locale\en_US\resources\Tokenizer->extractTripletsPhrases, line 228 in D:\Apache24\htdocs\search\src\library\PhraseParser.php 
  called from seekquarry\yioop\library\PhraseParser->extractPhrasesInLists, line 2038 in D:\Apache24\htdocs\search\src\controllers\components\CrawlComponent.php
And nothing gets indexed.
Any ideas what can be happening and how to fix it?
Thanks.
(Edited: 2022-07-26)
Hey guys, I just installed Yioop to my Windows machine to crawl a folder that has a ton of HTML files in it, but when I ran the test to crawl I get a lot of these: WARNING: Invalid argument supplied for foreach() at line 1298 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php in seekquarry\yioop\locale\en_US\resources\Tokenizer->stemPhrase, line 1274 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php called from seekquarry\yioop\locale\en_US\resources\Tokenizer->extractTripletByType, line 562 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php called from seekquarry\yioop\locale\en_US\resources\Tokenizer->rearrangeTripletsByType, line 999 in D:\Apache24\htdocs\search\work_directory\app\locale\en_US\resources\Tokenizer.php called from seekquarry\yioop\locale\en_US\resources\Tokenizer->extractTripletsPhrases, line 228 in D:\Apache24\htdocs\search\src\library\PhraseParser.php called from seekquarry\yioop\library\PhraseParser->extractPhrasesInLists, line 2038 in D:\Apache24\htdocs\search\src\controllers\components\CrawlComponent.php And nothing gets indexed. Any ideas what can be happening and how to fix it? Thanks.

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
Are you doing a web crawl to crawl this folder or are you doing some kind of archive crawl? Also, the warning above is right after an mb_split function call. Do you have the multibyte functions installed? (My guess is you do, otherwise, more serious issues would have occurred). The above is a warning not a fatal crash, so it probably would not have prevented indexing. I just added an empty check before that line number. I will probably push some notice/warning fixes as a version 9.0.1 soon.
If you could tell me how you did the crawl (crawl options you chose, how you started, stopped crawl, and then set the index to use for search results), I can try to help you further.
Best,
Chris
Are you doing a web crawl to crawl this folder or are you doing some kind of archive crawl? Also, the warning above is right after an mb_split function call. Do you have the multibyte functions installed? (My guess is you do, otherwise, more serious issues would have occurred). The above is a warning not a fatal crash, so it probably would not have prevented indexing. I just added an empty check before that line number. I will probably push some notice/warning fixes as a version 9.0.1 soon. If you could tell me how you did the crawl (crawl options you chose, how you started, stopped crawl, and then set the index to use for search results), I can try to help you further. Best, Chris

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
Web crawl to files in the same server, basically a crawl to http://localhost
I added it, in page processing I selected to search for htm and html files, and in Seed sites I put the URL for where the files are.
Did i miss something?
(Edited: 2022-07-26)
Web crawl to files in the same server, basically a crawl to http://localhost I added it, in page processing I selected to search for htm and html files, and in Seed sites I put the URL for where the files are. Did i miss something?

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
Here is the line from apache log:
stderr: PHP Notice: file_get_contents(): read of 8192 bytes failed with errno=13 Permission denied in D:\\Apache24\\htdocs\\search\\src\\controllers\\FetchController.php on line 97
But the folder where the files are has that everyone can read them.
Here is the line from apache log: stderr: PHP Notice: file_get_contents(): read of 8192 bytes failed with errno=13 Permission denied in D:\\Apache24\\htdocs\\search\\src\\controllers\\FetchController.php on line 97 But the folder where the files are has that everyone can read them.

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
That file is used to keep track of the last time a fetcher has spoken with the name server. It probably should be read-writable. Can you maybe include a low res screen shot of your Manage Machine activity page, your Manage Crawl Activity page, and also your Crawl Options?
Best,
Chris
(Edited: 2022-07-26)
That file is used to keep track of the last time a fetcher has spoken with the name server. It probably should be read-writable. Can you maybe include a low res screen shot of your Manage Machine activity page, your Manage Crawl Activity page, and also your Crawl Options? Best, Chris

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
I stopped the crawl, but these are my options:
I made the folder to crawl RW for everyone, but still nothing.
Resource Description for Crawl options 2.png Resource Description for Crawl options.png
I stopped the crawl, but these are my options: I made the folder to crawl RW for everyone, but still nothing. ((resource:Crawl options 2.png|Resource Description for Crawl options 2.png)) ((resource:Crawl options.png|Resource Description for Crawl options.png))

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
Forgot these:
Resource Description for machines.png
Resource Description for crawls_status.png
(Edited: 2022-07-27)
Forgot these: ((resource:machines.png|Resource Description for machines.png)) ((resource:crawls_status.png|Resource Description for crawls_status.png))

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
The test page is not giving me any warnings anymore, but it's giving me messages about redirecting to https and others about http2, so let me ask you, is Yioop compatible with http2? if not, you might want to update it to be compatible with http2 specially now that http3 is coming out soon.
The test page is not giving me any warnings anymore, but it's giving me messages about redirecting to https and others about http2, so let me ask you, is Yioop compatible with http2? if not, you might want to update it to be compatible with http2 specially now that http3 is coming out soon.

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
Yioop is dependent on curl, so as long as you have a modern copy of curl library installed, it should be HTTP/2 capable. I haven't added HTTP/3 support yet. You are starting your crawl from a http:// seed site, so if that page redirects to https:// you might get a redirect message. Again, that wouldn't explain why you are getting no search results. On the other hand, when I look at your page options, it looks like you have created a classifier that you are using to classify which web pages should be indexed. Did you try turning that off? Also, if you only check htm, html, it will only index web pages with that extension. So the page https://foo.com/ will not be indexed, because it does not have either extension, but https://foo.com/index.html would be indexed. The screening is done before any attempt at downloading the page, so at that point the mime type is not known. You can check unknown if you think it is okay to crawl https://foo.com/ . Also, I would tend to check at a minimum .txt as well.
Chris
(Edited: 2022-07-27)
Yioop is dependent on curl, so as long as you have a modern copy of curl library installed, it should be HTTP/2 capable. I haven't added HTTP/3 support yet. You are starting your crawl from a http:// seed site, so if that page redirects to https:// you might get a redirect message. Again, that wouldn't explain why you are getting no search results. On the other hand, when I look at your page options, it looks like you have created a classifier that you are using to classify which web pages should be indexed. Did you try turning that off? Also, if you only check htm, html, it will only index web pages with that extension. So the page https://foo.com/ will not be indexed, because it does not have either extension, but https://foo.com/index.html would be indexed. The screening is done before any attempt at downloading the page, so at that point the mime type is not known. You can check unknown if you think it is okay to crawl https://foo.com/ . Also, I would tend to check at a minimum .txt as well. Chris
2022-07-27

-- Yioop 9 on Windows, PHP 7.4 and Apache 2.4 not crawling
I had Classifiers off before and it was doing the same thing, I only added it to see if improves, but nothing, I turned it off again and the seed site is already put as https, so no redirect was supposed to happen, I don't know why is redirecting.
I also chose htm and html pages because the folder is full of them, there is no other file type in it.
I had Classifiers off before and it was doing the same thing, I only added it to see if improves, but nothing, I turned it off again and the seed site is already put as https, so no redirect was supposed to happen, I don't know why is redirecting. I also chose htm and html pages because the folder is full of them, there is no other file type in it.
[ Next ]
X