Yioop - PHP Search Engine

2020-10-03

Yioop Update.

Yioop 7 was released over the summer. The changelogs for version 7 and 7.1 show what new:

I have also done several smaller scale crawls, the largest being about 125 million pages to try to understand how to improve indexing and crawling.

These crawls started from a much larger set of seed sites than I was using in 2014/2015 when Yioop did its one and only so far billion page crawl. For that crawl I used about 2-3 hundred popular seed sites; whereas, for the more recent crawls I used about 10 times that. These sites were from a much broader pool of regions, so the hope was I could see how well the internationalization indexing features of Yioop worked. I also looked at top sites and didn't immediately delete them if they were porn sites. I am not sure if for these reasons or just because web-site developers are getting more sophisticated in their SEO attempts, my crawler seems to be discovering way more link farms than in the past. So in the 2020 crawler climate you have the following bad situation for third party crawlers: many legit sites (government/company) restrict to only let Google and Bing (not always Bing) crawl them, locking in the Google monopoly, while many non-legit sites have links farms that suck in your general web crawlers.

I can't really do too much about the Google situation except to plead with website owners to not block behaving bots by default. Or if they see a bot that gives a contact you URL and they have an issue with the bot, at least try to contact the owner to see if they can fix the issue.

I have been working on an improved Bloom filter based detection algorithm for link farms, which I will continue to experiment with for smallish crawls. Yioop at its peak had six mac minis, it is currently down to four. So I will probably continue to do more smaller scale crawls (couple hundred million pages) over the next few months to hone this technology before trying to do a larger crawls again. Another thing I am working on is to improve the indexing. Currently, in addition to the actual pages downloaded, Yioop creates small mini-pages for each link discovered. This makes it easier to do certain queries like find the words associated with a page but not on the page or to show potential search results for pages not yet downloaded. However, it makes Yioop indexes about 10 times larger than they otherwise would be if this wasn't done. This is basically because these mini pages getting listed in each posting list and there are at least 10 times more links discovered than pages downloaded. I am reworking the code to achieve most of these abilities via adding additional meta words to the Yioop index and not creating these mini-documents. I am also rebalancing what is done in the fetcher versus what is done in the indexer which should hopefully make certain merging operations in the indexer work better. Lots of Yioop coding ahead before version 8 in mid 2021!

(Edited: 2020-10-03)

Yioop 7 was released over the summer. The changelogs for version 7 and 7.1 show what new: [[https://www.seekquarry.com/p/Changelog#Changes%20in%20Version%207.1|Changelog 7.1]] [[https://www.seekquarry.com/p/Changelog#Changes%20in%20Version%207|Changelog 7]] I have also done several smaller scale crawls, the largest being about 125 million pages to try to understand how to improve indexing and crawling. These crawls started from a much larger set of seed sites than I was using in 2014/2015 when Yioop did its one and only so far billion page crawl. For that crawl I used about 2-3 hundred popular seed sites; whereas, for the more recent crawls I used about 10 times that. These sites were from a much broader pool of regions, so the hope was I could see how well the internationalization indexing features of Yioop worked. I also looked at top sites and didn't immediately delete them if they were porn sites. I am not sure if for these reasons or just because web-site developers are getting more sophisticated in their SEO attempts, my crawler seems to be discovering way more link farms than in the past. So in the 2020 crawler climate you have the following bad situation for third party crawlers: many legit sites (government/company) restrict to only let Google and Bing (not always Bing) crawl them, locking in the Google monopoly, while many non-legit sites have links farms that suck in your general web crawlers. I can't really do too much about the Google situation except to plead with website owners to not block behaving bots by default. Or if they see a bot that gives a contact you URL and they have an issue with the bot, at least try to contact the owner to see if they can fix the issue. I have been working on an improved Bloom filter based detection algorithm for link farms, which I will continue to experiment with for smallish crawls. Yioop at its peak had six mac minis, it is currently down to four. So I will probably continue to do more smaller scale crawls (couple hundred million pages) over the next few months to hone this technology before trying to do a larger crawls again. Another thing I am working on is to improve the indexing. Currently, in addition to the actual pages downloaded, Yioop creates small mini-pages for each link discovered. This makes it easier to do certain queries like find the words associated with a page but not on the page or to show potential search results for pages not yet downloaded. However, it makes Yioop indexes about 10 times larger than they otherwise would be if this wasn't done. This is basically because these mini pages getting listed in each posting list and there are at least 10 times more links discovered than pages downloaded. I am reworking the code to achieve most of these abilities via adding additional meta words to the Yioop index and not creating these mini-documents. I am also rebalancing what is done in the fetcher versus what is done in the indexer which should hopefully make certain merging operations in the indexer work better. Lots of Yioop coding ahead before version 8 in mid 2021!