PHP Search Engine

=A Brief Description of Yioop Bot=

==How to Identify Yioop Bot==

Presumably, you arrived at this site because you noticed traffic from a User-Agent that identified itself with the string:

<pre> Mozilla/5.0 (compatible; YioopBot; +https://www.yioop.com/bot.php) </pre>

If the IP Address was also 173.13.143.73 to 78, then you have come to the right place to find out about who was probably crawling your site. If it was a different IP address, then someone else is hijacking my crawler's name.

==Who runs Yioop Bot==

I am Chris Pollett. A couple years ago I began experimenting on my home machines to create a 25 million page index. I chose 25 million as my target as this was the number of pages crawled in the original 1998 paper on Google. I finally achieved this goal in early October, 2011. My next goal was to crawl to 100 million pages on my home machines -- the number crawled by Nutch in its 2003 demonstration. This was achieved by Yioop between February 5 and March 14, 2012. Here is a list of longer crawls I have/I am conducting:

'''Jul. 6, 2022 - '''. Sequence of test crawls.
'''May. 24, 2019 - '''. Sequence of test crawls.
'''Oct. 10, 2014 - Oct 15, 2015'''. A billion pages.
'''Jul. 31, 2013 - Nov. 11, 2013'''. 334 million pages.
'''Dec. 17, 2012 - Mar. 14, 2013'''. 276 million pages.
'''May, 2012 - July, 2012'''. 250 million pages.

My crawls are used in an actual search engine available at [[https://www.yioop.com/|https://www.yioop.com/]]. This site gets queries from around the world. The GPLv3 source code of this search engine and the crawler are available for download at [[https://www.seekquarry.com/|https://www.seekquarry.com/]]. If you are really bored, you can actually test this software on your site to confirm or refute what is described below. If you find bugs, it would be nice to drop me a line at the address at the end of this article.

==How Often Yioop Bot Crawls a Site==

Yioop Bot is currently run sporadically (not continuously) on a small number of machines. Each machine has about 4-6 fetcher processes. Each fetcher has open at most 100-300 connections at any given time. In a typical situation, these connections would not all be to the same host.

==How you can Change how Yioop Bot Crawls your Site==

Yioop Bot understands robots.txt (it has to be robots.txt not robot.txt ) files and will obey commands in them except for commands that prevent the crawling of the host page (aka landing page) of a site. That is, if you have a command that blocks a url like http://foo.com/somepath, Yioop will honor it; but Yioop might still download the page http://foo.com/ . A robots.txt must be placed in the root folder of your website for its instructions to be followed. Yioop does not look in subfolders for robots.txt files. A simple robots.txt file to block Yioop! from crawling any folders other than host url page, the coolstuff folder, and its subfolders might look like:

<pre> User-agent: '''''YioopBot''''' Disallow: / Allow: /cool_stuff/ </pre>

YioopBot also obeys HTML ROBOTS meta tags with content among none, noindex, nofollow, noarchive, nosnippet. An example HTML page, using the noindex, nofollow directive might look as follows:

<pre> <!DOCTYPE html > <html> <head><title>Meta Robots Example</title> <meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />  </head> <body> <p>Stuff robots shouldn't put in their index. <a href="/somewhere">A link that nofollow will prevent from being followed</a></p> </body> </html> </pre>

YioopBot does not use Open Directory or Yahoo! Directory data, so noodp and noydir are implicitly supported. YioopBot matches case-insensitively.

Within the head of the document one can also specify a canonical page corresponding to the current page using the [[https://en.wikipedia.org/wiki/Canonicallinkelement|rel canonical syntax]]. For example,

might appear on a page with url http://my.canonical.page.com/?t=gratuitous_token to indicate that this page and the canonical page are the same.

Within HTML documents Yioop Bot honors anchor rel="nofollow" directives. For example, the following link would not be followed by Yioop Bot:

<pre> <a href="/somewhere_else" rel="nofollow" >This link would not be followed by YioopBot</a> </pre>

Yioop Bot further understands the Crawl-delay extension to the robots.txt standard and also Sitemap directives. For example,

<pre> User-agent: '''''YioopBot''''' Crawl-Delay: 10 # YioopBot will wait 10 seconds between requests Sitemap: http://www.mycoolsite.com/mycoolsitemap.xml.gz #YioopBot will eventually download </pre>

For non-HTML pages, you can control how Yioop Bot indexes, follows links, and how Yioop! displays results from these pages in the [[https://www.yioop.com/|Yioop! Web site]] by using an X-Robots-Tag HTTP header. For example, if your web server sent as part of its HTTP Response header before the actual page data of say a PDF file, the following

<pre> X-Robots-Tag: nosnippet </pre>

then if the PDF appeared as part of search results, then would be no snippet text under the link in the search results. If you want to specify a canonical link for a non html document you can use an HTTP header like:

<pre> Link: <http://my.canonical.page.com/sub_dir/my.pdf>; rel="canonical" </pre>

==More Specifics on robots.txt and Meta Tag Handling==

When processing a robots.txt file, if Disallow and Allow lines are in conflict, YioopBot gives preference to the Allow directive over the Disallow directive as the default behavior of robots.txt is to allow everything except what is explicitly disallowed.

If a webpage has a noindex meta tag, then it won't show up in search results, provided that Yioop! has actually downloaded the page. If Yioop! hasn't downloaded the page, or is forbidden from downloading the page by a robots.txt file, it is possible for a link to the page to show up in search results. This could happen if another page links to the given page, and Yioop has extracted this link and its text and used them in search results. One can check if a URL has been downloaded by typing a query info:URL into Yioop! and seeing the results.

When processing a robots.txt file, YioopBot first looks for YioopBot User-agent blocks and extracts all of the Allow and Disallow paths listed in them. On success, these form the path that YioopBot will use to restrict its access to your site. If it cannot find any such block, it searches case-insensitively for User-Agent names which may contain the wildcard which match with YioopBot's name. For example, oop, Bot, etc. It then parses all of these blocks and uses them to restrict its access to your site. In particular, if you have a block "User-Agent: " followed by allow and disallow rules, and no blocks for YioopBot, then these paths will be what YioopBot uses and honors.

Sitemap directives as per the [[http://www.sitemaps.org/protocol.html#informing|Sitemap specification]] are not associated with any particular User-Agent. So Yioop processes, to the extent that it does, any such directive it finds.

In processing, Allow and Disallow paths prior to March, 2012 (v 0.86), YioopBot did not understand or ` in these paths. "" and "`" are Google, Yahoo, and Bing supported extensions to the original robots.txt specification. As of March, 2012, YioopBot does understand these extensions. So for example, one can block access to pages on your site containing a query string by having a Disallow path such as:

<pre> Disallow: /*? </pre>

Yioop! makes use of the [[https://curl.haxx.se/|cURL]] libraries to download web pages. Prior to March, 2012 (v0.86), Yioop! used cURL's automatic following of redirects. This meant that Yioop! sometimes followed URL shortened links or other redirects to a page whose robots.txt would have denied it access. Since March, 2012, Yioop! does not use this feature of cURL and for a redirect response instead extracts a link that has to go through the same queuing and robots.txt checking as all other links.

==How Quickly does Yioop Bot Change its Behavior==

When my machines are crawling for longer than one day, they cache the robots.txt file. They use the cached directives rather than re-requesting the robots.txt file for 24 hours before making a new request of the robots.txt file again. So if you change your robots.txt file it might take a little while before the changes are noticed by my crawler.

==Contact Info==

If you have any questions about my crawler, please feel free to contact me (chris@pollett.org).