Yioop -

Old PHP Search Engine Blog

Describes what crawls Yioop performed pre-2014.

Nov. 15, 2013

I stopped the July 31 crawl on Nov. 11. Over the next few weeks I will try improving query performance of results returned from this crawl and do software updates to the machines involved. Then I will release Version 0.98 of the Yioop software. The crawl got to 333649911 pages -- so more than 1/3 billion pages. It is my largest crawl to date. The crawl lasted 104 days. During the crawling period I periodically stopped the crawl to improve the crawler's stability. So 14 days were lost to these upgrades. The last upgrade was Oct. 11. Yioop crawled nonstop without incident for the last month of the crawl. When it was crawling, it averaged 3.7million pages/day or 154 thousand pages/hour. If one includes downtime the speed was 3.2million pages/day. Some of the slower overall rate was due to the fact that the start/stop mechanism Yioop had in case of a crash was initially much slower than later in the crawl after some brain improvements. At the end of crawl Yioop still seemed to be reporting a crawl speed of 180thousand pages/hour.

Aug. 15, 2013

I started a new large scale crawl on July 31. I have been starting and stopping it as I continue to test out the index format introduced with Version 0.96. It has reached 52ish million pages now. I figure this large enough to switch over to it as the default index from the slightly broken old 226million page index. This crawl is not using any classifiers. Some experiments were conducted on a couple million page crawls with classifiers. I still want to experiment some more to make sure the results are improved by a classifier before using it in a large scale crawl. On the other hand, I wanted to get a replacement crawl up on the demo site to replace the one damaged by the hard drive failure, so I figured I'd go for a vanilla crawl. It's currently crawling at around 200-250 thousand urls/hour however since I started I have probably stopped the crawl for 3 days.

Jul. 24, 2013

Version 0.96 is out today. This version includes a new hybrid inverted index/suffix tree indexing scheme which should make calculating search results from the next crawl I do faster (doesn't affect old crawls). Yioop can now make use of ETag and Expire's information when deciding whether to download a URL it has seen before. Yioop now also supports the creation of classifiers using active learning. These can be used to label and add scoring information to documents during a crawl. Version 0.96 also includes improvements to the RSS feed news_updater and a segmenter for Chinese.
About a month ago, one of the six machines in the guest bedroom running the Yioop demo site had a hard drive failure. Each of these machines has a 4TB drive, but as I am mainly self-financing this project and drives are not-free, I don't have back-ups of web crawls I do. The upshot is that I lost a portion of the 276million page crawl I had done. Around 228 million are left, and although results are served from them, they are a little funky. Since the failure, I have updated the Wikipedia, Open Directory, and UTZoo archive crawls that were also being hosted on yioop.com. There was also a May 2012 crawl of 250million pages that was damaged which I will likely just delete. I am currently preparing to do a new large scale crawl. I want to train some classifiers to use in this new crawl first, but I expect to begin in a couple days.

Apr. 4, 2013

Version 0.94 is out today. The documentation and revised seekquarry.com site should be up later today. This version adds a simple language called Page Rules for controlling how data is extracted from web pages during the summary creation phase of indexing. It also adds the ability to index records coming from a database query and it adds a generic text importer which works on plain text, gzip'd, and bzip'd text records. For example, access log records, emails, etc. As a test of these new facilities, I created my first test index of the UTZoo Usenet Archive from 1981-1991. This first pass is rather crude, but I hope to improve it over the coming months. This index is available for selection using settings.
The current default index in Yioop is 276 million pages, obtained between Dec 17 and Mar 14 (It was resumed for half a day sometime later in March for a test).
Other features in this new version of Yioop are Atom support as a News Feed Search Source, a dedicated process new_updater.php for handling news updates, and a better algorithm for distributing archive data during an archive crawl. Many other minor improvements were made.

Jan. 4, 2013

Happy New Year! I am finally releasing a Version 0.92 of Yioop today. This version supports doing archive crawls of crawl mixes. This should make query performance of crawl mixes much better. Cache pages of search results now have a new history UI that allows you to search cache pages in all indexes you have, much like the way Internet Archive does. Yioop now supports single character edit spell corrections on searches after they have been done, and it has an API for transliterating between roman and other scripts. Query performance has been improved over previous versions and lots of minor bugs have been fixed.
Since the end of November, I have been discovering bugs through the experiments with findcan.ca. I then did a complete archive crawl of the English Wikipedia. This took about three days. One can see the results by going to Yioop settings and selecting the Wikipedia crawl. At some point I should improve Yioop's Media Wiki processor because it doesn't really understand most wiki mark-up and this shows on cached versions of pages. In any case, after doing the Wikipedia archive crawl, I started a new general crawl on Yioop on Dec 17. It is around 76 million pages. I will probably let it go for the next few months (crossing fingers). The older 250 million page crawl from last May-July is still available on the Yioop site as well under settings.

Nov. 23, 2012

It was around Nov. 4 that I actually began a larger scale crawl. I had started a larger scale crawl and let it crawl to about 70million pages. My brother wanted to mess around with crawling on findcan, so I suspended my crawl as they use the same machines and I didn't want to completely bring those machines to their knees.
In the meantime I will look at the results so far. There are two parameters I typically try to tune: the starting sites and the page extraction code. The latter code determines which words on a page to extract for indexing -- this can have a big influence on the final results. Yioop by default only downloads the first 50,000 bytes of every page. To keep the index smaller it doesn't index every word in these 50,000 bytes. Instead, it grabs title text, h1 tags up to a threshold number of bytes and splits those out into a field known as title text. It then take meta descriptions tag text, div tag text, paragraph tag text, and so on out to a larger number of byte threshold. It puts these into another field known as description. In addition, link text is also treated separately. When a query comes in a weighted score of these components is used to do the final ranking. "Grabbing title text", "to a threshold", etc is vague, how this is made precise greatly affects the quality of the final result and is what I am working on now.

Nov. 2, 2012

It turns out that in my testing -- at least on my laptop -- adding a per-term index didn't seem to speed things up very much over implementing next() using just galloping search, so I went back to the latter. I was able to speed-up query performance a bit by optimizing my code that extracts postings from the byte strings they occur in. It could still be improved somewhat. I think I might break down and write an PHP extension at some point to make these lower level things faster -- I am still trying though to do everything in pure PHP for now. I'm am going to begin a new large-scale crawl later today.

Oct. 22, 2012

Two of the SSDs in Yioop were Crucial m4 512GB SSDs and there seems to be a bug in their firmware so that they cease to work after 5200 hours. I am going to try to update the firmware later in the week. Till then the Yioop search results will seem a bit weird as they are partitioned by document and only coming from one of three machines. -- This was fixed as of Oct. 23 .

Oct 19, 2012

It's been a while since my last Yioop blog post, so I figure I should post. I currently have the hardware to do a crawl at least twice as big as the last May crawl. I am doing some testing with the findcan.ca site to see if I can fine tune its download speed and also to see what happens when I start trying to do a more extensive crawl with Yioop. Yioop currently is running a slightly more recent version of the software than Version 0.90. So it supports News results and these are updated hourly. Currently, I am working on query performance. One dumb thing I had been doing was to validate the index time stamp provided on all queries to Yioop. The point of this was for some reason the Googlebot and some other crawlers kept trying to hit links on my site to old indexes that no longer existed -- so I validated the timestamp and gave an error in this case. For an external request of this type, this makes sense; however, the code was dumbly being applied when Yioop was talking between its own machines. Removing this check on internal requests approximately halved the query response time on single term queries. Yioop's performance on multi-term queries for terms that it doesn't treat as n-grams is still somewhat weak. For Version 0.92, I am going to tweak the index format so that the posting lists make use of a per-term index. My hope is that this will speed things up a lot in the multi-term case. After I've coded this, and done a little more tuning, I will try to do my next large scale -- at least, 1/2 billion pages, I might be able to get to 0.8-1billion with the hard drive capacities that I have.

July 28, 2012.

The crawl has reached 1/4 billion now. From what I've seen so far with my current hardware I could probably crawl out to 1/2 a billion. I am stopping the crawl now though and trying to tweak my code to improve it and also to improve my initial seed site info. For about 7 days during the last month and a half I had to stop the crawl for various mishaps. The machine with Yioop on it had issues towards the end of June and was replaced, which meant I had to set up the new machine -- I took the original back to the Apple store and they were very awesome about replacing it for me. Kudos Apple. The current machine uses a separate drive for the OS then for Yioop and this seems to help a lot. The second issue was Yioop was screwing up in the parsing of Bing's robots.txt file -- nobody from Bing wrote me -- but I am trying to watch out for these. I stopped a couple days while I was figuring out what was going on. It turned out the robots.txt parser was okay (which it should be as I had a bunch of unit tests for it), it was that when I fed in a URL to do a check, I was stripping some of the query string. With this fixed, I resumed the crawl. I will probably spend the next couple months looking at the results of this crawl, trying to improve Yioop before doing a new large crawl. I might also try to get some more hardware to see if I could do a billion page crawl. It has been more than a decade since Google grew past a billion pages (midway through 2000?). With Moore's Law and gigabit connections to the home on the horizon, it should soon be quite feasible for the average household to do such a crawl.

June 12, 2012.

I had a couple more false starts before starting the crawl for real on May 11. Thereafter the crawl proceeded relatively smoothly up to the first 100million so far. I was typically running around 16 fetchers and typical crawl speeds were 150-220 thousand/hour. The fastest I have seen is 300thousand, the slowest around 80thousand -- this happened not infrequently when a bunch of fetchers crashed and I forgot to re-start them. One major glitch in the crawl so far, was how I had organized the SSD drives on yioop.com. On the two other queue servers the drive for the yioop installation was not the OS drive. So the OS drive had plenty of space. On yioop.com they were one and the same, with the February crawl taking up most of the space on this drive, yioop.com was running out of swap space and crashed a couple of times. This meant the dictionary portion of the index was messed up and needed to be reindexed at some point. To fix the swap space issue I got rid of some useless stuff on the drive and continued crawling. I then reindexed around 98million, but the code for reindex didn't set the permissions for the web server to write to the dictionary when it was done so I lost a couple days before I realized this and fixed stuff. The 0.86 version of the code needed the summaries in addition to the dictionary and posting lists to be on SSD to achieve decent performance. This hogged a lot of expensive resources. Version 0.88 which is getting close to being released only needs the dictionary and posting lists to be on SSD to achieve the same performance. After the 98million page stop I reorganized folders moving summaries off the SSD for the February crawl and moving the May crawl's dictionary and postings lists onto SSD. I should have plenty of SSD space now until 200million when I will push February completely off SSD (leaving it on disk as a just-in-case index). With that I should be able to go 300million with my existing set-up without compromising performance.

May 6, 2012

False start there. It turns out that the create_function code that I used to remove anonymous functions to re-gain 5.2 compatibility for serving search results also introduced a memory leak. This caused fetchers to crash very frequently. This has now been fixed and I am now restarting the crawl. I have created an 0.861 tag for the fix.

May 5, 2012

Starting a new crawl. This crawl adds a first pass at trying to detect "safe" versus "non-safe" web pages. Yioop doesn't at this point implement a filter on such a notion by default. By adding safe:true as a search term in this new crawl one should be able to (hopefully) remove many sites with explicit content. This crawl also adds the meta word media:video. Detection in this case is based on url sniffing. I am leaving the February crawl results as the default crawl results until this crawl gets to 100 million. Then I will switch. While the crawl is on going search results returned for searches on the new crawl will only have the top tier of the dictionary. This can represent as little as a half of all page data collected until the final closing of the crawl. My hope is this crawl will be at least a few hundred million pages. I still don't have the hard drive capacity to get to a billion, but will try to add capacity over the summer.

May 2, 2012

Hammering away at the last few bugs before releasing v0.86. I am doing a test crawl with findcan.ca. This uses two of the same queue servers as Yioop does. I have been debugging some performance issues that resulted from my more careful robots.txt handling. I seem to have gotten the speed back and a little more. I want to work a little on how the results from various queue_servers are combined before pushing the next release, hopefully sometime this week.</p>

April 15, 2012

Replace Yioop Hardware. Yioop was down intermittently for a couple days because the logic board on the Mac Mini acting as the web head died. This is now fixed.

March 30, 2012

Test Crawl. Starting a 1-2 day crawl to test new multi-curl code. Also, testing on a larger scale new support for * and $ in robots.txt, X-Robots-Tags, etc.

February 5 - March 14 Crawl

This crawl was initially started with Version 0.822 of Yioop. It was done using three Mac Mini's -- two 2011 MacMini with i7 processors which I had bought, one 2010 Mac Mini donated by my brother (Thanks, Al!). According to <a href="http://speedtest.net">http://speedtest.net</a>, my Comcast small business internet gets about 17-20Mbps down. During the crawl 100162534 pages were downloaded and 2526333234 urls extracted. For most of the crawl, Yioop was downloading between 100,000-150,000 pages/hour or about 3 million pages/day. The size of this crawl is comparable to the first demo of Nutch in 2003 or about 4 times the size of Google as described in the original Google paper of 1998. This was the first extensive crawl using Yioop to make use of multiple queue servers. An earlier trial crawl on <a href="http://findcan.ca">http://findcan.ca</a> using two queue servers downloaded 10 million pages. The February 5 - March 14 crawl is also the largest crawl using the Yioop software that I am aware of. The previous largest crawl was conducted in December using a single queue server in which 30 million pages were downloaded.
Because this crawl was more extensive and faster than previous crawls I have conducted, it served as a good learning experience...
About two days into the crawl I was contacted by someone at projects.latimes.com because Yioop was crawling some of their more obscure, non-cached school pages too quickly. I added this site to Yioop's do not crawl list. At their suggestion I also added to the 0.84 version of Yioop the ability to detect if it is getting a lot of errors or slow pages from a web site, and if so, to act as it had received a Crawl-Delay instruction from a robots.txt file (even though it hadn't). It also seemed useful to be able to quickly search on which sites were responding with which HTTP error codes, so I added a code: meta-word to Version 0.84 of Yioop to make this easier.
The next thing (Feb 16) that happened during the crawl was that I noticed my IP had been banned by slashdot.org (which I read reasonably frequently). After contacting them to get my IP unblocked, I added them to my do not crawl list. From my e-mail exchanges with the slashdot people, it seems that if a robot downloads more than around 5000 pages/day it can get blocked. So I added to Version 0.84 the ability to set on a per site basis, a maximum number of pages to download/hour. Also, because it seemed useful to be able to say what I downloaded from a site, and when, I increased the resolution of the date: meta word from the day level of resolution to the second level. In Version 0.84, this meta word combined with the site: meta word, can be used to tell, for any given domain, and any given second, how many pages were downloaded.</p> <p>Version 0.822 of Yioop relied on the DNS caching that comes with cURL. In particular, different fetchers wouldn't know the addresses that others had cached. During the crawl, I got the feeling that my home internet was becoming unusable, my browser would take forever contacting the DNS server to look up web-site addresses. So I added to Version 0.84 a queue server-based caching mechanism. On Feb 27 or so, one of my machines decided it was tired out and decided to conk out. As it happened on a teaching day, I stopped the crawl for two days. Then looked at the relevant hard drive to try to understand what happened. After making sure things were okay, I upgraded the software on all of the queue_servers. So the new statistics and DNS caching were added then (this affects some of the results on the statistics page). With DNS caching, I wasn't able to notice the crawl going on while using the internet -- the internet became more usable again.
On March 9, I was contacted by modularfords.com saying that my robot was disobeying their robots.txt directives. I stopped the crawl to investigate. Yioop was parsing the robots.txt files okay, but I found a bug in the way it was inserting rules into the robot Bloom filter used to check where Yioop is allowed to crawl. This was fixed. Yioop in Version 0.82, only stored robots.txt files in a non-humanly readable Bloom filter, it did not keep caches of these files. It seemed to me it would be useful to know exactly what robots.txt file was given to Yioop at a given time, so I modified Yioop so that it would cache these pages. I set this up like sitemaps, so Yioop extracts only meta words from such pages, not English or other language words which might appear in the index. I added a new meta word path:, so that path:/robots.txt finds all pages with urls ending with /robots.txt in the index. This combined with site: allows one to look up the robots.txt downloaded for a site. After making these changes, I restarted the crawl and crawled the remaining ten million or so pages.