2015-07-21

Upgrading Yioop To a Better Archiving System?.

I really do love Yioop, but it seems because it stores the crawl, and settings, data in the Work Directory, instead of using something like MySQL or the Even better Hadoop system to store the data it is extremely slow even on small crawls, and can't possibly stand up against other search solutions like Nutch or ElasticSearch. So I thought I might as well at least suggest that as a Goal for the next few updates maybe the project could move away from the Work Directory, and onto a more conventional and faster system like Hadoop. Which may allow the Yioop Software to gain some popularity.
Also as a Note in its current state even with an SSD it takes around 30 seconds to rank 10 results for a new search term.
I really do love Yioop, but it seems because it stores the crawl, and settings, data in the Work Directory, instead of using something like MySQL or the Even better Hadoop system to store the data it is extremely slow even on small crawls, and can't possibly stand up against other search solutions like Nutch or ElasticSearch. So I thought I might as well at least suggest that as a Goal for the next few updates maybe the project could move away from the Work Directory, and onto a more conventional and faster system like Hadoop. Which may allow the Yioop Software to gain some popularity. Also as a Note in its current state even with an SSD it takes around 30 seconds to rank 10 results for a new search term.

-- Upgrading Yioop To a Better Archiving System?
Can you give some specific stats on the crawl (size etc) you did and what version of Yioop where you have 30 second queries? Give the specific queries. It is much faster than that. As mentioned in the docs for full web page storage, Yioop stores data more efficiently than Mysql. Hadoop is a platform for doing Map Reduce jobs. It does have a file system component but storing things in it is not solving any query speed issues. Yioop is more directly comparable to having a pre-built Nutch/Lucene/Solr stack where you don't have to code stuff to get things to work together.
(Edited: 2015-07-21)
Can you give some specific stats on the crawl (size etc) you did and what version of Yioop where you have 30 second queries? Give the specific queries. It is much faster than that. As mentioned in the docs for full web page storage, Yioop stores data more efficiently than Mysql. Hadoop is a platform for doing Map Reduce jobs. It does have a file system component but storing things in it is not solving any query speed issues. Yioop is more directly comparable to having a pre-built Nutch/Lucene/Solr stack where you don't have to code stuff to get things to work together.

-- Upgrading Yioop To a Better Archiving System?
Hi, Thanks for the Extra info, and the tests I had been doing earlier that resulted in 30 second searches where mainly image searches on the Yioop.com Search Engine. To give an example if you search "Google" in the Image Sub-search it took Around 27 Seconds. I'm not sure if Yioop has inbuilt caching so here is a screenshot. http://i.imgur.com/65cHnSM.png As for the Exact Crawl Stats it was the current Default Crawl for Yioop.com. Also I should note that Yioop does seem to be quite snappy with smaller indexes, such-as one I started earlier today today of 20,000 Pages with an Average Response time of around 0.09 so I was wrong by saying that in my earlier post.
(Edited: 2015-07-21)
Hi, Thanks for the Extra info, and the tests I had been doing earlier that resulted in 30 second searches where mainly image searches on the Yioop.com Search Engine. To give an example if you search "Google" in the Image Sub-search it took Around 27 Seconds. I'm not sure if Yioop has inbuilt caching so here is a screenshot. http://i.imgur.com/65cHnSM.png As for the Exact Crawl Stats it was the current Default Crawl for Yioop.com. Also I should note that Yioop does seem to be quite snappy with smaller indexes, such-as one I started earlier today today of 20,000 Pages with an Average Response time of around 0.09 so I was wrong by saying that in my earlier post.

-- Upgrading Yioop To a Better Archiving System?
That crawl is currently 700 million pages/20 billion links. It is not being served using SSD and the same six mac minis are currently doing the crawl. Image search will also tend to be slower than normal search. The speed will improve significantly once I am done this crawl and if I switch the dictionary portion of the index to SSD (I actually am saving up to get enough SSD for this as I am doing this project out of pocket). Yioop does usually cache queries (except when I turn that option off). Those same six machines are also serving the findcan.ca index, however, that index is on SSD, although is only around 20million pages. It might give you a better feel for the speed when SSD is in use.
That crawl is currently 700 million pages/20 billion links. It is not being served using SSD and the same six mac minis are currently doing the crawl. Image search will also tend to be slower than normal search. The speed will improve significantly once I am done this crawl and if I switch the dictionary portion of the index to SSD (I actually am saving up to get enough SSD for this as I am doing this project out of pocket). Yioop does usually cache queries (except when I turn that option off). Those same six machines are also serving the findcan.ca index, however, that index is on SSD, although is only around 20million pages. It might give you a better feel for the speed when SSD is in use.

-- Upgrading Yioop To a Better Archiving System?
Yes. Well Your probably correct about the fact it will be faster on SSDs, and I must admit its a big job for just six mac Mini's. Also I wanted to Ask, in Yioops settings you can set how much of a webpage you want to download, but after the page is initially downloaded and indexed is the full content kept on the server or does Yioop delete some of the content, and just keep the KeyWords found in Indexing? Assuming Cache is Turned Off.
(Edited: 2015-07-21)
Yes. Well Your probably correct about the fact it will be faster on SSDs, and I must admit its a big job for just six mac Mini's. Also I wanted to Ask, in Yioops settings you can set how much of a webpage you want to download, but after the page is initially downloaded and indexed is the full content kept on the server or does Yioop delete some of the content, and just keep the KeyWords found in Indexing? Assuming Cache is Turned Off.

-- Upgrading Yioop To a Better Archiving System?
If under Page Options, Cache whole crawled pages is turned off, then Yioop will not keep whole caches of pages. Yioop will then store for each page a gzip'd summary of that page up to the byte limit specified in page options. It also stores data for summary in the inverted index. The latter data is around the same size as the summary.
If under Page Options, Cache whole crawled pages is turned off, then Yioop will not keep whole caches of pages. Yioop will then store for each page a gzip'd summary of that page up to the byte limit specified in page options. It also stores data for summary in the inverted index. The latter data is around the same size as the summary.

-- Upgrading Yioop To a Better Archiving System?
Okay Thanks!
(Edited: 2015-07-21)
Okay Thanks!

-- Upgrading Yioop To a Better Archiving System?
Just to be Clear by by "Byte Limit" your referring to the Max Summary length and not the Byte Range to Download correct?
Just to be Clear by by "Byte Limit" your referring to the Max Summary length and not the Byte Range to Download correct?

-- Upgrading Yioop To a Better Archiving System?
Yes, the Max Page Summary Length in Bytes is how many bytes it will gzip. The Byte range to download is so that Yioop doesn't get stuck downloading too big items.
Yes, the Max Page Summary Length in Bytes is how many bytes it will gzip. The Byte range to download is so that Yioop doesn't get stuck downloading too big items.
X