-- Option to exclude duplicate URLs with # appended?
Hi Spurdo,
For humanly entered seed sites on the crawl options pages, Yioop trusts what the user wants as far as fragments go (i.e., it won't delete them and will download the fragments separately). However, when it is crawling and canonicalizing urls to crawl next, it does not include the fragment. So if it discovers two links on a page
http://foo.org/#A and http://foo.org/#B it would only schedule http://foo.org/ to crawl. So to avoid the issue you mention above don't enter fragments when entering seed sites.
Best,
Chris
Hi Spurdo,
For humanly entered seed sites on the crawl options pages, Yioop trusts what the user wants as far as fragments go (i.e., it won't delete them and will download the fragments separately). However, when it is crawling and canonicalizing urls to crawl next, it does not include the fragment. So if it discovers two links on a page
http://foo.org/#A and http://foo.org/#B it would only schedule http://foo.org/ to crawl. So to avoid the issue you mention above don't enter fragments when entering seed sites.
Best,
Chris