2016-11-30

Option to exclude duplicate URLs with # appended?.

I noticed that the crawler views URLs with #text at the end to be different URLs which causes thousands of duplicate URLs on some sites. Is there an option to strip off #text from the end of a url?
EG: http://google.com/#someelement to http://google.com/
I noticed that the crawler views URLs with #text at the end to be different URLs which causes thousands of duplicate URLs on some sites. Is there an option to strip off #text from the end of a url? EG: http://google.com/#someelement to http://google.com/

-- Option to exclude duplicate URLs with # appended?
Hi Spurdo,
For humanly entered seed sites on the crawl options pages, Yioop trusts what the user wants as far as fragments go (i.e., it won't delete them and will download the fragments separately). However, when it is crawling and canonicalizing urls to crawl next, it does not include the fragment. So if it discovers two links on a page http://foo.org/#A and http://foo.org/#B it would only schedule http://foo.org/ to crawl. So to avoid the issue you mention above don't enter fragments when entering seed sites.
Best, Chris
Hi Spurdo, For humanly entered seed sites on the crawl options pages, Yioop trusts what the user wants as far as fragments go (i.e., it won't delete them and will download the fragments separately). However, when it is crawling and canonicalizing urls to crawl next, it does not include the fragment. So if it discovers two links on a page http://foo.org/#A and http://foo.org/#B it would only schedule http://foo.org/ to crawl. So to avoid the issue you mention above don't enter fragments when entering seed sites. Best, Chris
X