-- How to turn yioop crawler into a topical web crawler?
Originally Posted By: jeffseka1
Hello thank you very much for the quick response, Acording to the explanation above together with the documentation "Writing an Indexing Plugin" it seems as if something that presunmed being fairly simple to implement, is almost not possible as far as yioop script is concirned.
Well i have taken quite some time studying the code mostly the fetcher class trying to look closley how i can simply implement this functionality from my custom plugin.
This is my finding. after the page has been processed (done all the relevancy checks) from my custom plugin, even when the proccessed page is typically irrelevant to what i am looking for, the crawl still carries on proccessing it up unitil the "postProcessing($index_name)" is called. secondly there is no easy way of storing this irrelevant url for further crawl that the crawler never download it again as we have already examined it and its not our type of information that we want.
All am trying to index is only sites talking about colleges as a topic period. if the current crawled page is facebook,amazon, if the plugin manages to pick it up as completely irrelevant to what we want, why should i wait for the crawl to finish? cant i just return anything null or negative from the "pageProcessing($page, $url)" so the calling method shall know that its time to get to the next loop of available pages. also which method is to call if i am to this this irrelavant link to the links that we shall never have to crawl ever again.
'''Originally Posted By: jeffseka1'''
Hello thank you very much for the quick response, Acording to the explanation above together with the documentation "Writing an Indexing Plugin" it seems as if something that presunmed being fairly simple to implement, is almost not possible as far as yioop script is concirned.<br><br>Well i have taken quite some time studying the code mostly the fetcher class trying to look closley how i can simply implement this functionality from my custom plugin. <br><br>This is my finding. after the page has been processed (done all the relevancy checks) from my custom plugin, even when the proccessed page is typically irrelevant to what i am looking for, the crawl still carries on proccessing it up unitil the "postProcessing($index_name)" is called. secondly there is no easy way of storing this irrelevant url for further crawl that the crawler never download it again as we have already examined it and its not our type of information that we want. <br><br>All am trying to index is only sites talking about colleges as a topic period. if the current crawled page is facebook,amazon, if the plugin manages to pick it up as completely irrelevant to what we want, why should i wait for the crawl to finish? cant i just return anything null or negative from the "pageProcessing($page, $url)" so the calling method shall know that its time to get to the next loop of available pages. also which method is to call if i am to this this irrelavant link to the links that we shall never have to crawl ever again.