2013-09-04

How to turn yioop crawler into a topical web crawler?.

Originally Posted By: jeffseka1
Hello first let me explain the problem.
I want to crawl and index only web site that talks about a spesific topic or that has spesific information lets say only sites about colleges.
i want to have a specific relevancy plugin that during crawling or indexing, it will be able to check the page before it saves it, that it is relevant to the information we are looking for by some means of relavancy curriculations,
if the page does not meet our requirements as in relavancy, we dont save it and the url is saved in the not to crawl urls.

my question is where in the script can i be able to get the page, check relevancy if relevant i let the script carry on, if not relevant i drop that page and have its url in the not to crawl again list.

which class or method is to overide?
'''Originally Posted By: jeffseka1''' Hello first let me explain the problem.<br>I want to crawl and index only web site that talks about a spesific topic or that has spesific information lets say only sites about colleges.<br>i want to have a specific relevancy plugin that during crawling or indexing, it will be able to check the page before it saves it, that it is relevant to the information we are looking for by some means of relavancy curriculations,<br>if the page does not meet our requirements as in relavancy, we dont save it and the url is saved in the not to crawl urls.<br><br>my question is where in the script can i be able to get the page, check relevancy if relevant i let the script carry on, if not relevant i drop that page and have its url in the not to crawl again list.<br><br>which class or method is to overide?

-- How to turn yioop crawler into a topical web crawler?
The method where stuff happens is processFetchPages in fetcher.php. If you write an indexing plugin, then it will be called at during the $processor->handle() call in this method. Currently, page rules would then be processed next. Finally, classifier labeling is done. I will probably reorder the last two. I could add a check there for whether a site has the label self::DELETE and if so remove it from further processing. There you would just need to write your plugin and made sure it had that label. If that seems like a reasonable feature to invoice $50, let me know via PM. I am trying to switch whereby Yioop can generate a little consulting money.
The method where stuff happens is processFetchPages in fetcher.php. If you write an indexing plugin, then it will be called at during the $processor-&gt;handle() call in this method. Currently, page rules would then be processed next. Finally, classifier labeling is done. I will probably reorder the last two. I could add a check there for whether a site has the label self::DELETE and if so remove it from further processing. There you would just need to write your plugin and made sure it had that label. If that seems like a reasonable feature to invoice $50, let me know via PM. I am trying to switch whereby Yioop can generate a little consulting money.
2013-09-05

-- How to turn yioop crawler into a topical web crawler?
Originally Posted By: jeffseka1
Hello thank you very much for the quick response, Acording to the explanation above together with the documentation "Writing an Indexing Plugin" it seems as if something that presunmed being fairly simple to implement, is almost not possible as far as yioop script is concirned.

Well i have taken quite some time studying the code mostly the fetcher class trying to look closley how i can simply implement this functionality from my custom plugin.

This is my finding. after the page has been processed (done all the relevancy checks) from my custom plugin, even when the proccessed page is typically irrelevant to what i am looking for, the crawl still carries on proccessing it up unitil the "postProcessing($index_name)" is called. secondly there is no easy way of storing this irrelevant url for further crawl that the crawler never download it again as we have already examined it and its not our type of information that we want.

All am trying to index is only sites talking about colleges as a topic period. if the current crawled page is facebook,amazon, if the plugin manages to pick it up as completely irrelevant to what we want, why should i wait for the crawl to finish? cant i just return anything null or negative from the "pageProcessing($page, $url)" so the calling method shall know that its time to get to the next loop of available pages. also which method is to call if i am to this this irrelavant link to the links that we shall never have to crawl ever again.
'''Originally Posted By: jeffseka1''' Hello thank you very much for the quick response, Acording to the explanation above together with the documentation &quot;Writing an Indexing Plugin&quot; it seems as if something that presunmed being fairly simple to implement, is almost not possible as far as yioop script is concirned.<br><br>Well i have taken quite some time studying the code mostly the fetcher class trying to look closley how i can simply implement this functionality from my custom plugin. <br><br>This is my finding. after the page has been processed (done all the relevancy checks) from my custom plugin, even when the proccessed page is typically irrelevant to what i am looking for, the crawl still carries on proccessing it up unitil the &quot;postProcessing($index_name)&quot; is called. secondly there is no easy way of storing this irrelevant url for further crawl that the crawler never download it again as we have already examined it and its not our type of information that we want. <br><br>All am trying to index is only sites talking about colleges as a topic period. if the current crawled page is facebook,amazon, if the plugin manages to pick it up as completely irrelevant to what we want, why should i wait for the crawl to finish? cant i just return anything null or negative from the &quot;pageProcessing($page, $url)&quot; so the calling method shall know that its time to get to the next loop of available pages. also which method is to call if i am to this this irrelavant link to the links that we shall never have to crawl ever again.

-- How to turn yioop crawler into a topical web crawler?
I swapped page rules and classifiers in the latest git version go Yioop. This means there are currently two slightly inelegant ways to do this (rather than one before) and a feature request. The current ways to do this are (1) write a plugin that adds a label for the page to be deleted to the data for a page or (2) use a classifier to add a label. Then you could use a Yioop page rule to nuke most of the contents of the page so that it would not appear under any search in the index. This is somewhat inelegant in that a stub of the page would still be stored wasting a little bit of space. All of this would occur before any postProcessing. I am not sure where you are getting that from. The feature request which makes sense to get around to at some point is to fix a name for a label that would mean not to even bother have a stub stored.
I swapped page rules and classifiers in the latest git version go Yioop. This means there are currently two slightly inelegant ways to do this (rather than one before) and a feature request. The current ways to do this are (1) write a plugin that adds a label for the page to be deleted to the data for a page or (2) use a classifier to add a label. Then you could use a Yioop page rule to nuke most of the contents of the page so that it would not appear under any search in the index. This is somewhat inelegant in that a stub of the page would still be stored wasting a little bit of space. All of this would occur before any postProcessing. I am not sure where you are getting that from. The feature request which makes sense to get around to at some point is to fix a name for a label that would mean not to even bother have a stub stored.
2013-09-09

-- How to turn yioop crawler into a topical web crawler?
Hey Jeff,
Just in case you didn't see my PM, your plugin is ready.

Best,
Chris
Hey Jeff,<br>Just in case you didn't see my PM, your plugin is ready.<br><br>Best,<br>Chris
X