-- wordfilter plugin / feature request
Hey trs-eric,
I've modified Web Scrapers in Yioop now to also support regex. As an example, one can write:
FILTER_LIST_HTML5=r/\<!doctype\s*html\s*\>/i
Here the r/ indicates that the right hand side of the equality is a regex not an xpath expression. If /\<!doctype\s*html\s*\>/i matches the current document then FILTER_LIST_HTML5 gets the value 1, otherwise it gets the value 0. Since this is a FILTER_LIST_ term if it is value is not empty (i.e., not 0) it will get added to the summary before the WordPlugin runs. So a Word Plugin Rule like:
FILTER_LIST_HTML5:JUSTFOLLOW
will cause the crawler to just follow the links but not add the page to the index.
To see that it works, on the most recent git repository version of Yioop, you can go to Page Options, then the Test Options tab and try it out on various pages. When a page is HTML5 you will see that FILTER_LIST_HTML5 with value 1 will appear as one of the fields after Page Rules are applied. Also, you will that ROBOT_METAS has as one of its elements JUSTFOLLOW. On the other hand, if the page is not HTML5, FILTER_LIST_HTML5 has value 0, and ROBOT_METAS will be empty. You can add other extract variable regexes, for different flavor of HTML as desired.
Best,
Chris
Hey trs-eric,
I've modified Web Scrapers in Yioop now to also support regex. As an example, one can write:
FILTER_LIST_HTML5=r/\<!doctype\s*html\s*\>/i
Here the r/ indicates that the right hand side of the equality is a regex not an xpath expression. If /\<!doctype\s*html\s*\>/i matches the current document then FILTER_LIST_HTML5 gets the value 1, otherwise it gets the value 0. Since this is a FILTER_LIST_ term if it is value is not empty (i.e., not 0) it will get added to the summary before the WordPlugin runs. So a Word Plugin Rule like:
FILTER_LIST_HTML5:JUSTFOLLOW
will cause the crawler to just follow the links but not add the page to the index.
To see that it works, on the most recent git repository version of Yioop, you can go to Page Options, then the Test Options tab and try it out on various pages. When a page is HTML5 you will see that FILTER_LIST_HTML5 with value 1 will appear as one of the fields after Page Rules are applied. Also, you will that ROBOT_METAS has as one of its elements JUSTFOLLOW. On the other hand, if the page is not HTML5, FILTER_LIST_HTML5 has value 0, and ROBOT_METAS will be empty. You can add other extract variable regexes, for different flavor of HTML as desired.
Best,
Chris