16/04/2022

wordfilter plugin / feature request.

Is it possible to filter HTML in the wordfilter plugin? Or do you think this will require a new plugin.
I would like to do something like:
+<!DOCTYPE html>:JUSTFOLLOW
To filter out HTML5 documents, it doesn't seem to have much effect on the search results.
Any idea how I could filter in or out based on some HTML text?
Is it possible to filter HTML in the wordfilter plugin? Or do you think this will require a new plugin. I would like to do something like: +<!DOCTYPE html>:JUSTFOLLOW To filter out HTML5 documents, it doesn't seem to have much effect on the search results. Any idea how I could filter in or out based on some HTML text?
17/04/2022

-- wordfilter plugin / feature request
Hey trs-eric,
That seems like a good suggestion. Right now, it doesn't support that, but I could add support for the next version of Yioop. The code where it checks is in src/library/indexing_plugins/WordFilterPlugin.php in the checkFilter method. It is probably less than a 10 line fix. I'll let you know when I have done the change in the git repository.
Best,
Chris
Hey trs-eric, That seems like a good suggestion. Right now, it doesn't support that, but I could add support for the next version of Yioop. The code where it checks is in src/library/indexing_plugins/WordFilterPlugin.php in the checkFilter method. It is probably less than a 10 line fix. I'll let you know when I have done the change in the git repository. Best, Chris

-- wordfilter plugin / feature request
Hey trs-eric,
Thinking about it a bit more, the word plugin is applied on titles and descriptions after page summarization has been done. So there won't be any tags at that stage. What probably makes more sense is to use a Web Scraper to extract new fields, either beginning with FILTER_TERM_..., or FILTER_LIST_... using an appropriate XPath. I've modified the word plugin now so that if a field FILTER_TERM_... is found, for example, FILTER_TERM_DOCTYPE, then the term FILTER_TERM_DOCTYPE is added to the summarized description before the word plugin is run. If a field FILTER_LIST_... then the terms in its value are added to the summarized description before the word plugin is run. This allows you to write an appropriate word plugin rule without changing the word plugin syntax to filter what you want.
Best, Chris
(Edited: 17/04/2022)
Hey trs-eric, Thinking about it a bit more, the word plugin is applied on titles and descriptions after page summarization has been done. So there won't be any tags at that stage. What probably makes more sense is to use a Web Scraper to extract new fields, either beginning with FILTER_TERM_..., or FILTER_LIST_... using an appropriate XPath. I've modified the word plugin now so that if a field FILTER_TERM_... is found, for example, FILTER_TERM_DOCTYPE, then the term FILTER_TERM_DOCTYPE is added to the summarized description before the word plugin is run. If a field FILTER_LIST_... then the terms in its value are added to the summarized description before the word plugin is run. This allows you to write an appropriate word plugin rule without changing the word plugin syntax to filter what you want. Best, Chris
22/04/2022

-- wordfilter plugin / feature request
Hey, thanks for the update! Any idea how to actually set this up? I'm not sure how to get the doctype in a web scraper signature field, or really how to fill out the rest of this scraper to do what you describe. Any tips?
Hey, thanks for the update! Any idea how to actually set this up? I'm not sure how to get the doctype in a web scraper signature field, or really how to fill out the rest of this scraper to do what you describe. Any tips?
X