Yioop - PHP Search Engine

wordfilter plugin / feature request.

Is it possible to filter HTML in the wordfilter plugin? Or do you think this will require a new plugin.

I would like to do something like:

+<!DOCTYPE html>:JUSTFOLLOW

To filter out HTML5 documents, it doesn't seem to have much effect on the search results.

Any idea how I could filter in or out based on some HTML text?

Is it possible to filter HTML in the wordfilter plugin? Or do you think this will require a new plugin. I would like to do something like: +<!DOCTYPE html>:JUSTFOLLOW To filter out HTML5 documents, it doesn't seem to have much effect on the search results. Any idea how I could filter in or out based on some HTML text?

-- wordfilter plugin / feature request

Hey trs-eric,

That seems like a good suggestion. Right now, it doesn't support that, but I could add support for the next version of Yioop. The code where it checks is in src/library/indexing_plugins/WordFilterPlugin.php in the checkFilter method. It is probably less than a 10 line fix. I'll let you know when I have done the change in the git repository.

Best,

Chris

Hey trs-eric, That seems like a good suggestion. Right now, it doesn't support that, but I could add support for the next version of Yioop. The code where it checks is in src/library/indexing_plugins/WordFilterPlugin.php in the checkFilter method. It is probably less than a 10 line fix. I'll let you know when I have done the change in the git repository. Best, Chris

-- wordfilter plugin / feature request

Hey trs-eric,

Thinking about it a bit more, the word plugin is applied on titles and descriptions after page summarization has been done. So there won't be any tags at that stage. What probably makes more sense is to use a Web Scraper to extract new fields, either beginning with FILTER_TERM_..., or FILTER_LIST_... using an appropriate XPath. I've modified the word plugin now so that if a field FILTER_TERM_... is found, for example, FILTER_TERM_DOCTYPE, then the term FILTER_TERM_DOCTYPE is added to the summarized description before the word plugin is run. If a field FILTER_LIST_... then the terms in its value are added to the summarized description before the word plugin is run. This allows you to write an appropriate word plugin rule without changing the word plugin syntax to filter what you want.

Best, Chris

(Edited: 2022-04-17)

Hey trs-eric, Thinking about it a bit more, the word plugin is applied on titles and descriptions after page summarization has been done. So there won't be any tags at that stage. What probably makes more sense is to use a Web Scraper to extract new fields, either beginning with FILTER_TERM_..., or FILTER_LIST_... using an appropriate XPath. I've modified the word plugin now so that if a field FILTER_TERM_... is found, for example, FILTER_TERM_DOCTYPE, then the term FILTER_TERM_DOCTYPE is added to the summarized description before the word plugin is run. If a field FILTER_LIST_... then the terms in its value are added to the summarized description before the word plugin is run. This allows you to write an appropriate word plugin rule without changing the word plugin syntax to filter what you want. Best, Chris

-- wordfilter plugin / feature request

Hey, thanks for the update! Any idea how to actually set this up? I'm not sure how to get the doctype in a web scraper signature field, or really how to fill out the rest of this scraper to do what you describe. Any tips?

-- wordfilter plugin / feature request

Hey, sorry to bother you, but is it possible to get an example configuration, I'm afraid I don't understand how to implement your recommendation?

-- wordfilter plugin / feature request

Hey TRS-Eric,

As I understand it, you want to filter based on HTML. Webscrapers are only run on HTML content as it is. So you could add a Webscraper with signature.

 //html

Then nothing under Text Xpath or Delete XPath. Finally under, extract fields, you could have the line:

 FILTER_TERM_HTML=//html

Then in the word plugin:

 [domain_youd_line_to_apply_stuff_to]
 FILTER_TERM_HTML:JUSTFOLLOW

This will work in the next version of Yioop (9.0). Or in the version in the git repository. This filters all html. Or did you want exactly HTML5?

(Edited: 2022-06-11)

Hey TRS-Eric, As I understand it, you want to filter based on HTML. Webscrapers are only run on HTML content as it is. So you could add a Webscraper with signature. //html Then nothing under Text Xpath or Delete XPath. Finally under, extract fields, you could have the line: FILTER_TERM_HTML=//html Then in the word plugin: [domain_youd_line_to_apply_stuff_to] FILTER_TERM_HTML:JUSTFOLLOW This will work in the next version of Yioop (9.0). Or in the version in the git repository. This filters all html. Or did you want exactly HTML5?

-- wordfilter plugin / feature request

Oh thanks! Yes I want to filter out HTML5 and possibly weigh other versions of HTML. Thanks!

-- wordfilter plugin / feature request

Hey trs-eric,

I've modified Web Scrapers in Yioop now to also support regex. As an example, one can write:

 FILTER_LIST_HTML5=r/\<!doctype\s*html\s*\>/i

Here the r/ indicates that the right hand side of the equality is a regex not an xpath expression. If /\<!doctype\s*html\s*\>/i matches the current document then FILTER_LIST_HTML5 gets the value 1, otherwise it gets the value 0. Since this is a FILTER_LIST_ term if it is value is not empty (i.e., not 0) it will get added to the summary before the WordPlugin runs. So a Word Plugin Rule like:

 FILTER_LIST_HTML5:JUSTFOLLOW

will cause the crawler to just follow the links but not add the page to the index. To see that it works, on the most recent git repository version of Yioop, you can go to Page Options, then the Test Options tab and try it out on various pages. When a page is HTML5 you will see that FILTER_LIST_HTML5 with value 1 will appear as one of the fields after Page Rules are applied. Also, you will that ROBOT_METAS has as one of its elements JUSTFOLLOW. On the other hand, if the page is not HTML5, FILTER_LIST_HTML5 has value 0, and ROBOT_METAS will be empty. You can add other extract variable regexes, for different flavor of HTML as desired.

Best,

Chris

Hey trs-eric, I've modified Web Scrapers in Yioop now to also support regex. As an example, one can write: FILTER_LIST_HTML5=r/\<!doctype\s*html\s*\>/i Here the r/ indicates that the right hand side of the equality is a regex not an xpath expression. If /\<!doctype\s*html\s*\>/i matches the current document then FILTER_LIST_HTML5 gets the value 1, otherwise it gets the value 0. Since this is a FILTER_LIST_ term if it is value is not empty (i.e., not 0) it will get added to the summary before the WordPlugin runs. So a Word Plugin Rule like: FILTER_LIST_HTML5:JUSTFOLLOW will cause the crawler to just follow the links but not add the page to the index. To see that it works, on the most recent git repository version of Yioop, you can go to Page Options, then the Test Options tab and try it out on various pages. When a page is HTML5 you will see that FILTER_LIST_HTML5 with value 1 will appear as one of the fields after Page Rules are applied. Also, you will that ROBOT_METAS has as one of its elements JUSTFOLLOW. On the other hand, if the page is not HTML5, FILTER_LIST_HTML5 has value 0, and ROBOT_METAS will be empty. You can add other extract variable regexes, for different flavor of HTML as desired. Best, Chris