2020-04-29

Indexing Russian sites.

Hi Chris I apologize in advance for my English but I am facing a problem. During the scanning of some Russian sites where the language is not explicitly indicated at the beginning of the page, I get encoding problems. Maybe there is a way to set UTF-8 hard? My index is built only from English and Russian sites,I would also like to ask a question about how to disable different results for different Yioop languages?
 thanks.
Hi Chris I apologize in advance for my English but I am facing a problem. During the scanning of some Russian sites where the language is not explicitly indicated at the beginning of the page, I get encoding problems. Maybe there is a way to set UTF-8 hard? My index is built only from English and Russian sites,I would also like to ask a question about how to disable different results for different Yioop languages? thanks.

-- Indexing Russian sites
Under the Manage Locales Activity you can either disable or delete other locales. This won't prevent other locales from appearing in an index if you crawl them. Probably the easiest way to prevent this would be to use a word filter plugin to try to detect if the page had common English or Russian stop words and if so index them, and otherwise, just follow the links on the page. I have a somewhat old tutorial on how that plugin works:
Best,
Chris
Under the Manage Locales Activity you can either disable or delete other locales. This won't prevent other locales from appearing in an index if you crawl them. Probably the easiest way to prevent this would be to use a word filter plugin to try to detect if the page had common English or Russian stop words and if so index them, and otherwise, just follow the links on the page. I have a somewhat old tutorial on how that plugin works: [[ https://www.yioop.com/?c=group&a=wiki&group_id=20&arg=media&page_id=26&n=04%20Niche%20or%20Subject%20Specific%20Crawling%20With%20Yioop.mp4]] Best, Chris

<font style=
Resource Description for Снимок.JPG Resource Description for Снимок2.JPG Thanks for the answer Chris! but unfortunately the plugin defines common characters but only English is displayed in the output, Russians are replaced by "?". For instance "???????? Apache2.4.torrent ????"
(Edited: 2020-04-29)
((resource:Снимок.JPG|Resource Description for Снимок.JPG)) ((resource:Снимок2.JPG|Resource Description for Снимок2.JPG)) Thanks for the answer Chris! but unfortunately the plugin defines common characters but only English is displayed in the output, Russians are replaced by "?". For instance "???????? Apache2.4.torrent ????"
2020-05-16

-- Indexing Russian sites
Okay. My goal is to have a new version of Yioop out by mid-june. I will add check UTF-8 compatibility of regex's in WordFilterPlugin.php to my list of bugs to try to fix. My guess is a some preg function needs a /u on the end of some regex.
Best,
Chris
Okay. My goal is to have a new version of Yioop out by mid-june. I will add check UTF-8 compatibility of regex's in WordFilterPlugin.php to my list of bugs to try to fix. My guess is a some preg function needs a /u on the end of some regex. Best, Chris
X