2014-10-30

Indexing languge.

Originally Posted By: brujula
Hello first of all, and thank for this great search engine. Our project is create a search engine but only for contents written in Basque (es-EU). How can I restrict only this contents.
'''Originally Posted By: brujula''' Hello first of all, and thank for this great search engine. Our project is create a search engine but only for contents written in Basque (es-EU). How can I restrict only this contents.

-- Indexing languge
That sounds like a fun project. If I had to do it, I would use Manage Locales to add a new Basque locale. I would then try to translate as many of the forward facing strings, string beginning with search_ , settings_ , group_, wiki_ (in that order) into Basque. Then at least Yioop's text to the end user would be in Basque. Next I would get a collection of better known Basque web pages. This would probably include Wikipedia, any Basque chat sites (the discussion board features of Yioop will continue to improve in Version 1.2), Basque businesses, sports teams, travel sites, etc. These would then be my initial seed sites on a crawl. I would use the word filter plugin so that I only crawl pages containing certain words that I would expect to see on a Basque page. This would filter pages before summarization (faster). If I wanted to be more sophisticated I could train a Basque classifier under Manage Classifiers. At crawl time I could then use this classifier to add a meta word to pages that were classified as Basque. Finally, I could use the Page Field Extraction Language to look for this meta word and only index a page if it had it.
That sounds like a fun project. If I had to do it, I would use Manage Locales to add a new Basque locale. I would then try to translate as many of the forward facing strings, string beginning with search_ , settings_ , group_, wiki_ (in that order) into Basque. Then at least Yioop's text to the end user would be in Basque. Next I would get a collection of better known Basque web pages. This would probably include Wikipedia, any Basque chat sites (the discussion board features of Yioop will continue to improve in Version 1.2), Basque businesses, sports teams, travel sites, etc. These would then be my initial seed sites on a crawl. I would use the word filter plugin so that I only crawl pages containing certain words that I would expect to see on a Basque page. This would filter pages before summarization (faster). If I wanted to be more sophisticated I could train a Basque classifier under Manage Classifiers. At crawl time I could then use this classifier to add a meta word to pages that were classified as Basque. Finally, I could use the Page Field Extraction Language to look for this meta word and only index a page if it had it.
X