2020-01-14

Crawling Again.

Happy New Year!
August 2019 I did a couple 75 million page crawls and went back to the drawing board to figure out how to better improve both the crawling and search results.
One problem I had noticed which I hadn't noticed in previous crawls was the prevalence of links farms, mainly in China. These sites would consist of a collection of interlinking urls all of the form:
 some_hash_string.domain.name
or
 some_random_lists_words.domain.name
All of the pages in the link farm would tend to link back to a particular page that was being promoted. Link farms are hard to train a crawler to avoid, but I added some code to try to do this.
I also have written some code to try to get information useful for crawling from Wikipedia dumps. I wrote a script which reads Wikipedia extracts the x many most popular wikipedia pages based on wikipedia usage statistics and then extracts first paragraphs from all the pages articles for these pages and adds them to a Yioop Search Wiki. The search wiki results are then displayed on a query related to the wiki page as a callout like other major search engines do. Using the same dump I also extract infobox website urls for these popular pages for addition to my starting seed sites for crawls.
All of this is being done for each of the 23 language Yioop comes with, so hopefully this will means a diversity of search results when someone searches in another language. I have used Yandex.translate to translate all the missing static string translations for these other locales. As these were static translation, no calls to Yandex are done now that I have finished the translation process. The only languages I know are English, French (used to be quasi-fluent), Chinese (limited), so it is hard to judge how good or bad these localizations are. I know for French there were a fair number of issues some of which I have addressed by hand changing some of the translations. If people notice things that need to be changed please let me know. For Chinese, Forrest Sun has made some improvements to how segmentation works, so hopefully this will improve Chinese search results for future crawls.
Over the next couple of weeks, I am going to try some new tests crawls to see how well these enhancements effect search results.
Chris
Happy New Year! August 2019 I did a couple 75 million page crawls and went back to the drawing board to figure out how to better improve both the crawling and search results. One problem I had noticed which I hadn't noticed in previous crawls was the prevalence of links farms, mainly in China. These sites would consist of a collection of interlinking urls all of the form: some_hash_string.domain.name or some_random_lists_words.domain.name All of the pages in the link farm would tend to link back to a particular page that was being promoted. Link farms are hard to train a crawler to avoid, but I added some code to try to do this. I also have written some code to try to get information useful for crawling from Wikipedia dumps. I wrote a script which reads Wikipedia extracts the x many most popular wikipedia pages based on wikipedia usage statistics and then extracts first paragraphs from all the pages articles for these pages and adds them to a Yioop Search Wiki. The search wiki results are then displayed on a query related to the wiki page as a callout like other major search engines do. Using the same dump I also extract infobox website urls for these popular pages for addition to my starting seed sites for crawls. All of this is being done for each of the 23 language Yioop comes with, so hopefully this will means a diversity of search results when someone searches in another language. I have used [[https://translate.yandex.com/|Yandex.translate]] to translate all the missing static string translations for these other locales. As these were static translation, no calls to Yandex are done now that I have finished the translation process. The only languages I know are English, French (used to be quasi-fluent), Chinese (limited), so it is hard to judge how good or bad these localizations are. I know for French there were a fair number of issues some of which I have addressed by hand changing some of the translations. If people notice things that need to be changed please let me know. For Chinese, Forrest Sun has made some improvements to how segmentation works, so hopefully this will improve Chinese search results for future crawls. Over the next couple of weeks, I am going to try some new tests crawls to see how well these enhancements effect search results. Chris
X