Yioop - PHP Search Engine

Archive Crawl Question.

Chris,

I think I know the answer to this but just want to make sure.

If I do an archive crawl of a "mix" then after that is complete is it ok to delete all the previous components of the mix?

Also, when a crawl is deleted - is the data actually deleted from my server?

Thanks, Anthony

(Edited: 2019-02-12)

Chris, I think I know the answer to this but just want to make sure. If I do an archive crawl of a "mix" then after that is complete is it ok to delete all the previous components of the mix? Also, when a crawl is deleted - is the data actually deleted from my server? Thanks, Anthony

-- Archive Crawl Question

bump

-- Archive Crawl Question

Yes, to both questions. Each crawl (including archive crawls) has a timestamp. If it is a single machine crawl, then all index data for the crawl will be stored in a folder:

 work_directory/cache/IndexDataTHE_TIMESTAMP.

If you crawl on one instance of Yioop, get such a folder, then copy it to some other instance of Yioop. That other instance of Yioop could be used to serve search results for that crawl. This is good if you want to serve search results from a shared hosting set up.

All queue data (what to crawl next) is kept in

 work_directory/cache/QueueBundleTHE_TIMESTAMP.

Fetchers, processes actually perform the download of pages, keep their processing info in files of the form

 work_directory/cache/FETCHER_NUM-ArchiveTHE_TIMESTAMP.

In the dev version of Yioop, one can have repeating crawls and multiple crawls at the same time so there is also a CHANNEL_NUM in front of the -Archive name as well as a new kind of DoubleIndexData index folder.

Data sent from fetchers but not yet processed by queue servers is held in

 work_directory/schedules/ScheduleDataTHE_TIMESTAMP

Best,

Chris

(Edited: 2019-02-16)

Yes, to both questions. Each crawl (including archive crawls) has a timestamp. If it is a single machine crawl, then all index data for the crawl will be stored in a folder: work_directory/cache/IndexDataTHE_TIMESTAMP. If you crawl on one instance of Yioop, get such a folder, then copy it to some other instance of Yioop. That other instance of Yioop could be used to serve search results for that crawl. This is good if you want to serve search results from a shared hosting set up. All queue data (what to crawl next) is kept in work_directory/cache/QueueBundleTHE_TIMESTAMP. Fetchers, processes actually perform the download of pages, keep their processing info in files of the form work_directory/cache/FETCHER_NUM-ArchiveTHE_TIMESTAMP. In the dev version of Yioop, one can have repeating crawls and multiple crawls at the same time so there is also a CHANNEL_NUM in front of the -Archive name as well as a new kind of DoubleIndexData index folder. Data sent from fetchers but not yet processed by queue servers is held in work_directory/schedules/ScheduleDataTHE_TIMESTAMP Best, Chris

-- Archive Crawl Question

Thanks Chris!

I consider myself to be a fairly technical minded person and I honestly cannot grasp how this all works. It is truly amazing that this can be done without the need for large mysql databases.

This is the best free search software available and I appreciate all your efforts

Anthony

Thanks Chris! I consider myself to be a fairly technical minded person and I honestly cannot grasp how this all works. It is truly amazing that this can be done without the need for large mysql databases. This is the best free search software available and I appreciate all your efforts Anthony