2019-02-12

Archive Crawl Question.

Chris,
I think I know the answer to this but just want to make sure.
If I do an archive crawl of a "mix" then after that is complete is it ok to delete all the previous components of the mix?
Also, when a crawl is deleted - is the data actually deleted from my server?
Thanks, Anthony
(Edited: 2019-02-12)
Chris, I think I know the answer to this but just want to make sure. If I do an archive crawl of a "mix" then after that is complete is it ok to delete all the previous components of the mix? Also, when a crawl is deleted - is the data actually deleted from my server? Thanks, Anthony
2019-02-13

-- Archive Crawl Question
bump
bump
2019-02-16

-- Archive Crawl Question
Yes, to both questions. Each crawl (including archive crawls) has a timestamp. If it is a single machine crawl, then all index data for the crawl will be stored in a folder:
 work_directory/cache/IndexDataTHE_TIMESTAMP.
If you crawl on one instance of Yioop, get such a folder, then copy it to some other instance of Yioop. That other instance of Yioop could be used to serve search results for that crawl. This is good if you want to serve search results from a shared hosting set up.
 
All queue data (what to crawl next) is kept in
 work_directory/cache/QueueBundleTHE_TIMESTAMP.
 
Fetchers, processes actually perform the download of pages, keep their processing info in files of the form
 work_directory/cache/FETCHER_NUM-ArchiveTHE_TIMESTAMP. 
 
In the dev version of Yioop, one can have repeating crawls and multiple crawls at the same time so there is also a CHANNEL_NUM in front of the -Archive name as well as a new kind of DoubleIndexData index folder.
 
Data sent from fetchers but not yet processed by queue servers is held in
 work_directory/schedules/ScheduleDataTHE_TIMESTAMP
Best,
Chris
(Edited: 2019-02-16)
Yes, to both questions. Each crawl (including archive crawls) has a timestamp. If it is a single machine crawl, then all index data for the crawl will be stored in a folder: work_directory/cache/IndexDataTHE_TIMESTAMP. If you crawl on one instance of Yioop, get such a folder, then copy it to some other instance of Yioop. That other instance of Yioop could be used to serve search results for that crawl. This is good if you want to serve search results from a shared hosting set up. All queue data (what to crawl next) is kept in work_directory/cache/QueueBundleTHE_TIMESTAMP. Fetchers, processes actually perform the download of pages, keep their processing info in files of the form work_directory/cache/FETCHER_NUM-ArchiveTHE_TIMESTAMP. In the dev version of Yioop, one can have repeating crawls and multiple crawls at the same time so there is also a CHANNEL_NUM in front of the -Archive name as well as a new kind of DoubleIndexData index folder. Data sent from fetchers but not yet processed by queue servers is held in work_directory/schedules/ScheduleDataTHE_TIMESTAMP Best, Chris

-- Archive Crawl Question
Thanks Chris!
I consider myself to be a fairly technical minded person and I honestly cannot grasp how this all works. It is truly amazing that this can be done without the need for large mysql databases.
This is the best free search software available and I appreciate all your efforts
Anthony
Thanks Chris! I consider myself to be a fairly technical minded person and I honestly cannot grasp how this all works. It is truly amazing that this can be done without the need for large mysql databases. This is the best free search software available and I appreciate all your efforts Anthony
X