org.archive.crawler.prefetch.RuntimeLimitEnforcer_en.utf8 Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of heritrix-engine Show documentation

The newest version!

description: A processor that halts further progress once a fixed amount of time has elapsed since the start of a crawl. It is possible to configure this processor per host, but it should be noted that Heritrix does not track runtime per host separately. Especially when using facilities like the BdbFrontier's hold-queues, the actual amount of time spent crawling a host may have little relevance to total elapsed time. Note however that using overrides and/or refinements only makes sense when using the 'Block URIs' end operation. The pause and terminate operations have global impact once encountered. end-operation-description: The action that the processor takes once the runtime has elapsed. Operation: Pause job - Pauses the crawl. A change (increase) to the runtime duration will make it pausible to resume the crawl. Attempts to resume the crawl without modifying the run time will cause it to be immediately paused again. Operation: Terminate job - Terminates the job. Equivalent to using the max-time setting on the CrawlController.

Operation: Block URIs - Blocks each URI with an -5002 (blocked by custom processor) fetch status code. This will cause all the URIs queued to wind up in the crawl.log. runtime-seconds-description: The amount of time, in seconds, that the crawl will be allowed to run before this processor performs it's 'end operation.' controller-description: The CrawlController used to stop or pause crawls if runtime limits are detected. statistics-tracker-description: The StatisticsTracker used to determine how long the crawl is running.