All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.archive.crawler.frontier.AbstractFrontier_en.utf8 Maven / Gradle / Ivy

The newest version!
rules-description:
The list of canonicalization rules to apply to URIs before making decisions
about them.


manager-description:
The sheet manager used to configure this BdbFrontier.  It will also be used
to look up context-specific settings for discovered URIs.


seeds-description:
The seeds module used to prepare the seeds for crawling.


logger-module-description:
The logger module of the crawl; used to log various error conditions.


delay-factor-description:
How many multiples of last fetch elapsed time to wait before recontacting 
same server. 


force-queue-assignment-description:
queue assignment to force onto CrawlURIs; intended to be overridden 


respect-crawl-delay-description:
Whether to respect a 'Crawl-Delay' (in seconds) given in a site's
robots.txt; default is true

max-delay-ms-description:
never wait more than this long, regardless of multiple 


max-per-host-bandwidth-usage-kb-sec-description:
maximum per-host bandwidth usage 


max-retries-description:
maximum times to emit a CrawlURI without final disposition 

outbound-queue-capacity-description:
Capacity of the outbound queue carrying eligible URIs from the frontier
thread to the worker ToeThreads.

inbound-queue-multiple-description:
Capacity of the inbound queue carrying updates to the frontier thread from
other threads, as a multiple of the outbound queue capacity. 

min-delay-ms-description:
always wait this long after one completion before recontacting same 
server, regardless of multiple 


preference-embed-hops-description:
number of hops of embeds (ERX) to bump to front of host queue 


queue-assignment-policy-description:
Defines how to assign URIs to queues. Can assign by host, by ip, and into 
one of a fixed set of buckets (1k). 


recovery-log-enabled-description:
Recover log on or off attribute. 


retry-delay-seconds-description:
for retryable problems, seconds to wait before a retry 


source-tag-seeds-description:
Whether to tag seeds with their own URI as a heritable 'source' String, 
which will be carried-forward to all URIs discovered on paths originating 
from that seed. When present, such source tags appear in the 
second-to-last crawl.log field. 


total-bandwidth-usage-kb-sec-description:
maximum overall bandwidth usage 


controller-description:
The crawl controller using this Frontier.


queue-assignment-policy-description:
The queue assignment policy to use.


recovery-dir-description:
The directory that the ignored seeds report and the recovery log will live in.


scratch-dir-description:
The directory where temporary files will be written.

scope-description:
The decide rule used to determine whether or not URIs are in the scope of 
the crawl. Here in the frontier, it is only applied to bulk URI imports, 
not usual discoveries. This value should be shared with the other Scoper
processors for consistency. 

server-cache-description:
The server cache to use for the frontier. 

user-agent-provider-description:
The UserAgentProvider to use for the frontier. 




© 2015 - 2024 Weber Informatics LLC | Privacy Policy