All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.archive.modules.net.RobotsHonoringPolicy_en.utf8 Maven / Gradle / Ivy

Go to download

This project contains some of the configurable modules used within the Heritrix application to crawl the web. The modules in this project can be used in applications other than Heritrix, however.

There is a newer version: 3.6.0
Show newest version
description:
Robots honoring policy.


user-agents-description:
Alternate user-agent values to consider using for the 'most-favored-set' 
policy. 


custom-robots-description:
Custom robots to use if policy type is 'custom'. Compose as if an actual 
robots.txt file. 


masquerade-description:
Should we masquerade as another user agent when obeying the rules 
declared for it. Only relevant if the policy type is 'most-favored' or 
'most-favored-set'. 


type-description:
Policy type. The 'classic' policy simply obeys all robots.txt rules for 
the configured user-agent. The 'ignore' policy ignores all robots rules. 
The 'custom' policy allows you to specify a policy, in robots.txt format, 
as a setting. The 'most-favored' policy will crawl an URL if the 
robots.txt allows any user-agent to crawl it. The 'most-favored-set' 
policy requires you to supply an list of alternate user-agents, and for 
every page, if any agent of the set is allowed, the page will be crawled. 






© 2015 - 2025 Weber Informatics LLC | Privacy Policy