org.archive.io.warc.package.html Maven / Gradle / Ivy
The newest version!
org.archive.io.warc package
Experimental WARC Writer and Readers. Code and specification subject to change
with no guarantees of backward compatibility: i.e. newer readers
may not be able to parse WARCs written with older writers. This package
contains prototyping code for revision 0.12 of the WARC specification.
See latest revision
for current state (Version 0.10 code and its documentation has been moved into the
v10 subpackage).
Implementation Notes
Tools
Initial implementations of Arc2Warc
and Warc2Arc
tools can be found in Heritrix, at
org.archive.io.Arc2Warc and org.archive.io.Warc2Arc
respectively. Pass --help
to learn how to use each tool.
TODO
- Is MIME-Version header needed? MIME Parsers seem fine without (python email
lib and java mail).
- Should we write out a Content-Transfer-Encoding
header (Currently we do not). Need section in spec. explicit about our
interpretation of MIME and deviations (e.g. content-transfer-encoding should
be assumed binary in case of WARCs, multipart is not disallowed but not
encouraged, etc.)
- Minor: Do WARC-Version: 0.12 like MIME-Version: 1.0 rather than
WARC/0.12 for lead in to an ARCRecord?
© 2015 - 2024 Weber Informatics LLC | Privacy Policy