All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.archive.io.warc.package.html Maven / Gradle / Ivy

The newest version!



org.archive.io.warc package


Experimental WARC Writer and Readers.  Code and specification subject to change
with no guarantees of backward compatibility: i.e. newer readers
may not be able to parse WARCs written with older writers. This package
contains prototyping code for revision 0.12 of the WARC specification.
See latest revision
for current state (Version 0.10 code and its documentation has been moved into the
v10 subpackage).


Implementation Notes

Tools

Initial implementations of Arc2Warc and Warc2Arc tools can be found in Heritrix, at org.archive.io.Arc2Warc and org.archive.io.Warc2Arc respectively. Pass --help to learn how to use each tool.

TODO

  • Is MIME-Version header needed? MIME Parsers seem fine without (python email lib and java mail).
  • Should we write out a Content-Transfer-Encoding header (Currently we do not). Need section in spec. explicit about our interpretation of MIME and deviations (e.g. content-transfer-encoding should be assumed binary in case of WARCs, multipart is not disallowed but not encouraged, etc.)
  • Minor: Do WARC-Version: 0.12 like MIME-Version: 1.0 rather than WARC/0.12 for lead in to an ARCRecord?




© 2015 - 2024 Weber Informatics LLC | Privacy Policy