de.schlichtherle.io.package.html Maven / Gradle / Ivy
Show all versions of truezip Show documentation
Provides transparent, multi-threaded read/write access to archive
files (ZIP, TAR, etc) and their entries as if they were (virtual)
directories and files.
Archive files may be arbitrarily nested and the nesting level is
only limited by heap and file system size.
Contents
Basic Operations
In order to create a new archive file, the client application can simply use
{@link de.schlichtherle.io.File#mkdir()}.
In order to delete it, {@link de.schlichtherle.io.File#delete()} can be used.
Similar to a regular directory this is only possible if the archive file is empty.
Alternatively, the client application could use {@link de.schlichtherle.io.File#deleteAll()}
in order to delete the virtual directory in one go, regardless of its contents.
To read an archive entry, the client application can simply create a {@link de.schlichtherle.io.FileInputStream}
or a {@link de.schlichtherle.io.FileReader} with the path or a {@link de.schlichtherle.io.File}
instance as its constructor parameter. Note that you cannot create a {@code FileInputStream}
or a {@code FileReader} to read an archive file itself (unless it's a false
positive, i.e. a regular file or directory with an archive file suffix).
Likewise, to write an archive entry, the client application can simply create
a {@link de.schlichtherle.io.FileOutputStream} or a {@link de.schlichtherle.io.FileWriter}
with the path or a {@link de.schlichtherle.io.File} instance as its constructor
parameter. Note that you cannot create a {@code FileOutputStream} or a
{@code FileWriter} to write an archive file itself (unless it's a false positive,
i.e. a regular file or directory with an archive file suffix).
If the client application just needs to copy data however, using one of the
copy methods in the {@code File} class
is highly recommended instead of using {@code File(In|Out)putStream} directly.
These methods use asynchronous I/O (though they return synchronously), pooled big
buffers, pooled threads (on JSE 5 and later) and do not need to decompress/recompress
archive entry data when copying from one archive file to another for supported archive
types. In addition, they are guaranteed to fail gracefully, while many Java apps
fail to close their streams if an {@code IOException} occurs.
Note that there is no eqivalent to {@code java.io.RandomAccessFile} in this
package because it's impossible to seek within compressed archive entry data.
Using Archive Entry Streams
When using streams, the client application should always close them in a
{@code finally}-block like this:
FileOutputStream out = new FileOutputStream(file);
try {
// Do I/O here...
} finally {
out.close(); // ALWAYS close the stream!
}
This ensures that the stream is closed even if an exception occurs.
Note that for various (mostly archive driver specific) reasons, the {@code close()}
method may throw an {@code IOException}, too. The client application needs
to deal with this appropriately, for example by enclosing the entire block with
another {@code try-catch}-block like this:
try {
FileOutputStream out = new FileOutputStream(file);
try {
// Do I/O here...
} finally {
out.close(); // ALWAYS close the stream!
}
} catch (IOException ex) {
ex.printStackTrace();
}
This idiom is not at all specific to TrueZIP: Streams often utilize OS resources
such as file descriptors, database or network connections. All OS resources are
limited however and sometimes they are even exclusively allocated for a stream,
so the stream should always be closed as soon as possible again, especially in long
running server applications (relying on {@code finalize()} to do this during
garbage collection is unsafe). Unfortunately, many Java applications and libraries
fail in this respect.
TrueZIP is affected by open archive entry streams in the following ways:
- Archive drivers provided by third parties may restrict the number of open
input or output entry streams for an archive file. If this is exceeded, any
attempt to open another entry stream results in a {@link de.schlichtherle.io.FileBusyException}.
- When unmounting an archive file (see below), depending
on the parameters, TrueZIP may choose to force the closing of any open entry
streams or not. If the entry streams are not forced to close, the archive
file cannot get unmounted and an {@link de.schlichtherle.io.ArchiveBusyException}
is thrown. If the entry streams are forced to close however, the archive file
is unmounted and an {@link de.schlichtherle.io.ArchiveBusyWarningException}
is thrown to indicate that subsequent I/O operations on these entry streams
(other than {@code close()}) will fail with an {@link de.schlichtherle.io.ArchiveEntryStreamClosedException}.
Neither solution is optimal.
In order to prevent these exceptions, TrueZIP automatically closes entry streams
when they are garbage collected. However, the client application should never rely
on this because the delay and order in which streams are processed by the finalizer
thread is not specified and any unwritten data gets lost in output streams.
Atomicity of File System Operations
In general, a file system operation is either atomic or not. In its strict
sense, an atomic operation meets the following conditions:
- The operation either completely succeeds or completely fails. If it fails,
the state of the file system is not changed.
- Third parties can't monitor or influence the
changes as they are in progress. They can only see the result.
All reliable file system implementations meet the first condition and so does
TrueZIP. However, the situation is different for the second condition:
- TrueZIP's virtual file system implementation is running in a JVM process,
so other processes could monitor and influence changes in progress.
- TrueZIP's recognition of archive files is configurable, so other {@code File}
instances could monitor and influence changes in progress.
- TrueZIP caches state information about archive files on the heap and in
temporary files, so other definitions of the classes in this package which have
been loaded by other class loaders could monitor and influence changes in progress.
This implies that TrueZIP cannot provide any operations which are atomic in its
strict sense. However, many file system operations in this package are declared
to be virtually atomic according to their Javadoc. A virtually atomic operation
meets the following conditions:
- The operation either completely succeeds or completely fails. If it fails,
the state of the (virtual) file system is not changed.
- If the path does not contain any archive files, the operation is always
delegated to the real file system and third parties can't monitor or influence
the changes as they are in progress. They can only see the result.
- Otherwise, all {@code File} instances which recognize the same set
of archive files in the path and share the same definition of classes in this
package can't monitor or influence the changes as they are in progress. They
can only see the result.
These conditions apply regardless of whether the {@code File} instances
are used by different threads or not. In other words, TrueZIP is thread safe as
much as you could expect from a real file system.
Updating Archive Files
To provide random read/write access to archive files, TrueZIP needs to associate
some state for every recognized archive file on the heap and in the folder for temporary
files while the client application is operating on the VFS.
TrueZIP automatically mounts the VFS from an archive file on the first
access. The client application can then operate on the VFS in an arbitrary manner.
Finally, an archive file must get unmounted in order to update it with the
cumulated modifications. Note that an archive entry gets modified by any operation
which creates, modifies or deletes it.
Explicit vs. Implicit Unmounting
Archive file unmounting is performed semi-automatic:
- Explicit unmounting happens when the client application calls {@link
de.schlichtherle.io.File#umount} or {@link de.schlichtherle.io.File#update}.
- Implicit unmounting happens when the JVM terminates (by a JVM shutdown
hook) or when the client application modifies an archive entry more than once.
The latter case is also called implicit remounting, because the VFS is
immediately mounted again in order to continue the operation.
Explicit unmounting is required to support third-party access to an archive file
(see below) or to monitor progress (see
below). It also allows some control over any exceptions
thrown: Both {@code umount()} and {@code update()} may throw an {@link
de.schlichtherle.io.ArchiveWarningException} or an {@link de.schlichtherle.io.ArchiveException}.
The client application may catch these exceptions and act on them individually (see
below).
However, calling {@code umount()} or {@code update()} too often may
increase the overall runtime: On each call, all remaining entries in the archive
file are copied to the archive file again if the archive file did already exist.
If the client application is explicitly unmounting the archive file after each modification,
this may lead to an overall runtime of {@code O(s*s)}, where {@code s}
is the size of the archive file in bytes (see below).
In comparison, implicit unmounting provides best performance because archive
files are only updated if there's really a need to. It also works reliably: The
JVM shutdown hook is always run unless the JVM crashes
(note
that an uncatched throwable terminates the JVM, but does not crash
it - a JVM crash is an extremely rare situation which indicates a bug in the JVM
implementation, not a bug in the JRE or the application). Furthermore, it omits
the need to introduce a call to {@code umount()} or {@code update()} in
legacy applications.
The disadvantage is that the client application cannot easily detect
and deal with any exceptions thrown as a result of updating an
archive file:
Depending on where the implicit unmount happens, either an
arbitrary {@link java.io.IOException} is thrown, a boolean value
is returned, or - when called from the JVM shutdown hook - just a
stack trace is printed.
In addition, updating an existing archive file takes linear runtime
(see below). However, using long running
JVM shutdown hooks is generally discouraged: They can't use
{@link java.util.logging}, they can't use a GUI to monitor
progress (see below) and they can only
get debugged on JSE 5 or later.
Third Party Access
Because TrueZIP associates some state with any archive file which is read and/or
write accessed by the client application, it requires exclusive access to these
archive files until they get unmounted again.
Third parties must not concurrently access these archive
files nor their entries unless the precautions outlined
below have been taken!
In this context, third parties are:
- Instances of the class {@code java.io.File} which are not instances
of the class {@code de.schlichtherle.io.File}.
- Instances of the class {@code de.schlichtherle.io.File} which do not
recognize the same set of archive files in the path due to the use of a differently
working {@link de.schlichtherle.io.ArchiveDetector}.
- Other definitions of the classes in this package which have been loaded
by different class loaders.
- Other system processes.
As a rule of thumb, the same archive file or entry within an archive file should
not be accessed by different {@code File} classes ({@code java.io.File}
versus {@code de.schlichtherle.io.File}) or {@code File} instances with
different {@code ArchiveDetector} parameters. This ensures that the state associated
to an archive file is not shadowed or bypassed.
To ensure that all {@code File} instances recognize the same set of archive
files in a path, it's recommended not to use constructors or methods of
the {@code File} class with explicit {@code ArchiveDetector} parameters
unless there is good reason to.
To ensure that all {@code File} instances share the same definition of classes
in this package, it's recommended to add TrueZIP's JAR to the boot class path or
the extension class path.
If the prerequisites for these recommendations don't apply or if the recommendations
can't be followed, the client application may call {@link de.schlichtherle.io.File#umount}
({@link de.schlichtherle.io.File#update} will not work) to perform an explicit
unmount. This clears all state information so that the third party can then safely
access any archive file. In addition, the client application must make sure not
to access the same archive file or any of its entries in any way while the third
party is still accessing it.
Failure to comply to these guidelines may result in
unpredictable behavior and may even cause loss of data!
Exception Handling
{@code umount()} and {@code update()} are guaranteed to process
all archive files which are in use or have been touched by the client application.
However, processing some of these archive files may fail for a number of I/O related
reasons. Hence, during processing, a sequential chain of archive exceptions
is constructed and thrown upon termination unless its empty. Note that sequential
exception chaining is a concept which is completely orthogonal to Java's general
exception cause chaining: In a sequential archive exception chain, each archive
exception may still have a chain of other exceptions as its cause (most likely
{@code IOException}s).
Archive exceptions fall into two categories:
- The class {@link de.schlichtherle.io.ArchiveWarningException} is the root
of all warning exception types. These exceptions are thrown if an archive file
has been completely updated, but some warning conditions apply. No data has
been lost.
- Its super class {@link de.schlichtherle.io.ArchiveException} is the root
of all other exception types (unless it's an {@code ArchiveWarningException}
again). These exceptions are thrown if an archive file could not get updated
completely. This implies loss of some or all data in the respective archive
file.
Note that the effect which is indicated by an archive exception is local: An
exception thrown when processing an archive file does not imply an archive exception
or loss of data when processing another archive file.
When the archive exception chain is thrown by this method, it's first sorted
according to (1) descending order of priority and (2) ascending order of appearance,
and the resulting head exception is then thrown. Since {@code ArchiveWarningException}s
have a lower priority than {@code ArchiveException}s, they are always pushed
back to the end of the chain, so that an application can use the following simple
idiom to detect if only some warnings or at least one severe error has occured:
try {
File.umount(); // with or without parameters
} catch (ArchiveWarningException oops) {
// Only instances of the class ArchiveWarningException exist in
// the sequential chain of exceptions. We decide to ignore this.
} catch (ArchiveException ouch) {
// At least one exception occured which is not just an
// ArchiveWarningException. This is a severe situation that
// needs to be handled.
// Print the sequential chain of exceptions in order of
// descending priority and ascending appearance.
//ouch.printStackTrace();
// Print the sequential chain of exceptions in order of
// appearance instead.
ouch.sortAppearance().printStackTrace();
}
Note that the {@link java.lang.Throwable#getMessage()} method (and hence {@link
java.lang.Throwable#printStackTrace()} will concatenate the detail messages of the
exceptions in the sequential chain in the given order.
Performance Considerations
Unmounting a modified archive file is a linear runtime operation: If the size
of the resulting archive file is s bytes, the operation always completes
in O(s), even if only a single, small archive entry has been modified
within a very large archive file. Unmounting an unmodified or newly created archive
file is a constant runtime operation: It always completes in O(1). These
magnitudes are independent of whether unmounting was performed explicitly or implicitly.
Now if the client application modifies each entry in a loop and accidentally
triggers unmounting the archive file on each iteration, then the overall runtime
increases to O(s*s)! Here's an example:
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
File entry = new File("archive.zip", names[i]); // O(1)
entry.createNewFile(); // O(1)
File.umount(); // O(i + 1) !!
}
// Overall: O(n*n) !!!
The bad runtime is because {@code umount()} is called within the loop. Moving
it out of the loop fixes the issue:
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
File entry = new File("archive.zip", names[i]); // O(1)
entry.createNewFile(); // O(1)
}
File.umount(); // new file: O(1); modified: O(n)
// Overall: O(n)
In essence: If at all, the client application should never call {@code umount()}
or {@code update()} in a loop which modifies an archive file.
The situation gets more complicated with implicit remounting: If a file entry
shall get modified which already has been modified before, TrueZIP implicitly remounts
the archive file in order to avoid writing duplicated entries to it (which would
waste space and may even confuse other utilities). Here's an example:
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
File entry = new File("archive.zip", names[i]); // O(1)
entry.createNewFile(); // First modification: O(1)
entry.createNewFile(); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!
Each call to {@code createNewFile()} is a modification operation. Hence,
on the second call to this method, TrueZIP needs to do an implicit remount which
writes all entries in the archive file created so far to disk again.
Unfortunately, a modification operation is not always so easy to spot. Consider
the following example to create an archive file with empty entries which all share
the same last modification time:
long time = System.currentTimeMillis();
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
File entry = new File("archive.zip", names[i]); // O(1)
entry.createNewFile(); // First modification: O(1)
entry.setLastModified(time); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!
When {@code setLastModified()} gets called, the entry has already been written
and so an implicit remount is triggered, which writes all entries in the archive
file created so far to disk again.
Detail: This deficiency is caused by archive file formats: All currently
supported archive types require to write an entry's meta data (including the last
modification time) before its content to the archive file. So if the meta data is
to be modified, the archive entry and hence the whole archive file needs to get
rewritten, which is what the implicit remount is doing.
To avoid accidental remounting when copying data, you should consider using the
advanced copy methods instead. These methods
are easy to use, work reliably and provide superior performance.
Monitoring Progress
When unmounting, the client application can monitor the progress by another thread
using {@link de.schlichtherle.io.File#getLiveArchiveStatistics()}. The returned
instance is a proxy which returns live statistics about the updating process.
Here's an example how to monitor unmounting progress on standard error output
after an initial delay of two seconds:
class ProgressMonitor extends Thread {
Long[] args = new Long[2];
ArchiveStatistics liveStats = File.getLiveArchiveStatistics();
ProgressMonitor() {
setPriority(Thread.MAX_PRIORITY);
setDaemon(true);
}
public void run() {
boolean run = false;
for (long sleep = 2000; ; sleep = 200, run = true) {
try {
Thread.sleep(sleep);
} catch (InterruptedException shutdown) {
break;
}
showProgress();
}
if (run) {
showProgress();
System.err.println();
}
}
void showProgress() {
// Round up to kilobytes.
args[0] = new Long(
(liveStats.getUpdateTotalByteCountRead() + 1023) / 1024);
args[1] = new Long(
(liveStats.getUpdateTotalByteCountWritten() + 1023) / 1024);
System.err.print(MessageFormat.format(
"Top level archive IO: {0} / {1} KB \r", args));
}
void shutdown() {
interrupt();
try {
join();
} catch (InterruptedException interrupted) {
interrupted.printStackTrace();
}
}
}
// ...
ProgressMonitor monitor = new ProgressMonitor();
monitor.start();
try {
File.umount();
} finally {
monitor.shutdown();
}
Conclusions
Here are some guidelines to find the right balance between performance and control:
- When the JVM terminates, calling {@code umount()}
is recommended in order to handle exceptions explicitly, but not required because
TrueZIP's JVM shutdown hook takes care of unmounting anyway and prints the stacktrace
of any exceptions on the standard error output.
- Otherwise, in order to achieve best performance, {@code umount()} or
{@code update()} should not get called unless either
third party access or explicit
exception handling is required.
- For the same reason, these methods should never get called in a
loop which modifies an archive file.
- {@code umount()} is generally preferred over {@code update()}
for safety reasons.
Miscellaneous
Virtual Directories in Archive Files
The top level entries in an archive file build its root directory. The
root directory is never written to the output when an archive file is modified.
To the client application, the root directory behaves like any other directory
and is addressed by naming the archive file in a path: For example, the client application
may list its contents by calling {@link de.schlichtherle.io.File#list()} or {@link
de.schlichtherle.io.File#listFiles()}.
The root directory receives its last modification time from the archive file
whenever it's read. Likewise, the archive file will receive the root directory's
last modification time whenever it's written.
While this is a proper emulation of the behavior of real file systems, it may
confuse users if only entries which are located one level or more below the root
directory have been changed in an existing archive file: In this case, the last
modification time of the root directory is not updated and hence the archive file's
last modification time will not reflect the changes in the deeper directory levels.
As a workaround, the client application can use the idiom {@code {@link de.schlichtherle.io.File#isArchive()}
&& {@link de.schlichtherle.io.File#isDirectory()}} to detect an archive file
and explicitly change the last modification time of its root directory by calling
{@link de.schlichtherle.io.File#setLastModified(long)}.
An archive may contain directories for which no entry is present in the file
although they contain at least one member in their directory tree for which an entry
is actually present in the file. Similarly, if {@link de.schlichtherle.io.File#isLenient}
returns {@code true} (which is the default), an archive entry may be created
in an archive file although its parent directory hasn't been explicitly created
by calling {@link de.schlichtherle.io.File#mkdir} before.
Such a directory is called a ghost directory: Like the root directory,
a ghost directory is not written to the output whenever an archive file is modified.
This is to mimic the behavior of most archive utilities which do not create archive
entries for directories.
To the client application, a ghost directory behaves like a regular directory
with the exception that its last modification time returned by {@link de.schlichtherle.io.File#lastModified()}
is {@code 0L}. If the client application sets the last modification time explicitly
using {@link de.schlichtherle.io.File#setLastModified(long)}, then the ghost directory
reincarnates as a regular directory and will be output to the archive file.
Mind that a ghost directory can only exist within an archive file, but not every
directory within an archive file is actually a ghost directory.
Entry Names in Archive Files
File paths may be composed of elements which either refer to regular nodes in
the real file system (directories, files or special files), including top level
archive files, or refer to entries within an archive file.
As usual in Java, elements in a path which refer to regular nodes may be case
sensitive or not in TrueZIP's VFS, depending on the real file system and/or the
platform.
However, elements in a path which refer to archive entries are always case sensitive.
This enables the client application to address all files in existing archive files,
regardless of the operating system they've been created on.
For existing archive files, redundant elements in entry names such as the empty
string ({@code ""}), the dot ({@code "."}) directory, or the dot-dot ({@code ".."})
directory are removed in the VFS when the archive file is read and not
retained when the archive file is modified.
If an entry name contains characters which have no representation in the character
set of the corresponding archive file type, then all file operations to
create the archive entry will fail gracefully according to the documented contract
of the respective operation. This is to protect the client application from creating
archive entries which cannot get encoded and decoded again correctly. For example,
the Euro sign (€) does not have a representation in the IBM437 character set and
hence cannot be used for entries in ordinary ZIP files unless TrueZIP's configuration
is customized to use another charset.
If an archive file contains entries with absolute entry names, such as /readme.txt
rather than readme.txt, the client application cannot address these entries
using the VFS in this package. However, these entries are retained like any other
entry whenever the client application modifies the archive file. This should not
impose problems as absolute entry names should never be used anyway and I'm not
aware of any recent tools which would allow to create these.
If an archive file contains both a file and a directory entry with the same name
it's up to the individual methods how they behave in this case. This could happen
with archive files created by external tools only. Both {@link de.schlichtherle.io.File#isDirectory()}
and {@link de.schlichtherle.io.File#isFile()} will return {@code true} in this
case and in fact they are the only methods the client application can rely upon
to act properly in this situation: Many other methods use a combination of {@code
isDirectory()} and {@code isFile()} calls and will show an undefined
behavior.
The good news is that both the file and the directory coexist in the virtual
archive file system implemented by this package. Thus, whenever the archive file
is modified, both entries will be retained and no data gets lost. This allows you
to use another tool to fix the issue in the archive file. TrueZIP never allows the
client application to create such an archive file, however.