All Downloads are FREE. Search and download functionalities are using the official Maven repository.

de.schlichtherle.io.package.html Maven / Gradle / Ivy

Go to download

TrueZIP is a Java based Virtual File System (VFS) to enable transparent, multi-threaded read/write access to archive files (ZIP, TAR etc.) as if they were directories. Archive files may be arbitrarily nested and the nesting level is only limited by heap and file system size.

The newest version!




    
        
    
    
        

Provides transparent, multi-threaded read/write access to archive files (ZIP, TAR, etc) and their entries as if they were (virtual) directories and files. Archive files may be arbitrarily nested and the nesting level is only limited by heap and file system size.


Contents

  1. Basic Operations
  2. Atomicity of File System Operations
  3. Updating Archive Files
  4. Miscellaneous

Basic Operations

In order to create a new archive file, the client application can simply use {@link de.schlichtherle.io.File#mkdir()}.

In order to delete it, {@link de.schlichtherle.io.File#delete()} can be used. Similar to a regular directory this is only possible if the archive file is empty. Alternatively, the client application could use {@link de.schlichtherle.io.File#deleteAll()} in order to delete the virtual directory in one go, regardless of its contents.

To read an archive entry, the client application can simply create a {@link de.schlichtherle.io.FileInputStream} or a {@link de.schlichtherle.io.FileReader} with the path or a {@link de.schlichtherle.io.File} instance as its constructor parameter. Note that you cannot create a {@code FileInputStream} or a {@code FileReader} to read an archive file itself (unless it's a false positive, i.e. a regular file or directory with an archive file suffix).

Likewise, to write an archive entry, the client application can simply create a {@link de.schlichtherle.io.FileOutputStream} or a {@link de.schlichtherle.io.FileWriter} with the path or a {@link de.schlichtherle.io.File} instance as its constructor parameter. Note that you cannot create a {@code FileOutputStream} or a {@code FileWriter} to write an archive file itself (unless it's a false positive, i.e. a regular file or directory with an archive file suffix).

If the client application just needs to copy data however, using one of the copy methods in the {@code File} class is highly recommended instead of using {@code File(In|Out)putStream} directly. These methods use asynchronous I/O (though they return synchronously), pooled big buffers, pooled threads (on JSE 5 and later) and do not need to decompress/recompress archive entry data when copying from one archive file to another for supported archive types. In addition, they are guaranteed to fail gracefully, while many Java apps fail to close their streams if an {@code IOException} occurs.

Note that there is no eqivalent to {@code java.io.RandomAccessFile} in this package because it's impossible to seek within compressed archive entry data.

Using Archive Entry Streams

When using streams, the client application should always close them in a {@code finally}-block like this:

FileOutputStream out = new FileOutputStream(file);
try {
    // Do I/O here...
} finally {
    out.close(); // ALWAYS close the stream!
}

This ensures that the stream is closed even if an exception occurs.

Note that for various (mostly archive driver specific) reasons, the {@code close()} method may throw an {@code IOException}, too. The client application needs to deal with this appropriately, for example by enclosing the entire block with another {@code try-catch}-block like this:

try {
    FileOutputStream out = new FileOutputStream(file);
    try {
        // Do I/O here...
    } finally {
        out.close(); // ALWAYS close the stream!
    }
} catch (IOException ex) {
    ex.printStackTrace();
}

This idiom is not at all specific to TrueZIP: Streams often utilize OS resources such as file descriptors, database or network connections. All OS resources are limited however and sometimes they are even exclusively allocated for a stream, so the stream should always be closed as soon as possible again, especially in long running server applications (relying on {@code finalize()} to do this during garbage collection is unsafe). Unfortunately, many Java applications and libraries fail in this respect.

TrueZIP is affected by open archive entry streams in the following ways:

  • Archive drivers provided by third parties may restrict the number of open input or output entry streams for an archive file. If this is exceeded, any attempt to open another entry stream results in a {@link de.schlichtherle.io.FileBusyException}.
  • When unmounting an archive file (see below), depending on the parameters, TrueZIP may choose to force the closing of any open entry streams or not. If the entry streams are not forced to close, the archive file cannot get unmounted and an {@link de.schlichtherle.io.ArchiveBusyException} is thrown. If the entry streams are forced to close however, the archive file is unmounted and an {@link de.schlichtherle.io.ArchiveBusyWarningException} is thrown to indicate that subsequent I/O operations on these entry streams (other than {@code close()}) will fail with an {@link de.schlichtherle.io.ArchiveEntryStreamClosedException}. Neither solution is optimal.

In order to prevent these exceptions, TrueZIP automatically closes entry streams when they are garbage collected. However, the client application should never rely on this because the delay and order in which streams are processed by the finalizer thread is not specified and any unwritten data gets lost in output streams.


Atomicity of File System Operations

In general, a file system operation is either atomic or not. In its strict sense, an atomic operation meets the following conditions:

  1. The operation either completely succeeds or completely fails. If it fails, the state of the file system is not changed.
  2. Third parties can't monitor or influence the changes as they are in progress. They can only see the result.

All reliable file system implementations meet the first condition and so does TrueZIP. However, the situation is different for the second condition:

  • TrueZIP's virtual file system implementation is running in a JVM process, so other processes could monitor and influence changes in progress.
  • TrueZIP's recognition of archive files is configurable, so other {@code File} instances could monitor and influence changes in progress.
  • TrueZIP caches state information about archive files on the heap and in temporary files, so other definitions of the classes in this package which have been loaded by other class loaders could monitor and influence changes in progress.

This implies that TrueZIP cannot provide any operations which are atomic in its strict sense. However, many file system operations in this package are declared to be virtually atomic according to their Javadoc. A virtually atomic operation meets the following conditions:

  1. The operation either completely succeeds or completely fails. If it fails, the state of the (virtual) file system is not changed.
  2. If the path does not contain any archive files, the operation is always delegated to the real file system and third parties can't monitor or influence the changes as they are in progress. They can only see the result.
  3. Otherwise, all {@code File} instances which recognize the same set of archive files in the path and share the same definition of classes in this package can't monitor or influence the changes as they are in progress. They can only see the result.

These conditions apply regardless of whether the {@code File} instances are used by different threads or not. In other words, TrueZIP is thread safe as much as you could expect from a real file system.


Updating Archive Files

To provide random read/write access to archive files, TrueZIP needs to associate some state for every recognized archive file on the heap and in the folder for temporary files while the client application is operating on the VFS.

TrueZIP automatically mounts the VFS from an archive file on the first access. The client application can then operate on the VFS in an arbitrary manner. Finally, an archive file must get unmounted in order to update it with the cumulated modifications. Note that an archive entry gets modified by any operation which creates, modifies or deletes it.

Explicit vs. Implicit Unmounting

Archive file unmounting is performed semi-automatic:

  • Explicit unmounting happens when the client application calls {@link de.schlichtherle.io.File#umount} or {@link de.schlichtherle.io.File#update}.
  • Implicit unmounting happens when the JVM terminates (by a JVM shutdown hook) or when the client application modifies an archive entry more than once. The latter case is also called implicit remounting, because the VFS is immediately mounted again in order to continue the operation.

Explicit unmounting is required to support third-party access to an archive file (see below) or to monitor progress (see below). It also allows some control over any exceptions thrown: Both {@code umount()} and {@code update()} may throw an {@link de.schlichtherle.io.ArchiveWarningException} or an {@link de.schlichtherle.io.ArchiveException}. The client application may catch these exceptions and act on them individually (see below).

However, calling {@code umount()} or {@code update()} too often may increase the overall runtime: On each call, all remaining entries in the archive file are copied to the archive file again if the archive file did already exist. If the client application is explicitly unmounting the archive file after each modification, this may lead to an overall runtime of {@code O(s*s)}, where {@code s} is the size of the archive file in bytes (see below).

In comparison, implicit unmounting provides best performance because archive files are only updated if there's really a need to. It also works reliably: The JVM shutdown hook is always run unless the JVM crashes (note that an uncatched throwable terminates the JVM, but does not crash it - a JVM crash is an extremely rare situation which indicates a bug in the JVM implementation, not a bug in the JRE or the application). Furthermore, it omits the need to introduce a call to {@code umount()} or {@code update()} in legacy applications.

The disadvantage is that the client application cannot easily detect and deal with any exceptions thrown as a result of updating an archive file: Depending on where the implicit unmount happens, either an arbitrary {@link java.io.IOException} is thrown, a boolean value is returned, or - when called from the JVM shutdown hook - just a stack trace is printed. In addition, updating an existing archive file takes linear runtime (see below). However, using long running JVM shutdown hooks is generally discouraged: They can't use {@link java.util.logging}, they can't use a GUI to monitor progress (see below) and they can only get debugged on JSE 5 or later.

Third Party Access

Because TrueZIP associates some state with any archive file which is read and/or write accessed by the client application, it requires exclusive access to these archive files until they get unmounted again.

Third parties must not concurrently access these archive files nor their entries unless the precautions outlined below have been taken!

In this context, third parties are:

  1. Instances of the class {@code java.io.File} which are not instances of the class {@code de.schlichtherle.io.File}.
  2. Instances of the class {@code de.schlichtherle.io.File} which do not recognize the same set of archive files in the path due to the use of a differently working {@link de.schlichtherle.io.ArchiveDetector}.
  3. Other definitions of the classes in this package which have been loaded by different class loaders.
  4. Other system processes.

As a rule of thumb, the same archive file or entry within an archive file should not be accessed by different {@code File} classes ({@code java.io.File} versus {@code de.schlichtherle.io.File}) or {@code File} instances with different {@code ArchiveDetector} parameters. This ensures that the state associated to an archive file is not shadowed or bypassed.

To ensure that all {@code File} instances recognize the same set of archive files in a path, it's recommended not to use constructors or methods of the {@code File} class with explicit {@code ArchiveDetector} parameters unless there is good reason to.

To ensure that all {@code File} instances share the same definition of classes in this package, it's recommended to add TrueZIP's JAR to the boot class path or the extension class path.

If the prerequisites for these recommendations don't apply or if the recommendations can't be followed, the client application may call {@link de.schlichtherle.io.File#umount} ({@link de.schlichtherle.io.File#update} will not work) to perform an explicit unmount. This clears all state information so that the third party can then safely access any archive file. In addition, the client application must make sure not to access the same archive file or any of its entries in any way while the third party is still accessing it.

Failure to comply to these guidelines may result in unpredictable behavior and may even cause loss of data!

Exception Handling

{@code umount()} and {@code update()} are guaranteed to process all archive files which are in use or have been touched by the client application. However, processing some of these archive files may fail for a number of I/O related reasons. Hence, during processing, a sequential chain of archive exceptions is constructed and thrown upon termination unless its empty. Note that sequential exception chaining is a concept which is completely orthogonal to Java's general exception cause chaining: In a sequential archive exception chain, each archive exception may still have a chain of other exceptions as its cause (most likely {@code IOException}s).

Archive exceptions fall into two categories:

  1. The class {@link de.schlichtherle.io.ArchiveWarningException} is the root of all warning exception types. These exceptions are thrown if an archive file has been completely updated, but some warning conditions apply. No data has been lost.
  2. Its super class {@link de.schlichtherle.io.ArchiveException} is the root of all other exception types (unless it's an {@code ArchiveWarningException} again). These exceptions are thrown if an archive file could not get updated completely. This implies loss of some or all data in the respective archive file.

Note that the effect which is indicated by an archive exception is local: An exception thrown when processing an archive file does not imply an archive exception or loss of data when processing another archive file.

When the archive exception chain is thrown by this method, it's first sorted according to (1) descending order of priority and (2) ascending order of appearance, and the resulting head exception is then thrown. Since {@code ArchiveWarningException}s have a lower priority than {@code ArchiveException}s, they are always pushed back to the end of the chain, so that an application can use the following simple idiom to detect if only some warnings or at least one severe error has occured:

try {
    File.umount(); // with or without parameters
} catch (ArchiveWarningException oops) {
    // Only instances of the class ArchiveWarningException exist in
    // the sequential chain of exceptions. We decide to ignore this.
} catch (ArchiveException ouch) {
    // At least one exception occured which is not just an
    // ArchiveWarningException. This is a severe situation that
    // needs to be handled.

    // Print the sequential chain of exceptions in order of
    // descending priority and ascending appearance.
    //ouch.printStackTrace();

    // Print the sequential chain of exceptions in order of
    // appearance instead.
    ouch.sortAppearance().printStackTrace();
}

Note that the {@link java.lang.Throwable#getMessage()} method (and hence {@link java.lang.Throwable#printStackTrace()} will concatenate the detail messages of the exceptions in the sequential chain in the given order.

Performance Considerations

Unmounting a modified archive file is a linear runtime operation: If the size of the resulting archive file is s bytes, the operation always completes in O(s), even if only a single, small archive entry has been modified within a very large archive file. Unmounting an unmodified or newly created archive file is a constant runtime operation: It always completes in O(1). These magnitudes are independent of whether unmounting was performed explicitly or implicitly.

Now if the client application modifies each entry in a loop and accidentally triggers unmounting the archive file on each iteration, then the overall runtime increases to O(s*s)! Here's an example:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
    File.umount(); // O(i + 1) !!
}
// Overall: O(n*n) !!!

The bad runtime is because {@code umount()} is called within the loop. Moving it out of the loop fixes the issue:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
}
File.umount(); // new file: O(1); modified: O(n)
// Overall: O(n)

In essence: If at all, the client application should never call {@code umount()} or {@code update()} in a loop which modifies an archive file.

The situation gets more complicated with implicit remounting: If a file entry shall get modified which already has been modified before, TrueZIP implicitly remounts the archive file in order to avoid writing duplicated entries to it (which would waste space and may even confuse other utilities). Here's an example:

String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.createNewFile(); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!

Each call to {@code createNewFile()} is a modification operation. Hence, on the second call to this method, TrueZIP needs to do an implicit remount which writes all entries in the archive file created so far to disk again.

Unfortunately, a modification operation is not always so easy to spot. Consider the following example to create an archive file with empty entries which all share the same last modification time:

long time = System.currentTimeMillis();
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.setLastModified(time); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!

When {@code setLastModified()} gets called, the entry has already been written and so an implicit remount is triggered, which writes all entries in the archive file created so far to disk again.

Detail: This deficiency is caused by archive file formats: All currently supported archive types require to write an entry's meta data (including the last modification time) before its content to the archive file. So if the meta data is to be modified, the archive entry and hence the whole archive file needs to get rewritten, which is what the implicit remount is doing.

To avoid accidental remounting when copying data, you should consider using the advanced copy methods instead. These methods are easy to use, work reliably and provide superior performance.

Monitoring Progress

When unmounting, the client application can monitor the progress by another thread using {@link de.schlichtherle.io.File#getLiveArchiveStatistics()}. The returned instance is a proxy which returns live statistics about the updating process.

Here's an example how to monitor unmounting progress on standard error output after an initial delay of two seconds:

class ProgressMonitor extends Thread {
    Long[] args = new Long[2];
    ArchiveStatistics liveStats = File.getLiveArchiveStatistics();

    ProgressMonitor() {
        setPriority(Thread.MAX_PRIORITY);
        setDaemon(true);
    }

    public void run() {
        boolean run = false;
        for (long sleep = 2000; ; sleep = 200, run = true) {
            try {
                Thread.sleep(sleep);
            } catch (InterruptedException shutdown) {
                break;
            }
            showProgress();
        }
        if (run) {
            showProgress();
            System.err.println();
        }
    }

    void showProgress() {
        // Round up to kilobytes.
        args[0] = new Long(
                (liveStats.getUpdateTotalByteCountRead() + 1023) / 1024);
        args[1] = new Long(
                (liveStats.getUpdateTotalByteCountWritten() + 1023) / 1024);
        System.err.print(MessageFormat.format(
                "Top level archive IO: {0} / {1} KB        \r", args));
    }

    void shutdown() {
        interrupt();
        try {
            join();
        } catch (InterruptedException interrupted) {
            interrupted.printStackTrace();
        }
    }
}

// ...

ProgressMonitor monitor = new ProgressMonitor();
monitor.start();
try {
    File.umount();
} finally {
    monitor.shutdown();
}

Conclusions

Here are some guidelines to find the right balance between performance and control:

  1. When the JVM terminates, calling {@code umount()} is recommended in order to handle exceptions explicitly, but not required because TrueZIP's JVM shutdown hook takes care of unmounting anyway and prints the stacktrace of any exceptions on the standard error output.
  2. Otherwise, in order to achieve best performance, {@code umount()} or {@code update()} should not get called unless either third party access or explicit exception handling is required.
  3. For the same reason, these methods should never get called in a loop which modifies an archive file.
  4. {@code umount()} is generally preferred over {@code update()} for safety reasons.

Miscellaneous

Virtual Directories in Archive Files

The top level entries in an archive file build its root directory. The root directory is never written to the output when an archive file is modified.

To the client application, the root directory behaves like any other directory and is addressed by naming the archive file in a path: For example, the client application may list its contents by calling {@link de.schlichtherle.io.File#list()} or {@link de.schlichtherle.io.File#listFiles()}.

The root directory receives its last modification time from the archive file whenever it's read. Likewise, the archive file will receive the root directory's last modification time whenever it's written.

While this is a proper emulation of the behavior of real file systems, it may confuse users if only entries which are located one level or more below the root directory have been changed in an existing archive file: In this case, the last modification time of the root directory is not updated and hence the archive file's last modification time will not reflect the changes in the deeper directory levels.

As a workaround, the client application can use the idiom {@code {@link de.schlichtherle.io.File#isArchive()} && {@link de.schlichtherle.io.File#isDirectory()}} to detect an archive file and explicitly change the last modification time of its root directory by calling {@link de.schlichtherle.io.File#setLastModified(long)}.

An archive may contain directories for which no entry is present in the file although they contain at least one member in their directory tree for which an entry is actually present in the file. Similarly, if {@link de.schlichtherle.io.File#isLenient} returns {@code true} (which is the default), an archive entry may be created in an archive file although its parent directory hasn't been explicitly created by calling {@link de.schlichtherle.io.File#mkdir} before.

Such a directory is called a ghost directory: Like the root directory, a ghost directory is not written to the output whenever an archive file is modified. This is to mimic the behavior of most archive utilities which do not create archive entries for directories.

To the client application, a ghost directory behaves like a regular directory with the exception that its last modification time returned by {@link de.schlichtherle.io.File#lastModified()} is {@code 0L}. If the client application sets the last modification time explicitly using {@link de.schlichtherle.io.File#setLastModified(long)}, then the ghost directory reincarnates as a regular directory and will be output to the archive file.

Mind that a ghost directory can only exist within an archive file, but not every directory within an archive file is actually a ghost directory.

Entry Names in Archive Files

File paths may be composed of elements which either refer to regular nodes in the real file system (directories, files or special files), including top level archive files, or refer to entries within an archive file.

As usual in Java, elements in a path which refer to regular nodes may be case sensitive or not in TrueZIP's VFS, depending on the real file system and/or the platform.

However, elements in a path which refer to archive entries are always case sensitive. This enables the client application to address all files in existing archive files, regardless of the operating system they've been created on.

For existing archive files, redundant elements in entry names such as the empty string ({@code ""}), the dot ({@code "."}) directory, or the dot-dot ({@code ".."}) directory are removed in the VFS when the archive file is read and not retained when the archive file is modified.

If an entry name contains characters which have no representation in the character set of the corresponding archive file type, then all file operations to create the archive entry will fail gracefully according to the documented contract of the respective operation. This is to protect the client application from creating archive entries which cannot get encoded and decoded again correctly. For example, the Euro sign (€) does not have a representation in the IBM437 character set and hence cannot be used for entries in ordinary ZIP files unless TrueZIP's configuration is customized to use another charset.

If an archive file contains entries with absolute entry names, such as /readme.txt rather than readme.txt, the client application cannot address these entries using the VFS in this package. However, these entries are retained like any other entry whenever the client application modifies the archive file. This should not impose problems as absolute entry names should never be used anyway and I'm not aware of any recent tools which would allow to create these.

If an archive file contains both a file and a directory entry with the same name it's up to the individual methods how they behave in this case. This could happen with archive files created by external tools only. Both {@link de.schlichtherle.io.File#isDirectory()} and {@link de.schlichtherle.io.File#isFile()} will return {@code true} in this case and in fact they are the only methods the client application can rely upon to act properly in this situation: Many other methods use a combination of {@code isDirectory()} and {@code isFile()} calls and will show an undefined behavior.

The good news is that both the file and the directory coexist in the virtual archive file system implemented by this package. Thus, whenever the archive file is modified, both entries will be retained and no data gets lost. This allows you to use another tool to fix the issue in the archive file. TrueZIP never allows the client application to create such an archive file, however.





© 2015 - 2024 Weber Informatics LLC | Privacy Policy