de.schlichtherle.io.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of truezip Show documentation
TrueZIP is a Java based Virtual File System (VFS) to enable transparent, multi-threaded read/write access to archive files (ZIP, TAR etc.) as if they were directories. Archive files may be arbitrarily nested and the nesting level is only limited by heap and file system size.
The newest version!




    
        
    
    
        Provides transparent, multi-threaded read/write access to archive
            files (ZIP, TAR, etc) and their entries as if they were (virtual)
            directories and files.
            Archive files may be arbitrarily nested and the nesting level is
            only limited by heap and file system size.
        
        Contents
        
            Basic Operations
            Atomicity of File System Operations
            Updating Archive Files
            Miscellaneous
        
        
        Basic Operations
        In order to create a new archive file, the client application can simply use
            {@link de.schlichtherle.io.File#mkdir()}.
        In order to delete it, {@link de.schlichtherle.io.File#delete()} can be used.
            Similar to a regular directory this is only possible if the archive file is empty.
            Alternatively, the client application could use {@link de.schlichtherle.io.File#deleteAll()}
            in order to delete the virtual directory in one go, regardless of its contents.
        To read an archive entry, the client application can simply create a {@link de.schlichtherle.io.FileInputStream}
            or a {@link de.schlichtherle.io.FileReader} with the path or a {@link de.schlichtherle.io.File}
            instance as its constructor parameter. Note that you cannot create a {@code FileInputStream}
            or a {@code FileReader} to read an archive file itself (unless it's a false
            positive, i.e. a regular file or directory with an archive file suffix).
        Likewise, to write an archive entry, the client application can simply create
            a {@link de.schlichtherle.io.FileOutputStream} or a {@link de.schlichtherle.io.FileWriter}
            with the path or a {@link de.schlichtherle.io.File} instance as its constructor
            parameter. Note that you cannot create a {@code FileOutputStream} or a
            {@code FileWriter} to write an archive file itself (unless it's a false positive,
            i.e. a regular file or directory with an archive file suffix).
        If the client application just needs to copy data however, using one of the
            copy methods in the {@code File} class
            is highly recommended instead of using {@code File(In|Out)putStream} directly.
            These methods use asynchronous I/O (though they return synchronously), pooled big
            buffers, pooled threads (on JSE 5 and later) and do not need to decompress/recompress
            archive entry data when copying from one archive file to another for supported archive
            types. In addition, they are guaranteed to fail gracefully, while many Java apps
            fail to close their streams if an {@code IOException} occurs.
        Note that there is no eqivalent to {@code java.io.RandomAccessFile} in this
            package because it's impossible to seek within compressed archive entry data.
        Using Archive Entry Streams
        When using streams, the client application should always close them in a
            {@code finally}-block like this:
        
            FileOutputStream out = new FileOutputStream(file);
try {
    // Do I/O here...
} finally {
    out.close(); // ALWAYS close the stream!
}
        
        This ensures that the stream is closed even if an exception occurs.
        Note that for various (mostly archive driver specific) reasons, the {@code close()}
            method may throw an {@code IOException}, too. The client application needs
            to deal with this appropriately, for example by enclosing the entire block with
            another {@code try-catch}-block like this:
        
            try {
    FileOutputStream out = new FileOutputStream(file);
    try {
        // Do I/O here...
    } finally {
        out.close(); // ALWAYS close the stream!
    }
} catch (IOException ex) {
    ex.printStackTrace();
}
        
        This idiom is not at all specific to TrueZIP: Streams often utilize OS resources
            such as file descriptors, database or network connections. All OS resources are
            limited however and sometimes they are even exclusively allocated for a stream,
            so the stream should always be closed as soon as possible again, especially in long
            running server applications (relying on {@code finalize()} to do this during
            garbage collection is unsafe). Unfortunately, many Java applications and libraries
            fail in this respect.
        TrueZIP is affected by open archive entry streams in the following ways:
        
            Archive drivers provided by third parties may restrict the number of open
	input or output entry streams for an archive file. If this is exceeded, any 
	attempt to open another entry stream results in a {@link de.schlichtherle.io.FileBusyException}.
            When unmounting an archive file (see below), depending
	on the parameters, TrueZIP may choose to force the closing of any open entry 
	streams or not. If the entry streams are not forced to close, the archive 
	file cannot get unmounted and an {@link de.schlichtherle.io.ArchiveBusyException} 
	is thrown. If the entry streams are forced to close however, the archive file 
	is unmounted and an {@link de.schlichtherle.io.ArchiveBusyWarningException} 
	is thrown to indicate that subsequent I/O operations on these entry streams 
	(other than {@code close()}) will fail with an {@link de.schlichtherle.io.ArchiveEntryStreamClosedException}. 
	Neither solution is optimal.
        
        In order to prevent these exceptions, TrueZIP automatically closes entry streams
            when they are garbage collected. However, the client application should never rely
            on this because the delay and order in which streams are processed by the finalizer
            thread is not specified and any unwritten data gets lost in output streams.
        
        Atomicity of File System Operations
        In general, a file system operation is either atomic or not. In its strict
            sense, an atomic operation meets the following conditions: 
        
            The operation either completely succeeds or completely fails. If it fails,
	the state of the file system is not changed. 
            Third parties can't monitor or influence the
	changes as they are in progress. They can only see the result. 
        
        All reliable file system implementations meet the first condition and so does
            TrueZIP. However, the situation is different for the second condition: 
        
            TrueZIP's virtual file system implementation is running in a JVM process,
	so other processes could monitor and influence changes in progress. 
            TrueZIP's recognition of archive files is configurable, so other {@code File}
	instances could monitor and influence changes in progress. 
            TrueZIP caches state information about archive files on the heap and in
	temporary files, so other definitions of the classes in this package which have 
	been loaded by other class loaders could monitor and influence changes in progress.
            
        
        This implies that TrueZIP cannot provide any operations which are atomic in its
            strict sense. However, many file system operations in this package are declared
            to be virtually atomic according to their Javadoc. A virtually atomic operation
            meets the following conditions: 
        
            The operation either completely succeeds or completely fails. If it fails,
	the state of the (virtual) file system is not changed. 
            If the path does not contain any archive files, the operation is always
	delegated to the real file system and third parties can't monitor or influence 
	the changes as they are in progress. They can only see the result. 
            Otherwise, all {@code File} instances which recognize the same set
	of archive files in the path and share the same definition of classes in this 
	package can't monitor or influence the changes as they are in progress. They 
	can only see the result. 
        
        These conditions apply regardless of whether the {@code File} instances
            are used by different threads or not. In other words, TrueZIP is thread safe as
            much as you could expect from a real file system.
        
        Updating Archive Files
        To provide random read/write access to archive files, TrueZIP needs to associate
            some state for every recognized archive file on the heap and in the folder for temporary
            files while the client application is operating on the VFS.
        TrueZIP automatically mounts the VFS from an archive file on the first
            access. The client application can then operate on the VFS in an arbitrary manner.
            Finally, an archive file must get unmounted in order to update it with the
            cumulated modifications. Note that an archive entry gets modified by any operation
            which creates, modifies or deletes it.
        Explicit vs. Implicit Unmounting
        Archive file unmounting is performed semi-automatic:
        
            Explicit unmounting happens when the client application calls {@link
	de.schlichtherle.io.File#umount} or {@link de.schlichtherle.io.File#update}.
            Implicit unmounting happens when the JVM terminates (by a JVM shutdown
	hook) or when the client application modifies an archive entry more than once. 
	The latter case is also called implicit remounting, because the VFS is 
	immediately mounted again in order to continue the operation.
        
        Explicit unmounting is required to support third-party access to an archive file
            (see below) or to monitor progress (see
            below). It also allows some control over any exceptions
            thrown: Both {@code umount()} and {@code update()} may throw an {@link
            de.schlichtherle.io.ArchiveWarningException} or an {@link de.schlichtherle.io.ArchiveException}.
            The client application may catch these exceptions and act on them individually (see
            below).
        However, calling {@code umount()} or {@code update()} too often may
            increase the overall runtime: On each call, all remaining entries in the archive
            file are copied to the archive file again if the archive file did already exist.
            If the client application is explicitly unmounting the archive file after each modification,
            this may lead to an overall runtime of {@code O(s*s)}, where {@code s}
            is the size of the archive file in bytes (see below).
        In comparison, implicit unmounting provides best performance because archive
            files are only updated if there's really a need to. It also works reliably: The
            JVM shutdown hook is always run unless the JVM crashes
                (note
            that an uncatched throwable terminates the JVM, but does not crash
            it - a JVM crash is an extremely rare situation which indicates a bug in the JVM
            implementation, not a bug in the JRE or the application). Furthermore, it omits
            the need to introduce a call to {@code umount()} or {@code update()} in
            legacy applications.
        The disadvantage is that the client application cannot easily detect
            and deal with any exceptions thrown as a result of updating an
            archive file:
            Depending on where the implicit unmount happens, either an
            arbitrary {@link java.io.IOException} is thrown, a boolean value
            is returned, or - when called from the JVM shutdown hook - just a
            stack trace is printed.
            In addition, updating an existing archive file takes linear runtime
            (see below). However, using long running
            JVM shutdown hooks is generally discouraged: They can't use
            {@link java.util.logging}, they can't use a GUI to monitor
            progress (see below) and they can only
            get debugged on JSE 5 or later.
        Third Party Access
        Because TrueZIP associates some state with any archive file which is read and/or
            write accessed by the client application, it requires exclusive access to these
            archive files until they get unmounted again.
        
            Third parties must not concurrently access these archive
                    files nor their	entries unless the precautions outlined
                    below have been taken!
        
        In this context, third parties are:
        
            Instances of the class {@code java.io.File} which are not instances
	of the class {@code de.schlichtherle.io.File}.
            Instances of the class {@code de.schlichtherle.io.File} which do not
	recognize the same set of archive files in the path due to the use of a differently 
	working {@link de.schlichtherle.io.ArchiveDetector}.
            Other definitions of the classes in this package which have been loaded
	by different class loaders.
            Other system processes.
        
        As a rule of thumb, the same archive file or entry within an archive file should
            not be accessed by different {@code File} classes ({@code java.io.File}
            versus {@code de.schlichtherle.io.File}) or {@code File} instances with
            different {@code ArchiveDetector} parameters. This ensures that the state associated
            to an archive file is not shadowed or bypassed.
        To ensure that all {@code File} instances recognize the same set of archive
            files in a path, it's recommended not to use constructors or methods of
            the {@code File} class with explicit {@code ArchiveDetector} parameters
            unless there is good reason to.
        To ensure that all {@code File} instances share the same definition of classes
            in this package, it's recommended to add TrueZIP's JAR to the boot class path or
            the extension class path.
        If the prerequisites for these recommendations don't apply or if the recommendations
            can't be followed, the client application may call {@link de.schlichtherle.io.File#umount}
            ({@link de.schlichtherle.io.File#update} will not work) to perform an explicit
            unmount. This clears all state information so that the third party can then safely
            access any archive file. In addition, the client application must make sure not
            to access the same archive file or any of its entries in any way while the third
            party is still accessing it.
        
        
            Failure to comply to these guidelines may result in
                    unpredictable behavior and may even cause loss of data!
        
        Exception Handling
        {@code umount()} and {@code update()} are guaranteed to process 
                all archive files which are in use or have been touched by the client application.
            However, processing some of these archive files may fail for a number of I/O related
            reasons. Hence, during processing, a sequential chain of archive exceptions
            is constructed and thrown upon termination unless its empty. Note that sequential
            exception chaining is a concept which is completely orthogonal to Java's general
            exception cause chaining: In a sequential archive exception chain, each archive
            exception may still have a chain of other exceptions as its cause (most likely
            {@code IOException}s).
        Archive exceptions fall into two categories:
        
            The class {@link de.schlichtherle.io.ArchiveWarningException} is the root
	of all warning exception types. These exceptions are thrown if an archive file 
	has been completely updated, but some warning conditions apply. No data has 
	been lost.
            Its super class {@link de.schlichtherle.io.ArchiveException} is the root
	of all other exception types (unless it's an {@code ArchiveWarningException} 
	again). These exceptions are thrown if an archive file could not get updated 
	completely. This implies loss of some or all data in the respective archive 
	file.
        
        Note that the effect which is indicated by an archive exception is local: An
            exception thrown when processing an archive file does not imply an archive exception
            or loss of data when processing another archive file.
        When the archive exception chain is thrown by this method, it's first sorted
            according to (1) descending order of priority and (2) ascending order of appearance,
            and the resulting head exception is then thrown. Since {@code ArchiveWarningException}s
            have a lower priority than {@code ArchiveException}s, they are always pushed
            back to the end of the chain, so that an application can use the following simple
            idiom to detect if only some warnings or at least one severe error has occured:
        
            try {
    File.umount(); // with or without parameters
} catch (ArchiveWarningException oops) {
    // Only instances of the class ArchiveWarningException exist in
    // the sequential chain of exceptions. We decide to ignore this.
} catch (ArchiveException ouch) {
    // At least one exception occured which is not just an
    // ArchiveWarningException. This is a severe situation that
    // needs to be handled.

    // Print the sequential chain of exceptions in order of
    // descending priority and ascending appearance.
    //ouch.printStackTrace();

    // Print the sequential chain of exceptions in order of
    // appearance instead.
    ouch.sortAppearance().printStackTrace();
}
        
        Note that the {@link java.lang.Throwable#getMessage()} method (and hence {@link
            java.lang.Throwable#printStackTrace()} will concatenate the detail messages of the
            exceptions in the sequential chain in the given order.
        Performance Considerations
        Unmounting a modified archive file is a linear runtime operation: If the size
            of the resulting archive file is s bytes, the operation always completes
            in O(s), even if only a single, small archive entry has been modified
            within a very large archive file. Unmounting an unmodified or newly created archive
            file is a constant runtime operation: It always completes in O(1). These
            magnitudes are independent of whether unmounting was performed explicitly or implicitly.
        Now if the client application modifies each entry in a loop and accidentally
            triggers unmounting the archive file on each iteration, then the overall runtime
            increases to O(s*s)! Here's an example:
        
            String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
    File.umount(); // O(i + 1) !!
}
// Overall: O(n*n) !!!
        
        The bad runtime is because {@code umount()} is called within the loop. Moving
            it out of the loop fixes the issue:
        
            String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // O(1)
}
File.umount(); // new file: O(1); modified: O(n)
// Overall: O(n)
        
        In essence: If at all, the client application should never call {@code umount()}
            or {@code update()} in a loop which modifies an archive file.
        The situation gets more complicated with implicit remounting: If a file entry
            shall get modified which already has been modified before, TrueZIP implicitly remounts
            the archive file in order to avoid writing duplicated entries to it (which would
            waste space and may even confuse other utilities). Here's an example:
        
            String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.createNewFile(); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!
        
        Each call to {@code createNewFile()} is a modification operation. Hence,
            on the second call to this method, TrueZIP needs to do an implicit remount which
            writes all entries in the archive file created so far to disk again.
        Unfortunately, a modification operation is not always so easy to spot. Consider
            the following example to create an archive file with empty entries which all share
            the same last modification time:
        
            long time = System.currentTimeMillis();
String[] names = { "a", "b", "c" };
int n = names.length;
for (int i = 0; i < n; i++) { // n * ...
    File entry = new File("archive.zip", names[i]); // O(1)
    entry.createNewFile(); // First modification: O(1)
    entry.setLastModified(time); // Second modification triggers remount: O(i + 1) !!
}
// Overall: O(n*n) !!!
        
        When {@code setLastModified()} gets called, the entry has already been written
            and so an implicit remount is triggered, which writes all entries in the archive
            file created so far to disk again.
        Detail: This deficiency is caused by archive file formats: All currently
            supported archive types require to write an entry's meta data (including the last
            modification time) before its content to the archive file. So if the meta data is
            to be modified, the archive entry and hence the whole archive file needs to get
            rewritten, which is what the implicit remount is doing.
        To avoid accidental remounting when copying data, you should consider using the
            advanced copy methods instead. These methods
            are easy to use, work reliably and provide superior performance.
        Monitoring Progress
        When unmounting, the client application can monitor the progress by another thread
            using {@link de.schlichtherle.io.File#getLiveArchiveStatistics()}. The returned
            instance is a proxy which returns live statistics about the updating process.
        Here's an example how to monitor unmounting progress on standard error output
            after an initial delay of two seconds:
        
            class ProgressMonitor extends Thread {
    Long[] args = new Long[2];
    ArchiveStatistics liveStats = File.getLiveArchiveStatistics();

    ProgressMonitor() {
        setPriority(Thread.MAX_PRIORITY);
        setDaemon(true);
    }

    public void run() {
        boolean run = false;
        for (long sleep = 2000; ; sleep = 200, run = true) {
            try {
                Thread.sleep(sleep);
            } catch (InterruptedException shutdown) {
                break;
            }
            showProgress();
        }
        if (run) {
            showProgress();
            System.err.println();
        }
    }

    void showProgress() {
        // Round up to kilobytes.
        args[0] = new Long(
                (liveStats.getUpdateTotalByteCountRead() + 1023) / 1024);
        args[1] = new Long(
                (liveStats.getUpdateTotalByteCountWritten() + 1023) / 1024);
        System.err.print(MessageFormat.format(
                "Top level archive IO: {0} / {1} KB        \r", args));
    }

    void shutdown() {
        interrupt();
        try {
            join();
        } catch (InterruptedException interrupted) {
            interrupted.printStackTrace();
        }
    }
}

// ...

ProgressMonitor monitor = new ProgressMonitor();
monitor.start();
try {
    File.umount();
} finally {
    monitor.shutdown();
}
        
        Conclusions
        Here are some guidelines to find the right balance between performance and control:
        
            When the JVM terminates, calling {@code umount()}
	is recommended in order to handle exceptions explicitly, but not required because 
	TrueZIP's JVM shutdown hook takes care of unmounting anyway and prints the stacktrace 
	of any exceptions on the standard error output.
            Otherwise, in order to achieve best performance, {@code umount()} or
	{@code update()} should not get called unless either
                third party access or explicit
                exception handling is required.
            For the same reason, these methods should never get called in a
                loop which modifies an archive file.
            {@code umount()} is generally preferred over {@code update()}
	for safety reasons.
        
        
        Miscellaneous
        Virtual Directories in Archive Files
        The top level entries in an archive file build its root directory. The
            root directory is never written to the output when an archive file is modified.
        To the client application, the root directory behaves like any other directory
            and is addressed by naming the archive file in a path: For example, the client application
            may list its contents by calling {@link de.schlichtherle.io.File#list()} or {@link
            de.schlichtherle.io.File#listFiles()}. 
        The root directory receives its last modification time from the archive file
            whenever it's read. Likewise, the archive file will receive the root directory's
            last modification time whenever it's written. 
        While this is a proper emulation of the behavior of real file systems, it may
            confuse users if only entries which are located one level or more below the root
            directory have been changed in an existing archive file: In this case, the last
            modification time of the root directory is not updated and hence the archive file's
            last modification time will not reflect the changes in the deeper directory levels.
        
        As a workaround, the client application can use the idiom {@code {@link de.schlichtherle.io.File#isArchive()}
            && {@link de.schlichtherle.io.File#isDirectory()}} to detect an archive file
            and explicitly change the last modification time of its root directory by calling
            {@link de.schlichtherle.io.File#setLastModified(long)}. 
        An archive may contain directories for which no entry is present in the file
            although they contain at least one member in their directory tree for which an entry
            is actually present in the file. Similarly, if {@link de.schlichtherle.io.File#isLenient}
            returns {@code true} (which is the default), an archive entry may be created
            in an archive file although its parent directory hasn't been explicitly created
            by calling {@link de.schlichtherle.io.File#mkdir} before. 
        Such a directory is called a ghost directory: Like the root directory,
            a ghost directory is not written to the output whenever an archive file is modified.
            This is to mimic the behavior of most archive utilities which do not create archive
            entries for directories. 
        To the client application, a ghost directory behaves like a regular directory
            with the exception that its last modification time returned by {@link de.schlichtherle.io.File#lastModified()}
            is {@code 0L}. If the client application sets the last modification time explicitly
            using {@link de.schlichtherle.io.File#setLastModified(long)}, then the ghost directory
            reincarnates as a regular directory and will be output to the archive file. 
        Mind that a ghost directory can only exist within an archive file, but not every
            directory within an archive file is actually a ghost directory. 
        Entry Names in Archive Files
        File paths may be composed of elements which either refer to regular nodes in
            the real file system (directories, files or special files), including top level
            archive files, or refer to entries within an archive file. 
        As usual in Java, elements in a path which refer to regular nodes may be case
            sensitive or not in TrueZIP's VFS, depending on the real file system and/or the
            platform. 
        However, elements in a path which refer to archive entries are always case sensitive.
            This enables the client application to address all files in existing archive files,
            regardless of the operating system they've been created on. 
        For existing archive files, redundant elements in entry names such as the empty
            string ({@code ""}), the dot ({@code "."}) directory, or the dot-dot ({@code ".."})
            directory are removed in the VFS when the archive file is read and not
            retained when the archive file is modified.
        If an entry name contains characters which have no representation in the character
            set of the corresponding archive file type, then all file operations to
            create the archive entry will fail gracefully according to the documented contract
            of the respective operation. This is to protect the client application from creating
            archive entries which cannot get encoded and decoded again correctly. For example,
            the Euro sign (€) does not have a representation in the IBM437 character set and
            hence cannot be used for entries in ordinary ZIP files unless TrueZIP's configuration
            is customized to use another charset. 
        If an archive file contains entries with absolute entry names, such as /readme.txt
            rather than readme.txt, the client application cannot address these entries
            using the VFS in this package. However, these entries are retained like any other
            entry whenever the client application modifies the archive file. This should not
            impose problems as absolute entry names should never be used anyway and I'm not
            aware of any recent tools which would allow to create these. 
        If an archive file contains both a file and a directory entry with the same name
            it's up to the individual methods how they behave in this case. This could happen
            with archive files created by external tools only. Both {@link de.schlichtherle.io.File#isDirectory()}
            and {@link de.schlichtherle.io.File#isFile()} will return {@code true} in this
            case and in fact they are the only methods the client application can rely upon
            to act properly in this situation: Many other methods use a combination of {@code
            isDirectory()} and {@code isFile()} calls and will show an undefined
            behavior. 
        The good news is that both the file and the directory coexist in the virtual
            archive file system implemented by this package. Thus, whenever the archive file
            is modified, both entries will be retained and no data gets lost. This allows you
            to use another tool to fix the issue in the archive file. TrueZIP never allows the
            client application to create such an archive file, however.