org.lwjgl.util.zstd.Zdict Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of lwjgl-zstd Show documentation
A fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios.
The newest version!
/*
 * Copyright LWJGL. All rights reserved.
 * License terms: https://www.lwjgl.org/license
 * MACHINE GENERATED FILE, DO NOT EDIT
 */
package org.lwjgl.util.zstd;

import org.jspecify.annotations.*;

import java.nio.*;

import org.lwjgl.*;

import org.lwjgl.system.*;

import static org.lwjgl.system.Checks.*;
import static org.lwjgl.system.MemoryUtil.*;

/**
 * Native bindings to the dictionary builder API of Zstandard (zstd).
 * 
 * Why should I use a dictionary?
 * 
 * Zstd can use dictionaries to improve compression ratio of small data. Traditionally small files don't compress well because there is very little
 * repetition in a single sample, since it is small. But, if you are compressing many similar files, like a bunch of JSON records that share the same
 * structure, you can train a dictionary on ahead of time on some samples of these files. Then, zstd can use the dictionary to find repetitions that are
 * present across samples. This can vastly improve compression ratio.
 * 
 * When is a dictionary useful?
 * 
 * Dictionaries are useful when compressing many small files that are similar. The larger a file is, the less benefit a dictionary will have. Generally,
 * we don't expect dictionary compression to be effective past 100KB. And the smaller a file is, the more we would expect the dictionary to help.
 * 
 * How do I use a dictionary?
 * 
 * Simply pass the dictionary to the zstd compressor with {@link Zstd#ZSTD_CCtx_loadDictionary CCtx_loadDictionary}. The same dictionary must then be passed to the decompressor, using
 * {@link Zstd#ZSTD_DCtx_loadDictionary DCtx_loadDictionary}. There are other more advanced functions that allow selecting some options, see {@code zstd.h} for complete documentation.
 * 
 * What is a zstd dictionary?
 * 
 * A zstd dictionary has two pieces: Its header, and its content. The header contains a magic number, the dictionary ID, and entropy tables. These entropy
 * tables allow zstd to save on header costs in the compressed file, which really matters for small data. The content is just bytes, which are repeated
 * content that is common across many samples.
 * 
 * What is a raw content dictionary?
 * 
 * A raw content dictionary is just bytes. It doesn't have a zstd dictionary header, a dictionary ID, or entropy tables. Any buffer is a valid raw content
 * dictionary.
 * 
 * How do I train a dictionary?
 * 
 * Gather samples from your use case. These samples should be similar to each other. If you have several use cases, you could try to train one dictionary
 * per use case.
 * 
 * Pass those samples to {@link #ZDICT_trainFromBuffer trainFromBuffer} and that will train your dictionary. There are a few advanced versions of this function, but this is a great
 * starting point. If you want to further tune your dictionary you could try {@link #ZDICT_optimizeTrainFromBuffer_cover optimizeTrainFromBuffer_cover}. If that is too slow you can try
 * {@link #ZDICT_optimizeTrainFromBuffer_fastCover optimizeTrainFromBuffer_fastCover}.
 * 
 * If the dictionary training function fails, that is likely because you either passed too few samples, or a dictionary would not be effective for your
 * data. Look at the messages that the dictionary trainer printed, if it doesn't say too few samples, then a dictionary would not be effective.
 * 
 * How large should my dictionary be?
 * 
 * A reasonable dictionary size, the {@code dictBufferCapacity}, is about 100KB. The zstd CLI defaults to a 110KB dictionary. You likely don't need a
 * dictionary larger than that. But, most use cases can get away with a smaller dictionary. The advanced dictionary builders can automatically shrink the
 * dictionary for you, and select a the smallest size that doesn't hurt compression ratio too much. See the {@code shrinkDict} parameter. A smaller
 * dictionary can save memory, and potentially speed up compression.
 * 
 * How many samples should I provide to the dictionary builder?
 * 
 * We generally recommend passing ~100x the size of the dictionary in samples. A few thousand should suffice. Having too few samples can hurt the
 * dictionaries effectiveness. Having more samples will only improve the dictionaries effectiveness. But having too many samples can slow down the
 * dictionary builder.
 * 
 * How do I determine if a dictionary will be effective?
 * 
 * Simply train a dictionary and try it out. You can use zstd's built in benchmarking tool to test the dictionary effectiveness.
 * 
 * 
 * # Benchmark levels 1-3 without a dictionary
 * zstd -b1e3 -r /path/to/my/files
 * # Benchmark levels 1-3 with a dictionary
 * zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
 * 
 * When should I retrain a dictionary?
 * 
 * You should retrain a dictionary when its effectiveness drops. Dictionary effectiveness drops as the data you are compressing changes. Generally, we do
 * expect dictionaries to "decay" over time, as your data changes, but the rate at which they decay depends on your use case. Internally, we regularly
 * retrain dictionaries, and if the new dictionary performs significantly better than the old dictionary, we will ship the new dictionary.
 * 
 * I have a raw content dictionary, how do I turn it into a zstd dictionary?
 * 
 * If you have a raw content dictionary, e.g. by manually constructing it, or using a third-party dictionary builder, you can turn it into a zstd
 * dictionary by using {@link #ZDICT_finalizeDictionary finalizeDictionary}. You'll also have to provide some samples of the data. It will add the zstd header to the raw content, which
 * contains a dictionary ID and entropy tables, which will improve compression ratio, and allow zstd to write the dictionary ID into the frame, if you so
 * choose.
 * 
 * Do I have to use zstd's dictionary builder?
 * 
 * No! You can construct dictionary content however you please, it is just bytes. It will always be valid as a raw content dictionary. If you want a zstd
 * dictionary, which can improve compression ratio, use {@link #ZDICT_finalizeDictionary finalizeDictionary}.
 * 
 * What is the attack surface of a zstd dictionary?
 * 
 * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so zstd should never crash, or access out-of-bounds memory no matter what the
 * dictionary is. However, if an attacker can control the dictionary during decompression, they can cause zstd to generate arbitrary bytes, just like if
 * they controlled the compressed data.
 */
public class Zdict {

    static { LibZstd.initialize(); }

    public static final int
        ZDICT_CONTENTSIZE_MIN = 128,
        ZDICT_DICTSIZE_MIN    = 256;

    protected Zdict() {
        throw new UnsupportedOperationException();
    }

    // --- [ ZDICT_trainFromBuffer ] ---

    /** Unsafe version of: {@link #ZDICT_trainFromBuffer trainFromBuffer} */
    public static native long nZDICT_trainFromBuffer(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples);

    /**
     * Train a dictionary from an array of samples.
     * 
     * Redirect towards {@link #ZDICT_optimizeTrainFromBuffer_fastCover optimizeTrainFromBuffer_fastCover} single-threaded, with {@code d=8}, {@code steps=4}, {@code f=20}, and {@code accel=1}.
     * 
     * Samples must be stored concatenated in a single flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the
     * size of each sample, in order.
     * 
     * The resulting dictionary will be saved into {@code dictBuffer}.
     * 
     * Note: {@code ZDICT_trainFromBuffer()} requires about 9 bytes of memory for each input byte.
     * 
     * Tips:
     * 
     * 
     * In general, a reasonable dictionary has a size of ~ 100 KB.
     * It's possible to select smaller or larger size, just by specifying {@code dictBufferCapacity}.
     * In general, it's recommended to provide a few thousands samples, though this can vary a lot.
     * It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
     * 
     *
     * @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
     */
    @NativeType("size_t")
    public static long ZDICT_trainFromBuffer(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes) {
        if (CHECKS) {
            if (DEBUG) {
                check(samplesBuffer, getSamplesBufferSize(samplesSizes));
            }
        }
        return nZDICT_trainFromBuffer(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining());
    }

    // --- [ ZDICT_getDictID ] ---

    /** Unsafe version of: {@link #ZDICT_getDictID getDictID} */
    public static native int nZDICT_getDictID(long dictBuffer, long dictSize);

    /**
     * Extracts {@code dictID}.
     *
     * @return zero if error (not a valid dictionary)
     */
    @NativeType("unsigned int")
    public static int ZDICT_getDictID(@NativeType("void const *") ByteBuffer dictBuffer) {
        return nZDICT_getDictID(memAddress(dictBuffer), dictBuffer.remaining());
    }

    // --- [ ZDICT_isError ] ---

    public static native int nZDICT_isError(long errorCode);

    @NativeType("unsigned int")
    public static boolean ZDICT_isError(@NativeType("size_t") long errorCode) {
        return nZDICT_isError(errorCode) != 0;
    }

    // --- [ ZDICT_getErrorName ] ---

    public static native long nZDICT_getErrorName(long errorCode);

    @NativeType("char const *")
    public static @Nullable String ZDICT_getErrorName(@NativeType("size_t") long errorCode) {
        long __result = nZDICT_getErrorName(errorCode);
        return memASCIISafe(__result);
    }

    // --- [ ZDICT_trainFromBuffer_cover ] ---

    /** Unsafe version of: {@link #ZDICT_trainFromBuffer_cover trainFromBuffer_cover} */
    public static native long nZDICT_trainFromBuffer_cover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);

    /**
     * Train a dictionary from an array of samples using the COVER algorithm.
     * 
     * Samples must be stored concatenated in a single flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the
     * size of each sample, in order.
     * 
     * The resulting dictionary will be saved into {@code dictBuffer}.
     * 
     * Note: {@code ZDICT_trainFromBuffer_cover()} requires about 9 bytes of memory for each input byte.
     * 
     * Tips:
     * 
     * 
     * In general, a reasonable dictionary has a size of ~ 100 KB.
     * It's possible to select smaller or larger szie, just by specifying {@code dictBufferCapacity}.
     * In general, it's recommended to provide a few thousands samples, though this can vary a lot.
     * It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
     * 
     *
     * @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
     */
    @NativeType("size_t")
    public static long ZDICT_trainFromBuffer_cover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_cover_params_t") ZDICTCoverParams parameters) {
        if (CHECKS) {
            if (DEBUG) {
                check(samplesBuffer, getSamplesBufferSize(samplesSizes));
            }
        }
        return nZDICT_trainFromBuffer_cover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
    }

    // --- [ ZDICT_optimizeTrainFromBuffer_cover ] ---

    /** Unsafe version of: {@link #ZDICT_optimizeTrainFromBuffer_cover optimizeTrainFromBuffer_cover} */
    public static native long nZDICT_optimizeTrainFromBuffer_cover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);

    /**
     * The same requirements as {@link #ZDICT_trainFromBuffer_cover trainFromBuffer_cover} hold for all the parameters except {@code parameters}.
     * 
     * This function tries many parameter combinations and picks the best parameters. {@code *parameters} is filled with the best parameters found, dictionary
     * constructed with those parameters is stored in {@code dictBuffer}.
     * 
     * 
     * All of the parameters {@code d}, {@code k}, {@code steps} are optional.
     * If {@code d} is non-zero then we don't check multiple values of {@code }d, otherwise we check {@code d = {6, 8}}.
     * If {@code steps} is zero it defaults to its default value.
     * If {@code k} is non-zero then we don't check multiple values of {@code k}, otherwise we check steps values in {@code [50, 2000]}.
     * 
     * 
     * Note: {@code ZDICT_optimizeTrainFromBuffer_cover()} requires about 8 bytes of memory for each input byte and additionally another 5 bytes of memory for
     * each byte of memory for each thread.
     *
     * @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}. On success
     *         {@code *parameters} contains the parameters selected.
     */
    @NativeType("size_t")
    public static long ZDICT_optimizeTrainFromBuffer_cover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_cover_params_t *") ZDICTCoverParams parameters) {
        if (CHECKS) {
            if (DEBUG) {
                check(samplesBuffer, getSamplesBufferSize(samplesSizes));
            }
        }
        return nZDICT_optimizeTrainFromBuffer_cover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
    }

    // --- [ ZDICT_trainFromBuffer_fastCover ] ---

    /** Unsafe version of: {@link #ZDICT_trainFromBuffer_fastCover trainFromBuffer_fastCover} */
    public static native long nZDICT_trainFromBuffer_fastCover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);

    /**
     * Train a dictionary from an array of samples using a modified version of COVER algorithm.
     * 
     * Samples must be stored concatenated in a single flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the
     * size of each sample, in order. {@code d} and {@code k} are required. All other parameters are optional, will use default values if not provided. The
     * resulting dictionary will be saved into {@code dictBuffer}.
     * 
     * Note: {@code ZDICT_trainFromBuffer_fastCover()} requires about 1 bytes of memory for each input byte and additionally another {@code 6 * 2^f} bytes of
     * memory.
     * 
     * Tips: In general, a reasonable dictionary has a size of {@code ~100 KB}. It's possible to select smaller or larger size, just by specifying
     * {@code dictBufferCapacity}. In general, it's recommended to provide a few thousands samples, though this can vary a lot. It's recommended that total
     * size of all samples be about {@code ~x100} times the target size of dictionary.
     *
     * @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
     */
    @NativeType("size_t")
    public static long ZDICT_trainFromBuffer_fastCover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_fastCover_params_t") ZDICTFastCoverParams parameters) {
        if (CHECKS) {
            if (DEBUG) {
                check(samplesBuffer, getSamplesBufferSize(samplesSizes));
            }
        }
        return nZDICT_trainFromBuffer_fastCover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
    }

    // --- [ ZDICT_optimizeTrainFromBuffer_fastCover ] ---

    /** Unsafe version of: {@link #ZDICT_optimizeTrainFromBuffer_fastCover optimizeTrainFromBuffer_fastCover} */
    public static native long nZDICT_optimizeTrainFromBuffer_fastCover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);

    /**
     * The same requirements as {@link #ZDICT_trainFromBuffer_fastCover trainFromBuffer_fastCover} hold for all the parameters except {@code parameters}.
     * 
     * This function tries many parameter combinations (specifically, {@code k} and {@code d} combinations) and picks the best parameters. {@code *parameters}
     * is filled with the best parameters found, dictionary constructed with those parameters is stored in {@code dictBuffer}.
     * 
     * 
     * All of the parameters {@code d}, {@code k}, {@code steps}, {@code f}, and {@code accel} are optional.
     * If {@code d} is non-zero then we don't check multiple values of {@code d}, otherwise we check {@code d = {6, 8}}.
     * If {@code steps} is zero it defaults to its default value.
     * If {@code k} is non-zero then we don't check multiple values of {@code k}, otherwise we check steps values in {@code [50, 2000]}.
     * If {@code f} is zero, default value of 20 is used.
     * If {@code accel} is zero, default value of 1 is used.
     * 
     * 
     * Note: {@code ZDICT_optimizeTrainFromBuffer_fastCover()} requires about 1 byte of memory for each input byte and additionally another {@code 6 * 2^f}
     * bytes of memory for each thread.
     *
     * @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}. On success
     *         {@code *parameters} contains the parameters selected.
     */
    @NativeType("size_t")
    public static long ZDICT_optimizeTrainFromBuffer_fastCover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_fastCover_params_t *") ZDICTFastCoverParams parameters) {
        if (CHECKS) {
            if (DEBUG) {
                check(samplesBuffer, getSamplesBufferSize(samplesSizes));
            }
        }
        return nZDICT_optimizeTrainFromBuffer_fastCover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
    }

    // --- [ ZDICT_finalizeDictionary ] ---

    /** Unsafe version of: {@link #ZDICT_finalizeDictionary finalizeDictionary} */
    public static native long nZDICT_finalizeDictionary(long dictBuffer, long dictBufferCapacity, long dictContent, long dictContentSize, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);

    /**
     * Given a custom content as a basis for dictionary, and a set of samples, finalize dictionary by adding headers and statistics.
     * 
     * Samples must be stored concatenated in a flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the size of
     * each sample in order.
     * 
     * Notes:
     * 
     * 
     * {@code maxDictSize} must be ≥ {@code dictContentSize}, and must be ≥ {@link #ZDICT_DICTSIZE_MIN DICTSIZE_MIN} bytes.
     * {@code ZDICT_finalizeDictionary()} will push notifications into {@code stderr} if instructed to, using {@code notificationLevel>0}.
     * {@code dictBuffer} and {@code dictContent} can overlap.
     * 
     *
     * @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
     */
    @NativeType("size_t")
    public static long ZDICT_finalizeDictionary(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer dictContent, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_params_t") ZDICTParams parameters) {
        if (CHECKS) {
            if (DEBUG) {
                check(samplesBuffer, getSamplesBufferSize(samplesSizes));
            }
        }
        return nZDICT_finalizeDictionary(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(dictContent), dictContent.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
    }

    private static long getSamplesBufferSize(PointerBuffer samplesSizes) {
        long bytes = 0L;
        for (int i = 0; i < samplesSizes.remaining(); i++) {
            bytes += samplesSizes.get(i);
        }
        return bytes;
    }

}