org.lwjgl.util.zstd.Zdict Maven / Gradle / Ivy
Show all versions of lwjgl-zstd Show documentation
/*
* Copyright LWJGL. All rights reserved.
* License terms: https://www.lwjgl.org/license
* MACHINE GENERATED FILE, DO NOT EDIT
*/
package org.lwjgl.util.zstd;
import org.jspecify.annotations.*;
import java.nio.*;
import org.lwjgl.*;
import org.lwjgl.system.*;
import static org.lwjgl.system.Checks.*;
import static org.lwjgl.system.MemoryUtil.*;
/**
* Native bindings to the dictionary builder API of Zstandard (zstd).
*
* Why should I use a dictionary?
*
* Zstd can use dictionaries to improve compression ratio of small data. Traditionally small files don't compress well because there is very little
* repetition in a single sample, since it is small. But, if you are compressing many similar files, like a bunch of JSON records that share the same
* structure, you can train a dictionary on ahead of time on some samples of these files. Then, zstd can use the dictionary to find repetitions that are
* present across samples. This can vastly improve compression ratio.
*
* When is a dictionary useful?
*
* Dictionaries are useful when compressing many small files that are similar. The larger a file is, the less benefit a dictionary will have. Generally,
* we don't expect dictionary compression to be effective past 100KB. And the smaller a file is, the more we would expect the dictionary to help.
*
* How do I use a dictionary?
*
* Simply pass the dictionary to the zstd compressor with {@link Zstd#ZSTD_CCtx_loadDictionary CCtx_loadDictionary}. The same dictionary must then be passed to the decompressor, using
* {@link Zstd#ZSTD_DCtx_loadDictionary DCtx_loadDictionary}. There are other more advanced functions that allow selecting some options, see {@code zstd.h} for complete documentation.
*
* What is a zstd dictionary?
*
* A zstd dictionary has two pieces: Its header, and its content. The header contains a magic number, the dictionary ID, and entropy tables. These entropy
* tables allow zstd to save on header costs in the compressed file, which really matters for small data. The content is just bytes, which are repeated
* content that is common across many samples.
*
* What is a raw content dictionary?
*
* A raw content dictionary is just bytes. It doesn't have a zstd dictionary header, a dictionary ID, or entropy tables. Any buffer is a valid raw content
* dictionary.
*
* How do I train a dictionary?
*
* Gather samples from your use case. These samples should be similar to each other. If you have several use cases, you could try to train one dictionary
* per use case.
*
* Pass those samples to {@link #ZDICT_trainFromBuffer trainFromBuffer} and that will train your dictionary. There are a few advanced versions of this function, but this is a great
* starting point. If you want to further tune your dictionary you could try {@link #ZDICT_optimizeTrainFromBuffer_cover optimizeTrainFromBuffer_cover}. If that is too slow you can try
* {@link #ZDICT_optimizeTrainFromBuffer_fastCover optimizeTrainFromBuffer_fastCover}.
*
* If the dictionary training function fails, that is likely because you either passed too few samples, or a dictionary would not be effective for your
* data. Look at the messages that the dictionary trainer printed, if it doesn't say too few samples, then a dictionary would not be effective.
*
* How large should my dictionary be?
*
* A reasonable dictionary size, the {@code dictBufferCapacity}, is about 100KB. The zstd CLI defaults to a 110KB dictionary. You likely don't need a
* dictionary larger than that. But, most use cases can get away with a smaller dictionary. The advanced dictionary builders can automatically shrink the
* dictionary for you, and select a the smallest size that doesn't hurt compression ratio too much. See the {@code shrinkDict} parameter. A smaller
* dictionary can save memory, and potentially speed up compression.
*
* How many samples should I provide to the dictionary builder?
*
* We generally recommend passing ~100x the size of the dictionary in samples. A few thousand should suffice. Having too few samples can hurt the
* dictionaries effectiveness. Having more samples will only improve the dictionaries effectiveness. But having too many samples can slow down the
* dictionary builder.
*
* How do I determine if a dictionary will be effective?
*
* Simply train a dictionary and try it out. You can use zstd's built in benchmarking tool to test the dictionary effectiveness.
*
*
* # Benchmark levels 1-3 without a dictionary
* zstd -b1e3 -r /path/to/my/files
* # Benchmark levels 1-3 with a dictionary
* zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
*
* When should I retrain a dictionary?
*
* You should retrain a dictionary when its effectiveness drops. Dictionary effectiveness drops as the data you are compressing changes. Generally, we do
* expect dictionaries to "decay" over time, as your data changes, but the rate at which they decay depends on your use case. Internally, we regularly
* retrain dictionaries, and if the new dictionary performs significantly better than the old dictionary, we will ship the new dictionary.
*
* I have a raw content dictionary, how do I turn it into a zstd dictionary?
*
* If you have a raw content dictionary, e.g. by manually constructing it, or using a third-party dictionary builder, you can turn it into a zstd
* dictionary by using {@link #ZDICT_finalizeDictionary finalizeDictionary}. You'll also have to provide some samples of the data. It will add the zstd header to the raw content, which
* contains a dictionary ID and entropy tables, which will improve compression ratio, and allow zstd to write the dictionary ID into the frame, if you so
* choose.
*
* Do I have to use zstd's dictionary builder?
*
* No! You can construct dictionary content however you please, it is just bytes. It will always be valid as a raw content dictionary. If you want a zstd
* dictionary, which can improve compression ratio, use {@link #ZDICT_finalizeDictionary finalizeDictionary}.
*
* What is the attack surface of a zstd dictionary?
*
* Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so zstd should never crash, or access out-of-bounds memory no matter what the
* dictionary is. However, if an attacker can control the dictionary during decompression, they can cause zstd to generate arbitrary bytes, just like if
* they controlled the compressed data.
*/
public class Zdict {
static { LibZstd.initialize(); }
public static final int
ZDICT_CONTENTSIZE_MIN = 128,
ZDICT_DICTSIZE_MIN = 256;
protected Zdict() {
throw new UnsupportedOperationException();
}
// --- [ ZDICT_trainFromBuffer ] ---
/** Unsafe version of: {@link #ZDICT_trainFromBuffer trainFromBuffer} */
public static native long nZDICT_trainFromBuffer(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples);
/**
* Train a dictionary from an array of samples.
*
* Redirect towards {@link #ZDICT_optimizeTrainFromBuffer_fastCover optimizeTrainFromBuffer_fastCover} single-threaded, with {@code d=8}, {@code steps=4}, {@code f=20}, and {@code accel=1}.
*
* Samples must be stored concatenated in a single flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the
* size of each sample, in order.
*
* The resulting dictionary will be saved into {@code dictBuffer}.
*
* Note: {@code ZDICT_trainFromBuffer()} requires about 9 bytes of memory for each input byte.
*
* Tips:
*
*
* - In general, a reasonable dictionary has a size of ~ 100 KB.
* - It's possible to select smaller or larger size, just by specifying {@code dictBufferCapacity}.
* - In general, it's recommended to provide a few thousands samples, though this can vary a lot.
* - It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
*
*
* @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
*/
@NativeType("size_t")
public static long ZDICT_trainFromBuffer(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes) {
if (CHECKS) {
if (DEBUG) {
check(samplesBuffer, getSamplesBufferSize(samplesSizes));
}
}
return nZDICT_trainFromBuffer(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining());
}
// --- [ ZDICT_getDictID ] ---
/** Unsafe version of: {@link #ZDICT_getDictID getDictID} */
public static native int nZDICT_getDictID(long dictBuffer, long dictSize);
/**
* Extracts {@code dictID}.
*
* @return zero if error (not a valid dictionary)
*/
@NativeType("unsigned int")
public static int ZDICT_getDictID(@NativeType("void const *") ByteBuffer dictBuffer) {
return nZDICT_getDictID(memAddress(dictBuffer), dictBuffer.remaining());
}
// --- [ ZDICT_isError ] ---
public static native int nZDICT_isError(long errorCode);
@NativeType("unsigned int")
public static boolean ZDICT_isError(@NativeType("size_t") long errorCode) {
return nZDICT_isError(errorCode) != 0;
}
// --- [ ZDICT_getErrorName ] ---
public static native long nZDICT_getErrorName(long errorCode);
@NativeType("char const *")
public static @Nullable String ZDICT_getErrorName(@NativeType("size_t") long errorCode) {
long __result = nZDICT_getErrorName(errorCode);
return memASCIISafe(__result);
}
// --- [ ZDICT_trainFromBuffer_cover ] ---
/** Unsafe version of: {@link #ZDICT_trainFromBuffer_cover trainFromBuffer_cover} */
public static native long nZDICT_trainFromBuffer_cover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);
/**
* Train a dictionary from an array of samples using the COVER algorithm.
*
* Samples must be stored concatenated in a single flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the
* size of each sample, in order.
*
* The resulting dictionary will be saved into {@code dictBuffer}.
*
* Note: {@code ZDICT_trainFromBuffer_cover()} requires about 9 bytes of memory for each input byte.
*
* Tips:
*
*
* - In general, a reasonable dictionary has a size of ~ 100 KB.
* - It's possible to select smaller or larger szie, just by specifying {@code dictBufferCapacity}.
* - In general, it's recommended to provide a few thousands samples, though this can vary a lot.
* - It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
*
*
* @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
*/
@NativeType("size_t")
public static long ZDICT_trainFromBuffer_cover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_cover_params_t") ZDICTCoverParams parameters) {
if (CHECKS) {
if (DEBUG) {
check(samplesBuffer, getSamplesBufferSize(samplesSizes));
}
}
return nZDICT_trainFromBuffer_cover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
}
// --- [ ZDICT_optimizeTrainFromBuffer_cover ] ---
/** Unsafe version of: {@link #ZDICT_optimizeTrainFromBuffer_cover optimizeTrainFromBuffer_cover} */
public static native long nZDICT_optimizeTrainFromBuffer_cover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);
/**
* The same requirements as {@link #ZDICT_trainFromBuffer_cover trainFromBuffer_cover} hold for all the parameters except {@code parameters}.
*
* This function tries many parameter combinations and picks the best parameters. {@code *parameters} is filled with the best parameters found, dictionary
* constructed with those parameters is stored in {@code dictBuffer}.
*
*
* - All of the parameters {@code d}, {@code k}, {@code steps} are optional.
* - If {@code d} is non-zero then we don't check multiple values of {@code }d, otherwise we check {@code d = {6, 8}}.
* - If {@code steps} is zero it defaults to its default value.
* - If {@code k} is non-zero then we don't check multiple values of {@code k}, otherwise we check steps values in {@code [50, 2000]}.
*
*
* Note: {@code ZDICT_optimizeTrainFromBuffer_cover()} requires about 8 bytes of memory for each input byte and additionally another 5 bytes of memory for
* each byte of memory for each thread.
*
* @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}. On success
* {@code *parameters} contains the parameters selected.
*/
@NativeType("size_t")
public static long ZDICT_optimizeTrainFromBuffer_cover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_cover_params_t *") ZDICTCoverParams parameters) {
if (CHECKS) {
if (DEBUG) {
check(samplesBuffer, getSamplesBufferSize(samplesSizes));
}
}
return nZDICT_optimizeTrainFromBuffer_cover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
}
// --- [ ZDICT_trainFromBuffer_fastCover ] ---
/** Unsafe version of: {@link #ZDICT_trainFromBuffer_fastCover trainFromBuffer_fastCover} */
public static native long nZDICT_trainFromBuffer_fastCover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);
/**
* Train a dictionary from an array of samples using a modified version of COVER algorithm.
*
* Samples must be stored concatenated in a single flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the
* size of each sample, in order. {@code d} and {@code k} are required. All other parameters are optional, will use default values if not provided. The
* resulting dictionary will be saved into {@code dictBuffer}.
*
* Note: {@code ZDICT_trainFromBuffer_fastCover()} requires about 1 bytes of memory for each input byte and additionally another {@code 6 * 2^f} bytes of
* memory.
*
* Tips: In general, a reasonable dictionary has a size of {@code ~100 KB}. It's possible to select smaller or larger size, just by specifying
* {@code dictBufferCapacity}. In general, it's recommended to provide a few thousands samples, though this can vary a lot. It's recommended that total
* size of all samples be about {@code ~x100} times the target size of dictionary.
*
* @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
*/
@NativeType("size_t")
public static long ZDICT_trainFromBuffer_fastCover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_fastCover_params_t") ZDICTFastCoverParams parameters) {
if (CHECKS) {
if (DEBUG) {
check(samplesBuffer, getSamplesBufferSize(samplesSizes));
}
}
return nZDICT_trainFromBuffer_fastCover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
}
// --- [ ZDICT_optimizeTrainFromBuffer_fastCover ] ---
/** Unsafe version of: {@link #ZDICT_optimizeTrainFromBuffer_fastCover optimizeTrainFromBuffer_fastCover} */
public static native long nZDICT_optimizeTrainFromBuffer_fastCover(long dictBuffer, long dictBufferCapacity, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);
/**
* The same requirements as {@link #ZDICT_trainFromBuffer_fastCover trainFromBuffer_fastCover} hold for all the parameters except {@code parameters}.
*
* This function tries many parameter combinations (specifically, {@code k} and {@code d} combinations) and picks the best parameters. {@code *parameters}
* is filled with the best parameters found, dictionary constructed with those parameters is stored in {@code dictBuffer}.
*
*
* - All of the parameters {@code d}, {@code k}, {@code steps}, {@code f}, and {@code accel} are optional.
* - If {@code d} is non-zero then we don't check multiple values of {@code d}, otherwise we check {@code d = {6, 8}}.
* - If {@code steps} is zero it defaults to its default value.
* - If {@code k} is non-zero then we don't check multiple values of {@code k}, otherwise we check steps values in {@code [50, 2000]}.
* - If {@code f} is zero, default value of 20 is used.
* - If {@code accel} is zero, default value of 1 is used.
*
*
* Note: {@code ZDICT_optimizeTrainFromBuffer_fastCover()} requires about 1 byte of memory for each input byte and additionally another {@code 6 * 2^f}
* bytes of memory for each thread.
*
* @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}. On success
* {@code *parameters} contains the parameters selected.
*/
@NativeType("size_t")
public static long ZDICT_optimizeTrainFromBuffer_fastCover(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_fastCover_params_t *") ZDICTFastCoverParams parameters) {
if (CHECKS) {
if (DEBUG) {
check(samplesBuffer, getSamplesBufferSize(samplesSizes));
}
}
return nZDICT_optimizeTrainFromBuffer_fastCover(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
}
// --- [ ZDICT_finalizeDictionary ] ---
/** Unsafe version of: {@link #ZDICT_finalizeDictionary finalizeDictionary} */
public static native long nZDICT_finalizeDictionary(long dictBuffer, long dictBufferCapacity, long dictContent, long dictContentSize, long samplesBuffer, long samplesSizes, int nbSamples, long parameters);
/**
* Given a custom content as a basis for dictionary, and a set of samples, finalize dictionary by adding headers and statistics.
*
* Samples must be stored concatenated in a flat buffer {@code samplesBuffer}, supplied with an array of sizes {@code samplesSizes}, providing the size of
* each sample in order.
*
* Notes:
*
*
* - {@code maxDictSize} must be ≥ {@code dictContentSize}, and must be ≥ {@link #ZDICT_DICTSIZE_MIN DICTSIZE_MIN} bytes.
* - {@code ZDICT_finalizeDictionary()} will push notifications into {@code stderr} if instructed to, using {@code notificationLevel>0}.
* - {@code dictBuffer} and {@code dictContent} can overlap.
*
*
* @return size of dictionary stored into {@code dictBuffer} (≤ {@code dictBufferCapacity}) or an error code, which can be tested with {@link #ZDICT_isError isError}.
*/
@NativeType("size_t")
public static long ZDICT_finalizeDictionary(@NativeType("void *") ByteBuffer dictBuffer, @NativeType("void const *") ByteBuffer dictContent, @NativeType("void const *") ByteBuffer samplesBuffer, @NativeType("size_t const *") PointerBuffer samplesSizes, @NativeType("ZDICT_params_t") ZDICTParams parameters) {
if (CHECKS) {
if (DEBUG) {
check(samplesBuffer, getSamplesBufferSize(samplesSizes));
}
}
return nZDICT_finalizeDictionary(memAddress(dictBuffer), dictBuffer.remaining(), memAddress(dictContent), dictContent.remaining(), memAddress(samplesBuffer), memAddress(samplesSizes), samplesSizes.remaining(), parameters.address());
}
private static long getSamplesBufferSize(PointerBuffer samplesSizes) {
long bytes = 0L;
for (int i = 0; i < samplesSizes.remaining(); i++) {
bytes += samplesSizes.get(i);
}
return bytes;
}
}