com.github.tommyettinger.digital.Distributor Maven / Gradle / Ivy

Go to download
package com.github.tommyettinger.digital;

import java.util.Random;

/**
 * Different methods for distributing input {@code long} or {@code double} values from a given domain into specific
 * distributions, such as the normal distribution. {@link #probit(double)} and {@link #probitHighPrecision(double)} take
 * a double in the 0.0 to 1.0 range (typically exclusive, but not required to be), and produce a normal-distributed
 * double centered on 0.0 with standard deviation 1.0 . {@link #normal(long)} takes a long in the entire range of
 * possible long values, and also produces a double centered on 0.0 with standard deviation 1.0 . Similarly,
 * {@link #normalF(int)} takes an int in the entire range of possible int values, and produces a float centered on 0f
 * with standard deviation 1f. All of these ways will preserve patterns in the input, so inputs close to the lowest
 * possible input (0.0 for probit(), {@link Long#MIN_VALUE} for normal(), {@link Integer#MIN_VALUE} for normalF()) will
 * produce the lowest possible output (-8.375 for probit(), normal(), and normalF()),
 * and similarly for the highest possible inputs producing the highest possible outputs.
 */
public final class Distributor {

    private Distributor() {}

    private static final double[] TABLE = new double[1024];
    private static final float[] TABLE_F = new float[1024];

    static {
        for (int i = 0; i < 1024; i++) {
            TABLE_F[i] = (float) (TABLE[i] = probitHighPrecision(0.5 + i * 0x1p-11));
        }
    }

    /**
     * A way of taking a double in the (0.0, 1.0) range and mapping it to a Gaussian or normal distribution, so high
     * inputs correspond to high outputs, and similarly for the low range. This is centered on 0.0 and its standard
     * deviation seems to be 1.0 (the same as {@link Random#nextGaussian()}). If this is given an input of 0.0
     * or less, it returns -8.375, which is slightly less than the result when given {@link Double#MIN_VALUE}. If it is
     * given an input of 1.0 or more, it returns 8.375, which is significantly larger than the result when given the
     * largest double less than 1.0 (this value is further from 1.0 than {@link Double#MIN_VALUE} is from 0.0). If
     * given {@link Double#NaN}, it returns whatever {@link Math#copySign(double, double)} returns for the arguments
     * {@code 8.375, Double.NaN}, which is implementation-dependent.
     * 

     * This uses an algorithm by Peter John Acklam, as implemented by Sherali Karimov.
     * Original source.
     * Information on the algorithm.
     * Wikipedia's page on the probit function may help, but
     * is more likely to just be confusing.
     * 

     * Acklam's algorithm and Karimov's implementation are both competitive on speed with the Box-Muller Transform and
     * Marsaglia's Polar Method, but slower than Ziggurat and the {@link #normal(long)} method here. This isn't quite
     * as precise as Box-Muller or Marsaglia Polar, and can't produce as extreme min and max results in the extreme
     * cases they should appear. If given a typical uniform random {@code double} that's exclusive on 1.0, it won't
     * produce a result higher than
     * {@code 8.209536145151493}, and will only produce results of at least {@code -8.209536145151493} if 0.0 is
     * excluded from the inputs (if 0.0 is an input, the result is {@code -8.375}). This requires a fair amount of
     * floating-point multiplication and one division for all {@code d} where it is between 0 and 1 exclusive, but
     * roughly 1/20 of the time it need a {@link Math#sqrt(double)} and {@link Math#log(double)} as well.
     * 

     * This can be used both as an optimization for generating Gaussian random values, and as a way of generating
     * Gaussian values that match a pattern present in the inputs (which you could have by using a sub-random sequence
     * as the input, such as those produced by a van der Corput, Halton, Sobol or R2 sequence). Most methods of generating
     * Gaussian values (e.g. Box-Muller and Marsaglia polar) do not have any way to preserve a particular pattern. Note
     * that if you don't need to preserve patterns in input, then either the Ziggurat method (which is available and the
     * default in the juniper library for pseudo-random generation) or the Marsaglia polar method (which is the default
     * in the JDK Random class) will perform better in each one's optimal circumstances. The {@link #normal(long)}
     * method here (using the Linnormal algorithm) both preserves patterns in input (given a {@code long}) and is faster
     * than Ziggurat, making it the quickest here, though at some cost to precision.
     *
     * @param d should be between 0 and 1, exclusive, but other values are tolerated
     * @return a normal-distributed double centered on 0.0; all results will be between -8.375 and 8.375, both inclusive
     * @see #probitHighPrecision(double) There is a higher-precision, slower variant on this method.
     */
    public static double probit (final double d) {
        if (d <= 0 || d >= 1) {
            return Math.copySign(8.375, d - 0.5);
        } else if (d < 0.02425) {
            final double q = Math.sqrt(-2.0 * Math.log(d));
            return (((((-7.784894002430293e-03 * q - 3.223964580411365e-01) * q - 2.400758277161838e+00) * q - 2.549732539343734e+00) * q + 4.374664141464968e+00) * q + 2.938163982698783e+00) / (
                    (((7.784695709041462e-03 * q + 3.224671290700398e-01) * q + 2.445134137142996e+00) * q + 3.754408661907416e+00) * q + 1.0);
        } else if (0.97575 < d) {
            final double q = Math.sqrt(-2.0 * Math.log(1 - d));
            return -(((((-7.784894002430293e-03 * q - 3.223964580411365e-01) * q - 2.400758277161838e+00) * q - 2.549732539343734e+00) * q + 4.374664141464968e+00) * q + 2.938163982698783e+00) / (
                    (((7.784695709041462e-03 * q + 3.224671290700398e-01) * q + 2.445134137142996e+00) * q + 3.754408661907416e+00) * q + 1.0);
        }
        final double q = d - 0.5;
        final double r = q * q;
        return (((((-3.969683028665376e+01 * r + 2.209460984245205e+02) * r - 2.759285104469687e+02) * r + 1.383577518672690e+02) * r - 3.066479806614716e+01) * r + 2.506628277459239e+00) * q / (
                ((((-5.447609879822406e+01 * r + 1.615858368580409e+02) * r - 1.556989798598866e+02) * r + 6.680131188771972e+01) * r - 1.328068155288572e+01) * r + 1.0);
    }

    /**
     * Complementary error function, partial implementation.
     * See Wikipedia for more.
     * @param x any non-negative double
     * @return a double between 0 and 1... I think?
     */
    private static double erfcBase(double x) {
        return ((0.56418958354775629) / (x + 2.06955023132914151)) *
                ((x * (x + 2.71078540045147805) + 5.80755613130301624) / (x * (x + 3.47954057099518960) + 12.06166887286239555)) *
                ((x * (x + 3.47469513777439592) + 12.07402036406381411) / (x * (x + 3.72068443960225092) + 8.44319781003968454)) *
                ((x * (x + 4.00561509202259545) + 9.30596659485887898) / (x * (x + 3.90225704029924078) + 6.36161630953880464)) *
                ((x * (x + 5.16722705817812584) + 9.12661617673673262) / (x * (x + 4.03296893109262491) + 5.13578530585681539)) *
                ((x * (x + 5.95908795446633271) + 9.19435612886969243) / (x * (x + 4.11240942957450885) + 4.48640329523408675)) *
                Math.exp(-x * x);
    }

    /**
     * The complementary error function, equivalent to {@code 1.0 - erf(x)}.
     * This uses a different approximation that should have extremely low error (below {@code 0x1p-53}, or
     * {@code 1.1102230246251565E-16}).
     * See Wikipedia for more.
     *
     * @param x any finite double
     * @return a double between 0 and 2, inclusive
     */
    private static double erfc(double x) {
        return x >= 0 ? erfcBase(x) : 2.0 - erfcBase(-x);
    }

    /**
     * This is the same as {@link #probit(double)},
     * except that it performs an additional step of post-processing to
     * bring the result even closer to the normal distribution.
     * It also produces normal-distributed doubles (with standard deviation 1.0)
     * given inputs between 0.0 and 1.0, exclusive.
     *
     * @param d should be between {@link Double#MIN_NORMAL}, inclusive, and 1, exclusive; subnormal values may return NaN
     * @return a normal-distributed double centered on 0.0
     * @see #probit(double) There is a lower-precision, faster variant on this method, which this uses internally.
     */
    public static double probitHighPrecision(double d)
    {
        double x = probit(d);
        if( d > 0.0 && d < 1.0 && d != 0.5) {
            double e = 0.5 * erfc(x * -0.7071067811865475) - d; //-0.7071067811865475 == -1.0 / Math.sqrt(2.0)
            double u = e * 2.5066282746310002 * Math.exp((x * x) * 0.5); //2.5066282746310002 == Math.sqrt(2 * Math.PI)
            x = x - u / (1.0 + x * u * 0.5);
        }
        return x;
    }

    /**
     * Given any {@code long} as input, this maps the full range of non-negative long values to much of the non-negative
     * half of the range of the normal distribution with standard deviation 1.0, and similarly maps all negative long
     * values to their equivalent-magnitude non-negative counterparts. Notably, an input of 0 will map to {@code 0.0},
     * an input of -1 will map to {@code -0.0}, and inputs of {@link Long#MIN_VALUE} and  {@link Long#MAX_VALUE} will
     * map to {@code -8.375} and {@code 8.375}, respectively. If you only pass this small
     * sequential inputs, there may be no detectable difference between some outputs. This is meant to be given inputs
     * with large differences (at least millions) if very different outputs are desired.
     * 

     * The algorithm here can be called Linnormal; it is comparatively quite simple, and mostly relies on lookup from a
     * precomputed table of results of {@link #probitHighPrecision(double)}, followed by linear interpolation. Values in
     * the "trail" of the normal distribution, that is, those produced by long values in the uppermost 1/2048 of all
     * values or the lowermost 1/2048 of all values, are computed slightly differently. Where the other parts of the
     * distribution use the bottom 53 bits to make an interpolant between 0.0 and 1.0 and use it verbatim, values in the
     * trail do all that and then square that interpolant, before going through the same type of interpolation.
     * 

     * This is like the "Ziggurat algorithm" to make normal-distributed doubles, but this preserves patterns in the
     * input. Uses a large table of the results of {@link #probitHighPrecision(double)}, and interpolates between
     * them using linear interpolation. This tends to be faster than Ziggurat at generating normal-distributed values,
     * though it probably has slightly worse quality. Since Ziggurat is already much faster than other common methods,
     * such as the Box-Muller Method, {@link #probit(double)} function, or the Marsaglia Polar Method (which Java itself
     * uses), this being faster than Ziggurat is a good thing. All methods of generating normal-distributed variables
     * while preserving input patterns are approximations, and this is slightly less accurate than some ways (but better
     * than the simplest ways, like just summing many random variables and re-centering around 0).
     *
     * @param n any long; input patterns will be preserved
     * @return a normal-distributed double, matching patterns in {@code n}
     */
    public static double normal(long n) {
        final long sign = n >> 63;
        n ^= sign;
        final int top10 = (int) (n >>> 53);
        final double t = (n & 0x1FFFFFFFFFFFFFL) * 0x1p-53, v;
        if (top10 == 1023) {
            v = t * t * (8.375 - 3.297193345691938) + 3.297193345691938;
        } else {
            final double s = TABLE[top10];
            v = t * (TABLE[top10 + 1] - s) + s;
        }
        return Math.copySign(v, sign);
    }
    /**
     * Given any {@code int} as input, this maps the full range of non-negative int values to much of the non-negative
     * half of the range of the normal distribution with standard deviation 1f, and similarly maps all negative int
     * values to their equivalent-magnitude non-negative counterparts. Notably, an input of 0 will map to {@code 0f},
     * an input of -1 will map to {@code -0f}, and inputs of {@link Integer#MIN_VALUE} and {@link Integer#MAX_VALUE}
     * will map to {@code -8.375f} and {@code 8.375f}, respectively. If you only pass this small
     * sequential inputs, there may be no detectable difference between some outputs. This is meant to be given inputs
     * with large differences (at least millions) if very different outputs are desired.
     * 

     * The algorithm here can be called Linnormal; it is comparatively quite simple, and mostly relies on lookup from a
     * precomputed table of results of {@link #probitHighPrecision(double)}, followed by linear interpolation. Values in
     * the "trail" of the normal distribution, that is, those produced by int values in the uppermost 1/2048 of all
     * values or the lowermost 1/2048 of all values, are computed slightly differently. Where the other parts of the
     * distribution use the bottom 53 bits to make an interpolant between 0.0 and 1.0 and use it verbatim, values in the
     * trail do all that and then square that interpolant, before going through the same type of interpolation.
     * 

     * This is like the "Ziggurat algorithm" to make normal-distributed doubles, but this preserves patterns in the
     * input. Uses a large table of the results of {@link #probitHighPrecision(double)}, and interpolates between
     * them using linear interpolation. This tends to be faster than Ziggurat at generating normal-distributed values,
     * though it probably has slightly worse quality. Since Ziggurat is already much faster than other common methods,
     * such as the Box-Muller Method, {@link #probit(double)} function, or the Marsaglia Polar Method (which Java itself
     * uses), this being faster than Ziggurat is a good thing. All methods of generating normal-distributed variables
     * while preserving input patterns are approximations, and this is slightly less accurate than some ways (but better
     * than the simplest ways, like just summing many random variables and re-centering around 0).
     *
     * @param n any int; input patterns will be preserved
     * @return a normal-distributed float, matching patterns in {@code n}
     */
    public static float normalF(int n) {
        final int sign = n >> 31;
        n ^= sign;
        final int top10 = (n >>> 21);
        final float t = (n & 0x1FFFFF) * 0x1p-21f, v;
        if (top10 == 1023) {
            v = t * t * (8.375005f - 3.297193345691938f) + 3.297193345691938f;
        } else {
            final float s = TABLE_F[top10];
            v = t * (TABLE_F[top10 + 1] - s) + s;
        }
        return Math.copySign(v, sign);
    }
}