hep.aida.tfloat.bin.QuantileFloatBin1D Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of parallelcolt Show documentation
Parallel Colt is a multithreaded version of Colt - a library for high performance scientific computing in Java. It contains efficient algorithms for data analysis, linear algebra, multi-dimensional arrays, Fourier transforms, statistics and histogramming.
The newest version!
package hep.aida.tfloat.bin;

import cern.colt.list.tfloat.FloatArrayList;
import cern.jet.random.tfloat.engine.FloatRandomEngine;
import cern.jet.stat.tfloat.quantile.FloatQuantileFinder;
import cern.jet.stat.tfloat.quantile.FloatQuantileFinderFactory;

/**
 * 1-dimensional non-rebinnable bin holding float elements with
 * scalable quantile operations defined upon; Using little main memory, quickly
 * computes approximate quantiles over very large data sequences with and even
 * without a-priori knowledge of the number of elements to be filled;
 * Conceptually a strongly lossily compressed multiset (or bag); Guarantees to
 * respect the worst case approximation error specified upon instance
 * construction. First see the package
 * summary and javadoc tree view to get the
 * broad picture.
 * 
 * Motivation and Problem: Intended to help scale applications requiring
 * quantile computation. Quantile computation on very large data sequences is
 * problematic, for the following reasons: Computing quantiles requires sorting
 * the data sequence. To sort a data sequence the entire data sequence needs to
 * be available. Thus, data cannot be thrown away during filling (as done by
 * static bins like {@link StaticFloatBin1D} and {@link MightyStaticFloatBin1D}
 * ). It needs to be kept, either in main memory or on disk. There is often not
 * enough main memory available. Thus, during filling data needs to be streamed
 * onto disk. Sorting disk resident data is prohibitively time consuming. As a
 * consequence, traditional methods either need very large memories (like
 * {@link DynamicFloatBin1D}) or time consuming disk based sorting.
 * 

 * This class proposes to efficiently solve the problem, at the expense of
 * producing approximate rather than exact results. It can deal with infinitely
 * many elements without resorting to disk. The main memory requirements are
 * smaller than for any other known approximate technique by an order of
 * magnitude. They get even smaller if an upper limit on the maximum number of
 * elements ever to be added is known a-priori.
 * 

 * Approximation error: The approximation guarantees are parametrizable
 * and explicit but probabilistic, and apply for arbitrary value distributions
 * and arrival distributions of the data sequence. In other words, this class
 * guarantees to respect the worst case approximation error specified upon
 * instance construction to a certain probability. Of course, if it is specified
 * that the approximation error should not exceed some number very close
 * to zero, this class will potentially consume just as much memory as any of
 * the traditional exact techniques would do. However, for errors larger than
 * 10^-5, its memory requirements are modest, as shown by the table
 * below.
 * 

 * Main memory requirements: Given in megabytes, assuming a single
 * element (float) takes 8 byte. The number of elements required is
 * then MB*1024*1024/8.
 * 

 * Parameters:
 * 

 * epsilon - the maximum allowed approximation error on quantiles; in
 * [0.0,1.0]. To get exact rather than approximate quantiles, set
 * epsilon=0.0;
 * 
 * 
delta - the probability allowed that the approximation error fails
 * to be smaller than epsilon; in [0.0,1.0]. To avoid probabilistic
 * answers, set delta=0.0. For example, delta = 0.0001 is
 * equivalent to a confidence of 99.99%.
 * 
 * 
quantiles - the number of quantiles to be computed; in
 * [0,Integer.MAX_VALUE].
 * 
 * 
is N known? - specifies whether the exact size of the dataset over
 * which quantiles are to be computed is known.
 * 
 * 
N_max - the exact dataset size, if known. Otherwise, an upper
 * limit on the dataset size. If no upper limit is known set to infinity (
 * Long.MAX_VALUE).
 * 
 * N_max=inf - we are sure that exactly (known) or less than
 * (unknown) infinity elements will be added. 

 * N_max=10⁶ - we are sure that exactly (known) or
 * less than (unknown) 10⁶ elements will be added. 

 * N_max=10⁷ - we are sure that exactly (known) or
 * less than (unknown) 10⁷ elements will be added. 

 * N_max=10⁸ - we are sure that exactly (known) or
 * less than (unknown) 10⁸ elements will be added.
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 *  Required main memory [MB]
#quantiles 
 
 epsilon delta   N
 * unknown 
 * N known
 * 
N_max=inf N_max=10⁶ N_max=10⁷ N_max=10⁸ N_max=inf N_max=10⁶ N_max=10⁷ N_max=10⁸
 
    
any 0 any infinity
 * 7.6 76 762 infinity
 * 7.6 76 762
any 10^-1 0 infinity 0.003 0.005 0.006 0.03 0.003 0.005 0.006
10^-2 0.02 0.03 0.05 0.31 0.02 0.03 0.05
10^-3 0.12 0.2 0.3 2.7 0.12 0.2 0.3
10^-4 0.6 1.2 2.1 26.9 0.6 1.2 2.1
10^-5 2.5 6.4 11.6 205 2.5 6.4 11.6
10^-6 7.6 25.4 63.6 1758 7.6 25.4 63.6
   
100 
 * 10^-2
 10^{* -1} 0.033 0.021 0.03 0.03 0.020 0.020 0.020 0.020
10^{* -5} 0.038 0.021 0.03 0.04 0.024 0.020 0.020 0.020

 * 10^-3
 10^-1 0.48 0.12 0.2 0.3 0.32 0.12 0.2 0.3
10^-5 0.54 0.12 0.2 0.3 0.37 0.12 0.2 0.3

 * 10^-4
 10^-1 6.6 0.6 1.2 2.1 4.6 0.6 1.2 2.1
10^-5 7.2 0.6 1.2 2.1 5.2 0.6 1.2 2.1

 * 10^-5
 10^-1 86 2.5 6.4 11.6 63 2.5 6.4 11.6
10^-5 94 2.5 6.4 11.6 70 2.5 6.4 11.6
   

 * 10000 
 * 10^-2
 10^-1 0.04 0.02 0.03 0.04 0.02 0.02 0.02 0.02
10^-5 0.04 0.02 0.03 0.04 0.03 0.02 0.03 0.03

 * 10^-3
 10^-1 0.52 0.12 0.21 0.3 0.35 0.12 0.21 0.3
10^-5 0.56 0.12 0.21 0.3 0.38 0.12 0.21 0.3

 * 10^-4
 10^-1 7.0 0.64 1.2 2.1 5.0 0.64 1.2 2.1
10^-5 7.5 0.64 1.2 2.1 5.4 0.64 1.2 2.1

 * 10^-5
 10^-1 90 2.5 6.4 11.6 67 2.5 6.4 11.6
10^-5 96 2.5 6.4 11.6 71 2.5 6.4 11.6
     

 * #quantiles 
 * epsilon 
 * delta N_max=inf N_max=10⁶ N_max=10⁷ N_max=10⁸ N_max=inf N_max=10⁶ N_max=10⁷ N_max=10⁸
N
 * unknown N known
 * 

 * Required main memory [MB]
 * 
 * 
 * 
 * Implementation:
 * 

 * After: Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Random
 * Sampling Techniques for Space Efficient Online Computation of Order
 * Statistics of Large Datasets. Proc. of the 1999 ACM SIGMOD Int. Conf. on
 * Management of Data, Paper available 
 * here.
 * 

 * and
 * 

 * Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate
 * Medians and other Quantiles in One Pass and with Limited Memory, Proc. of the
 * 1998 ACM SIGMOD Int. Conf. on Management of Data, Paper available 
 * here.
 * 

 * The broad picture is as follows. Two concepts are used: Shrinking and
 * Sampling. Shrinking takes a data sequence, sorts it and produces a
 * shrinked data sequence by picking every k-th element and throwing away all
 * the rest. The shrinked data sequence is an approximation to the original data
 * sequence.
 * 

 * Imagine a large data sequence (residing on disk or being generated in memory
 * on the fly) and a main memory block of n=b*k elements (
 * b is the number of buffers, k is the number of elements per
 * buffer). Fill elements from the data sequence into the block until it is full
 * or the data sequence is exhausted. When the block (or a subset of buffers) is
 * full and the data sequence is not exhausted, apply shrinking to lossily
 * compress a number of buffers into one single buffer. Repeat these steps until
 * all elements of the data sequence have been consumed. Now the block is a
 * shrinked approximation of the original data sequence. Treating it as if it
 * would be the original data sequence, we can determine quantiles in main
 * memory.
 * 

 * Now, the whole thing boils down to the question of: Can we choose b
 * and k (the number of buffers and the buffer size) such that
 * b*k is minimized, yet quantiles determined upon the block are
 * guaranteed to be away from the true quantiles no more than some
 * epsilon? It turns out, we can. It also turns out that the required
 * main memory block size n=b*k is usually moderate (see the table
 * above).
 * 

 * The theme can be combined with random sampling to further reduce main memory
 * requirements, at the expense of probabilistic guarantees. Sampling filters
 * the data sequence and feeds only selected elements to the algorithm outlined
 * above. Sampling is turned on or off, depending on the parametrization.
 * 

 * This quick overview does not go into important details, such as assigning
 * proper weights to buffers, how to choose subsets of buffers to shrink,
 * etc. For more information consult the papers cited above.
 * 
 * 

 * Time Performance:
 * 

 * 
Pentium Pro 200 Mhz, SunJDK 1.2.2, NT, java -classic,

 * filling 10 ⁴ elements at a time, reading 100 percentiles at a
 * time,

 * hasSumOfLogarithms()=false, hasSumOfInversions()=false,
 * getMaxOrderForSumOfPowers()=2

 * 
 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * Performance

 * Quantiles 
 * Epsilon 
 * Delta   Filling 

 * [#elements/sec]   Quantile computation

 * [#quantiles/sec]
N unknown,

 * N_max=inf N known,

 * N_max=10⁷ N unknown,

 * N_max=inf N known,

 * N_max=10⁷
     

 * 10⁴ 10 ^-1 10 ^{* -1} 
 * 
 * 1600000
 * 
 * 1300000 250000 130000
10 ^-2 360000 1200000 50000 20000
10 ^-3 150000 200000 3600 3000
10 ^-4 120000 170000 80 1000
 * 
 * 
 * @see cern.jet.stat.tfloat.quantile
 * @author [email protected]
 * @version 0.9, 03-Jul-99
 */
public class QuantileFloatBin1D extends MightyStaticFloatBin1D {
    /**
     * 
     */
    private static final long serialVersionUID = 1L;
    protected FloatQuantileFinder finder = null;

    /**
     * Not public; for use by subclasses only! Constructs and returns an empty
     * bin.
     */
    protected QuantileFloatBin1D() {
        super(false, false, 2);
    }

    /**
     * Equivalent to
     * new QuantileBin1D(false, Long.MAX_VALUE, epsilon, 0.001, 10000, new cern.jet.random.engine.DRand(new java.util.Date())
     * .
     */
    public QuantileFloatBin1D(float epsilon) {
        this(false, Long.MAX_VALUE, epsilon, 0.001f, 10000, new cern.jet.random.tfloat.engine.FRand(
                new java.util.Date()));
    }

    /**
     * Equivalent to
     * new QuantileBin1D(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2)
     * .
     */
    public QuantileFloatBin1D(boolean known_N, long N, float epsilon, float delta, int quantiles,
            FloatRandomEngine randomGenerator) {
        this(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2);
    }

    /**
     * Constructs and returns an empty bin that, under the given constraints,
     * minimizes the amount of memory needed.
     * 
     * Some applications exactly know in advance over how many elements
     * quantiles are to be computed. Provided with such information the main
     * memory requirements of this class are small. Other applications don't
     * know in advance over how many elements quantiles are to be computed.
     * However, some of them can give an upper limit, which will reduce main
     * memory requirements. For example, if elements are selected from a
     * database and filled into histograms, it is usually not known in advance
     * how many elements are being filled, but one may know that at most
     * S elements, the number of elements in the database, are filled.
     * A third type of application knowns nothing at all about the number of
     * elements to be filled; from zero to infinitely many elements may actually
     * be filled. This method efficiently supports all three types of
     * applications.
     * 
     * @param known_N
     *            specifies whether the number of elements over which quantiles
     *            are to be computed is known or not.
     *            
     * @param N
     *            if known_N==true, the number of elements over which
     *            quantiles are to be computed. if known_N==false, the
     *            upper limit on the number of elements over which quantiles are
     *            to be computed. In other words, the maximum number of elements
     *            ever to be added. If such an upper limit is a-priori unknown,
     *            then set N = Long.MAX_VALUE.
     *            

     * @param epsilon
     *            the approximation error which is guaranteed not to be exceeded
     *            (e.g. 0.001) (0 <= epsilon <= 1). To
     *            get exact rather than approximate quantiles, set
     *            epsilon=0.0;
     *            

     * @param delta
     *            the allowed probability that the actual approximation error
     *            exceeds epsilon (e.g. 0.0001) (0 <= delta <=
     *            1). To avoid probabilistic answers, set delta=0.0.
     *            For example, delta = 0.0001 is equivalent to a
     *            confidence of 99.99%.
     *            

     * @param quantiles
     *            the number of quantiles to be computed (e.g. 100) (
     *            quantiles >= 1). If unknown in advance, set this
     *            number large, e.g. quantiles >= 10000.
     *            

     * @param randomGenerator
     *            a uniform random number generator. Set this parameter to
     *            null to use a default generator seeded with the
     *            current time.
     *            

     *            The next three parameters specify additional capabilities
     *            unrelated to quantile computation. They are identical to the
     *            one's defined in the constructor of the parent class
     *            {@link MightyStaticFloatBin1D}.
     *            

     * @param hasSumOfLogarithms
     *            Tells whether {@link #sumOfLogarithms()} can return meaningful
     *            results. Set this parameter to false if measures of
     *            sum of logarithms, geometric mean and product are not
     *            required.
     *            

     * @param hasSumOfInversions
     *            Tells whether {@link #sumOfInversions()} can return meaningful
     *            results. Set this parameter to false if measures of
     *            sum of inversions, harmonic mean and sumOfPowers(-1) are not
     *            required.
     *            

     * @param maxOrderForSumOfPowers
     *            The maximum order k for which
     *            {@link #sumOfPowers(int)} can return meaningful results. Set
     *            this parameter to at least 3 if the skew is required, to at
     *            least 4 if the kurtosis is required. In general, if moments
     *            are required set this parameter at least as large as the
     *            largest required moment. This method always substitutes
     *            Math.max(2,maxOrderForSumOfPowers) for the parameter
     *            passed in. Thus, sumOfPowers(0..2) always returns
     *            meaningful results.
     */
    public QuantileFloatBin1D(boolean known_N, long N, float epsilon, float delta, int quantiles,
            FloatRandomEngine randomGenerator, boolean hasSumOfLogarithms, boolean hasSumOfInversions,
            int maxOrderForSumOfPowers) {
        super(hasSumOfLogarithms, hasSumOfInversions, maxOrderForSumOfPowers);
        this.finder = FloatQuantileFinderFactory.newFloatQuantileFinder(known_N, N, epsilon, delta, quantiles,
                randomGenerator);
        this.clear();
    }

    /**
     * Adds the part of the specified list between indexes from
     * (inclusive) and to (inclusive) to the receiver.
     * 
     * @param list
     *            the list of which elements shall be added.
     * @param from
     *            the index of the first element to be added (inclusive).
     * @param to
     *            the index of the last element to be added (inclusive).
     * @throws IndexOutOfBoundsException
     *             if
     *             list.size()>0 && (from<0 || from>to || to>=list.size())
     *             .
     */

    public synchronized void addAllOfFromTo(FloatArrayList list, int from, int to) {
        super.addAllOfFromTo(list, from, to);
        if (this.finder != null)
            this.finder.addAllOfFromTo(list, from, to);
    }

    /**
     * Removes all elements from the receiver. The receiver will be empty after
     * this call returns.
     */

    public synchronized void clear() {
        super.clear();
        if (this.finder != null)
            this.finder.clear();
    }

    /**
     * Returns a deep copy of the receiver.
     * 
     * @return a deep copy of the receiver.
     */

    public synchronized Object clone() {
        QuantileFloatBin1D clone = (QuantileFloatBin1D) super.clone();
        if (this.finder != null)
            clone.finder = (FloatQuantileFinder) clone.finder.clone();
        return clone;
    }

    /**
     * Computes the deviations from the receiver's measures to another bin's
     * measures.
     * 
     * @param other
     *            the other bin to compare with
     * @return a summary of the deviations.
     */

    public String compareWith(AbstractFloatBin1D other) {
        StringBuffer buf = new StringBuffer(super.compareWith(other));
        if (other instanceof QuantileFloatBin1D) {
            QuantileFloatBin1D q = (QuantileFloatBin1D) other;
            buf.append("25%, 50% and 75% Quantiles: " + relError(quantile(0.25f), q.quantile(0.25f)) + ", "
                    + relError(quantile(0.5f), q.quantile(0.5f)) + ", " + relError(quantile(0.75f), q.quantile(0.75f)));
            buf.append("\nquantileInverse(mean): " + relError(quantileInverse(mean()), q.quantileInverse(q.mean()))
                    + " %");
            buf.append("\n");
        }
        return buf.toString();
    }

    /**
     * Returns the median.
     */
    public float median() {
        return quantile(0.5f);
    }

    /**
     * Computes and returns the phi-quantile.
     * 
     * @param phi
     *            the percentage for which the quantile is to be computed. phi
     *            must be in the interval (0.0,1.0].
     * @return the phi quantile element.
     */
    public synchronized float quantile(float phi) {
        return quantiles(new FloatArrayList(new float[] { phi })).get(0);
    }

    /**
     * Returns how many percent of the elements contained in the receiver are
     * <= element. Does linear interpolation if the element is not
     * contained but lies in between two contained elements.
     * 
     * @param element
     *            the element to search for.
     * @return the percentage phi of elements <= element (
     *         0.0 <= phi <=1.0).
     */
    public synchronized float quantileInverse(float element) {
        return finder.phi(element);
    }

    /**
     * Returns the quantiles of the specified percentages. For implementation
     * reasons considerably more efficient than calling {@link #quantile(float)}
     * various times.
     * 
     * @param phis
     *            the percentages for which quantiles are to be computed. Each
     *            percentage must be in the interval (0.0,1.0].
     *            percentages must be sorted ascending.
     * @return the quantiles.
     */
    public synchronized FloatArrayList quantiles(cern.colt.list.tfloat.FloatArrayList phis) {
        return finder.quantileElements(phis);
    }

    /**
     * Returns how many elements are contained in the range
     * [minElement,maxElement]. Does linear interpolation if one or
     * both of the parameter elements are not contained. Returns exact or
     * approximate results, depending on the parametrization of this class or
     * subclasses.
     * 
     * @param minElement
     *            the minimum element to search for.
     * @param maxElement
     *            the maximum element to search for.
     * @return the number of elements in the range.
     */
    public int sizeOfRange(float minElement, float maxElement) {
        return Math.round(size() * (quantileInverse(maxElement) - quantileInverse(minElement)));
    }

    /**
     * Divides (rebins) a copy of the receiver at the given percentage
     * boundaries into bins and returns these bins, such that each bin
     * approximately reflects the data elements of its range.
     * 
     * The receiver is not physically rebinned (divided); it stays unaffected by
     * this operation. The returned bins are such that if one would have
     * filled elements into multiple bins instead of one single all encompassing
     * bin only, those multiple bins would have approximately the same
     * statistics measures as the one's returned by this method.
     * 

     * The split(...) methods are particularly well suited for
     * real-time interactive rebinning (the famous "scrolling slider" effect).
     * 

     * Passing equi-distant percentages like
     * (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) into this method will yield bins
     * of an equi-depth histogram, i.e. a histogram with bin boundaries
     * adjusted such that each bin contains the same number of elements, in this
     * case 20% each. Equi-depth histograms can be useful if, for example, not
     * enough properties of the data to be captured are known a-priori to be
     * able to define reasonable bin boundaries (partitions). For example, when
     * guesses about minimas and maximas are strongly unreliable. Or when
     * chances are that by focussing too much on one particular area other
     * important areas and characters of a data set may be missed.
     * 

     * Implementation:
     * 

     * The receiver is divided into s = percentages.size()-1 intervals
     * (bins). For each interval I, its minimum and maximum elements
     * are determined based upon quantile computation. Further, each interval
     * I is split into k equi-percent-distant subintervals
     * (sub-bins). In other words, an interval is split into subintervals such
     * that each subinterval contains the same number of elements.
     * 

     * For each subinterval S, its minimum and maximum are determined,
     * again, based upon quantile computation. They yield an approximate
     * arithmetic mean am = (min+max)/2 of the subinterval. A
     * subinterval is treated as if it would contain only elements equal to the
     * mean am. Thus, if the subinterval contains, say, n
     * elements, it is assumed to consist of n mean elements
     * (am,am,...,am). A subinterval's sum of elements, sum of squared
     * elements, sum of inversions, etc. are then approximated using such a
     * sequence of mean elements.
     * 

     * Finally, the statistics measures of an interval I are computed
     * by summing up (integrating) the measures of its subintervals.
     * 

     * Accuracy:
     * 

     * Depending on the accuracy of quantile computation and the number of
     * subintervals per interval (the resolution). Objects of this class compute
     * exact or approximate quantiles, depending on the parameters used upon
     * instance construction. Objects of subclasses may always compute
     * exact quantiles, as is the case for {@link DynamicFloatBin1D}. Most
     * importantly for this class QuantileBin1D, a reasonably small
     * epsilon (e.g. 0.01, perhaps 0.001) should be used upon instance
     * construction. The confidence parameter delta is less important,
     * you may find delta=0.00001 appropriate. 

     * The larger the resolution, the smaller the approximation error, up to
     * some limit. Integrating over only a few subintervals per interval will
     * yield very crude approximations. If the resolution is set to a reasonably
     * large number, say 10..100, more small subintervals are integrated,
     * resulting in more accurate results. 

     * Note that for good accuracy, the number of quantiles computable with the
     * given approximation guarantees should upon instance construction be
     * specified, so as to satisfy
     * 

     * quantiles > resolution * (percentages.size()-1)
     * 

     * 

     * Example:
     * 

     * resolution=2, percentList = (0.0, 0.1, 0.2, 0.5, 0.9, 1.0) means
     * the receiver is to be split into 5 bins: 

     * 

     * bin 0 ranges from [0%..10%) and holds the smallest 10% of
     * the sorted elements.
     * 
bin 1 ranges from [10%..20%) and holds the next smallest 10%
     * of the sorted elements.
     * 
bin 2 ranges from [20%..50%) and holds the next smallest 30%
     * of the sorted elements.
     * 
bin 3 ranges from [50%..90%) and holds the next smallest 40%
     * of the sorted elements.
     * 
bin 4 ranges from [90%..100%) and holds the largest 10% of
     * the sorted elements.
     * 
     * 
     * The statistics measures for each bin are to be computed at a resolution
     * of 2 subbins per bin. Thus, the statistics measures of a bin are the
     * integrated measures over 2 subbins, each containing the same amount of
     * elements:
     * 

     * bin 0 has a subbin ranging from [ 0%.. 5%) and a subbin
     * ranging from [ 5%..10%).
     * 
bin 1 has a subbin ranging from [10%..15%) and a subbin
     * ranging from [15%..20%).
     * 
bin 2 has a subbin ranging from [20%..35%) and a subbin
     * ranging from [35%..50%).
     * 
bin 3 has a subbin ranging from [50%..70%) and a subbin
     * ranging from [70%..90%).
     * 
bin 4 has a subbin ranging from [90%..95%) and a subbin
     * ranging from [95%..100%).
     * 
     * 
     * Lets concentrate on the subbins of bin 0.
     * 

     * Assume the subbin A=[0%..5%) has a minimum of 300
     * and a maximum of 350 (0% of all elements are less than 300, 5%
     * of all elements are less than 350).
     * 
Assume the subbin B=[5%..10%) has a minimum of 350
     * and a maximum of 550 (5% of all elements are less than 350, 10%
     * of all elements are less than 550).
     * 
     * 
     * Assume the entire data set consists of N=100 elements.
     * 

     * Then subbin A has an approximate mean of 300+350 / 2 = 325,
     * a size of N*(5%-0%) = 100*5% = 5 elements, an approximate sum of
     * 325 * 100*5% = 1625, an approximate sum of squares of
     * 325² * 100*5% = 528125, an approximate sum of
     * inversions of (1.0/325) * 100*5% = 0.015, etc.
     * 
Analogously, subbin B has an approximate mean of
     * 350+550 / 2 = 450, a size of N*(10%-5%) = 100*5% = 5
     * elements, an approximate sum of 450 * 100*5% = 2250, an
     * approximate sum of squares of 450² * 100*5% = 1012500
     * , an approximate sum of inversions of (1.0/450) * 100*5% = 0.01,
     * etc.
     * 
     * 
     * Finally, the statistics measures of bin 0 are computed by summing up
     * (integrating) the measures of its subintervals: Bin 0 has a size of
     * N*(10%-0%)=10 elements (we knew that already), sum of
     * 1625+2250=3875, sum of squares of
     * 528125+1012500=1540625, sum of inversions of
     * 0.015+0.01=0.025, etc. From these follow other measures such as
     * mean=3875/10=387.5, rms = sqrt(1540625 / 10)=392.5, etc. The
     * other bins are computes analogously.
     * 
     * @param percentages
     *            the percentage boundaries at which the receiver shall be
     *            split.
     * @param k
     *            the desired number of subintervals per interval.
     */
    public synchronized MightyStaticFloatBin1D[] splitApproximately(FloatArrayList percentages, int k) {
        /*
         * percentages = [p0, p1, p2, ..., p(size-2), p(size-1)] defines bins
         * [p0,p1), [p1,p2), ..., [p(size-2),p(size-1)) each bin is divided into
         * k equi-percent-distant sub bins (subintervals). e.g. k = 2 means
         * "compute" with a resolution (accuracy) of 2 subbins (subintervals)
         * per bin,
         * 
         * percentages = [0.1, 0.2, 0.3, ..., 0.9, 1.0] means bin 0 holds the
         * first 0.1-0.0=10% of the sorted elements, bin 1 holds the next
         * 0.2-0.1=10% of the sorted elements, ...
         * 
         * bins = [0.1, 0.2), [0.2, 0.3), ..., [0.9, 1.0) subBins = [0.1, 0.15,
         * 0.2, 0.25, 0.3, ....]
         * 
         * [0.1, 0.15) [0.15, 0.2) [0.3, 0.35) [0.35, 0.4)
         * 
         * [0.2, 0.25) [0.25, 0.3)
         * 
         */
        int percentSize = percentages.size();
        if (k < 1 || percentSize < 2)
            throw new IllegalArgumentException();

        float[] percent = percentages.elements();
        int noOfBins = percentSize - 1;

        // construct subintervals
        float[] subBins = new float[1 + k * (percentSize - 1)];
        subBins[0] = percent[0];
        int c = 1;

        for (int i = 0; i < noOfBins; i++) {
            float step = (percent[i + 1] - percent[i]) / k;
            for (int j = 1; j <= k; j++) {
                subBins[c++] = percent[i] + j * step;
            }
        }

        // compute quantile elements;
        float[] quantiles = quantiles(new FloatArrayList(subBins)).elements();

        // collect summary statistics for each bin.
        // one bin's statistics are the integrated statistics of its
        // subintervals.
        MightyStaticFloatBin1D[] splitBins = new MightyStaticFloatBin1D[noOfBins];
        int maxOrderForSumOfPowers = getMaxOrderForSumOfPowers();
        maxOrderForSumOfPowers = Math.min(10, maxOrderForSumOfPowers); // don't
        // compute
        // tons
        // of
        // measures

        int dataSize = this.size();
        c = 0;
        for (int i = 0; i < noOfBins; i++) { // for each bin
            float step = (percent[i + 1] - percent[i]) / k;
            float binSum = 0;
            float binSumOfSquares = 0;
            float binSumOfLogarithms = 0;
            float binSumOfInversions = 0;
            float[] binSumOfPowers = null;
            if (maxOrderForSumOfPowers > 2) {
                binSumOfPowers = new float[maxOrderForSumOfPowers - 2];
            }

            float binMin = quantiles[c++];
            float safe_min = binMin;
            float subIntervalSize = dataSize * step;

            for (int j = 1; j <= k; j++) { // integrate all subintervals
                float binMax = quantiles[c++];
                float binMean = (binMin + binMax) / 2;
                binSum += binMean * subIntervalSize;
                binSumOfSquares += binMean * binMean * subIntervalSize;
                if (this.hasSumOfLogarithms) {
                    binSumOfLogarithms += (Math.log(binMean)) * subIntervalSize;
                }
                if (this.hasSumOfInversions) {
                    binSumOfInversions += (1 / binMean) * subIntervalSize;
                }
                if (maxOrderForSumOfPowers >= 3)
                    binSumOfPowers[0] += binMean * binMean * binMean * subIntervalSize;
                if (maxOrderForSumOfPowers >= 4)
                    binSumOfPowers[1] += binMean * binMean * binMean * binMean * subIntervalSize;
                for (int p = 5; p <= maxOrderForSumOfPowers; p++) {
                    binSumOfPowers[p - 3] += Math.pow(binMean, p) * subIntervalSize;
                }

                binMin = binMax;
            }
            c--;

            // example: bin(0) contains (0.2-0.1) == 10% of all elements
            int binSize = Math.round((percent[i + 1] - percent[i]) * dataSize);
            float binMax = binMin;
            binMin = safe_min;

            // fill statistics
            splitBins[i] = new MightyStaticFloatBin1D(this.hasSumOfLogarithms, this.hasSumOfInversions,
                    maxOrderForSumOfPowers);
            if (binSize > 0) {
                splitBins[i].size = binSize;
                splitBins[i].min = binMin;
                splitBins[i].max = binMax;
                splitBins[i].sum = binSum;
                splitBins[i].sum_xx = binSumOfSquares;
                splitBins[i].sumOfLogarithms = binSumOfLogarithms;
                splitBins[i].sumOfInversions = binSumOfInversions;
                splitBins[i].sumOfPowers = binSumOfPowers;
            }
            /*
             * float binMean = binSum / binSize;
             * System.out.println("size="+binSize);
             * System.out.println("min="+binMin);
             * System.out.println("max="+binMax);
             * System.out.println("mean="+binMean);
             * System.out.println("sum_x="+binSum);
             * System.out.println("sum_xx="+binSumOfSquares);
             * System.out.println("rms="+Math.sqrt(binSumOfSquares / binSize));
             * System.out.println();
             */

        }
        return splitBins;
    }

    /**
     * Divides (rebins) a copy of the receiver at the given interval
     * boundaries into bins and returns these bins, such that each bin
     * approximately reflects the data elements of its range.
     * 
     * For each interval boundary of the axis (including -infinity and
     * +infinity), computes the percentage (quantile inverse) of elements less
     * than the boundary. Then lets
     * {@link #splitApproximately(FloatArrayList,int)} do the real work.
     * 
     * @param axis
     *            an axis defining interval boundaries.
     * @param k
     *            the desired number of subintervals per interval.
     */
    public synchronized MightyStaticFloatBin1D[] splitApproximately(hep.aida.tfloat.FloatIAxis axis, int k) {
        FloatArrayList percentages = new FloatArrayList(new hep.aida.tfloat.ref.FloatConverter().edges(axis));
        percentages.beforeInsert(0, Float.NEGATIVE_INFINITY);
        percentages.add(Float.POSITIVE_INFINITY);
        for (int i = percentages.size(); --i >= 0;) {
            percentages.set(i, quantileInverse(percentages.get(i)));
        }

        return splitApproximately(percentages, k);
    }

    /**
     * Returns a String representation of the receiver.
     */

    public synchronized String toString() {
        StringBuffer buf = new StringBuffer(super.toString());
        buf.append("25%, 50%, 75% Quantiles: " + quantile(0.25f) + ", " + quantile(0.5f) + ", " + quantile(0.75f));
        // buf.append("10%, 25%, 50%, 75%, 90% Quantiles: "+quantile(0.1) + ",
        // "+ quantile(0.25) + ", "+ quantile(0.5) + ", " + quantile(0.75) + ",
        // " + quantile(0.9));
        buf.append("\nquantileInverse(median): " + quantileInverse(median()));
        buf.append("\n");
        return buf.toString();
    }
}
Required main memory [MB]
#quantiles	epsilon	delta	N * unknown	*	N known *
#quantiles	epsilon	delta	N_max=inf		N_max=10⁶	N_max=10⁷	N_max=10⁸	N_max=inf	N_max=10⁶	N_max=10⁷	N_max=10⁸

any	0	any	infinity *		7.6	76	762	infinity *	7.6	76	762
any	10^-1	0	infinity		0.003	0.005	0.006	0.03	0.003	0.005	0.006
	10^-2				0.02	0.03	0.05	0.31	0.02	0.03	0.05
	10^-3				0.12	0.2	0.3	2.7	0.12	0.2	0.3
	10^-4				0.6	1.2	2.1	26.9	0.6	1.2	2.1
	10^-5				2.5	6.4	11.6	205	2.5	6.4	11.6
	10^-6				7.6	25.4	63.6	1758	7.6	25.4	63.6

100	* 10^-2	10^{* -1}	0.033		0.021	0.03	0.03	0.020	0.020	0.020	0.020
	* 10^-2	10^{* -5}	0.038		0.021	0.03	0.04	0.024	0.020	0.020	0.020
	* 10^-3	10^-1	0.48		0.12	0.2	0.3	0.32	0.12	0.2	0.3
	* 10^-3	10^-5	0.54		0.12	0.2	0.3	0.37	0.12	0.2	0.3
	* 10^-4	10^-1	6.6		0.6	1.2	2.1	4.6	0.6	1.2	2.1
	* 10^-4	10^-5	7.2		0.6	1.2	2.1	5.2	0.6	1.2	2.1
	* 10^-5	10^-1	86		2.5	6.4	11.6	63	2.5	6.4	11.6
	* 10^-5	10^-5	94		2.5	6.4	11.6	70	2.5	6.4	11.6

* 10000	* 10^-2	10^-1	0.04		0.02	0.03	0.04	0.02	0.02	0.02	0.02
	* 10^-2	10^-5	0.04		0.02	0.03	0.04	0.03	0.02	0.03	0.03
	* 10^-3	10^-1	0.52		0.12	0.21	0.3	0.35	0.12	0.21	0.3
	* 10^-3	10^-5	0.56		0.12	0.21	0.3	0.38	0.12	0.21	0.3
	* 10^-4	10^-1	7.0		0.64	1.2	2.1	5.0	0.64	1.2	2.1
	* 10^-4	10^-5	7.5		0.64	1.2	2.1	5.4	0.64	1.2	2.1
	* 10^-5	10^-1	90		2.5	6.4	11.6	67	2.5	6.4	11.6
	* 10^-5	10^-5	96		2.5	6.4	11.6	71	2.5	6.4	11.6

* #quantiles	* epsilon	* delta	N_max=inf		N_max=10⁶	N_max=10⁷	N_max=10⁸	N_max=inf	N_max=10⁶	N_max=10⁷	N_max=10⁸
* #quantiles	* epsilon	* delta	N * unknown		N known *
* Required main memory [MB]
Performance
* Quantiles	* Epsilon	* Delta	Filling * [#elements/sec]	Quantile computation * [#quantiles/sec]
* Quantiles	* Epsilon	* Delta	N unknown, * N_max=inf	N known, * N_max=10⁷	N unknown, * N_max=inf	N known, * N_max=10⁷

* 10⁴	10 ^-1	10 ^{* -1}	* * 1600000 * *	1300000	250000	130000
	10 ^-2		360000	1200000	50000	20000
	10 ^-3		150000	200000	3600	3000
	10 ^-4		120000	170000	80	1000